This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
... <Mapping> <Scheduler="fork"> Fig. 4. Gauss-Siedel MetaPL description 34 "gauss.globus" "gauss.globus.sim" 32 30 28 26 24 22 20 1
2
3
4
5
5 4.5 4 3.5 3 2.5 2 1.5 1
6
7
8
"gauss.globus.err"
1
2
3
4
5
6
7
8
Fig. 5. Gauss-Siedel Simulation
5 Conclusions This paper shows a performance-oriented approach to GRID application development, founded on the adoption of a simulation environment (HeSSE) and a prototypal language (MetaPL) for performance evaluation of the application at any step of the target software development. We developed GRID simulation components for the existing simulation environment, able to reproduce the main globus components performance behavior. The proposed components were validated on a test-bed, built upon an SMP cluster. Then a simple application was developed in the prototypal language (MetaPL), adopting the newly developed GRID extensions, in order to drive the simulations and its performances were predicted.
Performance Oriented Development and Tuning of GRID Applications
517
To validate the approach, we developed the application and compared the results with the predicted ones, showing that the simulated behavior correspond to real system evolution.
Acknowledgment This work was partially supported by "Centro di Competenze" regione Campania. We want to thank Raffaele Vitolo for his technical support and Rocco Aversa for his contribution.
References 1. I. Foster, C. Kesselman, J. Nick, and S. Tuecke: “The physiology of the grid: An open grid services architecture for distributed systems integration". Technical report, Open Grid Service Infrastructure WG, Global Grid Forum, June 2002. 2. I. Foster, C. Kesselman, and S. Tuecke. The anatomy of the grid: Enabling scalable virtual organizations. International J. Supercomputer Applications, 15(3), 2001. 3. Buyya, R. , Murshed, M.: “GridSim: A Toolkit for the Modeling and Simulation of Distributed Resource Management and Scheduling for Grid Computing", The Journal of Concurrency and Computation: Practice and Experience (CCPE), Wiley Press, May 2002. 4. Takefusa Bricks: A Performance Evaluation System for Scheduling Algorithms on the Grids. JSPS Workshop on Applied Information Technology for Science (JWAITS 2001). 2001.01. 5. Huaxia Xia, Holly Dail, Henri Casanova and Andrew Chien, The MicroGrid: Using Emulation to Predict Application Performance in Diverse Grid Network Environments, In Proceedings of the Workshop on Challenges of Large Applications in Distributed Environments (CLADE ’04), held in conjunction with the Thirteenth IEEE International Symposium on High-Performance Distributed Computing, Honolulu, Hawaii, June 2004 . 6. Henri Casanova and Arnaud Legrand and Loris Marchal, Scheduling Distributed Applications: the SimGrid Simulation Framework,Proceedings of the third IEEE International Symposium on Cluster Computing and the Grid (CCGrid’03) 7. Aversa, R., Mazzeo, A., Mazzocca, N., Villano, U.: Developing Applications for Heterogeneous Computing Environments using Simulation: a Case Study. Parallel Computing 24 (1998) 741-761 8. Mazzocca N., Rak M., Villano U. 2000. “The Transition from a PVM Program Simulator to a Heterogeneous System Simulator: The HeSSE Project". Recent Advances in Parallel Virtual Machine and Message Passing Interface, in J. Dongarra et al. (eds.) Lecture Notes in Computer Science, Vol. 1908, Springer-Verlag, Berlin 2000, (pp. 266-273). 9. N. Mazzocca, M.Rak, R. Torella, E. Mancini and U. Villano, The HeSSE simulation environment. Proc. ESMc’2003, 27-29 Oct. 2003, Naples, Italy, pp. 270-274. 10. Mazzocca N., Rak M., Villano U., 2001. “MetaPL: a Notation System for Parallel Program Description and Performance Analysis" Parallel Computing Technologies, in Malyshkin. V. (ed.), Lecture Notes in Computer Science, Vol. 2127, Springer-Verlag, Berlin 2001,(pp. 80-93) 11. Labarta, J., Girona, S., Pillet, V., Cortes T., Gregoris, L.: DiP: a Parallel Program Development Environment. Proc. Euro-Par ’96, Lyon, France (Aug. 1996) Vol. II 665- 674 12. E. Mancini, N. Mazzocca, M. Rak, and U. Villano, Integrated Tools for Performance-Oriented Distributed Software Development. Proc. SERP’03 Conference, Las Vegas (NE), USA, June 23-26, 2003, vol. I, pp. 88-94
518
Emilio Mancini et al.
13. N. Mazzocca, M. Rak, U. Villano, The MetaPL approach to the performance analysis of distributed software systems. Proc. 3rd International Workshop on Software and Performance (WOSP02), IEEE Press (2002) 142-149 14. E. Mancini, M. Rak, R. Torella, “U. Villano, Off-line Performance Prediction of MessagePassing Applications on Cluster Systems, Lecture Notes in Computer Science, Vol. 2840, Springer-Verlag, Berlin 2003, (pp. 45-54).
Towards a Bulk-Synchronous Distributed Shared Memory Programming Environment for Grids H˚akan Mattsson1 and Christoph Kessler2 1
Department of Natural Science and Technology Gotland University, Sweden [email protected] 2 Programming Environments Laboratory (PELAB) Department of Information and Computer Science Link¨oping University, Sweden [email protected]
Abstract. The current practice in grid programming uses message passing, which unfortunately leads to code that is difficult to understand, debug and optimize. Hence, for grids to become commonly accepted, also as general-purpose parallel computation platforms, suitable parallel programming environments need to be developed. In this paper we propose an approach to realize a distributed shared memory programming environment for computational grids called GridNestStep, by adopting NestStep, a structured parallel programming language based on the Bulk Synchronous Parallel model of parallel computation.
1 Introduction Programming for grids [1] is currently not an easy task which is partly due to the distributed memory view that grids provide, leaving the programmer with the burden of handling communication between grid nodes explicitly. Today, programming for the grid often involves using message passing, for example MPICH-G2 [2], on top of a grid middleware (like Globus Toolkit [3]). As stated in [1] MPICH-G2, Globus Toolkit and other tools “. . . make it possible to write Grid programs, they do not make it easy.” Much work is still left to the programmer, such as handling communication, synchronization, and load balancing. In order to ease the burden of programming for a grid environment, alternative programming models are needed. One such model is distributed shared memory, in which the programmer sees the grid environment as a single (virtual) parallel computer with a single (virtual) shared memory. Providing this shared view of memory would, ideally, make programming for the grid as easy as for a usual shared memory computer, avoiding the error-prone use of message passing. To achieve this, either the application must be translated into a program utilizing the message passing model, or the message passing is done within an underlying system software layer that emulates a shared memory. Currently, the most suitable kind of applications for grids are embarrassingly parallel, such as parameter studies that simultaneously run multiple copies of the same sequential program, each with different input parameters. Another example application, although in this case in a peer-to-peer computing (P2PC) environment, is the SETI@Home J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 519–526, 2006. c Springer-Verlag Berlin Heidelberg 2006
520
H˚akan Mattsson and Christoph Kessler
project (Search for Extraterrestrial Intelligence) [4], where workpackages, containing data, can be downloaded and processed by voluntary participants. However, applications like parameter studies or SETI@Home can be considered being of a trivial kind of parallelism, since the workpackages are uniformly structured, no communication is required between the participating grid nodes, and inter-process coordination is limited to dispatching jobs to the grid and collecting the results returned. For applications with a less trivial structure of parallelism, generation of efficient, scalable code for grid and peer-to-peer computing systems is still an unsolved problem. This research aims at the development of a high-level programming environment called GridNestStep, which will provide a distributed shared memory layer on top of a computational grid system. Processor coordination, shared memory consistency model, and communication structure for remote memory accesses in GridNestStep follow the bulk-synchronous parallel (BSP) model of computation. GridNestStep will be built upon previous work by adopting the structured parallel programming language NestStep [5,6].
2 The BSP Model and NestStep NestStep is a parallel programming language for the Bulk Synchronous Parallel BSP [7] model of parallel computation. Adopting a single program, multiple data (SPMD) execution style, BSP organizes a parallel computation of a group of p processors into supersteps that are separated by barrier synchronizations, see Figure 1. time
P0 P1 P2 P3 P4 P5 P6 P7 P8 P9
global barrier local computation using local data only
superstep
communication phase (message passing) next barrier
Fig. 1. A BSP superstep
Each superstep consists of a first phase in which a processor performs computation on its local data and locally cached copies of remote data, and a second phase where the processor sends requests for reading or updating remote data. After performing a (conceptual) group-wide barrier synchronization the processor handles the messages received during the previous superstep and proceeds into the next superstep. Originally the BSP model defined communication in terms of explicit message passing between processors. In contrast, the BSP-based programming language NestStep
Towards a Bulk-Synchronous Distributed Shared Memory Programming Environment
521
[5,6] was designed to support software emulation of shared variables for the BSP model. NestStep also supports nested parallelism by providing language constructs for static and dynamic nesting of supersteps and for performing synchronization of processor subsets. This is done by organizing processors into groups, which can be dynamically divided and restored during the execution of a program, thus following the (nested) structure of the supersteps. NestStep is defined as a set of language constructs that can extend any imperative programming language, as long as unrestricted pointers are avoided. NestStep is implemented as a precompiler that translates the NestStep constructs into the basis language with calls to a runtime system. Then the basis language compiler is used to compile these source files and link them to the NestStep runtime library. Currently supported languages are Java, using the Java API for TCP/IP socket communication, and C, using MPI for communication. In NestStep, variables, arrays and objects are either private to a process or shared between a group of processes. For shared variables, arrays and objects, NestStep guarantees that at superstep boundaries all copies contain the same value (superstep consistency). NestStep defines two variants of sharing: – A replicated shared variable, array or object exists as a local copy on each processor in the group declaring it. Write accesses to the local copy are automatically committed to all remote copies at the end of the superstep. – A distributed shared array is partitioned between the processors in the group, and each processor has direct access to its partition. Read accesses to remote elements are done by prefetching in the preceding superstep’s communication phase. Remote write accesses translate to update messages that are committed at the end of a superstep. In both cases, message passing for updating remote copies of shared variables and array elements occurs only in the communication phase at the end of a superstep, which follows the BSP model. NestStep denotes BSP supersteps by the step and neststep statements. step {statements} groups a set of statements to be executed as one superstep (see Figure 1), while neststep (...) {statements} partitions the current processor group into a user-defined number of subgroups such that each subgroup executes the statements independently from the others, as a superstep. For example, the statement neststep(2, . . . ) splits the current processor group, say G, into two separate groups G1 and G2 . After the split, the two processor groups execute their respective supersteps independently of each other, such that changes to shared variables are only committed within each subgroup. Finally, when the subgroups have executed their supersteps, the parent group is restored and subgroup changes to shared variables defined in the parent group, are committed. This supports statically and dynamically nested parallelism and immediately maps to parallel divide-and-conquer computations. Summarizing, the step and neststep statements maintain two kinds of invariants during the execution of a NestStep program: – Superstep synchronicity: all processors of the same (active) processor group are working on the same superstep. This implies an implicit groupwide barrier synchronization at the beginning and the end of statement.
522
H˚akan Mattsson and Christoph Kessler
– Superstep consistency: guarantees that, before a (nest)step statement is performed, all processors belonging to the same group have the same value of their local copies of a shared variable. At the end of a (nest)step statement, a combine phase is carried out to combine and commit changes made to copies of replicated shared variables and remote shared array elements to reestablish superstep consistency. Communication between processors in a group is structured as a tree. During the combine-phase, messages are sent from the leaf processors towards the root of the tree. The result of the combine phase (where several global operations such as reductions and parallel prefix can be performed onthe-fly), is then committed to the processors of the group by sending messages in the opposite direction.
3 Mapping the BSP Model to a Grid Environment We propose to adapt the BSP model with the NestStep language to grid computing environments. As Figure 2 proposes, this could be done e.g. for NestStep-C, that is, using C/C++ with NestStep language extensions. A precompiler translates the NestStep constructs to run-time system calls. The result is an explicitly parallel program decomposed into supersteps. For simplicity, we consider for now only flat (non-nested) supersteps.
C program using NestStep runtime library Grid platform
Cluster 1
Current superstep divided into workpackages Scheduler
Cluster 2
Cluster 3
Fig. 2. Running NestStep programs on a grid system
The (virtual) BSP processors’ computations within a single superstep are embarassingly parallel, which perfectly matches the constraints encountered in a grid environment. Moreover, by superstep synchronicity, all grid processors involved will, at any time, work on the same superstep. Hence, we can define a workpackage by the computation of a single superstep and a set of (virtual) BSP processors to execute that superstep.
Towards a Bulk-Synchronous Distributed Shared Memory Programming Environment
523
Following the hierarchical structure of grids, we partition the computation work of a superstep into workpackages of appropriate size. Within each superstep S we define macro-workpackages w1S ,. . . , wPS by aggregating the work (program code with input data sets and a set of global BSP processor indices) for P disjoint processor subsets to be allocated for execution of S, which are to be farmed out to the P grid nodes (the clusters in Figure 2) available for the parallel computation. The sizes of these subsets are to be chosen depending on the current load of the grid nodes (if known), otherwise load balancing must be handled by the grid middleware. Assuming that each grid node j gets one macro-workpackage wjS , then wjS can be locally decomposed into smaller workpackages, one for each available processor of j, which are then executed locally, thereby exploiting more fine-grained parallelism within the grid nodes. After partitioning, the macro-workpackages are sent by a scheduler via the grid middleware to the grid nodes, see Figure 2. The results of executing the macro-workpackages are delivered via the scheduler to the client (NestStep) program. As mentioned in Section 2, NestStep can define replicated shared variables. Since the value of each of these variables must be identical on all processors of the current group at the beginning of each superstep, this means that at the end of each superstep the processors must communicate their locally held value of the shared variable. So, one problem is how to realize the communication in the grid environment efficiently, and to assure that this phase does not take an arbitrary amount of time, which could happen if some of the processors are still busy for a while with other work before processing their workpackage, thus delaying the entire superstep computation. In order to combine write accesses to shared variables, and broadcast updated values to all processors of a group at the beginning of the next superstep, we apply the same principle of combine trees as implemented in the NestStep runtime system [5], where the topmost level of the combine tree, including the root (client), spans the different grid nodes and the subtree for each grid node spans cluster-local processors only. Note that these trees are reconstructed for each superstep as the number of available processors per cluster may change. In the ideal case, the compiled code for the GridNestStep application program is already available on each processor on each grid node, such that a workpackage simply contains a reference to the current superstep and the range of (virtual) BSP processors assigned to each grid node. Note that a physical processor may emulate several (virtual) BSP processors if necessary. Otherwise, we also need to ship the superstep code along with the workpackage to the grid nodes. In grids with heterogenous processor hardware, this involves the problem of distributing the application in different versions, compiled for the respective node. One solution might be to precompile the application into a library of binary codes, for the different computer architectures. However, this means we will only support a fixed set of architectures. Another possibility is to ship ”raw” source code and compile it at the nodes. This would be a very flexible solution, since we can use any computational resources available at run time, but it requires a compiler along with all required libraries at each node and adds the overhead of compilation time to the time for processing the data. A variant of this scenario ships target-independent byte code that is compiled by a Just-In-Time compiler at the nodes. This has the advantage that we can perform
524
H˚akan Mattsson and Christoph Kessler
optimization on the code during run time. A further problem is the security aspect, i.e. the prevention of distributing malicious code. However, at this stage we are not concerned with this problem. The most appealing problem is how to perform load balancing in the system. As nodes in the grid are allocated work, we need a way to monitor the load on each of them, in order to prohibit unbalanced work. An idea is to communicate status information to the scheduler along with the computational results from each node. This information is then analyzed to make load balancing decisions for the next superstep. An alternative could be decentralized load balancing. For instance, a two-tier system with local load balancers on each grid node involved may primarily focus on intra-cluster load balancing but allow for inter-cluster task migration where an entire cluster runs out of local work. Clusteraware load balancing strategies for multi-cluster environments have been investigated by van Nieuwpoort et al. [8], albeit for more fine-grained and non-BSP computations. In computational grids with low reliability, as in P2PC systems, there arises the question what should happen when a node, for some reason, becomes unavailable during the processing of a superstep. This can be handled (precisely as in SETI@Home) by giving out multiple copies of the same workpackage to several grid nodes and tracking incoming results such that, especially in the case of accumulative updates to shared variables (reductions), only one of the contributions is considered.
4 Towards a Prototype Implementation Currently we are working on the first steps towards an implementation of GridNestStep by adapting a single-cluster implementation of the NestStep run-time system on top of MPI, and adding further system components (partitioner, scheduler, precompiler) needed for a grid environment. As grid platform we will use the SweGrid hardware of the current Swedish grid initiative [10] together with the NorduGrid middleware [11]. In this case the scheduler resides on the client machine, too. When jobs are submitted, the Nordugrid middleware will make sure that the receiving cluster will match the application-specific requirements on, for example, processor type and availability, memory size, operating system, and installed software. These requirements can be specified by the user in xRSL (Extended Resource Specification Language) [11].
5 Application: GridModelica GridModelica is a project aiming at creating a grid-enabled extension and implementation of the modeling language Modelica [9]. Modelica is an object-oriented, equationbased language for modeling physical systems. From a model specification, the Modelica compiler generates a large system of ordinary and partial differential equations that is submitted to an iterative solver. Executing the solver corresponds to a simulation of the system. There exist parallel algorithms for such iterative solver algorithms that have a bulk-synchronous parallel structure and thus can be formulated as NestStep programs. The GridModelica project builds a two-tier system for grid-based multi-domain modeling and simulation, by extending the Modelica language with additional constructs for
Towards a Bulk-Synchronous Distributed Shared Memory Programming Environment
525
distributed model component libraries and grid-enabled model composition. At the simulation layer, the GridNestStep programming environment will constitute the computational platform for executing the simulation programs generated by the GridModelica compiler.
6 Summary Grid computing has become an area of extensive research. However, currently grid programming is error-prone and much work is still left to the programmer. One (partial) solution to this problem is the introduction of a distributed shared memory environment with a simple, grid-compliant shared memory consistency model. We have proposed a language-based approach that follows the bulk-synchronous parallel model, and sketched how the superstep structure of BSP programs can be mapped to computational grids. Beyond conceptual work on program transformations, communication optimization, load balancing, and workpackage layout, several implementation aspects need further work. We will investigate different ways for the scheduler to get and exploit status information from the grid middleware. In the first implementation we will only be concerned with handling of flat supersteps. However, later on we would like to investigate the possibility to express nested parallelism in GridNestStep and exploit intra-cluster locality by mapping entire processor subgroups to the same grid node.
Acknowledgements This work is partly supported by VINNOVA, project GridModelica.
References 1. Ian Foster and Carl Kesselman. The Grid 2: Blueprint for a New Computing Infrastructure. Morgan Kaufmann Publishers, second edition, 2003. 2. Nicholas T. Karonis and Brian Toonen and Ian Foster. MPICH-G2: A Grid-Enabled Implementation of the Message Passing Interface. Journal of Parallel and Distributed Computing, 63(5):551 – 563, May 2003. 3. The Globus Alliance. http://www.globus.org 4. David P. Anderson and Jeff Cobb and Eric Korpela and Matt Lebofsky and Dan Werthimer SETI@Home: an experiment in public-resource computing. Communications of the ACM, 45(11):56 – 61, 2002. 5. C. Kessler. Managing distributed shared arrays in a bulk-synchronous parallel programming environment. Concurrency and Computation: Practice and Experience, 16:133 – 153, 2004. 6. C. W. Keßler. NestStep : Nested Parallelism and Virtual Shared Memory for the BSP Model. The Journal of Supercomputing, 17(3):245 – 262, November 2000. 7. L. G. Valiant. A Bridging Model for Parallel Computation. Communications of the ACM, 33(8):103 – 111, August, 1990. 8. R. V. van Nieuwpoort, T. Kielmann and H. E. Bal. Efficient Load Balancing for Wide-Area Divide-and-Conquer Applications., Proceedings of the Eighth ACM SIGPLAN Symposium on Principles and Practices of Parallel Programming, 2001.
526
H˚akan Mattsson and Christoph Kessler
9. Peter Fritzson. Principles of Object-Oriented Modeling and Simulation with Modelica 2.1. Wiley-IEEE Press, 2004. 10. SweGrid. http://www.swegrid.se 11. P. Eerola et al. The NorduGrid architecture and tools. Proceedings of Computing in High Energy and Nuclear Physics, 2003.
High-Performance Computing in Earthand Space-Science: An Introduction Organizer: Peter Messmer Tech-X Corporation 5621 Arapahoe Avenue, Suite A Boulder, CO 80303, USA [email protected]
1 Introduction The major goals of earth- and space science investigations (ESS) are the observation, understanding, modeling and prediction of effects on earth and in space. Advanced computing is required on all these stages: The increasing amount of data recorded by sensors has to be processed, archived, and retrieved. In order to extract knowledge from these observations, advanced algorithms are required to assist the scientist in filtering this data in a multitude of ways. Finally, in order to develop predictive capabilities, models are required that can capture all the important effects in a physical system. Early modeling efforts were mainly driven by effects discovered in new observations. The algorithm were then developed to include the necessary physics to understand these observations. However, the physical systems under investigation in ESS involve typically a multitude of effects on a variety of scales. Modern algorithms therefore try to span broader ranges of scales and capture a multitude of physical processes. This will finally lead to predictive capabilities, one of the goals for e.g. space-weather simulations sponsored by NASA’s Sun-Earth Connection Division. However, the choice of the most appropriate algorithms is only the first step toward successful large scale models. The codes have to be carefully designed to utilize the sophisticated hardware they run on. This can be taken into account relatively easy when designing a new code from scratch. But often large scale codes originate from legacy single effect models. Combining these codes into multi-physics models requires the choice of appropriate data structures to avoid excessive data transformation. And finally, advanced tools are needed for the analysis, reduction and management of the large data sets generated by these codes. Goal of this mini-symposiums was to bring together physicists and computerscientists, presenting both computational problems faced by the ESS communities, as well as solutions and tools from high-performance computing.
2 Applications Over the past few decades, a multitude of algorithms have been developed to model physical processes at various scales involved in ESS models. The assumptions underlying each of these models prove useful in their particular domain and can yield important J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 527–529, 2006. c Springer-Verlag Berlin Heidelberg 2006
528
Peter Messmer
scientific results. In this mini-symposium, results of several of these models were presented. Monte-Carlo based methods are an attractive means to investigate phenomena of a statistical nature, like diffusive processes. In order to obtain physically meaningful results, a good sampling of the initial conditions and the configuration space is required. The wide availability of PC-clusters enables these simulations without the extensive costs and complications of High-Performance Computing platforms. First principles studies are an additional field that draws a lot of attention, especially due to the simplicity of the physical assumptions. These codes usually yield good parallel performance via domain decomposition, allowing them to work efficiently on large scale systems. Increasing computing performance therefore enables first principles models on scales which were otherwise the domain of more approximative methods. An example are fully kinetic electromagnetic models of dusty plasmas, which are usually treated with hybrid kinetic-fluid algorithms. Nevertheless, for most problems in ESS, first principle models are still computationally too expensive. Often, all the details of the kinetic development are not required and a fluid approximation is valid. Studies like the interaction of the solar wind with a comet or a planet are examples where fluid or MHD models can be very helpful tools. Finally, for a variety of problems, hybrid approaches can provide the physical fidelity of kinetic models while keeping the speed benefit of fluid models. E.g. for reactive flows, the chemical processes can be modeled using kinetic equations, while the motion is based on fluid equations. The fast execution of these codes and high-performance computing techniques allow modeling such processes faster than in real time.
3 Algorithms and Tools While many exciting results can be generated using well established algorithms, a lot of models are still limited by present day computing resources. E.g. particle based first principles models are often computationally too demanding. Especially constraints for numerical stability, like the Courant condition, prohibit large scale simulations over extended periods of time. New algorithms, sometimes based on fundamentally new approaches, are therefore required to relax those constraints. Improved algorithms are the most advantageous approach to decrease the computational cost. However, these algorithms tend to be complex by themselves, and their parallel implementation can add an unnecessary burden on the developers. Tools are therefore required that simplify these implementations. This is particularly true for multi-physics models. Most of these models were developed as standalone applications and in order to couple them, an infrastructure is required to mediate between the quantities of the different models. E.g. to model magnetic reconnection, it would be beneficial to model the actual reconnection region in a fully kinetic way, whereas in the outer regions fluid or hybrid approximations are sufficient. Another example are global climate models, where oceanic and atmospheric models have to be coupled and quantities have to be converted between the different codes. Providing an infrastructure for these tasks is therefore highly desirable.
High-Performance Computing in Earth- and Space-Science: An Introduction
529
4 Conclusion The mini-symposium on High-Performance Computing in Earth and Space-Science brought together an exciting mix of papers from both physics and computer science. It became clear that cooperation between these communities is important in order to fully exploit the potential of present-day high-performance computing platforms. I hereby wish to thank the PARA04 organizers for hosting this mini-symposium, all the referees for reviewing the contributed papers and finally all the authors who submitted their papers to the workshop.
Applying High Performance Computing Techniques in Astrophysics Francisco Almeida1 , Evencio Mediavilla2 , Alex Oscoz2 , and Francisco de Sande1 1
2
Dept. de Estad´ıstica, Investigaci´on Operativa y Computaci´on Univ. de La Laguna, 38271–La Laguna, Spain {falmeida,fsande}@ull.es Instituto de Astrof´ısica de Canarias (IAC), c/V´ıa L´actea s/n, 38271–La Laguna, Spain {emg,aoscoz}@ll.iac.es
Abstract. When trying to improve the execution time of scientific applications using parallelism, two alternatives appear as the most extended: to use explicit message-passing or using a shared address memory space. MPI and OpenMP are nowadays the industrial standards for each alternative. We broach the parallelization of an astrophysics code used to measure various properties of the accretion disk in a black hole. Different parallel approaches have been implemented: pure MPI and OpenMP versions using cyclic and block distributions, a hybrid MPI+OpenMP parallelization and a MPI Master-Slave strategy. A broad computational experience on a ccNUMA SGI Origin 3000 architecture is presented. From the scientific point of view, the most profitable conclusion is the confirmation of the robustness of the technique that the original code implements.
1 Introduction This work collects the experiences of a collaboration between researchers coming from two different fields: astrophysics and parallel computing. We deal with different approaches to the parallelization of a scientific code that solves an important problem in astrophysics, the detection of supermassive black holes. To accomplish it, light curves of Quasi-stellar Object (QSO) images must be analytically modeled. The scientific aim is to find the values that minimize the error between this theoretical model and the observational data according to a chi-square criterion. The robustness of this procedure depends on the sampling size m. However, even for relatively small values of m, the determination of the minimum takes a long time of execution since m5 starting points of the grid must be considered, as will be shown later in section 2. We will show that parallelization can reduce the total time of execution, allowing to greatly increase the sampling and, consequently, to improve the robustness of the evaluated minimum. The sequential code is best suited for different parallel implementations varying from message-passing (MPI) to shared-memory (OpenMP) codes and hybrid solutions combining both (MPI+OpenMP). One of the aims of this work is to obtain the maximum reduction in the execution time for the original sequential code. With this goal in mind,
This work has been partially supported by the EC (FEDER) and the Spanish MCyT (Plan Nacional de I+D+I, TIC2002-04498-C05-05 and TIC2002-04400-C03-03).
J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 530–537, 2006. c Springer-Verlag Berlin Heidelberg 2006
Applying High Performance Computing Techniques in Astrophysics
531
we have also to consider that the amount of time invested in code developments is a heavy limitation for the usual non-parallel expert scientific programmers. To balance both objectives we devise the necessary efforts comparing pure MPI and OpenMP codes and mixed MPI+OpenMP programs also. The parallel approaches have been tested and a broad computational experience on a ccNUMA SGI Origin 3000 will be presented. The conclusions and the knowledge acquired with our experiments establish the basis for future developments on similar codes. The paper has been structured as follows. Section 2 describes the astrophysics problem and the code to be parallelized. This sequential code is slightly modified to facilitate the four parallel developments introduced in section 3. The computational experience is presented in section 4. We conclude that, for the application considered, no significative differences have been found among the performances of the different parallelizations. However, the development effort of the considered versions could be appreciable. We finalize the paper in section 5 with some concluding remarks and future lines of work.
2 The Problem Supermassive black holes (SMBH: objects with masses in the range 107 − 109 solar masses) are supposed to exist in the nucleus of many if not all the galaxies. It is also assumed that some of these objects are surrounded by a disk of material continuously spiraling (accretion disk) towards the deep gravitational potential pit of the SMBH and releasing huge quantities of energy giving rise to the phenomena known as quasars. We are interested in objects of dimensions comparable to the Solar System in galaxies very far away from the Milky Way. Objects of this size can not be directly imaged and alternative observational methods are used to study their structure. One of these methods is the observation of Quasistellar Object (QSO) images affected by a microlensing event to study the unresolved structure of the accretion disk. If light from a QSO pass through a galaxy located between the QSO and the observer it is possible that a star in the intervening galaxy crosses the QSO light beam. Thus the gravitational field of the star can amplify the light emission coming from the accretion disk (gravitational microlensing). As the star is moving with respect to the QSO light beam, the amplification varies during the crossing. The curve representing the change in luminosity of the QSO with time (QSO light curve) depends on the position of the star and on the structure of the accretion disk. The goal in this work is to model light curves of QSO images affected by a microlensing event to study the unresolved structure of the accretion disk. According to this objective we have fitted the peak of an high magnification microlensing event recently observed in one quadruple imaged quasar, Q 2237+0305. We have modeled the light curve corresponding to an standard accretion disk [5] amplified by a microlens crossing the image of the accretion disk. Leaving aside the physical meaning of the different variables (for details see [2]), the function modeling the dependence of the observed flux, Fν , with time, t, can be written as ξmax B ξdξ Fν (t) = Aν {1 + √ G[C(t − t0 )/ξ]} (D ξ3/4 (1−1/√ξ)−1/4 ) (2.1) ν ξ e −1 1 where ξmax is the ratio between the outer and inner radii of the accretion disc (we will adopt ξmax = 100). G is a function
532
Francisco Almeida et al.
+1
G(q) = −1
√
H(q − y)dy , q − y 1 − y2
To speed up the computation, G has been approximated by using MATHEMATICA (for details, see appendix in [2]). Therefore, the goal is to estimate the values of the parameters Aν , B, C, Dν , and t0 by fitting Fν to the observational data. Specifically, to find those values of the parameters that minimize N (Fiobs − Fν (ti ))2 χ2 = (2.2) σi 2 i where N is the number of data points (Fiobs ) corresponding to times ti (i = 1, ..., N ) and Fν (ti ) is the theoretical function evaluated at ti. σi is the observational error associated to each data value. To minimize χ2 we used the NAG [4] routine e04ccf. This routine [3] only requires evaluation of the function and not of the derivatives. As the determination of the minimum in the 5-parameters space depends on the initial conditions we needed to consider a 5-dimensional grid of starting points. If we consider m sampling intervals in each variable, the number of starting points in the grid is of m5 . For each one of the points of the grid we computed a local minimum. Finally, we select the absolute minimum among them. Even for relatively small values of m, the determination of the minimum takes a long time of execution. Figure 1 shows the original sequential version of the code. There are five nested loops corresponding to the traverse of the 5-dimensional grid of starting points. We minimize the jic2 function using the e04ccf NAG function. The starting point x, and the evaluation of jic2(x) (fx) are supplied as input/output parameters to the NAG 1 2 3 4 5 6 7 c 8 c 9 10 11 12 13 14 c 15 16 17 18 19 20 21 22 23 24
program seq_black_hole double precision t2(100), s(100), ne(100), fa(100), efa(100) common/data/t2, fa, efa, length double precision t, fit(5) common/var/t, fit Data input Initialize best solution do k1=1, m do k2=1, m do k3=1, m do k4=1, m do k5=1, m Initialize starting point x(1), ..., x(5) call jic2(nfit,x,fx) call e04ccf(nfit,x,fx,ftol,niw,w1,w2,w3,w4,w5,w6,jic2, x monit,maxcal,ifail) if (fx improves best fx) then update(best (x, fx)) endif enddo enddo enddo enddo enddo
Fig. 1. Original sequential pseudocode
Applying High Performance Computing Techniques in Astrophysics
533
function. Inside the jic2 function, the integral of formula 2.1 is computed using the d01ajf and d01alf NAG functions. These functions are general-purpose integrators that calculate an approximation to the integral of a function. The common blocks in lines 3 and 5 are used to pass additional parameters from the jic2 function to these integrators. Observe that jic2 is passed as a parameter to the minimization function e04ccf.
3 Parallelization The parallelization was coerced by two main constraints: to introduce the minimum amount of changes in the original code and to preserve the use of the NAG functions. At the beginning of our collaboration, one objective was that the kind of parallelization to be introduced in the code should be easily understood and reproduced by other non parallel programmers. The only modification introduced in the original sequential version was an ad hoc transformation of the iteration space by reducing the five nested loops in listing 1 to a single one. This modification does not change the underlying semantics of the sequential code nor the order in which the points are evaluated. The transformation can be easily achieved since there are no data dependencies among the different points in the iteration domain. The reason to introduce this transformation is to ease and clarify the parallelizations. MPI and OpenMP pure versions can be directly obtained from that code. The MPI parallelization is the most straightforward: after inserting in the code the initialization and finalization MPI calls, we simply divide the single loop iteration space among the available set of processors using a cyclic or block distribution. In the pure OpenMP parallelization, the main difficulty is to identify the shared/private variables in the code. Once they are located, the corresponding OpenMP parallel do pragma can be introduced in the main loop. Variables included in common blocks deserved a particular consideration. To avoid read/write concurrent accesses, the threadprivate pragma was included. The mixed MPI/OpenMP version just merges both former versions. Each MPI process expands a certain number of OpenMP threads that take in charge the iteration chunk of the MPI process. As it will be shown in the next section, none of the different versions present a linear speedup. This was the motivation to introduce a MPI Master-Slave approach. This alternative has the disadvantage of requiring a higher expertise in parallel programming and therefore it was introduced only to correct the load imbalance of the previous versions.
4 Computational Results This section is devoted to investigate and quantify the performance obtained with the different parallel approaches taken. On a Sun Blade 100 Workstation running Solaris 5.0 and using the native Fortran77 Sun compiler (v. 5.0) with full optimizations the code takes 5.89 hours and 12.45 hours for sampling intervals of size m = 4 and m = 5 respectively.
534
Francisco Almeida et al. Table 1. Time for parallel codes, sampling intervals: m = 4 Time (secs.) P
B-MPI
B-OMP
C-MPI
MPI-OMP
MS-MPI
2 4 8 16 32
1652.34 923.40 481.81 256.14 133.00
1670.51 930.64 485.69 257.20 133.97
1674.85 951.12 496.05 256.95 133.87
1671.59 847.37 481.89 255.27 133.70
1641.94 821.64 412.44 211.54 113.74
Table 2. Time for the mixed (MPI-OpenMP) code for different combinations Threads/MPI P Processes. Sampling intervals: m = 4 The number of MPI processes is nT h P nTh
32
16
8
4
2
1 2 4 8 16 32
133.70 137.80 151.84 156.56 139.03 133.88
257.55 262.26 278.10 255.27 256.34 -
497.12 481.89 485.05 484.77 -
953.34 847.37 938.15 -
1679.17 1671.59 -
The target architecture for all the parallel executions has been a Silicon Graphics Origin 3000 with 128 R14000 (600Mhz) processors. Only 32 processors were used in the computational experiments. We restricted our executions to this architecture because it is the only available to us with the NAG libraries installed. Using the native MIPSpro Fortran77 compiler (v. 7.4) and the NAG Fortran Library - Mark 19 in the SGI Origin 3000 with full optimizations the sequential running time is 0.91 hours and 2.74 hours for sampling intervals of size m = 4 and m = 5 respectively. Table 1 shows execution time (in seconds) of the parallel codes with sampling intervals of size m = 4. Labels B-MPI and C-MPI correspond to the MPI code with block and cyclic distributions of iterations respectively. The label B-OMP corresponds to a block distribution using OpenMP. The mixed mode code (label MPI-OMP) correspond to the minimum times obtained for different combinations of MPI processes/OpenMP threads (see table 2). Column labeled MS-MPI shows the time corresponding to the MPI Master-Slave parallel version. Assuming that we do not have exclusive mode access to the architecture, the times collected in all the cases correspond to the minimum time from five different executions. Except for the Master-Slave version, times do not show a significative variation among the different parallel approaches taken. The speedups achieved reduce the sequential running time almost linearly with the number of processors involved. In Figure 2 we observe that in both MPI and OpenMP pure versions (C-MPI and B-OMP labels respectively), as the number of processors increases, the growth of the speedup diminishes (the MPI and OpenMP speedup curves appear overlapped in the figure). The reason for this behaviour is the call to the e04ccf
Applying High Performance Computing Techniques in Astrophysics
535
NAG routine inside the loop. The convergence run time for this routine is different depending on the starting point. This fact introduces load imbalance since each iteration may consume different run time and some processors may be overloaded. For each processor, figure 3 shows the run time consumed in the loop in an execution with 32 processors. We observe a large variation in the time for MPI and OpenMP versions as a consequence of the imbalance, while the Master-Slave (MS label in figure 2) code shows almost constant time for each processor. In the case of OpenMP, we have also tried to use dynamic scheduling strategies. Figure 4 compares the load imbalance introduced by this scheme against the Master-Slave balanced approach. We consider that with the current implementation of the OpenMP compiler, the overhead introduced with the dynamic scheduling is still unaffordable. Table 3. Time and speedup for parallel codes, sampling intervals: m = 5 Time (secs.)
Speedup
P
MPI
OMP
MPI-OMP
MS
MPI
OMP
MPI-OMP
MS
2 4 8 16 32
4957.0 2504.9 1261.2 640.3 338.1
5085.9 2614.5 1372.4 755.8 400.1
4973.1 2513.0 1265.1 642.9 339.2
4906.4 2455.0 1231.5 619.3 325.4
2.0 3.9 7.8 15.4 29.2
1.9 3.8 7.2 13.1 24.7
2.0 3.9 7.8 15.4 29.1
2.0 4.0 8.0 15.9 30.3
For the mixed MPI/OpenMP parallel version, we have considered in the computational experiment different combinations between MPI processes and OpenMP threads. Table 2 shows the running time of this parallel version. An execution with P processors P involves nT h OpenMP threads and nT h MPI processes. The minimum running times appear in boldface. As it could be expected, the execution time using P MPI processes and 1 OpenMP thread coincide with the correspondent MPI pure version. Analogously, the same behaviour is observed with 1 MPI process and P threads with respect to the pure OpenMP version. A regular pattern for the optimal combination of MPI processes and OpenMP threads is not observed. Table 3 collects the time and speedup corresponding to sampling intervals of size m = 5. We observe no significative variations among the different versions. This effect is due to a better load balance produced by the increment in the iteration space size. Almost linear speedup is observed in all the cases. The lowest speedup values are obtained with the OpenMP version, and again no significative improvement is achieved with the mixedmode MPI/OpenMP approach.
5 Conclusions and Future Work We conclude a first step of cooperation in the way of applying parallel techniques to improve performance in astrophysics codes. The scientific aim of applying high performance computing to computationally-intensive codes in astrophysics has been successfully achieved. The relevance of our results do not come directly from the particular
536
Francisco Almeida et al. 35 Linear B-OMP C-MPI MS
30
Speedup
25
20
15
10
5
0 0
5
10
15
20
25
30
35
Processors
Fig. 2. Speedup achieved by different parallel approaches 140 MS MPI OMP 130
Time (secs.)
120
110
100
90
80
70 0
5
10
15 20 Processors
25
30
35
Fig. 3. Load Balance: Master-Slave, MPI and OpenMP
application chosen, but from stating that parallel computing techniques are the key to broach large size real problems in the mentioned scientific field. For the case of non-expert users and the kind of codes we have been dealing with, we believe that MPI parallel versions are easier and safer. In the case of OpenMP, the proper usage of data scope attributes for the variables involved may be a handicap for users with non-parallel programming expertise. The higher current portability of the MPI version is another factor to be considered. The mixed MPI/OpenMP parallel version is more expertise-demanding. Nevertheless, as it has been stated by several authors [1], [6], and even for the case of a hybrid
Applying High Performance Computing Techniques in Astrophysics
537
140 MS OMP-dynamic 135 130
Time (secs.)
125 120 115 110 105 100 95 90 0
5
10
15 20 Processors
25
30
35
Fig. 4. Load Balance: Master-Slave vs. OpenMP with dynamic schedule
architecture like the SGI Origin 3000, this version does not offer any clear advantage and it has the disadvantage that the combination of processes/threads has to be tuned. The Master-Slave strategy appears as the most efficient when dealing with load imbalance even in the case of simple codes. This approach provides a satisfactory solution from the parallel programmer point of view, but it requires more parallel programming knowledge, and therefore it is not commendable for non-expert users. OpenMP does not provide easy solutions to broach the load imbalance situation. From the scientific point of view, the most profitable conclusion is the confirmation of the robustness of the method provided by the computational results. From now on we plan to continue this fruitful collaboration by applying parallel computing techniques to some other astrophysical challenge problems.
References 1. F. Cappello, D. Etiemble, MPI versus MPI+openMP on IBM SP for the NAS benchmarks, in: Proceedings of Supercomputing’2000 (CD-ROM), IEEE and ACM SIGARCH, Dallas, TX, 2000, lRI. 2. L. J. Goicoechea, D. Alcalde, E. Mediavilla, J. A. Mu˜noz, Determination of the properties of the central engine in microlensed QSOs, Astronomy and Astrophysics 397 (2003) 517–525. 3. J. A. Nelder, R. Mead, A simplex method for function minimization, The Computer Journal 7 (4) (1965) 308–313. 4. Numerical Algorithms Group, NAG Fortran library manual, mark 19, NAG, Oxford, UK (1999). 5. N. I. Shakura, R. A. Sunyaev, Black holes in binary systems. observational appearance, Astronomy and Astrophysics 24 (1973) 337–355. 6. L. Smith, M. Bull, Development of mixed mode MPI/OpenMP applications, Scientific Programming 9 (2–3) (2001) 83–98.
Statistical Properties of Dissipative MHD Accelerators Kaspar Arzner1 , Loukas Vlahos2 , Bernard Knaepen3 , and Nicolas Denewet3 1 Paul Scherrer Institut Laboratory for Astrophysics CH-5232 Villigen PSI, Switzerland [email protected] 2 Aristotle University Institute of Astronomy, Dept. of Physics 54006 Thessaloniki, Greece [email protected] 3 Universit´e Libre de Bruxelles Mathematical Physics Dept. CP231, Boulevard du Triomphe 1050 Bruxelles, Belgium [email protected]
Abstract. We use exact orbit integration to investigate particle acceleration in a Gauss field proxy of magnetohydrodynamic (MHD) turbulence. Regions where the electric current exceeds a critical threshold are declared to be ‘dissipative’ and endowed with super-Dreicer electric field EΩ = ηj. In this environment, test particles (electrons) are traced and their acceleration to relativistic energies is studied. As a main result we find that acceleration mostly takes place within the dissipation regions, and that the momentum increments have heavy (non-Gaussian) tails, while the waiting times between the dissipation regions are approximately exponentially distributed with intensity proportional to the particle velocity. No correlation between the momentum increment and the momentum itself is found. Our numerical results suggest an acceleration scenario with ballistic transport between independent ‘black box’ accelerators.
1 Introduction Astrophysical high-energy particles manifest as cosmic rays or, indirectly, as radio waves, X-rays, Gamma rays. These often occur in transients, and with distinctly nonequilibrium energy distributions. A prominent source of sporadic radio- and X-ray emission is the Sun during the active phase of its 11-year cycle. Among the numerous mechanisms proposed for accelerating solar particles to high energies (see [1] for an overview), stochastic ones attracted particular attention because they require generic input data and do not rely on special geometrical assumptions. In stochastic acceleration [2,3,4,5], particles move in random electromagnetic fields, where they become repeatedly deflected and, on average, accelerated. The electromagnetic fields are thought to arise from magneto-hydrodynamic (MHD) turbulence (e.g., [6]), perhaps excited by the broadband echo of a magnetic collapse. The turbulence may host shocks and other forms of dissipation if critical velocities [7] or electric current densities [8,9] are exceeded. Associated with dissipation are (collisional or anomalous) resistivity and non-conservative J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 538–545, 2006. c Springer-Verlag Berlin Heidelberg 2006
Statistical Properties of Dissipative MHD Accelerators
539
electric fields, which sustain, locally, the electric current against dissipative drag in order to meet the global constraints. However, a detailed balance on the level of individual charge carriers is impossible because the dissipative drag depends on particle position and velocity, whereas the electric field is a function of position only. Thus the electric field may compensate the bulk drag, but a (high-energy) population can be left over and exposed to acceleration [10]. This lack of detailed balance is in the heart of dissipative acceleration mechanisms. In plasmas, dissipation occurs at ‘ruptures’ of the magnetic structure, and is therefore localized around critical points of the magnetic field. The above scenario, first envisaged by Parker [11] for the solar atmosphere, has since been explored in a large number of numerical studies [12,13,14,6,15,16,17]. On the theoretical side, most stochastic acceleration theories [2,18,19] base on FokkerPlanck approaches, thus transferring two-point functions of the electromagnetic fields into drift and diffusion coefficients of particles by probing the fields along unperturbed trajectories. Dynamical particle averages are then replaced by field ensemble averages, neglecting the fact that real particles move in one realization of the random field. As a result, diffusive behaviour may be predicted even if particles are trapped in a single realization of the random field. In order to investigate the full diversity of orbit behaviour one must resort to numerical simulations. In the present contribution we analyze the behaviour of test particles in resistive MHD turbulence with localized dissipation regions, with particular emphasis on the validity of a Fokker-Planck description [20]. We use here exact orbit integration, and thus avoid any guiding centre approximations [21,22]. The price for rigorosity is computational cost, which makes the scheme only feasible with the aid of highperformance computing.
2 Acceleration Environment The MHD turbulence has been been modeled by full 3D spectral MHD simulations and by Gauss field proxies [23,17]. Weconcentrate here on the latter, which is computed from the vector potential A(x, t) = k ak cos(k·x−ω(k)t−φk ) by means of tabulated trigonometric calls. This allows to continuously determine the fields at the exact particle position, and avoids any real-space discretization artifacts, but the computational overhead restricts the k sum to a few 100 Fourier modes ak . They are taken from the −1 shell min(li−1 )<|k|< 10−2 rL with rL the rms thermal ion Larmor radius and li the outer scale of the power spectral density |ak |2 ∝ (1 + lx2 kx2 + ly2 ky2 + lz2 kz2 )−ν . The electromagnetic fields are then obtained from B= ∇×A E = −∂t A + η(j) j ,
(2.1) (2.2)
where µ0 j = ∇ × B and η(j) = η0 θ(|j| − jc ) is an anomalous resistivity switched on above the critical current jc ∼ encs . Here, cs and n are the sound speed and number density of the background plasma. The Gauss field A must satisfy the MHD constraints E · B = 0 if η(j) = 0
and
E/B ∼ vA
(2.3)
540
Kaspar Arzner et al.
with vA the Alfv´en velocity. Equation (2.3) can be achieved in several ways. For instance, one can use Euler potentials of which only one is time-dependent, or force A to point along a single direction. A somewhat more flexible way, used here, is axial gauge ak ·vA = 0 with dispersion relation ω(k) = k · vA . A constant magnetic field B0 along vA can 2 be included without violating Eq. (2.3), and we set |vA |2 = (B02 + σB )/(µ0 nmp ) with σB the rms magnetic fluctuations and mp the proton rest mass. In the present simulation, vA and the background magnetic field are along the z direction. The total magnetic field 2 is a free parameter, which defines the scales of the particle orbits. In B = B02 + σB order to represent coronal turbulence we choose vA ∼ 2 · 106 m/s, ν = 1.5, B ∼ 10−2 T, σB = 10−2 T, B0 = 10−3 T, and lx = ly = 103 m, lz = 104 m. The current threshold jc is exceeded in about 7% of the total volume. Note that our choice represents strong (σB B0 ) and anisotropic (lz lx , ly ) turbulence. To embed our simulation in the real solar atmosphere one should associate lz with the radial direction in order to reproduce the predominant orientation of coronal filaments.
3 Particle Dynamics 3.1 Physical Scaling Time is measured in units of the (non-relativistic) gyro period Ω −1 =m/qB; velocity in units of the speed of light; distance in units of cΩ −1 . Particle momentum is measured in units of mc; the vector potential in units of mc/q; the magnetic field in units of B; the electric current density in units of ΩB/(µ0 c), so that the dimensionless threshold 2 current is jc = (m/mp )cs c/vA ; and the electric field is measured in units of cB, so that the dimensionless Dreicer [10] field is ED = (ve /τ )(me /m) with ve the electron thermal velocity and τ the electron-ion collision time. The dimensionless equations of motion are dx = v dt d(γv ) ∂A = v × B − + η (|j |) j dt ∂t
(3.4) (3.5)
with γ the Lorentz factor, B = ∇ × A , and j = ∇ × B the electric current. The dimensionless resistivity η is characterized by the resulting dissipative electric field EΩ = η0 j relative to the Dreicer field ED . We chose η such that EΩ /ED ∼ 104 . 3.2 Particle Initial Conditions We consider electrons as test particles. The initial positions are uniformly distributed in space, and the velocities are from the tail v ≥ 3 vth of a maxwellian of 106 K, which is typical for the solar corona. Coulomb collisions are neglected, which is a good approximation once the acceleration has set on, but is not strictly correct in the beginning of the simulation. 3.3 Numerical Implementation and Simulation Management Equations (3.4) and (3.5) are integrated by traditional leapfrog and Runge-Kutta schemes. The test particle code is written in FORTRAN 90/95 and compiled by the
Statistical Properties of Dissipative MHD Accelerators
541
Fig. 1. Evolution of electron kinetic momentum. Top panel: 200 sample orbits; adiabatic (a), and accelerated (b) cases. Bottom panel: electric current density along the orbits a) and b). The critical current density (|j| > jc ) is marked by dotted line. The present simulation is an extension of the simulation of [17]
Portland Group’s Fortran 90 compiler (pgf90). Diagnostics and visualization uses IDL as a graphical back-end. The code is run on the MERLIN cluster of the Paul Scherrer Institut, and on the ANIC-2 cluster of the Universit´e libre de Bruxelles. The ANIC-2 cluster has 32 single Pentium IV nodes, a total of 48 Gbyte memory, and Ethernet connections. The MERLIN cluster consist of 56 mostly dual Athlon nodes with a total of 80 GByte memory, operated under Linux and connected by Myrinet and Ethernet links. MERLIN jobs are managed by the Load Sharing Facility (LSF) queueing system. Parallelization is done on a low level only, with different (and independent) test particles assigned to different CPU’s. MPICH/MPI is used to ensure crosstalk-free file I/O. The field data are computed on each CPU for the actual particle position. Random numbers are needed in the generation of the Fourier amplitudes and -phases of the electromagnetic fields, and in the particle initial data; they are taken from the intrinsic random number generator of pgf90.
4 Diagnostics and Results In order to characterize the relativistic acceleration process we consider the evolution of the kinetic momentum P = γv . This quantity is directly incremented by the equation
542
Kaspar Arzner et al.
Fig. 2. Frequency distribution of the energy jumps ∆γ of Fig. 1 (black line), together with a best-fit L´evy density (gray line) with parameters α0 = 0.75, β0 = −0.26, and C0 = 0.035 (crosses). Inlets: cuts of the likelihood surface at (α0 , β0 , C0 ). The 99% confidence level is marked boldface
Fig. 3. Left: travel times ∆tn = tn+1 −tn between acceleration regions versus velocity vn . Right: energy gain ∆γn = γn+1 − γn versus velocity. While ∆tn scales with inverse velocity (solid line: best-fit), there is no clear trend in the energy gain ∆γn
of motion (3.5), and – ignoring quantum effects – can grow to arbitrarily large values, so that it can serve as a diagnostics of diffusive behaviour. Alternatively we may use the kinetic energy γ.
Statistical Properties of Dissipative MHD Accelerators
543
Fig. 4. Histogram of the quantity |vn |∆tn , together with an exponential fit (dashed)
The results of the orbit simulations are shown in Figs. 1 - 4. When initially superthermal (v > ∼ 3vth ) electrons move in the turbulent electromagnetic fields (Eqns. 2.1, 2.2), some of them may become stochastically accelerated. From a population of 600 electrons we find that 35% of the particles are accelerated, while the other 65% remain adiabatic [22,24] during the simulation (Ωt ≤ 8·105). The two cases are illustrated in Fig. 1 (top). The orbit a) conserves energy adiabatically during the whole simulation, while the orbit b) does not. The orbits of 200 randomly chosen particles are also shown (gray) to trace out the full population. The bottom panel of Fig. 1 shows the the electric current density along the orbits a) and b). Time intervals where the critical current (doted line) is exceeded correspond to visits to the dissipation regions. As can be seen, acceleration (or deceleration) occurs predominantly within the dissipation regions. Accordingly, the orbit b) which never enters a dissipation region remains adiabatic. As a benchmark we have set η0 = 0 and found that no acceleration takes place at all, thus reproducing the ‘injection problem’ [25]. Smaller η yield smaller (than 35%) fractions of accelerated particles. A glance at graph a) of Fig. 1 shows that P (t ) is poorly represented by a Brownian motion [20] with continuous sample paths. Rather, P changes intermittently and in large jumps. Indeed, if we consider the energy change ∆γn = γn+1 − γn across a dissipation region, we find that its distribution P (∆γn ) has heavy tails and a convex shape which deviates from a Gaussian law (Fig. 2 black line). In order to characterize P (∆γn ) we have tried to fit it by a (skew) L´evy stable distribution PL (x). The latter is defined in terms of its Fourier transform [26] s πα φL (s) = exp − C|s|α 1 + iβ tan (4.6) |s| 2 with 0 < α ≤ 2, −1 ≤ β ≤ 1, and C > 0. The parameter α determines the asymptotic decay of PL (x) ∼ x−1−α at x C 1/α , and β determines its skewness. The probability
544
Kaspar Arzner et al.
density function (PDF) belonging to Eq. (4.6) has the ‘stability’ property that the sum of independent identically L´evy distributed variates is L´evy distributed as well. While rapidly converging series expansions [26] of PL (x) are available for 1 < α ≤ 2 or large arguments x, the evaluation of PL (x) at small arguments and α < 1 is more involved. We use here a strategy where PL (x) is obtained from direct computation of the Fourier inverse PL (x) = (2π)−1 e−isx φL (s) ds, with the integrand split into regimes of different approximations. At small |s|, both the exponential in φL (s) (Eq. 4.6) and the Fourier factor e−isx are expanded; at larger |s|, φL (s) is piecewise expanded while e−isx is retained. In both cases, the s-integration can be done analytically, and the pieces are summed numerically. The resulting (Poisson) maximum-likelihood estimates of the parameters (α, β, C) are α0 = 0.75, β0 = −0.26, C0 = 0.035. The predicted frequencies are shown in Fig. 2 (gray line), and inlets represent sections of constant likelihood in (α, β, C)-space, with the 99% confidence region enclosed by boldface line. The finding α0 < 2 agrees with the presence of large momentum jumps. In a next step we have investigated the waiting times ∆tn = tn+1 − tn between subsequent encounters with the dissipation regions. Simple ballistic transport between randomly positioned dissipation regions would predict a PDF of the form f (|vn |∆tn ) with vn the particle velocity and f (x) the PDF of distances between (magnetically connected) dissipation regions. (There is no Jacobian d(∆t)/dx since we are dealing with discrete events.) This is in fact the case. Figure 3 (left) shows a scatter plot of the actual velocity vn versus waiting time ∆tn . Gray crosses represent all simulated encounters with dissipation regions, including all particles and all simulated times. There is a clear trend for ∆tn to scale with vn−1 , and the black solid line represents a best fit of the form ∆tn = L/vn with L = 9 · 203 c/Ω. When a similar scatter plot of velocity versus energy gain is created (Fig. 3 right), then no clear correlation is seen: the energy gain is apparently independent of energy. In this sense the dissipation regions ‘erase’ the memory of the incoming particles. Returning to the waiting times, we may ask for the shape of the function f (|vn |∆tn ). This can be determined from a histogram of the quantity |vn |∆tn (Fig. 4, solid line). The decay is roughly exponential (dashed: best-fit), although the limited statistics does certainly not allow to exclude other forms.
5 Summary and Discussion We have performed exact orbit integrations of electrons in a Gauss field proxy of MHD turbulence with super-Dreicer electric fields localized in dissipation regions. It was found that the electrons remain adiabatic (during the duration of the simulation) if no dissipation regions are encountered, and can become accelerated if such are met. The resulting acceleration is intermittent and is not well described by a diffusion process, even if the underlying electromagnetic fields are Gaussian. On time scales which are large compared to the gyro time, the kinetic momentum performs a L´evy flight rather than a classical Brownian motion. The net momentum increments in the dissipation regions are independent of the in-going momentum, and have heavy tails which may be approximated by a stable law of index 0.75. The waiting times between subsequent encounters with dissipation regions are approximately exponentially distributed, P (∆t) ∼ e−v∆t/L , indicating that the dissipation regions are randomly placed along the magnetic field
Statistical Properties of Dissipative MHD Accelerators
545
lines. This one-dimensional Poisson behaviour is most likely caused by the Gauss field approximation, and the waiting times in true MHD turbulence are expected to behave differently. An ongoing study is devoted to these questions, and results will be reported elsewhere. In summary, our numerical results suggest that the acceleration process may be modeled by a continuous-time random walk with finite or infinite mean waiting time, and infinite variance of the momentum increments. Such models can be described in terms of fractional versions [27,28] of the Fokker-Planck equation.
References 1. Miller, J., Cargill, P. J., Emslie, A. G., Holman, G. D. Dennis, B. R., LaRosa, T. N., Winglee, R. M., Benka, S. G., Tsuneta, S., 1997, J. Geophys. Res, 102, 14.631 2. Karimabadi, Menyuk, C.R., Sprangle, P., & Vlahos, L. 1987, Astrophys. J., 316, 462 3. Miller J., & R. Ramaty, 1987, Solar Phys., 113, 195 4. Miller J., LaRosa, T.N., & Melrose, R. L., 1996, Astrophys. J., 461, 445 5. Miller, J., J.A., 1997, Astrophys. J., 491, 939 6. Biskamp, D., & M¨uller, W.C., 2000, Phys. Plasmas, 7, 4889 7. Treumann, R., and Baumjohann, W., 1997, Basic Space Plasma Physics, Imperial College Press 8. Papadopoulos, K. 1997, in: Dynamics of the Magnetosphere, ed. Akasofu & Reidel, Dordrecht 9. Parker, E. N., 1993, Astrophys. J., 414, 389 10. Dreicer, H. 1960, Phys. Rev., 117, 329 11. Parker, E. N., 1983, Astrophys. J., 264, 635 12. Matthaeus, W.H. & Lamkin, S.L., 1986, Phys. Fluids, 29, 2513 13. Ambrosiano, J., Mattheus, W.H., Goldstein, M.L., and Plante, D., 1988, J. Geophys. Res., 93, 14.383 14. Anastasiadis, A., Vlahos, L., & Georgoulis, M. K., 1997 Astrophys. J., 489, 367 15. Dimitruk, P., Matthaeus, W.H., Seenu, N., & Brown, M.R. 2003, Astrophys. J. Lett., 597, L81 16. Moriyashu, S., Kudoh, T., Yokoyama, T., Shibata, K., 2004, Astrophys. J, 601, L107 17. Arzner, K., and Vlahos, L., 2004, Astrophys. J. Lett., 605, L69 18. Karimabadi, H. N. Omidi, & C.R. Menyuk 1990, Phys. Fluids B, 2, 606 19. Schlickeiser R. 2002, Cosmic Ray Astrophysics, Berlin, Springer 20. Gardiner C. W. 1985, Handbook of stochastic methods, Springer. 21. Littlejohn, R. G., 1982, J. Math. Phys., 23(5), 742 22. Littlejohn, R. G., 1983, J. Plasma Physics, 29, 111 23. Adler, R. J., 1981, The Geometry of Random Fields, John Wiley & Sons 24. B¨uchner, J., and Zelenyi, L. M, 1989, J. Geophys. Res., 94, 11.821 25. Cargill, P. J., 2001, in: Encyclopedia of Astronomy and Astrophysics – Solar flares: Particle Acceleration Mechanisms, Nature Publ. Group, IOPP, Bristol, UK 26. Lukacs, E., 1960, Characteristic Functions, Charles Griffin, London 27. Uchaikin, V. V., 2000 Int. J. of Theoretical Physics, 39, no 8, 2087 28. Meerschaert M. M. and Benson, D. A, 2002, Phys. Rev. E, 66, 060102(R)
A Simulation Model for Forest Fires Gino Bella1 , Salvatore Filippone1 , Alessandro De Maio2 , and Mario Testa2 1 Universit`a di Roma “Tor Vergata”, Rome, Italy {bella,salvatore.filippone}@uniroma2.it 2 Numidia s.r.l., Rome, Italy
Abstract. The control of forest fires is a very important problem for many countries around the world. Proper containment and risk management depend on the availability of reliable forecasts of the flame front propagation under the prevailing wind conditions, taking into account the terrain features and other environmental variables. In this paper we discuss our initial development of the central part of an integrated system, namely a tool to simulate the advancement of the flame front. We discuss the pyrolisis model, the wood characteristics taken into account, and the modeling of heat exchange phenomena; this modeling tool is derived from the well known fluid dynamics code Kiva, originally developed for engine design applications. Finally, we give an overview of the future directions of development for this activity, with special attention to the models of combustion employed.
1 Introduction Wood combustion as encountered in a forest fire is a very complex phenomenon with many interactions. One of the main tools employed so far in the study of fires is the use of a special purpose wind tunnel with a combustive bed, i.e. a layer of material reproducing the combustion characteristics of the wood mix in the area under consideration. The simulation tool that has been developed has now reached a development status sufficient to replicate most results from a wind tunnel experiment; this is of course the first step towards the ability to simulate a fire in a given terrain scenario. The simulation engine is coupled with pre and post processing tools, to provide a complete visualization of the simulation scenario. Our simulation tool is based on the Kiva code [1]; KIVA is a code capable of simulating chemically reacting fluid flows, originally developed in the context of engine simulations.
2 The Wind Tunnel Fire Bed Simulation The fire bed is a layer of combustive material with a parallelepiped shape. It is considered to be composed of spherical particles; the material is characterized by the percentage of cellulose, water and inert and makes up a permeable bed, with a user-specified ratio between solid and gas components. J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 546–553, 2006. c Springer-Verlag Berlin Heidelberg 2006
A Simulation Model for Forest Fires
547
The base environment is the KIVA-3 code, originally developed at Los Alamos for the simulation of internal combustion engines and related gas-fuel spray interactions. Its mathematical governing equations and Numerical method are described below. To simulate wood combustion, we had to expand the base code by integrating models for heat exchange between gas and wood particles, and for pyrolisis. The current simulation model is already adequate to represent undergrowth fire; in the next development cycle we plan to introduce the model for massive solid wood entities, i.e. trees. 2.1 The Mathematical Governing Equations The mathematical model of KIVA-3 is the complete system of general time dependent Navier-Stokes equations, coupled with chemical kinetic and spray droplet dynamic models. The fluid motion equations are: – Species continuity: ∂ρm ρm + ∇ · (ρm u) = ∇ · [ρD∇( )] + ρ˙ cm + ρ˙ sm ∂t ρ where ρm is the mass density of species m, ρ is the total mass density, D is the diffusion coefficient, u is the fluid velocity, ρ˙ cm is the mass source term due to chemistry, ρ˙ sm is an additional mass source term. By summing the previous equation over all species we obtain the total fluid density equation: – Total mass conservation: ∂ρ + ∇ · (ρu) = ρ˙ s ∂t – Momentum conservation: ∂(ρu) + ∇ · (ρu u) = ∂t 1 2 − 2 ∇p − A0 ∇( ρk) + ∇ · σ + Fs + ρg α 3 where α is a dimensionless quantity used for the numerical PGS method [1], σ is the viscous stress tensor, Fs is the rate of momentum gain per unit volume due to interactions between gas and other phases and g is the constant specific body force. The quantity A0 is zero in laminar calculations and unity when turbulence is considered, k is the turbulent kinetic energy and its dissipation rate. – Internal energy conservation: ∂(ρI) + ∇ · (ρIu) = ∂t −p∇ · u + (1 − A0 )σ : ∇u − ∇ · J + A0 ρ + Q˙ c + Q˙ s where I is the specific internal energy, the symbol : indicates the matrix product, J is the heat flux vector, Q˙ c is the source terms due to chemical heat release Q˙ s is an additional source term. Furthermore, two additional transport equations are considered. These are the standard k − equations for the turbulence including terms due to interaction with different phases [1].
548
Gino Bella et al.
2.2 Numerical Method The numerical method employed in KIVA-3 is based on a variable step implicit Euler temporal finite difference scheme, where the timesteps are chosen using accuracy criteria. Each time step defines a cycle divided in three phases, corresponding to a physical splitting approach. In the first phase the chemical kinetic equations are solved, providing most of the source terms; the other two phases are devoted to the solution of fluid motion equations. The spatial discretization of the equations is based on a finite volume method, called the Arbitrary Lagrangian-Eulerian method, using a mesh in which positions of the vertices of the cells may be arbitrarily specified functions of time. This approach allows a mixed Lagrangian-Eulerian flow description. In the KIVA Lagrangian phase, the vertices of the cells move with the fluid velocity and there is no convection across cell boundaries. In this phase, the diffusion terms and the terms associated with pressure wave propagation are implicitly solved by a modified version of the SIMPLE (Semi Implicit Method for Pressure-Linked Equations) algorithm [2]. This algorithm, well known in the CFD community, is an iterative procedure to compute velocity, temperature and pressure fields. Upon convergence on pressure values, implicit solution of the diffusion terms in the turbulence equations is approached. After the SIMPLE solver has converged, we have an explicit convection phase, in which the grid is adapted to the new physical condition if needed. In our fire-bed simulation we have an advantage with respect to the standard code applications, in that there is no rezoning, i.e. the computational mesh does not change during the simulation. However we also have to account for the heat exchange by radiation in the first phase. 2.3 Heat Exchange Models Heat exchange between solid wood particles and sorrounding gas in a fire simulation happens through conduction, convection and radiation. The heat conduction part is modeled by a fifth order polynomial expression. The convection effect is biased through a number assigned to the Nusselt number; experimental evidence shows that a value around 2 is appropriate. The radiation heat exchange is modeled through the use of a vision factor [4]; this in turn is computed from the solid angle under which the cell under consideration “perceives” a solid body within the cell (see Fig. 1). To limit the computational requirements we define a cutoff distance beyond which the effect of radiation exchange is heuristically neglegible and would be only a computational cost. The source energy is computed with the formula ˙ Q(T, species) = 4σ pi api (T 4 − Tb4 ) i
where – σ = 5.6669 × 10−8 mW 2 K 4 is the Stefan-Boltzmann constant; – The sum is extended over all the (gaseous) chemical species produced in the combustion;
A Simulation Model for Forest Fires
549
Fig. 1. Radiation heat exchange model
– – – –
pi is the partial pressure of the i-th species; api is the Planck absorption coefficient for the i-th species; T is the local flame temperature; Tb is the absolute temperature of the sapce surrounding the flame.
Similarly to the cutoff distance, we introduce a cutoff temperature below which the radiated thermal flux is zeroed. It is also possible to introduce a correction factor Fa for the attenuation of the radiation efficiency; the attenuation is modeled with a flat region, followed by a linearly decreasing region. 2.4 Pyrolisis Model Due to thermal exchange, the particle temperature increases and the water in it vaporizes, providing a source term in mass, momentum and energy equations. After vaporisation is complete the pyrolisis process starts. For our purposes, pyrolisis corresponds to the thermal decomposition of organic molecules taking place without oxidizing agents following this chemical equation [5,6,7]: C6 H10 O5 − > 3.74C + 2.65H2 O + 1.17CO2 + 1.08CH4 The pyrolisis rate, as well as that of the oxidation process, is modeled with an Ahrrenius type kynetic equation for the mass gradient Ea dm = Am e(− RT ) dt
550
Gino Bella et al.
where Am is the pyrolisis reactivity and Ea is the activation energy necessary to trigger the process. A similar model controls the char creation process; for further details see [8]. All the mass, momentum and energy contributions of this processes are considered in the corresponding governing equations.
Fig. 2. Fire-bed discretization mesh geometry, measures in cm
3 Computational Experiments The test case we present has been built to reproduce the experimental setting described in [9]. The corresponding computational mesh is shown in Fig. 2; it has 27489 total cells, containing 20000 combustive particles. The combustive material bed composed of “Pines Ponderosa Needles” is 0.076 m thick, with a packing ratio of 0.063 and moisture (ratio of water mass to dry mass) of 0.264. The air speed is 2.68 m/s, and the experimental flame front speed is 0.008 m/s. The simulation results are shown in Fig. 3 and 4, where we can see the flame front position at two different simulated times; the flame front is identified with the isosurface of burnt gases at a temperature of 500 K. In Fig. 5 we show the flame front propagation speed as a function of simulated time; the simulated propagation speed agrees nicely with the experimental result of 0.008 m/s. The computational requirements are quite heavy; the simulation covering about 800 s shown above has taken over 100 hours on a standard workstation equipped with an Intel Pentium IV processor at 2.8 GHz, with 1 GB of main memory. The introduction of the advanced solvers used in [3] has already given some benefits as detailed in Fig. 6 where we show the time needed to simulate one second of the physical phenomenon for the old
A Simulation Model for Forest Fires
551
Fig. 3. Flame front visualization at 270 seconds
Fig. 4. Flame front visualization at 370 seconds
and new version of the code, over the first 30 time steps. As we can see, the new solvers are more efficient as soon as the combustion starts; since the overall simulation takes more than 20000 time steps, dominated by the combustion, the algorithmic change alone gives almost a factor of 2 speedup. Moreover the solvers are suitable to run in parallel;
552
Gino Bella et al.
Fig. 5. Flame front velocity (calculated from the burning rate)
Fig. 6. Simulation runtime
we are currently actively working on the parallelization of the heat exchange model, to obtain a reasonable run time for the physical cases of interest.
A Simulation Model for Forest Fires
553
4 Development Directions The first development direction is related to the implementation of the computational model engine; we have already started the parallelization of the code by merging the solver modules developed for the engine application [3]. From the modeling point of view the most important issues to be tackled are: – Modeling of trees – Realistic terrain shapes These are under consideration, and we plan to address them in medium term developments. The long term vision for this project is to embed this engine into an integrated system comprising real-time satellite data feeds and interfaces with GIS databases to provide a flexibile tool for risk management; this is a very challenging undertaking that will require substantial computational resources as well as a significant amount of code development.
References 1. A.A. Amsden, KIVA 3: A KIVA Program with Block Structured Mesh for Complex Geometries, Los Alamos National Lab., Tech. Rep. LA-12503-MS, 1993. 2. S.V. Patankar, Numerical Heat Transfer and Fluid Flow, Hemisphere Publ. Corp., 1980. 3. S. Filippone, P. D’Ambra, and M. Colajanni. Using a Parallel Library of Sparse Linear Algebra in a Fluid Dynamics Applications Code on Linux Clusters. In G. Joubert, A. Murli, F. Peters, M. Vanneschi eds., Parallel Computing - Advances & Current Issues, Imperial College Press Pub., pp. 441–448, 2002. 4. W. L. Grosshandler, RADCAL: A narrow-Band Model for Radiation Calculations in a Combustion Environment, NIST techical note 1402, 1993. 5. F. Shafizadeh, The Chemistry of pyrolysis and combustion. In: R.M. Rowell (ed.) 1984 The Chemistry of Solid Wood, Advances in Chemistry Series 207. American Chemical Society, Washington, DC, pp 489–530. 6. A. J. Stamm, Thermal degradation of wood and cellulose. Presented at the Symp. On Degratadion of Cellulose and Cellulose Derivatives, 127th National Meeting of the American Chemical Society, Cincinnati Ohio, 1955. 7. F. Shafizadeh, Chemistry of Pyrolysis and Combustion of Wood, in Proceedings of 1981 International Conference on Residential Solid Fuels - Environmental Impacts and Solutions, Cooper JA, Malek Deds, Oregon Graduate Center, Beverton: pp 746–771. 8. J. Hong, Modelling of char oxidation as a function of pressure using an intrinsic Langmuir rate equation, Tech. Rep. Brigham Young University, 2000. 9. W.R. Catchpole, E.A. Catchpole, B.W. Butler, R.C. Rothermel, G.A. Morris, D.J. Latham, Rate of Spread of Free-Burning Fires in Woody Fuels in a Wind Tunnel USDA Forest Service, Rocky Mountain Research Station, Intermountain Fire Sciences Laboratory, Missoula.
MHD Modeling of the Interaction Between the Solar Wind and Solar System Objects Andreas Ekenb¨ack and Mats Holmstr¨om Swedish Institute of Space Physics (IRF) P.O. Box 812 98134 Kiruna, Sweden {andreas.ekenback,mats.holmstrom}@irf.se
Abstract. Magnetohydrodynamic (MHD) models of plasma can be used to model many phenomena in the solar system. In this work we investigate the use of a general MHD solver - the Flash code - for the simulation of the interaction between the solar wind and solar system objects. As a test case we simulate the three-dimensional solar wind interaction with a simplified model of a comet and describe the modifications of the code. The simulation results are found to be consistent with previously published ones. We investigate the performance of the code by varying the number of processors and the number of grid cells. The code is found to scale well. Finally we outline how to simulate the solar wind interaction with other objects using the Flash code.
1 Introduction The solar wind is a plasma consisting mostly of protons and electrons that are ejected from the Sun’s corona, and streams radially outward through the solar system at speeds of several hundred kilometers per second. Embedded into the solar wind is also the interplanetary magnetic field. When this highly supersonic solar wind meets with solar system objects, such as planets, comets and asteroids, an interaction region is formed. The type of the interaction depends on the object. Around planets with a strong internal magnetic field, such as Earth, a cavity is formed, a magnetosphere, shielding the upper atmosphere from direct interaction with the solar wind. This shielding is missing at planets without a strong intrinsic field, e.g., at Mars. Then currents in the planet’s ionosphere form an obstacle that diverts the solar wind flow. Objects without any significant atmosphere, such as the Moon, are basically just physical obstacles to the flow, and a wake is formed behind the object. Comets have a very small nucleus compared to the size of the surrounding cloud of ions and neutrals and their interaction with the solar wind is governed by photoionization of neutrals and charge-exchange between ions and neutrals. In space physics, the numerical modeling of the interaction between the solar wind and solar system objects is an important tool in understanding the physics of the interaction region. The results of simulations is also important as inputs to further modeling, e.g., to predict loss of planetary atmospheres due to the interaction. Although self consistent particle and kinetic models best capture the actual physics, the computational cost at present limits the number of particles and the refinement of J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 554–562, 2006. c Springer-Verlag Berlin Heidelberg 2006
MHD Modeling of the Interaction Between the Solar Wind and Solar System Objects
555
grids. Since the computational cost of using a fluid model is less, more refined grids can be used. Thus, the modeling of the interaction between the solar wind and solar system objects is often done by using magnetohydrodynamic (MHD) fluid models. In this work we investigate the possibility of using an existing open source application - the Flash code [1] developed at University of Chicago - to make MHD simulations of the interaction between the solar wind and a non-magnetized object. We model a solar wind and a comet - in the form of a photoionization source - and outline how the simulation model could be generalized to simulate other objects in the solar system.
2 Magnetohydrodynamics Over small spatial volumes, the average properties of a plasma can be described using the basic conservation laws for a fluid [2]. The plasma is however conducting, so the effects of electric and magnetic fields and currents are added to ordinary hydrodynamics. In the presence of a magnetic field, the added conductivity of a fluid will cause a different behavior due to the interaction between the magnetic field and the motion; electric currents induced in the fluid (as a result of its motion) modify the field, and at the same time the flow of the currents in the field causes mechanical forces which modify the motion. 2.1 MHD Equations The MHD model used for most simulations is valid under a number of assumptions, listed in full in [3], the fundamental ones being that the time scale of interest must be long compared to microscopic particle motions and that the spatial scale of interest must be large compared to the Debye length and the thermal gyroradius [2]. MHD models have nevertheless proved adequate to describe several physical processes in the solar system [4,5,3], including the solar wind-comet interaction [6,7]. In what follows we use the notation: r ρ v B µ0 p ε g T t
is position is mass density is the velocity of the fluid is the magnetic field is permeability in vacuum = 4π × 10−7 Vs/Am is the plasma pressure is the internal energy is external forces such as gravity is the temperature of the fluid is elapsed time
It is common to use the MHD equations in their dimensionless, or normalized, form. We choose to use a reference velocity v0 , a reference density ρ0 and a reference length R0 . These three factors are then used to scale all quantities in the MHD equations. Letting an overbar indicate the variables in SI units we set r=
¯ r R0
ρ=
ρ¯ ρ0
v=
v ¯ v0
556
Andreas Ekenb¨ack and Mats Holmstr¨om
which for the other interesting parameters leads to t=
v0 ¯ t R0
B=
¯ B µ0 ρ0 v02
T =
T¯ v02
p=
p¯ ρ0 v02
Approximating the plasma as collisionless and without resistivity we can use the equations of ideal MHD [8,2]. Using the ideal limit of the one-fluid MHD equations in conservative form we will solve in Flash ∂ρ + ∇ · (ρv) = 0 ∂t ∂ρv B2 + ∇ · (ρvv − BB) = −∇ p + + ρg ∂t 2 ∂ρE B2 + ∇ · v ρE + p + − B(v · B) = ρg · v ∂t 2 ∂B + ∇ · (vB − Bv) = 0 ∂t where E=
(2.1)
(2.2) (2.3) (2.4)
1 2 1 B2 v +ε+ 2 2 ρ
The MHD system is in the Flash code complemented by the equation of state, relating the pressure to internal energy or temperature: p¯ = (γ − 1) ρ¯ ε¯
p¯ =
Na kB ¯ ρ¯ T M
where γ Na M kB
is the adiabatic index is the Avogadro number = 6.022× 1023 is the mean atomic mass [kg] is the Boltzmann constant = 1.38×10−23 J/K
3 Use of the Open Source Flash Code The Flash code is a modular, adaptive, parallel simulation code capable of handling general compressible flow problems in astrophysical environments. The Flash code is written mainly in Fortran90 and uses the Message-Passing Interface (MPI) library [9] for inter-processor communication and portability. It further uses a customized version of the PARAMESH library [10,11] to implement a block-structured adaptive cartesian grid, increasing resolution where it is needed. By using the MHD module of the Flash code, we can benefit from all the testing, development and performance tuning that has been done. Flash is also distributed with an IDL-based visualization tool that we found very useful. What we add is the specific
MHD Modeling of the Interaction Between the Solar Wind and Solar System Objects
557
initial conditions, boundary conditions and source terms for the problem at hand. The various examples provided with Flash facilitates the task of setting up a simulation model. We here explain how to add a new problem to Flash by using the solar wind-comet interaction as an example. Where needed we also indicate how to model other possible situations. 3.1 Creating a Problem Directory The Flash code was written with the intention of users being able to easily add their specific problems. The code is hence well prepared for adding a new problem; in the source code there is a directory $FLASHHOME$/setups where we create a directory comet_mhd for our problem. There are two files that must be included in a problem setup directory: Config where it is specified which of the modules in Flash that should be used. It also lists (and sets default values to) required runtime parameters. In case we specify a module in our Config file with several submodules, there are default ones which are used unless we also specify the submodule. init_block.F90 to initialize data in each of the blocks of the mesh. We give the initial values in terms of ordinary geometrical coordinates and Flash takes care of the mapping to blocks. The code first initializes the blocks at the lowest level of refinement and then where needed refines the mesh and initializes the newly created blocks. If necessary this refinement procedure is repeated until we reach a highest level of refinement. 3.2 Boundary Conditions To specify the boundary conditions it is possible to use some of the standard ones (“outflow”,”periodic” or “reflecting”) by a line for each boundary in the runtime parameter file. By stating a boundary as “user” Flash will use a file user_bnd.F90 in the problem directory to set specified variable values there. Our solar wind inflow is modeled by setting velocity, density, magnetic field and temperature at the left x-boundary and setting the boundary condition at the right x-boundary as “outflow”. The simulation is periodic in the y- and z-directions. 3.3 Alternation to Standard Algorithms In order to add the comet source terms for the MHD equations as specified in Section 4 it was necessary to make alterations in the main file for MHD calculations. This main file is mhd_sweep.F90 and it was copied to our problem directory. Alternations are then made in our mhd_sweep.F90 and this one will be used instead of the standard one. Calls to a created file comet_source_mhd.F90, that do additions to the rand hand side of the equations (2.1) and (2.2), were inserted in mhd_sweep.F90. To see that our non-standard file comet_source_mhd.F90 is compiled into the final executable it is necessary to have a Makefile in our problem directory stating the target comet_source_mhd.o and its dependencies.
558
Andreas Ekenb¨ack and Mats Holmstr¨om
3.4 Runtime Parameter File The final file needed to run our comet simulation is the runtime parameter file flash.par. It should be placed in the directory from where Flash is run. Besides assigning values to parameters and specifying boundary conditions, it controls the output. Checkfiles are dumped at intervals specified in flash.par and can be used to restart simulations from. We can also control when smaller files, to be used with the enclosed visualization tool, containing only a desired selection of the variables are dumped.
4 A Test Case: The Solar Wind-Comet Interaction To investigate the use of the Flash code for simulating the interaction between solar system objects and the solar wind we choose a comet as a test problem. We model the comet using a single fluid MHD. The setup follows Ogino et al. [6] where the comet is modeled as a spherically symmetric source of ions. Although much more refined comet models have been used later, e.g. by Gombosi et al. [7], we use this simpler model since we in this work primarily are interested in the performance and correctness of the code and our modifications. Because of the small mass (and the resulting small gravitational forces) of the cometary nucleus, a large amount of gas evaporates from the nucleus. The gas extends into the solar wind and is ionized by charge-exchange processes and by photoionization caused by solar ultraviolet radiation. In the MHD equations the interaction process is, following Ogino et al. [6] (originally by Schmidt et al. [12]), modeled as a plasma production centered at the cometary nucleus. We set the cometary plasma production rate ¯ −¯r/λ¯ c ¯ r ) = Qp e A(¯ ¯ c r¯2 4π λ
(4.5)
where ¯p Q r¯ ¯c λ
is the mass production rate of ions = 2.656×104 s−1 kg is the distance to the center of the comet [m] is the ionization distance = 3.03×105 m, that controls the width of the source
for which we make the following normalizations (in order to make A¯ dimensionless): Qp =
¯p Q ρ0 v0 R02
r=
r¯ R0
λc =
¯c λ R0
We modify the continuity equation (2.1) to ∂ρ + ∇ · (ρv) = A(r) ∂t
(4.6)
The cometary ions will also cause a modification to the momentum equation (2.1) so that we have ∂ρv B2 + ∇ · (ρvv − BB) = −∇ p + + ρg − A(r)v (4.7) ∂t 2 in the same region.
MHD Modeling of the Interaction Between the Solar Wind and Solar System Objects
559
Fig. 1. Density at steadystate for our 3D simulation of a comet with parameter settings nsw = 15 cm−3 , vsw = vx = 500 km/s, Bsw = Bz = 6 nT and source terms according to (4.5)-(4.7). This steady state occured ≈ 1 second after we add the comet as a source term, and the figure shows the density distribution 1.4 seconds after the encounter. We here show the x-z plane and thus have the solar wind coming in from the left and its magnetic field in the upward direction. The minimal refinement level was set to 2 and the maximal to 5, resulting in a peak of 1,024,000 in the number of cells
We use the same normalization as in [6], using the earth radius RE = 6.37 × 106 m as R0 . Other quantities are scaled with typical solar wind values: ρ0 = 2.51 × 10−20 m−3 kg (corresponding to a number density of the solar wind nsw of 15 × 106 m−3 ) and ¯ =B ¯z = 6 nT and v0 = 500 km/s. The magnetic field of the solar wind is set so that B we use a solar wind temperature T¯sw = 2 × 105 K. We use an adiabatic index γ = 5/3 and include no external forces (g = 0).
5 Results for the Test Case We here present results for the test case of the solar wind interaction with a comet, as described in Section 4. We first present the simulation results in the physical context and compare them with expected results and previously done simulations. Thereafter we present how the code scales with problem size and when we increase the number of used processors. From a physical point of view the simulations give result in agreement with Ogino et al. [6]. Figure 1 shows the result of a typical simulation in 3D. The characteristic weak bow shock is seen in all simulations. As seen in Figure 2, the solar wind is decelerated by the cometary ions and the magnetic field is draped and obtains peak values near the comet nucleus. The code seems to scale well for our test problem. The execution time grows almost linearly with the problem size as seen in Figure 3. There is also profit to be made by using more processors for a larger problem as seen to the right in Figure 3 - the deviation from the linear speedup appears later when we increase the problem size. As expected the speedup is sub-linear as we increase the number of processors due to increased communication time.
560
Andreas Ekenb¨ack and Mats Holmstr¨om
Fig. 2. Velocity (left) and magnetic (right) field vectors for our 3D simulation of a comet for comparison with Ogino et al. [6]. This corresponds to their Figure 3. Parameter values, length and density scale are the same as in Figure 1 35
44640* cells 110304 cells Line of scalability
1000 30
900
25
700
Speed−up
Total CPU time
800
600
20
15
500 10
400
5
300 200 0
2
4
6 8 Number of cells
10
12 4 x 10
0
12
4
8
16
32
Number of processors
Fig. 3. Left: Execution time as function of the number of cells in the problem. To the right we show the speedup as a function of the number of processors for two problem sizes
The mesh refinement works fine for our test problem; refinements are made in critical regions and only there. The automatic refinement is thus much satisfactory. The code is also capable of handling the discontinuities of our test problem. In our simulations we start with a flowing solar wind and then insert the production of cometary ions in the middle of our simulation box. This does not cause any problem in spite of the much lower velocity of these ions. Both the automatic refinement mechanism and discontinuities are shown in Figure 4 where we show a simulation at t=0.2,0.4 and 0.6 seconds. This simulation was done in 2D so that we could have large refinements differences and still keep simulation time reasonable.
MHD Modeling of the Interaction Between the Solar Wind and Solar System Objects
561
Fig. 4. The start of a simulation in 2D. At t=0 we add the source term A to the MHD equations as explained in Section 4. This shows the density at t=0.2s (left), t=0.4s (middle) and t=0.6s. Note how the mesh is refined (only) where needed. The minimal refinement level was set to 1 and the maximal to 8, resulting in a peak number of 128,000 cells
The simulations were performed at the Seth Linux Cluster at the High Performance Computing Center North (HPC2N) [13]. The cluster has AMD Athlon MP2000+ processors in 120 dual nodes with each CPU running at 1.667 GHz. The network has 667 Mbytes/s bandwidth and an application level latency of 1.46 µs. The network connects the nodes in a 3-dimensional torus organized as a 4x5x6 grid. Each node is also equipped with fast Ethernet. For all our simulations the limitation is CPU speed (and communication as we increase the number of nodes), as we do not use the full 1 GB memory of each node.
6 Conclusions and Discussion In this work we have shown the feasibility of using a general MHD solver for the simulation of the interaction between the solar wind and solar system objects. The open source Flash code needed small changes to specify the initial conditions, boundary conditions and source terms to solve our selected model - the three-dimensional interaction between the solar wind and a spherical symmetric source region, a simplified model of a comet. We assured the correctness of our results by comparison with previously published ones. The simulations showed that the code scales well, both as a function of problem size, and as a function of number of computing nodes. Thus, the Flash code seems to have the possibility of beeing used as a general tool for the investigation of the interaction between solar system objects and the solar wind. Some areas of future investigation is the possibility of solving multi fluid MHD problems (e.g., H+ and O+ ), test particle trajectory integration in the MHD solution field, and the way to handle planetary bodies.
Acknowledgments The software used in this work was in part developed by the DOE-supported ASC /Alliance Center for Astrophysical Thermonuclear Flashes at the University of Chicago, USA.
562
Andreas Ekenb¨ack and Mats Holmstr¨om
References 1. Alliances Center for Astrophysical Thermonuclear Flashes (ASCI), http://flash.uchicago.edu 2. M.G. Kivelson and C.T. Russell (editors). Introduction to Space Physics, University Press, Cambridge, UK, 1995 ISBN 0 521 45104 4 3. E.R. Priest and A.W. Hood (editors). Advances in Solar System Magnetohydrodynamics, University Press, Cambridge, UK, 1991 ISBN 0 521 40325 1 4. H. Matsumoto and Y. Omura (editors). Computer Space Plasma Physics, Terra Scientific Publishing Company, Tokyo, Japan, 1993 ISBN 4 88704 111 X 5. K. Kabin, K.C. Hansen, T.I. Gombosi, M.R. Combi, T.J. Linde, D.L. DeZeeuw, C.P.T. Groth, K.G. Powell and A.F. Nagy. Global MHD simulations of space plasma environments: heliosphere, comets, magnetospheres of planets and satellites Astrophysics and Space Science, 274:407-421, 2000 6. T. Ogino, R.J. Walker and M. Ashour-Abdalla. A Three-Dimensional MHD Simulation of the Interaction of the Solar Wind With Comet Halley Journal of Geophysical Research, vol. 93, NO. A9, pages 9568-9576, September 1988 7. T.I. Gombosi, D.L. De Zeeuw, R.M. H¨aberli and K.G. Powell. Three-dimensional multiscale MHD model of cometary plasma environments Journal of Geophysical Research, vol. 101, no. A7, pages 15233-15232, July 1996 8. T. I Gombosi. Physics of the Space Environment, University Press, Cambridge, UK, 1998 ISBN 0 521 59264 X 9. The Message Passing Interface (MPI), http://www-unix.mcs.anl.gov/mpi/ 10. Parallel Adaptive Mesh Refinement (PARAMESH), http://ct.gsfc.nasa.gov/paramesh/Users manual/amr.html 11. P. MacNeice, K.M. Olson, C. Mobarry, R. deFainchtein and C. Packer. PARAMESH: A parallel adaptive mesh refinement community toolkit Computer Physics Communications, vol. 126, pages 330-354, 2000 12. H.U. Schmidt and R. Wegmann. Plasma flow and magnetic fields in comets Comets: Gases, Ices, Grains and Plasma edited by L.L. Wilkening. University of Arizona Press, Tuscon, USA 13. High Performance Computing Center North, Ume˚a, Sweden, http://www.hpc2n.umu.se
Implementing Applications with the Earth System Modeling Framework Chris Hill1 , Cecelia DeLuca2 , V. Balaji3 , Max Suarez4 , Arlindo da Silva4 , William Sawyer4,7 , Carlos Cruz4 , Atanas Trayanov4, Leonid Zaslavsky4 , Robert Hallberg3 , Byron Boville2 , Anthony Craig2 , Nancy Collins2 , Erik Kluzek2 , John Michalakes2 , David Neckels2 , Earl Schwab2 , Shepard Smithline2 , Jon Wolfe2 , Mark Iredell6 , Weiyu Yang6 , Robert Jacob5 , and Jay Larson5 1
Massachusetts Institute of Technology National Center for Atmospheric Research 3 Geophysical Fluid Dynamics Laboratory 4 NASA Global Modeling and Assimilation Office 5 Argonne National Laboratory 6 National Center for Environmental Prediction 7 Swiss Federal Institute of Technology [email protected] 2
Abstract. The Earth System Modeling Framework (ESMF) project is developing a standard software platform for Earth system models. The standard defines a component architecture superstructure and a support infrastructure. The superstructure allows earth scientists to develop complex software models with numerous components in a coordinated fashion. The infrastructure allows models to run efficiently on high performance computers. It offers capabilities that are commonly needed in Earth Science applications, for example, support for a broad range of discrete grids, regridding functions, and a distributed grid class which represents the data decomposition. We illustrate these features through a simplified finite-volume atmospheric model, and report the parallel performance of the underlying ESMF components.
1 Introduction Earth Science applications — for example those simulating the time evolution of the atmosphere, ocean, earth crust and mantle, or climate — traditionally have been large monolithic codes usually written by a research group at one institution. The increase in the number of environmental aspects which are simulated, along with the access to ever more powerful parallel computers, has led to a prodigious increase in the complexity of these applications. In the last decade, researchers have acknowledged the need to share software components to overcome the software complexity. A prime example of such collaboration is a coupled ocean-atmosphere climate model, e.g. [3]. Such a model is developed by at least two groups, working for the most part independently of each other. It is key that the interfaces be clearly described, so that the separate components can be connected with a minimum of effort. Moreover, it is required that the final product runs efficiently on parallel computers. This challenge can be eased by a software framework which J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 563–572, 2006. c Springer-Verlag Berlin Heidelberg 2006
564
Chris Hill et al.
supports the various grids used by the components, transfers data (which are naturally distributed over the parallel machine) between grids, provides common functionality for I/O, logging, time management, and offers high-level interfaces for the interaction between components. These features are among those offered by the Earth System Modeling Framework (ESMF), a standards-based, open-source software platform for Earth Science applications. The architecture of ESMF has been described elsewhere, e.g., [2,7]. Thus we summarize the architecture only briefly in Section 2. In Section 3, we describe the implementation of a simple atmospheric model which predates ESMF, but has been revised for ESMF compliance. The resulting application consists of a number of ESMF components and thus serves as a useful example for the efficacy of ESMF paradigms. Results from performance studies are presented in Section 4. Section 5 contains a brief discussion of our experiences implementing ESMF applications.
2 Architectural Overview Software frameworks to ease development of complex applications are not new. There are several in Earth Science community alone: the GFDL Flexible Modeling System [4], the Goddard Earth Modeling System [13], the Weather Research and Forecast Model [10], the MIT Wrapper toolkit [6], the PILGRIM communications library [12] and the Model Coupling Toolkit [8]. ESMF unifies and standardizes these efforts, while offering most of the functionality of the union. ESMF consists of a superstructure and an infrastructure. The superstructure provides a software layer encompassing user code and describes the component interconnection through input and output data streams. The infrastructure is a standard support library which developers can use to speed application development. It supplies functionality for describing physical grids and fields, managing time and calendars, logging and profiling, performing communication, regridding data, among other things. Figure 1 depicts the architecture succinctly. Both ESMF layers provide complete support for parallel platforms. The grids used in the applications are distributed over a set of decomposition elements (DEs). The associated complexity of the data distribution is hidden in the superstructure, while the infrastructure has explicit routines operating on DEs. 2.1 Superstructure Layer The superstructure layer offers a unifying context for the interconnection of components. User components which abide by the layer’s standard can be inserted into other ESMFcompliant applications. Components are split into two classes: gridded components which perform work on a given discrete physical grid, and coupler components which map one state to another. The latter therefore provides the connectivity or “glue” for the former. Central to the superstructure is the ESMF state class. This defines a container for component data exchanges. Thus an ESMF component accepts an import state and produces an export state. The ESMF clock provides consistent notions of time between
Implementing Applications with the Earth System Modeling Framework
565
Fig. 1. The ESMF “sandwich” architecture, on left, has an upper-level (superstructure) and a lower-level (infrastructure) layer. Arrows illustrate that the superstructure-layer code (as well as the user code) can call the infrastructure layer. On right, examples of some of the horizontal grids for which ESMF provides built-in regridding support
components. Thus the import/export states and clock make up the interface for a component. An ESMF application can be summarized by the following points: 1. The application is structured into a set of gridded and coupler components. 2. The underlying ESMF component (either gridded or coupler) offers the methods Initialize, Run and Finalize. 3. Data transferred between components are packed into ESMF import and export states. 4. Time information is packed into ESMF time management data structures. 5. ESMF superstructure functionality is used to instantiate the active gridded and coupler components. 6. Each component’s Initialize, Run and Finalize methods are registered with the ESMF framework in the component’s SetServices method. 7. The actual application consists only of a call to an ESMF application driver. 2.2 Infrastructure Layer The infrastructure layer supplies functionality for tasks which are needed in a wide range of Earth Science applications. In this sense it can be considered a software library. It contains a number of classes which describe commonly used objects. For example, the import and export states mentioned in Section 2.1 consist of ESMF fields, which in turn consist of ESMF arrays and ESMF grids on which the data are defined. The grid is a pairing of an ESMF physical grid and a distributed grid object. The physical grid describes the discretization of continuous three-dimensional space for Earth system models. There are a number of common physical grids supported, a subset of which is depicted in Figure 1. The physical grid class is extensible, allowing external groups to develop further grids for new applications.
566
Chris Hill et al.
The distributed grid class represents the data structure’s decomposition into subdomains — typically for parallel processing. The class holds information for halo exchange of tiled decompositions in finite-difference, finite-volume and finite-element codes. The regrid class allows a field F containing data on a given physical grid P and distribution grid D to be remapped to a field F , physical grid P and distribution D . In other words, Rg : (F, P, D) −→ (F , P , D ). If the P = P , the values are interpolated, with the possibility of conserving physical quantities during the mapping. If P = P , the mapping is a redistribution, for which optimized routines are available. The decomposition element layout class is an abstraction for the parallel platform, which could range from a network of workstations to a distributed cluster of shared memory nodes. Parallelism is allowed with both threads and processes, both referred to abstractly as persistent execution threads, or PETs. ESMF gridded and coupling components are synchronized through a common notion of time referred to as a clock. The time and calendar classes support a wide range of concepts of time used in Earth Science applications. For example, some research scenarios call for a year of 360 “days”, while others require time to be carried in rational fractions to avoid numerical rounding and associated clock drift. The infrastructure also defines a set of I/O classes for the storage and retrieval of field and grid information. These support the standard formats in the community, namely NETCDF, HDF5 and GRIB.
3 The Finite-Volume Held-Suarez Atmospheric Model The Finite-Volume Held-Suarez (FVHS) application is a simple atmospheric model which incorporates a finite-volume dynamical core [9] to simulate atmospheric flow and idealized Held-Suarez Physics [5] which finds a rough approximation for the tendencies of physical processes influencing the atmosphere. The coupling of FV with HS — in particular with respect to their parallel execution on supercomputers — illustrates several challenges in the development of ESMF applications, and as such serves as a case-study for the transition of existing applications to ESMF compliance. 3.1 Creating ESMF Components The FV and HS components were already split logically into three phases of execution: Finite-Volume Initialize Read configuration Allocate internal state Read restart file Initialize variables Initialize comm. patterns Run Get newest input data Regrid data to internal vars. Update atmospheric state Deliver output vars. Finalize Save internal state Deallocate variables
Held-Suarez Read configuration Allocate internal state Read restart file Initialize variables Get newest input data Calculate tendencies Deliver output vars. Save internal state Deallocate variables
Implementing Applications with the Earth System Modeling Framework
567
The compartmentalization of the pre-ESMF code into these three areas was fortuitous for the construction of the ESMF Initialize, Run and Finalize methods. In the FVHS application, these are registered in the framework (Code Example 1) in the component’s SetServices routine which is the only public access point to the component. Code Example 1: Registration of methods and allocation/linking of the internal state in the FV component. The wrap variable allows access to a user defined part of the component’s internal state. Thus, it is possible for the developer to extend and tailor the state information for the particular application. call ESMF_GridCompSetEntryPoint ( gcFV, ESMF_SETINIT, Initialize, ESMF_SINGLEPHASE, status) call ESMF_GridCompSetEntryPoint ( gcFV, ESMF_SETRUN, Run, ESMF_SINGLEPHASE, status) call ESMF_GridCompSetEntryPoint ( gcFV, ESMF_SETFINAL, Finalize, ESMF_SINGLEPHASE, status) allocate( dyn_internal_state, stat=status ) wrap%dyn_state => dyn_internal_state call ESMF_UserCompSetInternalState ( gcFV,’FVstate’,wrap,status ) call ESMF_GridCompGetInternalState ( gcFV, GENWRAP, STATUS )
The main program is responsible for creating the gridded components, couplers and import/export states, registering the components methods with a call to ESMF GridCompSetServices, then initializing, running and finalizing the component with the corresponding ESMF calls. See Code Example 2. Code Example 2: FVHS initializations related to the FV component. The component is created, as well as its import and export state. Two couplers are created to connect the FV and HS components in both directions. The Initialize, Run and Finalize services (whose entry points were described previously) are registered with the framework with ESMF GridCompSetServices. Thereafter it is possible to access these services with calls to the framework. FV = ESMF_GridCompCreate ( ’FV’, layout=loFV, mtype=ESMF_ATM, configfile=cf_file, rc=rc ) impFV = ESMF_StateCreate ( ’FV_Imports’, ESMF_STATEIMPORT, compname=’FV dyn’, rc=rc ) expFV = ESMF_StateCreate ( ’FV_Exports’, ESMF_STATEEXPORT, compname=’FV dyn’, rc=rc ) ccFVxHS = ESMF_CplCompCreate ( ’FV to HS’, layout=loAppl, configfile=’FVxHS.rc’, rc=rc) ccHSxFV = ESMF_CplCompCreate ( ’HS to FV’, layout=loAppl, configfile=’HSxFV.rc’, rc=rc ) call ESMF_GridCompSetServices ( gcFV, FV_SetServices, rc ) : call ESMF_GridCompInitialize ( gcFV, impFV, expFV, clock, ESMF_SINGLEPHASE, rc ) : call ESMF_GridCompRun ( gcFV, impFV, expFV, clock, ESMF_SINGLEPHASE, rc ) : call ESMF_GridCompFinalize( gcFV, impFV, expFV, clock, rc)
3.2 Physical Grids Both the FV dynamical core and the HS physics employ a latitude-longitude grid. FV, however, uses a staggered grid on which winds are defined on the boundaries between cells, while in HS all values are defined in the cell center. ESMF provides mechanisms for describing these and other staggerings, as well as regridding software to interpolate values between staggered grids in a physically consistent manner. In Code Example 3, the physical (latitude-longitude) grid for FV is defined; the distributed grid is implied in the decompositions defined in the two dimensions countsPerDEDecomp1 and countsPerDEDecomp2. Code Example 3: Creation of an ESMF grid. The global dimensions of the domain, the description of the domain, and interval sizes are specified. The grid is defined as a latitude-longitude grid spanning the sphere, which is periodic in longitude. The decomposition is specified by the distributions in both dimensions, countsPerDEDecomp1
568
Chris Hill et al.
and countsPerDEDecomp2. A grid staggering ESMF GridStagger D is specified for this case where wind speed values are defined on the cell’s boundaries, not in its center. An ESMF grid is returned. GRID%GRIDXY = ESMF_GridCreateLogRectUniform(numDims=2, counts = (/grid%im, grid%jm/), & minGlobalCoordPerDim=(/-PI-delta1(1)/2,-PI/2-delta2(1)/2/), deltaPerDim=(/delta1,delta2/),& layout=layout, periodic=(/ESMF_TRUE, ESMF_FALSE/), horzGridKind=ESMF_GridKind_LatLon, & horzStagger=ESMF_GridStagger_D, horzCoordSystem=ESMF_CoordSystem_Spherical, & countsPerDEDecomp1=imxy, countsPerDEDecomp2=jmxy, name="FVCORE horizontal grid", rc=status)
3.3 Data Decomposition and Redistribution The FV and HS components distribute their data in fundamentally different manners. HS operations are strictly in the vertical, and therefore it is logical to distribute the data in both latitude and longitude. FV, on the other hand, performs a vertical integration and contains east-west/north-south data dependencies as well. Near the poles the zonal grid intervals become smaller, and the east-west dependencies heavily outweigh the northsouth. Thus FV distributes data in latitude and level for most calculations and in latitude and longitude for vertical integration. While these distributions optimize the parallel performance for the individual components, the data must be redistributed at the component boundaries. As described in Section 2.2, ESMF provides extensive support for regridding data. Code Example 4 illustrates the data redistribution: routes (communication patterns) are derived from the XYand YZ-decomposed fields during the initialization step. These can be applied repeatedly to the fields. Code Example 4: Data redistribution between XY and YZ decompositions. ESMF fields are created for arrays in each of the two employed decompositions. ESMF routes xy to yz and yz to xy are defined describing the communication pattern between the two fields. These routes can be applied repeatedly when the data needs to be redistributed between the fields. fieldXY = ESMF_FieldCreate( GRID%GRIDXY, ArrayXY, horzRelloc=ESMF_CELL_CENTER, & haloWidth=hWidth, name="field2", rc=rc ) fieldYZ = ESMF_FieldCreate( GRID%GRIDYZ, ArrayYZ, vertRelloc=ESMF_CELL_CENTER, & haloWidth=hWidth, name="field2", rc=rc ) call ESMF_FieldRedistStore(fieldXY, fieldYZ, delayout, xy_to_yz, rc=rc) call ESMF_FieldRedistStore(fieldYZ, fieldXY, delayout, yz_to_xy, rc=rc) : call ESMF_FieldRedist(fieldXY, fieldYZ, xy_to_yz, rc=rc) call ESMF_FieldGetArray(fieldYZ, arrayYZ, rc) : call ESMF_FieldRedist(fieldYZ, fieldXY, yz_to_xy, rc=rc) call ESMF_FieldGetArray(fieldXY, arrayXY, rc)
3.4 FV Subcomponents The component structure can be recursive, that is, an ESMF component can be constructed from further ESMF components. Such is the case for the finite-volume dynamical core, which consists of a solver for the primitive equations and an algorithm for the advection of tracers (atmospheric constituents, e.g. water vapor or particulates, which are carried by the winds). The primitive equation (PE) component and the tracer advection (TA) component are connected by couplers which are internal to FV. Just as the main FVHS program
Implementing Applications with the Earth System Modeling Framework
569
creates the FV and HS components and their couplers, the FV Initialize method creates the PE and TA components and their couplers.
4 Performance Studies While FVHS is not a full general circulation model (GCM), its scalability on massively parallel computers gives a good indication of the performance which can be expected of an ESMF-compliant GCM. Three hour simulations were performed with the FVHS model using a time step of 30 minutes. This is sufficiently long to determine the execution time of one time step in a longer production simulation. In Figure 2 the timings for the FV component are given. Since the Held-Suarez Physics is a much simpler calculation than the dynamics, FV makes up most of the overall computation.
Fig. 2. Even for a relatively low resolution, 2o × 2.5o × 32-level problem using 4-byte arithmetic, the ESMF dynamical core component scales adequately on the HP/Compaq 3800 and SGI Origin 3800 at NASA Goddard Space Flight Center. A 3 hr. atmospheric simulation is performed, and the scalability of the FV dynamical core component, which includes communication both in the form of halo exchanges and transposes. A one-dimensional domain decomposition is indicated with an . Additional parallelism with two () subdomains comes at some additional cost: the redistribution (transpose) must be performed on several arrays for each time step. However, for four (∗), and eight (+) subdomains, there is no additional overhead, and the code scales to roughly 100 CPUs
The results for this “small” problem indicate that the ESMF overhead for each invocation of the Run method is scalable. However, they do not make a quantitative statement about the magnitude of the overhead itself. Besides attaining scalability, the goal of the ESMF implementation is naturally to retain the performance of the underlying application.
570
Chris Hill et al.
In order to determine the introduced overhead, timings of the FV dynamical core [11] in the existing Community Atmosphere Model (CAM) were made. This is unfortunately not a precise comparison, since (1) the CAM grid has 26 vertical levels instead of 32 for FVHS, and (2) calculations are performed with 8-byte arithmetic in the former instead of the 4-byte arithmetic used in the latter. These differences are imposed by the data sets which are used for the respective CAM and FVHS runs. Experience shows, however, that the performance gains by the fewer level calculations is roughly compensated by the expense of the double precision calculation. The results are illustrated in Figure 3 in fact imply that the ESMF overhead is well below the 10% allowed by the ESMF requirements document [1].
Fig. 3. The graph on the left side illustrates the performance of the dynamical core (not ESMFcompliant) at 2o × 2.5o × 26 levels with 8-byte arithmetic in the Community Atmosphere Model on the SGI Origin 3800. The meaning of the symbols is identical to the corresponding SGI graph in Figure 2. The pie chart at right indicates the breakdown of times for the components of the full CAM, when run at 0.5o × 0.625o × 26 level resolution. The dynamical core comprises slices 1, 2, 3 and 8, and accounts for nearly half of the overall computation
5 Summary of Experiences The components mentioned in Section 3 predate ESMF, thus the codes had to be rewritten for ESMF compliance. The fundamental design of the original software lent itself well to the division into Initialize, Run and Finalize methods, thus the code restructuring was kept to a minimum. In spite of this, there were impediments to a quick implementation. First, the import and export states of the FV dynamical core were not entirely appropriate for the HeldSuarez component. Normally the mapping to the required FV import state and from the HS export state would have been programmed in the couplers, but in this case it was decided to modify the import and export states themselves, bringing them in line with other dynamical cores. We suspect it is normal that existing software components will
Implementing Applications with the Earth System Modeling Framework
571
sometimes require revisions in the import and export states to make them general enough for use in other ESMF applications. The ensuing development overhead unfortunately cannot be alleviated by a software framework. Secondly, some planned ESMF functionality needed in FVHS had not been implemented when FVHS was first integrated. Thus we wrote our own temporary implementations of the regrid interpolations between staggered and unstaggered grids, I/O operations, and other utilities. These deficiencies have since been corrected by the ESMF core team. On the other hand, the framework-specific tasks, such as the integration of components through couplers, and creation of physical and decomposition grids, were straightforward and took little time. The performance of the FV component proved to be comparable to that of the original dynamical core, indicating that the overhead an ESMF-compliant application was in this case not significant.
Acknowledgments The NASA ESTO program supports the development of ESMF through contracts under NASA Cooperative Agreement Notice CAN-00-OES-01.
References 1. V. Balaji, T. Bettge, B. Boville, T. Craig, C. Cruz, A. da Silva, C. DeLuca, B. Eaton, R. Hallberg, C. Hill, M. Iredell, R. Jacob, P. Jones, B. Kauffman, J. Larson, J. Michalakes, D. Neckels, J. Rosinski, S. Smithline, M. Suarez, J. Wolfe, W. Yang, M. Young, L. Zaslavsky. Earth System Modeling Framework; ESMF Requirements. http://www.esmf.ucar.edu/esmf docs/ESMF last reqdoc/. 2. V. Balaji, B. Boville, N. Collins, T. Craig, C. Cruz, A. da Silva, C. DeLuca, R. Hallberg, C. Hill, M. Iredell, R. Jacob, P. Jones, B. Kauffman, E. Kluzek, J. Larson, J. Michalakes, D. Neckels, W. Sawyer, E. Schwab, S. Smithline, M. Suarez, B. Womack, W. Yang, M. Young, L. Zaslavsky. Earth System Modeling Framework; DRAFT ESMF Architecture. http://www.esmf.ucar.edu/esmf docs/ESMF archdoc/. 3. M. B. Blackmon, B. Boville, F. Bryan, R. Dickinson, P. Gent, J. Kiehl, R. Moritz, D. Randall, J. Shukla, S. Solomon, G. Bonan, S. Doney, I. Fung, J. Hack, E. Hunke, and J. Hurrell. The Community Climate System Model. BAMS, 82(11):2357–2376, 2001. 4. Geophysical Fluid Dynamics Laboratory. The Flexible Modeling System. http://www.gfdl.noaa.gov/fms. 5. I. Held and M. Suarez. A Proposal for the Intercomparison of the Dynamical Cores of Atmospheric General Circulation Models. BAMS, 75(10):1825–1830, 1994. 6. C. Hill, A. Adcroft, D. Jamous, and J. Marshall. A strategy for tera-scale climate modeling. In Proc. 9th ECMWF Workshop on the Use of HPC in Meteorology. World Scientific Press, 2001. 7. C. Hill, C. DeLuca, Balaji, M. Suarez, and A. da Silva. The Architecture of the Earth System Modeling Framework. Computing in Science & Engineering, 6(1):18–28, 2004. 8. J. Larson, R. Jacob, I. Foster, and J. Guo. The Model Coupling Toolkit. In Proc. Int’l Conf. Computer Science, pages 185–194. Springer-Verlag, 2001. Lecture Notes in Computer Science 2073.
572
Chris Hill et al.
9. S.-J. Lin and R. B. Rood. A ’vertically Lagrangian’ Finite-Volume Dynamical Core for Global Models. Monthly Weather Review, 2004. Submitted for publication. 10. J. Michalakes, S. Chen, J. Dudhia, L. Hart, J. Klemp, J. Middlecoff, and W. Skamarock. Development of a next generation regional weather research and forecast model. In Proc. 9th ECMWF Workshop on the Use of HPC in Meteorology. World Scientific Press, 2001. 11. W. Sawyer and A. Mirin. A scalable implementation of a finite-volume dynamical core in the Community Atmosphere Model. In Proc. Parallel and Distributed Computing and Systems, ACTA Press, 2004. Accepted for publication. 12. W. Sawyer and P. Messmer. Parallel Grid Manipulations for General Circulation Models. In Parallel Processing and Applied Mathematics : 4th International Conference, pages 564–571. Springer-Verlag, 2002. Lecture Notes in Computer Science 2328. 13. D. Schaffer and M. Suarez. Design and Performance Analysis of a Massively Parallel Atmospheric General Circulation Model. J. Sci. Programming, 8:49–57, 2000.
Parallel Discrete Event Simulations of Grid-Based Models: Asynchronous Electromagnetic Hybrid Code Homa Karimabadi1, Jonathan Driscoll1 , Jagrut Dave1,2 , Yuri Omelchenko1, Kalyan Perumalla2 , Richard Fujimoto2 , and Nick Omidi1 1
SciberNet, Inc., Solana Beach, CA, 92075, USA {homak,driscoll,yurio,omidi}@scibernet.com 2 Georgia Institute of Technology, Atlanta, GA, 30332, USA {jagrut,kalyan,fujimoto}@cc.gatech.edu
Abstract. The traditional technique to simulate physical systems modeled by partial differential equations is by means of a time-stepped methodology where the state of the system is updated at regular discrete time intervals. This method has inherent inefficiencies. Recently, we proposed [1] a new asynchronous formulation based on a discrete-event-driven (as opposed to time-driven) approach, where the state of the simulation is updated on a “need-to-be-done-only” basis. Using a serial electrostatic implementation, we obtained more than two orders of magnitude speedup compared with traditional techniques. Here we examine issues related to the parallel extension of this technique and discuss several different parallel strategies. In particular, we present in some detail a newly developed discrete-event based parallel electromagnetic hybrid code and its performance using conservative synchronization on a cluster computer. These initial performance results are encouraging in that they demonstrate very good parallel speedup for large-scale simulation computations containing tens of thousands of cells, though overheads for inter-processor communication remain a challenge for smaller computations.
1 Introduction Computer simulations of many important complex physical systems have reached a barrier as existing techniques are ill-equipped to deal with the multi-physics, multiscale nature of such systems. An example is the solar wind interaction with the Earth’s magnetosphere. This interaction leads to a highly inhomogeneous system consisting of discontinuities and boundaries and involves coupled processes operating over spatial and temporal scales spanning several orders of magnitude. Inclusion of such disparate scales is beyond the scope of existing codes [2]. We have taken a new approach [1] to the simulation of such complex systems. The conventional time-stepped grid-based Particle-In-Cell (PIC) models provide the sequential execution of synchronous (time-driven) field and particle updates. In a synchronous simulation, the distributed field cells and particles undergo simultaneous state transitions
Research was supported by NSF ITR grant 0325046 at SciberNet Inc. and 0326431 at Georgia Institute of Technology. Some of the computations were performed at the San Diego Supercomputing Center.
J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 573–582, 2006. c Springer-Verlag Berlin Heidelberg 2006
574
Homa Karimabadi et al.
at regular discrete time intervals. In contrast, we propose a new, asynchronous type of PIC simulation based on a discrete-event-driven (as opposed to time-driven) approach, where particle and field time updates are carried out in space on a “need-to-be-doneonly” basis. In these simulations, particle and field information “events” are queued and continuously executed in time. The technique has some similarity to Cellular Automata (CA) in that complex behaviors result from interaction of adjacent cells [3]. However, unlike CA, the interactions between cells are governed by a full set of partial differential equations rather than the simple rules as are typically used in CA. The power of this technique is in its asynchronous nature as well as elimination of unnecessary computations in regions where there is no significant change in time. This is in contrast to CA, which are largely based on synchronous execution (e.g. [4]); to date, asynchronous parallel discrete event simulation of CA have only been applied to relatively simple phenomena such as Ising spin [5]). Using a serial electrostatic model, we have shown [1] that the discrete event technique can lead to more than two orders of magnitude speedup compared to conventional techniques. In the following, we discuss issues associated with the extension of this technique to parallel architectures. We then demonstrate, through a newly developed parallel hybrid code, that parallel processing can provide an additional order of magnitude improvement in performance.
2 Parallel Computation Issues Discrete Event Simulation (DES) offers substantial benefits compared to conventional explicit time-driven simulation by reducing the amount of computation that must be performed. However, by itself, DES is not sufficient to achieve the desired performance and scalability. Parallel DES (PDES) can help address this issue. However, the irregular nature of PDES computations leads to difficulties. Synchronization overhead, the number of concurrent computations, load distribution and event processing rate impact PDES performance significantly [6]. As in conventional (time-driven) simulations, the parallelization of asynchronous (event-driven) continuous PIC models is realized by decomposing the global computation domain into subdomains. In each subdomain, individual cells and particles are aggregated into containers that may be mapped to different processors. The parallel execution of time-driven simulations is commonly achieved by copying field information from the inner lattice cells to ghost cells of neighboring subdomains and exchanging out-of-bounds particles between the processors at the end of each update cycle. By contrast, in parallel asynchronous PIC simulations both particle and field events are not synchronized by the global clock (i.e. they do not take place at the same time intervals throughout the simulation domain), but occur at arbitrary time intervals, introducing synchronization problems. Unless precautions are taken, a process may receive an event message from a neighbor with a simulation time stamp that is in its past. In the following, we assume that the parallel simulation is composed of a collection of Simulation Processes (SPs) that communicate by exchanging time stamped event messages. Broadly, synchronization approaches may be classified as conservative or optimistic. Conservative synchronization ensures that each simulation process never
Parallel Discrete Event Simulations of Grid-Based Models
575
receives an event in its past [8, 9]. Runtime performance is critically dependent on apriori determination of an application property called lookahead (a time interval), which is roughly dependent on the degree to which the computation can predict future interactions with other processes without global information. On the other hand, the optimistic approach allows a process to receive a message in its past, but uses a rollback mechanism to recover [7]. Further discussion can be found in [10,11,12]. Another important issue concerns load balancing. As with any parallel or distributed application, the computation must be evenly balanced across processors and interprocessor communication should be minimized to achieve the best performance. Often these are conflicting goals. This is particularly challenging in PDES because of its irregular, unpredictable nature. Load balancing can greatly affect the efficiency of synchronization mechanisms (e.g. poor load distribution can lead to excessive rollbacks in optimistic systems). Automated schemes that balance workload at runtime using process migration present new challenges in this area [13,15]. Finally, it is desirable to decouple the parallel simulation engine that handles synchronization and communication from the application/models. This reduces the burden of the application developer, by not requiring an understanding of underlying PDES synchronization mechanisms. We have used an extensible simulation engine that provides multiple synchronization and event delivery mechanisms through a single interface, named µsik [16].
3 DES Model We have developed a general architecture for parallel discrete event modeling of gridbased models. Details will be presented elsewhere. Here we present a simplified version of our technique that illustrates the salient features of our model without getting bogged down in all the details. In the following, we show results from a 1D parallel hybrid code (light version) that we have developed and tested using µsik as the simulation engine. This code is used to highlight unique parallel issues that are encountered in the DES modeling of plasmas. The light version does not strictly conserve flux. However, the lack of strict local flux conservation does not change the result significantly in the problem of interest here. 3.1 Hybrid Algorithm Electromagnetic hybrid algorithms with fluid electrons and kinetic ions are ideally suited for physical phenomena that occur on ion time and spatial scales. Maxwell’s equations are solved by neglecting the displacement current in Ampere’s law (Darwin approximation), and by explicitly assuming charge neutrality. There are several variations of electromagnetic hybrid algorithms with fluid electrons and kinetic ions [18]. Here we use the one-dimensional resistive formulation [19] which casts field equations in terms of vector potential. The model problem uses the piston method where incoming plasma moving with flow speed larger than its thermal speed is reflected off the piston located on the rightmost boundary. This leads to the generation of a shockwave that propagates to the left. In this example, we use a flow speed large enough to form a fast magnetosonic
576
Homa Karimabadi et al.
shock. In all the runs shown here, the plasma is injected with a velocity of 1.0 (normalized to upstream Alfven speed), the background magnetic field is tilted at an angle of 30o , and the ion and electron betas are set to 0.1. The simulation domain is divided into cells [1], and the ions are uniformly loaded into each cell. We conducted experiments ranging from 4,096 to 65,536 cells, and initialized each simulation to have 100 ions per cell. Each cell is modeled as an SP in µsik and the state of each SP includes the cell’s field variables. The main tasks in the simulation are to a) initialize fields, b) initialize particles, c) calculate the exit time of each particle, d) sort IonQ (see below), e) push particle, f) update fields, g) recalculate exit time, and h) reschedule. This is accomplished through a combination of priority queues and three main classes of events. The ions are stored in either one of two priority queues as illustrated in Fig. 1. Ions are initialized within cells in an IonQ. As ions move out of the left most cell, new ions are injected into that cell in order to keep the flux of incoming ions fixed at the left boundary. MoveTime is the time at which an ion is to be moved next. The placement and removal of ions in IonQ and PendQ is controlled by comparing their MoveTimes to the current time and lookahead. Ions with MoveTimes more than current time + 2*lookahead have not yet been scheduled and are kept in the IonQ. A wakeup occurs when the fields in a given cell change by more than a certain threshold and MoveTimes of particles in the cell need to be updated. On a wakeup, only the ions in this queue recalculate their MoveTimes. Because ions in the IonQ have not yet been scheduled, a wakeup requires no event retractions. If an ion’s MoveTime becomes less than current time + 2*lookahead in the future, the ion is scheduled to move, and is removed from the IonQ and placed in the PendQ. Thus, the front of the IonQ is at least one lookahead period ahead of the current time. This guarantees that each ion move will be scheduled at least one lookahead period in advance. The PendQ is used to keep track of ions that have already been scheduled to exit, but have not yet left the cell. These particles have MoveTimes that are less than the current time. Ions in the PendQ with MoveTimes earlier than the current time have already left the cell and must be removed before cell values such as density and temperature are calculated.
if MoveTime < current time +2 * lookahead
IonQ
PendQ exited / scheduled to exit awaiting removal
current time
2 * lookahead
not yet scheduled
MoveTime
Fig. 1. At any moment, the ions in a cell are stored in one of two queues, the IonQ and the PendQ. Both are priority queues, sorted so that the ion with the earliest exit time is at the top
Events can happen at any simulation time and are managed separately by individual cells of the simulation. The flow of the program including functions of events and their interaction with µsik is illustrated in Fig. 2. In this simulation, each cell handles three different types of events. SendIon Event: This event is first run on each cell when the simulation is initialized,
Parallel Discrete Event Simulations of Grid-Based Models
577
and is responsible for sending ions from one cell to the next. This is accomplished by scheduling the complementary “AddIon” events for neighboring cells. The SendIon event schedules an AddIon event corresponding to every Ion which exits within two lookahead periods, and always schedules at least one SendIon event. In addition, the SendIon Event checks to see if the fields have changed by some tolerance, waking up particles in that cell if necessary. SendIon events occur frequently and as a whole are computationally significant. AddIon Event: This event is used to add a single ion to a cell. The ion’s new exit time is calculated, and then it is added to the IonQ. The fields in the cell are then updated and Notify Events are scheduled for the left and right neighbor cells to inform them of the field change. The AddIon Event causes state changes and occurs sporadically in large batches. NotifyEvent: This event updates the vector potential and temperature for the two neighbors. Start - SendIon Event is called on each cell to prime the simulation loop
... AddIon Event
SendIon Event
1. Remove ions which have exited from PendQ
1. Remove ions which have exited from PendQ, replace any ions that have left cell 0
Execute SendIon Event Execute AddIon Event
2. Calculate exit time of the ion being added
Musik
3. Push the ion onto the IonQ 4. Update the field inside this cell only 5. Schedule NotifyEvent for left and right neighboring cells
Schedule NotifyEvents for left & right neighbors
2. Check to see if the fields have changed more than the set tolerance, and if so, wakeup ions
Schedule AddIon Events for left or 3. Schedule one AddIon event for the right neighbors ion with the earliest exit time, and one AddIon event for every ion within 2*lookahead Reschedule SendIon Event for this cell 4. Reschedule SendIon event for self, one lookahead period before the next unscheduled ion will exit
Fig. 2. Flow diagram of the parallel hybrid code
Exit Time. We take the electric and magnetic fields to be constant within a cell, with arbitrary orientation and magnitude. In this case, a charged particle will have an equation of motion that can be calculated analytically and has the general form R(t) = At2 + Bt + rc sin(ωc t + φ) + C, where R(t)is the position of the particle. Newton’s method is used to solve for the exact exit time. Lookahead. If the typical velocity of a particle is v, and a typical cell width is x, then the time it takes for a particle to cross a cell is x/v. Lookahead must be a factor smaller than this time so that a particle covers a small fraction of the cell width in one lookahead period. On the other hand, if the lookahead is too small, the parallel performance will be poor. This happens when there are few event computations during a lookahead period. Synchronization overhead becomes larger than the computational load. We use the time it takes for the first particle to exit a cell to set the lookahead.
578
Homa Karimabadi et al. 1
(a)
B y -1 B /B tot tot0-look-ahead
2
Bz -1 2
time-stepped
B tot
event-driven
0
1.2
(b)
lookahead=0.07 lookahead=0.15
0.6 3200
cell number
4200
5
N 0 700
X
1000
Fig. 3. (a) Comparison of time-stepped and event-driven simulations of a fast magnetosonic shock. (b) The ratio of the total magnetic field from lookahead runs of 0.07 and 0.15 relative to the field from zero lookahead run
4 Results Figure 3(a) compares results of traditional time-stepped hybrid simulation and our eventstepped simulation for a single processor. We have plotted the y and z (transverse) components of the magnetic field, the total magnetic field, and the plasma density versus x, after the shock wave has separated from the piston on the right hand side. The match between the two simulations is remarkable as DES captures the (i) correct shock wave speed, and (ii) details of the wavetrain associated with the shock wave. This match is impressive considering the fact that the differences seen in Fig. 3(a) are within statistical fluctuations associated with changes in the noise level in hybrid codes. 4.1 Effect of Lookahead Next, we consider the effects of changing the lookahead on both the accuracy of the results as well as the execution time. The hardware for the runs shown here was a highperformance cluster at the Georgia Tech High Performance Computing laboratory. The cluster has 8 nodes, each with 4 2.8 GHz Xeon processors and 4 GB of RAM. The simulation uses 4 µsik Federates (one per processor), each with 2 Regions and 512 cells per Region. A Region is a grouping of cells for efficient load-distribution, described in Sect. 4.3. Figure 3(b) shows variations in the spatial profile of Btot , with lookahead. The zero lookahead run yields the most accurate result and is treated as a baseline. In the hybrid algorithm, the maximum lookahead must be less than the exit time of the earliest scheduled particle, which is approximately 0.15 for our choice of parameters. Deviations of the profile from the baseline are less than 10%, even when the maximum lookahead value is used.
Parallel Discrete Event Simulations of Grid-Based Models
579
Figure 4(a) shows the speedup in execution time relative to the zero lookahead run. The important point from this figure is that even small departures from zero lookahead lead to substantial improvements in speed. In fact, the most dramatic speedup (a factor of 3) is achieved when lookahead is changed from 0 to 0.005. Further changes in lookahead do improve performance, but at a much slower rate. For example, increasing the lookahead by an order of magnitude from 0.005 to 0.05 leads to only an additional 15% speedup. 4.2 Scaling with the Number of Processors
6
300
(a)
Speedup Relative to the Serial Run
Speed up relative to zero look ahead
The 1D modeling of the shock problem in Fig. 3(a) can be easily performed with a serial version of our code. However, our ultimate goal is to develop a 3D version of the code. As a simple means to evaluating the parallel execution in a 3D model, we have considered cell numbers as large as 65,536. This is sufficient to identify the key issues of parallel execution. Figure 4(b) shows the speedup as a function of the number of processors up to 128. The speedup is measured with respect to a sequential run. These runs were made on a 17 node cluster, with each node having 8 550 MHz CPUs and 4 GB of RAM. The simulation domain consisted of 8,192 cells in one case and 65,536 cells in the other.
3
0 0.0
0.02
0.04
0.06
0.08
0.1
look ahead values
0.12
0.14
(b)
100
ar
line
10
1 1
10
100
Number of Processors
Fig. 4. (a) Speedup with increased lookahead. (b) Scaling with the number of processors. The dashed line is a linear scaling curve. Speedup for 8,192 cells is in black and 65,536 cells is in gray. For each domain size, there are two curves - speedup considering the overall execution time and speedup without considering communication time. The scaling is superlinear if communication time is removed
As is evident from Fig. 4(b), the parallel speedup is good till eight processors, but declines as more processors are added. This is due to the architecture of the cluster, which uses a collection of 8 processor computers communicating through TCP/IP. With up to 8 processors, the entire simulation runs through shared memory, and the communication overheads are low. However, with more than 8 processors, the overheads associated with TCP/IP begin to offset the speed gained by using more processors. This reduces the slope of the curve. For 8,192 cells, the speedup does not increase significantly after 32 processors. This is because of increased synchronization costs, which negate the gains from parallel processing. Processors do not have a sufficient computational load between
580
Homa Karimabadi et al.
global synchronizations and spend a greater fraction of time waiting on other processors. For 65,536 cells, there is enough computation between global synchronizations to obtain good speedup up to 128 processors. Since the overheads associated with inter-processor communication become relatively smaller as the simulation size increases, we do not anticipate this effect being as pronounced with larger, 3D simulations. In 3D simulations, each processor would have several orders of magnitude more cells, making the relative overheads associated with TCP/IP much less. To test the idea that poor scaling is the result of the communication overheads, we have also plotted speedup with the communication time subtracted out in Fig. 4(b). The speedup in this case is better than expected, and is in fact super-linear. In other words, doubling the number of processors more than doubles the execution speed. This is the result of a peculiar feature of DES. Execution time does not necessarily scale linearly with the simulation size, even on one processor. This is in part due to the fact that as the event queue becomes larger (contains more pending events), the time associated with scheduling and retrieving each event increases. So, as the simulation gets distributed over more and more processors, each processor is effectively dealing with a smaller piece of the simulation, making the scaling non-linear. Memory performance (specifically, cache performance) can also lead to super-linear speedup. By keeping the same size problem but distributing it over more processors, the memory footprint in each processor shrinks. The total amount of cache memory increases in proportion with the number of processors used - with enough processors, one can, for example, fit the entire computation into the processors’ caches. An extreme case of this is when the problem is so large that it does not fit into the memory of a single machine, causing excessive paging. Although both effects could be causing the super-linear scaling seen in Fig. 4(b), the data structure performance appears to be dominant. In a no-load test of µsik, we changed the number of cells from ten to a million. The time to process a single event increased from 6.50 to 22.15 microseconds, indicating a scaling behavior of NlogN. Figure 5(a) shows the percentage of time spent in communication and blocking in each case. There is a significant increase in the fraction of time spent in communication and blocking for more than 8 processors. For 65,536 cells, the percentage settles to around 60% for higher number of processors. However, for 8,192 cells, the percentage keeps on increasing until 90% for 128 processors. 4.3 Load Balancing Figure 5(b) shows the variation in execution time as a function of the number of Regions per processor, as distributed by the Region Deal algorithm. In this scheme, the simulation is broken into small Regions which are then “dealt” out much like a card game among processors. These simulation runs were performed on the first cluster mentioned earlier. The curves show a different trend for higher number of processors (4,8,16) than for 2 processors. For higher number of processors, the variation in execution time because of the Region Deal load balancing scheme is less pronounced. The best execution times are close to the execution time for 1 Region per processor, with variations of less than 1,000 seconds. Also, increasing the number of Regions per processor increases the execution time in most cases. This is because of the increased synchronization overhead that negates the benefits of the load distribution scheme. For 2 processors, having more
8192 cells
100%
1.4 (b) 1.2
65536 cells
(a)
exeuction timeo / t
Communication Percentage
Parallel Discrete Event Simulations of Grid-Based Models
1
2 4 8 16 32 64 128 Number of Processors
2 processors
1.0 0.8 0.6
4 processors 8 processors
0.4
16 processors
0.2
0
581
0.0 1
10 Regions per Processor (log scale)
40
Fig. 5. (a) Percentage of time spent in communication. (b) Performance of load balancing algorithm
Regions per processor leads to better load distribution and hence reduced execution time. The execution time settles around 14,000 seconds for 16 Regions or more. In this case too, contiguous cells are assigned to different processors and incur greater synchronization overhead for higher number of Regions per processor.
References 1. Karimabadi, H, Driscoll, J, Omelchenko, Y.A. and N. Omidi, A New Asynchronous Methodology for Modeling of Physical Systems: Breaking the Curse of Courant Condition, J. Computational Physics, (2005), in press. 2. Karimabadi, H. and N. Omidi. Latest Advances in Hybrid Codes and their Application to Global Magnetospheric Simulations. in GEM, http://www-ssc.igpp.ucla.edu/gem/tutorial/ index.html) (2002). 3. Ilachinski, A., Cellular Automata, A Discrete Universe, World Scientific, 2002. 4. Smith, L., R. Beckman, et al. (1995). TRANSIMS: Transportation Analysis and Simulation System. Proceedings of the Fifth National Conference on Transportation Planning Methods. Seattle, Washington, Transportation Research Board. 5. Lubachevsky, B. D. (1989). Efficient Distributed Event-Driven Simulations of Multiple-Loop Networks. Communications of the ACM 32(1): 111-123. 6. Fujimoto, R.M., Parallel and Distributed Simulation Systems. (2000): Wiley Interscience. 7. Jefferson, D., Virtual Time, ACM Transactions on Programming Languages and Systems, (1985), 7(3):pp. 404-425. 8. Chandy, K. and J. Misra (1979). Distributed Simulation: A case study in design and verification of distributed programs. IEEE Transactions on Software Engineering. 9. Chandy, K. and J. Misra (1981). Asynchronous distributed simulation via a sequence of parallel computations. Communications of the ACM. 24. 10. Fujimoto, R. M. (1999), Exploiting Temporal Uncertainty in Parallel and Distributed Simulations, Proceedings of the 13th Workshop on Parallel and Distributed Simulation: 46-53. 11. Rao, D. M., N. V. Thondugulam, et al. (1998). Unsynchronized Parallel Discrete Event Simulation. Proceedings of the Winter Simulation Conference: 1563-1570. 12. Rajaei, H., R. Ayani, et al. (1993). The Local Time Warp Approach to Parallel Simulation. Proceedings of the 7th Workshop on Parallel and Distributed Simulation: 119-126. 13. Boukerche, A., and S. K. Das, Dynamic Load Balancing Strategies for Conservative Parallel Simulations, Workshop on Parallel and Distributed Simulation, 1997.
582
Homa Karimabadi et al.
14. Carothers, C. D., and R. M. Fujimoto, “Efficient Execution of Time Warp Programs on Heterogeneous, NOW Platforms,” IEEE Transactions on Parallel and Distributed Systems, Vol. 11, No. 3, pp. 299-317, March 2000. 15. Gan,B. P., et al., Load balancing for conservative simulation on shared memory multiprocessor systems, Workshop on Parallel and Distributed Simulation, 2000 16. Perumalla, K.S., µsik – A Micro-Kernel for Parallel and Distributed Simulation Systems, to appear in the Workshop on Principles of Advanced and Distributed Simulation, May, 2005. 17. Bagrodia, R., R. Meyer, et al. (1998). Parsec: A Parallel Simulation Environment for Complex Systems. IEEE Computer 31(10): 77-85. 18. Karimabadi, H., D. Krauss-Varban, J. Huba, and H. X. Vu, On magnetic reconnection regimes and associated three-dimensional asymmetries: Hybrid, Hall-less hybrid, and Hall-MHD simulations, J. Geophys. Res., Vol. 109, A09205,(2004). 19. Winske, D. and N. Omidi, Hybrid codes: Methods and Applications, in Computer Space Plasma Physics: Simulation Techniques and Software, H. Matsumoto and Y. Omura, Editors. (1993), Terra Scientific Publishing Company. p. 103-160.
Electromagnetic Simulations of Dusty Plasmas Peter Messmer Tech-X Corporation 5621 Arapahoe Avenue, Suite A, Boulder, CO 80303, USA [email protected]
Abstract. Dusty plasmas are ionized gases containing small particles of solid matter, occurring e.g. in space, fusion devices or plasma processing discharges. While understanding the behavior of these plasmas is important for a variety of applications, an analytic treatment is difficult and numerical methods are required. Ideally, the dusty plasma would be treated kinetically, modeling each particle individually. However, this is in general far too expensive and approximations have to be used, like representing the electrons as fluids or treating Maxwell’s equations in the electrostatic limit. Here we report on fully kinetic electromagnetic simulations of dusty plasmas, using the parallel kinetic plasma simulation code VORPAL. A brief review of the particle-in-cell (PIC) algorithm, including the model for modeling dust grains, and first results of single dust grain simulations are presented.
1 Introduction Dust - plasma interactions play an important role in space plasma systems. Comets, rings in planetary magnetospheres, exposed dusty surfaces, or the zodiacal dust cloud are all examples where dusty plasma effects shape the size and spatial distribution of small dust particles. Simultaneously, dust is often responsible for the composition, density and temperature of its plasma environment. The dynamics of charged dust particles can be surprisingly complex, fundamentally different from the well understood limits of gravity dominated motion (vanishing charge-to-mass ratio) or the adiabatic motion of electrons and ions. Dust particles in plasmas are unusual charge carriers. They are many orders of magnitude heavier than any other plasma particles and they can have orders of magnitude larger (negative or positive) time-dependent charges. Dusty plasma systems are characterized by new spatial and temporal scales (grain radius, average distance between grains, dust gyro-radius and frequency, etc.), in addition to the customary scales (Debye length, plasma frequency, etc.). Dust particles can communicate non-electromagnetic effects (gravity, drag, radiation pressure) to the plasma, representing new free energy sources. Their presence can influence the collective plasma behavior, altering the wave modes and triggering new instabilities.
2 PIC Algorithm An efficient way to model kinetic plasmas self-consistently is the well-established particle-in-cell (PIC) algorithm [1,2], which follows the trajectories of a representaJ. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 583–589, 2006. c Springer-Verlag Berlin Heidelberg 2006
584
Peter Messmer
tive number of particles, so called macro-particles, in the electromagnetic field. In the following, we briefly review this algorithm: The relativistic motion of charged particles in an electromagnetic field is governed by the Newton-Lorentz law dγv q v dx = E+ ×B =v (2.1) dt m c dt where x, v and q/m are position, velocity and charge to mass ratio of each particle, γ = (1 − v 2 /c2 )−1/2 and c is the speed of light. The evolution of the electric field E and the magnetic inductivity B is determined by Maxwell’s equations, ∇·E =
ρ 0
∂E j = c2 ∇ × B − ∂t 0
(2.2) (2.3)
∇·B = 0 ∂B = −c∇ × E ∂t
(2.4) (2.5)
where j denotes the current density and 0 is the permittivity of free space. To simulate a plasma, the equation of motion for the particles and Maxwell’s equations have to be solved concurrently. In the PIC algorithm, the field quantities E and B are therefore discretized in space, allowing to use finite differences for the curl operators in Eqs. (2.3) and (2.5). The particles, on the other hand, are located anywhere in the computational domain. Starting from an initial distribution of particles and fields, the system is evolved in time. Each time-step consists of two alternating phases: In the particle push phase, the new particle positions and velocities are determined according to Eq. (2.1). In the following field solve phase, the fields are updated using the new particle positions. In order to accelerate the particles in the particle push, the field quantities have to be determined at the particle location, using e.g. linear interpolation. The actual time integration is then done by a leap-frog scheme. To update the fields, a variety of algorithms and approximations can be used. Very common for dusty plasma simulations is the electrostatic approximation, which neglects the time dependency of B. The electric field is then determined via Poisson’s equation, Eq. (2.2), using the charge density derived from the particle positions. For systems with fast particles, or to model the interaction with electromagnetic waves, the time dependency of B cannot be neglected. The full set of Maxwell’s equations has then to be solved. The fields are coupled to the particles through the current density term in Eq. (2.3) and the charge density term in Eq. (2.2). Different algorithms are known to determine the current density from the particle motion. One example are charge-conserving current deposition algorithms [3], which determine the current flux through a cell surface from the charge carried across this surface due to the particle motion in one time step, have the nice property that they satisfy the continuity equation, ∂ρ = −∇ · j. ∂t
(2.6)
The advantage of this algorithm is that Eq. (2.2) remains satisfied if it was satisfied initially, as can be seen by taking the divergence of Eq. (2.3) and applying the continuity equation,
Electromagnetic Simulations of Dusty Plasmas
585
∂ j ∂ρ ∇E = −∇ = . (2.7) ∂t 0 ∂t Therefore, if an electromagnetic PIC code with charge-conserving current deposition is started with a self-consistent solution of E and B, the evolution of the fields is entirely determined by Eq. (2.3) and (2.5). The particle charge is only needed to determine the current density, and no explicit solution of Poisson’s equation is required . The field quantities are placed at different positions in space: The E field components are defined on the cell edges, whereas the B field components are known on the cell surfaces. The components are staggered in a so-called Yee grid [4]. In this setup, e.g. the Bz , Ex and Ey components reside on the same z-plane. This allows to compute the z-component of ∇×E at the location of Bz by means of finite-differences. Analogously, all other curl operators can be computed. Providing the current flux at locations of the E field grid, the B and E fields can be updated by leap-frog-scheme. All finite difference operators are time and space centered to guarantee second order accuracy and the whole scheme requires only local information, making it very suitable for parallel implementations. A drawback is, however, that the timestep ∆t and the grid cell spacing, ∆x, ∆y, ∆z, are coupled via a CFL condition −1/2 c∆t ≤ ∆x−2 + ∆y −2 + ∆z −2
(2.8)
which states that the propagation of light waves has to be resolved. This can be very restrictive, especially if the main focus of the simulation is on the particle behavior and the particle velocity is much smaller than the speed of light. 2.1 Parallel PIC Algorithm The PIC algorithm per-se is inherently parallel, as the same procedures (move particles under the influence of Newton-Lorentz force, deposit the particle charge or currents to the field) have to be performed for a large number of particles. In the electrostatic case, parallelization is complicated due to the solution of Poisson’s equation, which requires global information. Highly optimized iterative methods can be used to solve this problem efficiently. The parallelization of the electromagnetic problem turns out to be simpler than the electrostatic case. As only local information is required to update the fields and the particles, parallelization via domain decomposition can lead to good speedup. 2.2 Parallel Plasma Simulation Code VORPAL VORPAL [5] is a plasma simulation framework under development at the University of Colorado (CU) and Tech-X Corporation. The original domain of application was laser-plasma interaction, where its results have contributed to leading scientific publications, including a recent article in Nature [6]. Due to the flexibility built into the code, it has grown into many additional areas, like breakdown models in microwave guides or dusty plasmas. The VORPAL framework can currently model the interaction of electromagnetic fields, charged particles, and charged fluids.
586
Peter Messmer
Fig. 1. Speedup figures for the VORPAL code. Left: Speedup relative to the execution time on 8 processors for an electromagnetic 3D simulation on a 200 × 100 × 100 cell grid and 5 particles per cell. Right: Speedup relative to the execution time on 16 processors for an electrostatic 3D simulation on a 1026×65×65 cell grid, with different preconditioner to solve Poisson’s equation: Gauss-Seidel (Stars) and Algebraic Multigrid (diamonds). In both panels, the perfect speedup is indicated by a dotted line. Speedup figures are taken from References [5] and [7]
The design of VORPAL originates from the understanding that both 2D and 3D simulations will be required in the future: The shorter completion time of 2D simulations allows rapid investigations of physical situations which can then be followed up by more extensive simulations in 3D. The availability of large scale computing facilities, as the 6000 CPU IBM-SP at the National Energy Research Supercomputer Center as well as low-cost Beowulf clusters, affordable to Universities, small corporations and institutes, allows a broad range of users to benefit from this code. VORPAL uses C++ template meta-programming in order to generate both a 2D and a 3D code from the same code base. It was specifically designed to be a parallel hybrid code, while capable of running as a pure particle or fluid code. All the parallelization aspects of the code are handled transparently. The code uses recursive coordinate bisection, allowing for efficient load balancing even in the case of highly inhomogeneous particle distributions. Passing field, fluid and particle data between processors is efficiently handled, using set theory techniques. Figure 1 shows the speedup of VORPAL for both a purely electromagnetic PIC simulation and an for an electrostatic problem. In the electrostatic case, Poisson’s equation is solved for a potential, using the parallel iterative solvers of the AZTEC library [8].
3 Modeling Dusty Plasmas An interesting physical question is the temporal evolution of the charge on a dust grain immersed in a two component plasma consisting of electrons and ions. Electrons and ions stream onto the dust grain and ’stick’ to it. An uncharged dust grain will be initially charged negatively, due to the higher thermal velocity of the electrons. The resulting strong electric field will then pull ions to the grain, reducing the net charge. In general, analytic solutions are only available for the steady state. This problem is usually treated
Electromagnetic Simulations of Dusty Plasmas
587
Fig. 2. 2D VORPAL simulations of dusty plasmas. Left: Dust charge per particle (stars) as a function of ion flow speed, compared to the theoretical prediction (dotted). Right: Average dust charge as a function of dust density
in the electrostatic limit, where the dust grain is modeled as a particle absorbing boundary condition. However, if the particles velocities become relativistic, the full electromagnetic treatment is desirable. How can a dust grain be modeled in an electromagnetic PIC code? As shown in Sec. 2, the only quantity that is required to update the electromagnetic field self-consistently is the current density, which is inferred from the particle motion. An immobile particle therefore does not contribute to the evolution of the system. If a particle is stopped after it enters a cell effectively leads to an increase of the charge of this cell. An extended, infinitely heavy, perfectly dielectric dust grain can therefore be modeled as a simple particle absorbing area in the computational domain.
4 Preliminary Results In this section, we present some preliminary results of electromagnetic dust simulations. The dust grains are modeled as stationary, spatially extended cylindrical or spherical particle sinks. The charge-neutral background consists of thermal electrons and protons. 4.1 Asymmetric Charging Figure 2, left, shows the dependency of the average charge on a single dust grain as a function of the ion flow speed. The simulation was performed in 2D, with a single dust grain in a box of length 2 × 2 cm on a 100 × 100 cells grid. The electron temperature is 5 times the ion temperature. The initial electron Debye length is approximately 1 cm. The theoretical results were taken from Lapenta et al.[9], considering the dimensionality of the simulation. The simulations are in good agreement with the theoretical results. For this simulation, no correction for the depletion of the plasma due to the absorption by the dust grain was performed. This issue will be addressed in further kinetic studies. 4.2 Density Dependent Charging In a second example, the average charge per dust particle as a function of dust density was investigated. The number of dust grains was varied between 1.5 and 25 per electron
588
Peter Messmer
Fig. 3. Dust grain (light gray) in a plasma of streaming ions and thermal electrons. The ions flow from top-left to bottom right. Downstream of the grain, a negatively charged wake forms (dark gray), whereas a positively charged region appears upstream
Debye length. The setup geometry was identical to the previous case, but the electron temperature was chosen to be the same as the ion temperature. Theory [10] predicts a decrease of the average dust charge for increasing dust density. Figure 2, right, shows this effect reproduced with 2D VORPAL simulations, roughly following a power law. A major difference between our simulations and results published e.g. by Young et al. [11] is that the dust grains in our case are spatially resolved, allowing to investigate additional effects, like wake formation. 4.3 Wake Formation A third example of our preliminary investigations is wake formation in a system of streaming ions and a thermal electron population. Figure 3 shows the wake of a 3D VORPAL simulation. Clearly visible is the negatively charged tail behind the single dust grain. Also visible is a positively-charged bow in front of the dust grain.
5 Summary In the previous paragraphs, a model for dust grains in an electromagnetic particle-in-cell code was introduced. It was shown that a particle absorbing region in an electromagnetic PIC code with charge conserving current deposition can act as an infinitely heavy dust
Electromagnetic Simulations of Dusty Plasmas
589
grain. This model of a dust grain requires only local information and preserves the parallelism of the electromagnetic PIC algorithm. Good speedup is therefore expected. This will allow to perform studies of spatially extended dust grains in 3D. First simulations of single dust grains immersed in a thermal plasma were presented. The results are in good agreement with previously published results. Future studies will focus on the mutual influence of several dust grains, depending on their relative position. In addition, the time dependent charging of dust grains in 3D will be investigated. All these questions are still unresolved in dusty plasma physics [12].
References 1. 2. 3. 4. 5. 6.
7. 8. 9. 10. 11. 12.
Birdsall, Langdon, Plasma Physics via Computer Simulation, Adam Hilger, 1991. Hockney, R.W., Eastwood, J.W., Computer Simulation using Particles, Adam Hilger, 1988. Villasenor, J., Buneman, O., Comput. Phys. Commun., 69, 306, 1989. Yee, K.S., Numerical solution of initial boundary value problems involving Maxwell’s equations in isotropic media, IEEE Trans. Antennas Propagat., 14, 302, 1966. Nieter, C. and Cary, J.R., VORPAL: a versatile plasma simulation code, J. Comp. Phys, 196 (2), 448, 2004. Geddes, C.G.R, Toth, Cs., van Tilborg, J., Eseray, E., Schroeder, C.B., Bruhwiler, D., Nieter, C., Cary, J.R, High-quality electron beams from a laser wakefield accelerator using plasmachannel guiding, Nature, 431, 538, 2004. Messmer, P., Bruhwiler, D., An electrostatic solver for the VORPAL code, Comp. Phys. Comm., 164, 118, 2004. Tuminaro, R.S., Heroux, M., Hutchinson, S.A., Shadid, J.N.,Official Aztec Users’s Guide: Version 2.1, Technical Report, SAND99-8801J, 1999. Lapenta, G., Simulation of charging and shielding of dust particles in drifting plasmas, Phys. o. Plasmas, 6 (5), 1442, 1999. Goertz, C.K., Dusty plasmas in the solar system, Rev. Geophys, 27 (2), 271, 1989. Young, B., Cravens, T.E., Armstrong, T.P., Friauf, R.J., A two-dimensional particle-in-cell model of a dusty plasma, J. Geophys. Res., 99, 2255, 1994. Tsytovich, V.N., Morfill, G.E., Thomas, H., Complex Plasmas: I. Complex Plasmas as Unusual State of Matter, Plasma Phys. Reports., 28 (8), 625, 2002.
Advanced Algorithms and Software Components for Scientific Computing: An Introduction Organizer: Padma Raghavan Department of Computer Science and Engineering The Pennsylvania State University 343K IST Bldg., University Park, PA 16802, USA [email protected]
Abstract. Scientific computing concerns the design, analysis and implementation of algorithms and software to enable computational modeling and simulation across disciplines. Recent research in scientific computing has been spurred both by advances in multiprocessor systems and the wider acceptance of computational modeling as a third mode of scientific investigation, in addition in to theory and experiment. We introduce the next six chapters on advanced algorithms and software components for scientific computing on parallel computers. We review recent results toward enabling the scalable solution of large-scale modeling applications.
1 Introduction In the last decade, in addition to theory and experiment, computational modeling and simulation has been widely embraced as the third mode of scientific investigation and engineering design. Consequently, the associated computational challenges are being faced by an increasingly larger group of scientists and engineers from a variety of disciplines. This in turn has motivated research in scientific computing aimed at delivering advanced algorithms and software tools that can enable computational science and engineering without specialized knowledge of numerical analysis, computer architectures, and interoperable software components. In this chapter, we introduce some recent research in advanced scientific computing to enable large-scale computational modeling and simulation. In Section 2, we provide an introduction to the next six chapters covering a range of topics. The first chapter concerns algorithms and software for a scientific database for efficient function approximation in computational fluid dynamics. The next three chapters focus on techniques for scalable, parallel sparse linear system solution. The latter is often the computationally dominant step in simulations of models based on nonlinear partial differential equations using an implicit or semi-implicit scheme with finite-element or finite-difference discretizations. The final two chapters focus on advanced software environments, object-oriented and component frameworks for high-performance scientific software development. We end with concluding remarks in Section 3.
This work was supported in part by the National Science Foundation through grants CCF044345, EIA-0221916, and DMR-0205232.
J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 590–592, 2006. c Springer-Verlag Berlin Heidelberg 2006
Advanced Algorithms and Software Components for Scientific Computing
591
2 Advanced Algorithms and Software for Scientific Computing In this section, we provide an introduction to the following six chapters which represent emerging themes of research in scientific computing aimed at meeting the challenges of multi-scale computational modeling on high-performance computing systems. The first chapter by Plassmann and Veljkovic concerns parallel algorithms for a scientific database for efficient function approximation. The problem is related to the modeling of reacting flows with complex chemistry involving multiple scales, for example, through high-order, Direct Numerical Simulation (DNS). The authors consider the development of a scientific database for approximating computationally expensive functions by archiving previously computed exact values. Earlier, they had demonstrated that sequential algorithms and software for the database problem can be effective in decreasing the running time of complex reacting flow simulations. Now, they consider parallel algorithms and software based on a partitioning of the search space. Their approach involves maintaining a global BSP tree which can be searched on each processor in a manner analogous to the sequential algorithm. They also consider a variety of heuristics for coordinating the distributed management of the database. They establish through experiments that a hybrid of these heuristics exhibits the best performance and scalability by successfully balancing the computational load. The next three chapters focus on effective parallel sparse linear system solution. Sparse linear solvers can be broadly classified as being ‘direct,’ ‘iterative,’ ‘domain decomposition,’ or ‘multilevel/multigrid.’ Additionally, there are ‘preconditioning’ schemes which can be used with iterative solvers to accelerate their convergence. Each class contains a large number of algorithms, and there is multiplicative growth in the number of methods when special forms of multilevel, domain-decomposition, or direct methods are used as preconditioners. The performance of all these methods, including convergence, reliability, execution time, parallel efficiency, and scalability, can vary dramatically depending on the interactions between a specific method, the problem instance, and the computational architecture. The chapter by D’Ambra, Serafino, and Filippone concerns building effective preconditioners using schemes from domain-decomposition. They focus on constructing parallel Schwarz preconditioners by extending Parallel Sparse BLAS (PSBLAS), a library of routines providing basic linear algebra operations for building iterative sparse linear system solvers on distributed-memory multiprocessors. The next two chapters are related in the sense that they both concern using techniques from sparse direct solvers to develop preconditioners. The chapter by H´enon, Pellegrini, Ramet, Roman and Saad concerns using ‘blocked’ forms of incomplete factorization to develop robust preconditioners. The chapter by Teranishi and Raghavan concerns parallel preconditioners using incomplete factorizations that are ‘latency tolerant.’ i.e., preconditioners that can achieve high-efficiency despite the relatively large latency of inter-processor communication on message passing multiprocessors and clusters. All three chapters provide empirical results to demonstrate the efficiency of the preconditioners. These three chapters are representative of an emerging trend in scientific computing where the focus is on hybrid algorithms to provide efficient, scalable and robust sparse solvers that can meet the demands of a range of large-scale applications.
592
Padma Raghavan
The last two chapters concern software environments for scientific computing. The emphasis is on the design of ‘plug-and-play’ environments for software development. In such systems, the application developer has access to the wealth of software implementations of a variety of algorithms for different key kernels common to a wide range of simulations. This will enable developers to build software specific to their modeling needs, with relative ease and without sacrificing performance,e by selecting from tuned implementations of advanced algorithms. The chapter by Heroux and Sala discusses the design of the ‘Trilinos’ system, a two-level software structure, designed around a collection of ‘packages.’ Each package focuses on a particular area of research, such as linear and nonlinear solvers or algebraic preconditioners, and is usually developed by a small team of experts in a particular area of research. Packages exist underneath the Trilinos top level, which provides a common look-and-feel to application developers while allowing package interoperability. The final chapter by Norris concerns software architectures for advanced scientific computing. The Common Component Architecture (CCA) has been developed with the goal of managing the increasing complexity of scientific software development. The CCA attempts to provide a relatively simple component interface requirements with a minimal performance penalty. The chapter explores some concepts and approaches that can contribute to making CCA-based applications easier to design, implement, and maintain. When properly designed, CCA applications can also benefit from hybrid schemes, and adaptive method selection to enhance performance.
3 Conclusions Recent advances in computing hardware include high-performance multiprocessors like the BlueGene/L from IBM, commodity Beowulf clusters, and their organization into wide-area distributed computational grids. These hold the potential for discovery and design through computational modeling and simulation. However, the raw processing power of the hardware cannot be utilized without the development of sophisticated algorithms and software systems for scientific computing. Recent decades of research in scientific computing and numerical analysis have resulted in many algorithms for each of the main steps of applications based on models described by nonlinear partial-differential equations (PDEs). The first four of following six chapters concern techniques for developing hybrids of algorithms that were developed earlier in order improve performance or enhance scalability. The last two chapters concern advanced software architectures where such algorithms can be implemented with relative ease within computational modeling and simulation applications. Although many algorithms have been developed for key computational steps in PDE-based applications, such as sparse linear system solution, and mesh generation and improvement, a growing consensus is that it is neither possible nor practical to predict a priori which algorithm for a given numerical kernel performs best across applications. We therefore anticipate an increased focus on advanced algorithms and software environments that can be used to compose hybrids, and adaptively select methods to improve application performance. The results in the following six chapters reflect this emerging trend.
Extending PSBLAS to Build Parallel Schwarz Preconditioners Alfredo Buttari1 , Pasqua D’Ambra2 , Daniela di Serafino3 , and Salvatore Filippone1 1
Department of Mechanical Engineering, University of Rome “Tor Vergata" Viale del Politecnico, I-00133, Rome, Italy {alfredo.buttari,salvatore.filippone}@uniroma2.it 2 Institute for High-Performance Computing and Networking, CNR Via Pietro Castellino 111, I-80131 Naples, Italy [email protected] 3 Department of Mathematics, Second University of Naples Via Vivaldi 43, I-81100 Caserta, Italy [email protected]
Abstract. We describe some extensions to Parallel Sparse BLAS (PSBLAS), a library of routines providing basic Linear Algebra operations needed to build iterative sparse linear system solvers on distributed-memory parallel computers. We focus on the implementation of parallel Additive Schwarz preconditioners, widely used in the solution of linear systems arising from a variety of applications. We report a performance analysis of these PSBLAS-based preconditioners on test cases arising from automotive engine simulations. We also make a comparison with equivalent software from the well-known PETSc library.
1 Introduction Effective numerical simulations in many application fields, such as Computational Fluid Dynamics, require fast and reliable numerical software to perform Sparse Linear Algebra computations. The PSBLAS library was designed to provide the kernels needed to build iterative methods for the solution of sparse linear systems on distributed-memory parallel computers [10]. It includes parallel versions of most of the Sparse BLAS computational kernels proposed in [7] and a set of auxiliary routines for the creation and management of distributed sparse matrix structures. The library provides also application routines implementing some sparse Krylov solvers with Jacobi and block Jacobi preconditioners. The library routines have been proved to be powerful and flexible in restructuring a complex CFD code, improving the accuracy and efficiency in the solution of the sparse linear systems and implementing the boundary data exchanges arising in the numerical simulation process [11]. PSBLAS is based on a SPMD programming model and uses the Basic Linear Algebra Communication Subprograms (BLACS) [6] to manage interprocess data exchange. The library is internally written in C and Fortran 77, but provides high-level routine interfaces in Fortran 95. Current development is devoted to extending PSBLAS with preconditioners suitable for fluid flow problems in automotive engine J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 593–602, 2006. c Springer-Verlag Berlin Heidelberg 2006
594
Alfredo Buttari et al.
modeling. A basic development guideline is to preserve as much as possible the original PSBLAS software architecture while moving towards the new SBLAS standard [8]. In this work we focus on the extension for building Additive Schwarz preconditioners, widely used in the parallel iterative solution of linear systems arising from PDE discretizations. The paper is organized as follows. In Section 2 we give a brief description of Additive Schwarz preconditioners. In Section 3 we discuss how the set of PSBLAS data structures and routines has been extended to implement these preconditioners by reusing existing PSBLAS computational routines. In Section 4 we present the results of experiments devoted to analyze the performance of PSBLAS-based Schwarz preconditioners with Krylov solvers on test matrices arising from automotive engine simulations; we also show the results of comparisons with the widely used PETSc library [1]. Finally, in Section 5, we draw some conclusions and future work.
2 An Overview of Additive Schwarz Preconditioners Schwarz preconditioners are based on domain decomposition ideas originally introduced in the context of variational solution of Partial Differential Equations (see [5,15]). We focus on the algebraic formulation of Additive Schwarz preconditioners for the solution of general sparse linear systems [2,14]; these are widely used with parallel Krylov subspace solvers. Given the linear system Ax = b, where A = (aij ) ∈ n×n is a nonsingular sparse matrix with a symmetric non-zero pattern, let G = (W, E) be the adjacency graph of = 0} are the vertex set and the edge A, where W = {1, 2, . . . , n} and E = {(i, j) : aij set, respectively. Two vertices are neighbours if an edge connects them. A 0-overlap partition of W is just an ordinary partition of the graph, i.e. a set of m disjoint nonempty 0 subsets Wi0 ⊂ W such that ∪m i=1 Wi = W . A δ-overlap partition of W with δ > 0 can be defined recursively by considering the sets Wiδ ⊃ Wiδ−1 obtained by including the vertices that are neighbours of the vertices in Wiδ−1 . Let nδi be the size of Wiδ and Riδ the nδi × n matrix formed by the row vectors eTj of the n × n identity matrix, with j ∈ Wiδ . For each v ∈ n , Riδ v is the vector containing the components of v corresponding to δ the vertices in Wiδ , hence Riδ can be viewed as a restriction operator from n to ni . Likewise, the transpose matrix (Riδ )T can be viewed as a prolongation operator from δ ni to n . The Additive Schwarz (AS) preconditioner, MAS , is then defined by −1 MAS =
m
(Riδ )T (Aδi )−1 Riδ ,
i=1
where the nδi × nδi matrix Aδi = Riδ A(Riδ )T is obtained by considering the rows and columns of A corresponding to the vertices in Wiδ . When δ = 0 MAS is the well-known block Jacobi preconditioner. The convergence theory for the AS preconditioner is well developed in the case of symmetric positive definite matrices (see [5,15] and references therein). Roughly speaking, when the AS preconditioner is used in conjunction with a Krylov subspace method, the convergence rapidly improves as the overlap δ increases, while it deteriorates as the number m of
Extending PSBLAS to Build Parallel Schwarz Preconditioners
595
subsets Wiδ grows. Theoretical results are available also in the case of nonsymmetric and indefinite problems (see [4,15]). Recently some variants of the AS preconditioner have been developed that outperform the classical AS for a large class of matrices, in terms of both convergence rate and of computation and communication time on parallel distributed-memory computers [3,9,12]. In particular, the Restricted Additive Schwarz (RAS) preconditioner, MRAS , and the Additive Schwarz preconditioner with Harmonic extension (ASH), MASH , are defined by −1 = MRAS
m
˜ 0 )T (Aδ )−1 Rδ , (R i i i
−1 MASH =
i=1
m
˜0, (Riδ )T (Aδi )−1 R i
i=1
˜ 0 is the nδ × n matrix obtained by zeroing the rows of Rδ corresponding to the where R i i i vertices in Wiδ \Wi0 . The application of the AS preconditioner requires the solution of m independent linear systems of the form Aδi wi = Riδ v
(1)
and then the computation of the sum m
(Riδ )T wi .
(2)
i=1
˜ 0 ; hence, in a parallel implementaIn the RAS preconditioner Riδ in (2) is replaced by R i tion where each processor holds the rows of A with indices in Wi0 and the corresponding components of right-hand side and solution vectors, this sum does not require any communication. Analogously, in the ASH preconditioner Riδ in equation (1) is replaced by ˜ 0 ; therefore, the computation of the right-hand side does not involve any data exchange R i among processors. In the applications, the exact solution of system (1) is often prohibitively expensive. Thus, it is customary to substitute the matrix (Aδi )−1 with an approximation (Kiδ )−1 , computed by incomplete factorizations, such as ILU, or by iterative methods, such as SSOR or Multigrid (see [5]).
3 Building and Applying AS Preconditioners in PSBLAS We now review the basic operations involved in the Additive Schwarz preconditioners from the point of view of parallel implementation through PSBLAS routines. We begin by drawing a distinction between Preconditioner Setup: the set of basic operations needed to identify Wiδ , to build Aδi from A, and to compute Kiδ from Aδi ; Preconditioner Application: the set of basic operations needed to apply the restriction operator Riδ to a given vector v, to compute (an approximation of) wi , by applying (Kiδ )−1 to the restricted vector, and, finally, to obtain sum (2).
596
Alfredo Buttari et al.
Before showing how PSBLAS has been extended to implement the Additive Schwarz preconditioners described in Section 2, we briefly describe the main data structures involved in the PSBLAS library. A general row-block distribution of sparse matrices is supported by PSBLAS, with conformal distribution of dense vectors. Sparse Matrix: Fortran 95 derived data type, called D_SPMAT, it includes all the information about the local part of a distributed sparse matrix and its storage mode, following the Fortran 95 implementation of a sparse matrix in the sparse BLAS standard of [7]. Communication Descriptor: Fortran 95 derived data type DESC_TYPE; it contains a representation of the sets of indices involved in the parallel data distribution, including the 1-overlap indices, i.e. the set Wi1 \Wi0 , that is preprocessed for the data exchange needed in sparse matrix-vector products. Existing PSBLAS computational routines implement the operations needed for the application phase of AS preconditioners, provided that a representation of the δ-partition is built and packaged into a new suitable data structure during the phase of preconditioner setup. The next two sections are devoted to these issues. 3.1 PSBLAS Implementation of Preconditioner Application To compute the right-hand side in (1) the restriction operator Riδ must be applied to a vector v distributed among parallel processes conformally to the sparse matrix A. On each process the action of Riδ corresponds to gathering the entries of v with indices belonging to the set Wiδ \Wi0 . This is the semantics of the PSBLAS F90_PSHALO routine, which updates the halo components of a vector, i.e. the components corresponding to the 1overlap indices. The same code can apply an arbitrary δ-overlap operator, if a suitable auxiliary descriptor data structure is provided. Likewise, the computation of sum (2) can be implemented through a suitable call to the PSBLAS computational routine F90_PSOVRL; this routine can compute either the sum, the average or the square root of the average of the vector entries that are replicated in different processes according to an appropriate descriptor. ˜ 0 v, can be Finally, the computation of (Kiδ )−1 viδ , where viδ = Riδ v or viδ = R i accomplished by two calls to the sparse block triangular solve routine F90_PSSPSM, given a local (incomplete) factorization of Aδi . Therefore, the functionalities needed to implement the application phase of the AS, RAS and ASH preconditioners, in the routine F90_ASMAPPLY, are provided by existing computational PSBLAS routines, if an auxiliary descriptor DESC_DATA is built. Thus, the main effort in implementing the preconditioners lies in the definition of a preconditioner data structure and of routines for the preconditioner setup phase, as discussed in Section 3.2. 3.2 PSBLAS Implementation of Preconditioner Setup To implement the application of AS preconditioners we defined a data structure PREC_DATA that includes in a single entity all the items involved in the application of the preconditioner:
Extending PSBLAS to Build Parallel Schwarz Preconditioners
597
TYPE PREC_DATA INTEGER :: PREC INTEGER :: N_OVR TYPE(D_SPMAT) :: L, U REAL(KIND(1.D0)), POINTER :: D(:) TYPE(DESC_TYPE) :: DESC_DATA END TYPE PREC_DATA Fig. 1. Preconditioner data type
– a preconditioner identifier, PREC, and the number δ of overlap layers, N_OVR; – two sparse matrices L and U, holding the lower and upper triangular factors of Kiδ (the diagonal of the upper factor is stored in a separate array, D); – the auxiliary descriptor DESC_DATA, built from the sparse matrix A, according to N_OVR. This data structure has the Fortran 95 definition shown in Figure 1. Note that the sparse matrix descriptor is kept separate from the preconditioner data; with this choice the sparse matrix operations needed to implement a Krylov solver are independent of the choice of the preconditioner. The algorithm to setup an instance P of the PREC_DATA structure for AS, RAS or ASH, with overlap N_OVR, is outlined in Figure 2. By definition the submatrices A0i identify the vertices in Wi1 ; the relevant indices are stored into the initial communication descriptor. Given the set Wi1 , we may request the column indices of the non-zero entries in the rows corresponding to Wi1 \Wi0 ; these in turn identify the set Wi2 , and so on. All the communication is performed in the steps 4.2 and 6, while the other steps are performed locally by each process. A new auxiliary routine, F90_DCSRSETUP, has been developed to execute the steps 1–6. To compute the triangular factors of Kiδ (step 7), the existing PSBLAS routine F90_DCSRLU, performing an ILU(0) factorization of Aδi , is currently used. The two previous routines have been wrapped into a single PSBLAS application routine, named F90_ASMBUILD. It would be possible to build the matrices Aδi while building the auxiliary descriptor DESC_DATA. Instead, we separated the two phases, thus providing the capability to reuse the DESC_DATA component of an already computed preconditioner; this allows efficient handling of common application situations where we have to solve multiple linear systems with the same structure.
4 Numerical Experiments The PSBLAS-based Additive Schwarz preconditioners, coupled with a BiCGSTAB Krylov solver built on the top of PSBLAS, were tested on a set of matrices from automotive engine simulations. These matrices arise from the pressure correction equation in the implicit phase of a semi-implicit algorithm (ICED-ALE [13]) for the solution of unsteady compressible Navier-Stokes equations, implemented in the KIVA application software, as modified
598
Alfredo Buttari et al. 1. 2. 3. 4.
Initialize the descriptor P%PREC_DATA by copying the matrix descriptor DESC_A. Initialize the overlap level: η = 0. Initialize the local vertex set, Wiη = Wi0 , based on the current descriptor. While ( η < N_OVR ) 4.1 Increase the overlap: η = η + 1. 4.2 Build Wiη from Wiη−1 , by adding the halo indices of Aη−1 , and exchange i with other processors the column indices of the non-zero entries in the rows corresponding to Wiη \Wiη−1 . 4.3 Compute the halo indices of Aηi and store them into P%DESC_DATA. Endwhile 5. If ( N_OVR > 0 ) Optimize the descriptor P%DESC_DATA and store it in its final format. 6. Build the enlarged matrix Aδi , by exchanging rows with other processors. 7. Compute the triangular factors of the approximation Kiδ of Aδi . Fig. 2. Preconditioner setup algorithm
kivap1 (n = 86304, nnz = 1575568) kivap2 (n = 56904, nnz = 1028800)
Fig. 3. Sparsity patterns of the test matrices
in [11]. The test case is a simulation of a commercial direct injection diesel engine featuring a bowl shaped piston, with a mesh containing approximately 100000 control volumes; during the simulation mesh layers are activated/deactivated following the piston movement. We show measurements for two matrices, kivap1 and kivap2, corresponding to two different simulation stages. They have sizes 86304 and 56904, respectively, with symmetric sparsity patterns and up to 19 non-zeroes per row (see Figure 3). Numerical experiments were carried out on a Linux cluster, with 16 PCs connected via a Fast Ethernet switch, at the Department of Mathematics of the Second University of Naples. Each PC has a 600 MHz Pentium III processor, a memory of 256 MB and an L2 cache of 256 KB. PSBLAS was installed on the top of the BLAS reference implementation, BLACS 1.1 and mpich 1.2.5, using the gcc C compiler, version 2.96, and the Intel Fortran compiler, version 7.1. All the tests were performed using a row-block distribution of the matrices. Righthand side and solution vectors were distibuted conformally. The number m of vertex sets
Extending PSBLAS to Build Parallel Schwarz Preconditioners
599
Table 1. Iteration counts of RAS-preconditioned BiCGSTAB kivap1 np 1 2 4 6 8 10 12 14 16
δ=0 12 16 16 18 20 19 21 20 22
PSBLAS δ=1 δ=2 12 12 13 11 13 12 13 13 14 14 13 12 14 12 14 13 15 14
δ=4 12 12 13 12 12 13 12 14 13
δ=0 12 16 16 18 20 19 21 20 22
kivap2 PETSc δ=1 δ=2 12 12 12 12 12 11 12 12 12 13 13 12 13 13 12 13 13 12
δ=4 12 12 12 12 12 12 12 12 12
δ=0 33 50 47 48 50 54 54 51 58
PSBLAS δ=1 δ=2 33 33 34 33 36 34 37 38 37 37 38 35 34 38 36 38 37 38
δ=4 33 34 34 34 35 34 33 33 36
δ=0 33 46 50 52 53 51 53 52 59
PETSc δ=1 δ=2 33 33 34 32 35 32 35 34 36 32 33 32 35 32 35 32 38 32
δ=4 33 32 31 29 32 33 33 32 30
Wiδ used by the preconditioner was set equal to the number of processors; the right-hand side and the initial approximation of the solution of each linear system were set equal to the unit vector and the null vector, respectively. The preconditioners were applied as right preconditioners. The BiCGSTAB iterations were stopped when the 2-norm of the residual was reduced by a factor of 10−6 with respect to the initial residual, with a maximum of 500 iterations. We also compared the PSBLAS implementation of the RAS-preconditioned BiCGSTAB solver with the corresponding implementation provided by the well-known PETSc library [1], on the same test problems and with the same stopping criterion. The experiments were performed using PETSc 2.1.6, compiled with gcc 2.96 and installed on the top of mpich 1.2.5 and of the BLAS and LAPACK implementations provided with the Linux Red Hat 7.1 distribution. We report performance results with RAS; this was in general the most effective Additive Schwarz variant, in accordance with the literature cited in Section 2. In Table 1 we show the iteration counts of RAS-preconditioned BiCGSTAB for PSBLAS and PETSc on both test problems, varying the number np of processors and the overlap δ. We see that the number of iterations significantly decreases in passing from a 0-overlap to a 1-overlap partition, especially for kivap2, while it does not have relevant changes with a further growth of the overlap. We note also that, for δ > 0, the number of iterations is almost stable as the number of processes increases. This behaviour appears to be in agreement with the sparsity pattern of the test matrices. The number of iterations of PSBLAS is generally comparable with the one of PETSc. For kivap2, in a few cases a difference of 5 or 6 iterations is observed, which is due to some differences in the row-block distribution of the matrix and in the row ordering of the enlarged matrices Aδi used by the two solvers. In Figure 4 we show the execution times of the RAS-preconditioned BiCGSTAB on the selected test cases, varying the number of processors. These times include also the preconditioner setup times. As expected, the time usually decreases as the number of processors increases; a slight time increase can be observed in a few cases, which can be mainly attributed to the row-block distribution of the matrices. A more regular
600
Alfredo Buttari et al. kivap1
kivap2
5
8 N_OVR=0 N_OVR=1 N_OVR=2 N_OVR=4
6 total time (sec.)
total time (sec.)
4
N_OVR=0 N_OVR=1 N_OVR=2 N_OVR=4
7
3
2
5 4 3 2
1 1 0
2
4
6
8 np
10
12
14
16
0
2
4
6
8 np
10
12
14
16
Fig. 4. Execution times of PSBLAS RAS-preconditioned BiCGSTAB
behaviour is expected by applying suitable graph partitioning algorithms to the initial data distribution. For kivap1 the times are generally lower for overlap 0. Indeed, the small decrease in the number of BiCGSTAB iterations, as the overlap grows, is not able to balance the cost of the preconditioner setup phase. For kivap2, the reduction of the number of iterations obtained with overlap 1 is able to compensate the setup cost, thus leading to the smallest execution times. Finally, in Figure 5 we compare the execution times of PSBLAS and PETSc. The performance of the two solvers is generally comparable. PSBLAS is always faster on a small number of processors, whereas for higher levels of overlap (2 and 4) PETSc requires less execution time as the number of processors increases. A more detailed analysis has shown that this behaviour is due to the smaller preconditioner setup time of PETSc. This issue is currently under investigation, taking into account the different choices implemented by PSBLAS and PETSc in the setup of the preconditioners.
5 Conclusions and Future Work We presented some results of an ongoing activity devoted to updating and extending PSBLAS, a parallel library providing basic Linear Algebra operations needed to build iterative sparse linear system solvers on distributed-memory parallel computers. Motivations for our work come from the flexibility and effectiveness shown by PSBLAS in restructuring and parallelizing a legacy CFD code [11], still widely used in the automotive engine application world, and also from the appearance of a new proposal for the serial Sparse BLAS standard [8]. In this paper we focused on the extension of PSBLAS to implement different versions of Additive Schwarz preconditioners. On test problems from automotive engine simulations the preconditioners showed performances comparable with those of other well-established software. We are currently working on design and implementation issues concerning the addition of a coarse-grid solution step to the basic Additive Schwarz preconditioners, in order to build a two-level Schwarz preconditioning module.
Extending PSBLAS to Build Parallel Schwarz Preconditioners PSBLAS vs PETSc − N_OVR=0
PSBLAS vs PETSc − N_OVR=1
10
10 kivap1 − PSBLAS kivap1 − PETSc kivap2 − PSBLAS kivap2 − PETSc
kivap1 − PSBLAS kivap1 − PETSc kivap2 − PSBLAS kivap2 − PETSc
8 total time (sec.)
total time (sec.)
8
6
4
2
0
6
4
2
2
4
6
8 np
10
12
14
0
16
2
4
PSBLAS vs PETSc − N_OVR=2
8 np
10
12
14
16
10 kivap1 − PSBLAS kivap1 − PETSc kivap2 − PSBLAS kivap2 − PETSc
kivap1 − PSBLAS kivap1 − PETSc kivap2 − PSBLAS kivap2 − PETSc
8 total time (sec.)
8 total time (sec.)
6
PSBLAS vs PETSc − N_OVR=4
10
6
4
2
0
601
6
4
2
2
4
6
8 np
10
12
14
16
0
2
4
6
8 np
10
12
14
16
Fig. 5. Comparison of execution times of the PSBLAS and PETSc
References 1. S. Balay, K. Buschelman, W. D. Gropp, D. Kaushik, M. Knepley, L. Curfman McInnes, B. F. Smith, H. Zhang. PETSc Users Manual. Tech. Rep. ANL-95/11, Revision 2.1.6, Argonne National Laboratory, August 2003. 2. X. C. Cai, Y. Saad. Overlapping Domain Decomposition Algorithms for General Sparse Matrices. Num. Lin. Alg. with Applics., 3(3):221–237, 1996. 3. X.C. Cai, M. Sarkis. A Restricted Additive Schwarz Preconditioner for General Sparse Linear Systems. SIAM J. Sci. Comput., 21(2):792–797, 1999. 4. X.C. Cai, O. B. Widlund. Domain Decomposition Algorithms for Indefinite Elliptic Problems. SIAM J. Sci. Stat. Comput., 13(1):243–258, 1992. 5. T. Chan, T. Mathew. Domain Decomposition Algorithms. In A. Iserles, editor, Acta Numerica 1994, pages 61–143, 1994. Cambridge University Press. 6. J. J. Dongarra, R. C. Whaley. A User’s Guide to the BLACS v. 1.1. Lapack Working Note 94, Tech. Rep. UT-CS-95-281, University of Tennessee, March 1995 (updated May 1997).
602
Alfredo Buttari et al.
7. I. Duff, M. Marrone, G. Radicati, C. Vittoli. Level 3 Basic Linear Algebra Subprograms for Sparse Matrices: a User Level Interface. ACM Trans. Math. Softw., 23(3):379–401, 1997. 8. I. Duff, M. Heroux, R. Pozo. An Overview of the Sparse Basic Linear Algebra Subprograms: the New Standard from the BLAS Technical Forum. ACM Trans. Math. Softw., 28(2):239–267, 2002. 9. E. Efstathiou, J. G. Gander. Why Restricted Additive Schwarz Converges Faster than Additive Schwarz. BIT Numerical Mathematics, 43:945–959, 2003. 10. S. Filippone, M. Colajanni. PSBLAS: A Library for Parallel Linear Algebra Computation on Sparse Matrices. ACM Trans. Math. Softw., 26(4):527–550, 2000. 11. S. Filippone, P. D’Ambra, M. Colajanni. Using a Parallel Library of Sparse Linear Algebra in a Fluid Dynamics Applications Code on Linux Clusters. In G. Joubert, A. Murli, F. Peters, M. Vanneschi, editors, Parallel Computing - Advances & Current Issues, pages 441–448, 2002. Imperial College Press. 12. A. Frommer, D. B. Szyld. An Algebraic Convergence Theory for Restricted Additive Schwarz Methods Using Weighted Max Norms. SIAM J. Num. Anal., 39(2):463–479, 2001. 13. C. W. Hirt, A. A. Amsden, J. L. Cook. An Arbitrary Lagrangian-Eulerian Computing Method for All Flow Speeds. J. Comp. Phys., 14:227–253, 1974. 14. Y. Saad. Iterative Methods for Sparse Linear Systems. SIAM, 2nd edition, 2003. 15. B. Smith, P. Bjorstad, W. Gropp. Domain Decomposition: Parallel Multilevel Methods for Elliptic Partial Differential Equations. Cambridge University Press, 1996.
A Direct Orthogonal Sparse Static Methodology for a Finite Continuation Hybrid LP Solver ´ Pablo Guerrero-Garc´ıa and Angel Santos-Palomo Department of Applied Mathematics, Complejo Tecnol´ogico University of M´alaga, Campus de Teatinos s/n 29071 M´alaga, Spain {pablito,santos}@ctima.uma.es
Abstract. A finite continuation method for solving linear programs (LPs) has been recently put forward by K. Madsen and H. B. Nielsen which, to improve its performance, can be thought of as a Phase-I for a non-simplex active-set method (also known as basis-deficiency-allowing simplex variation); this avoids having to start the simplex method from a highly degenerate square basis. An efficient sparse implementation of this combined hybrid approach to solve LPs requires the use of the same sparse data structure in both phases, and a way to proceed in Phase-II when a non-square working matrix is obtained after Phase-I. In this paper a direct sparse orthogonalization methodology based on Givens rotations and a static sparsity data structure is proposed for both phases, with a L INPACK-like downdating without resorting to hyperbolic rotations. Its sparse implementation (recently put forward by us) is of reduced-gradient type, regularization is not used in Phase-II, and occasional refactorizations can take advantage of row orderings and parallelizability issues to decrease the computational effort. Keywords: linear programming, sparse direct methods, orthogonalization, Givens rotations, static sparsity structure, sparse update & downdate, non-simplex active-set methods, basis-deficiency-allowing simplex variations, Phase-I, finite continuation hybrid method, reduced-gradient, non-square working matrix.
1 Introduction In this paper a two-phase hybrid method for solving linear programs (LPs) like that described by Nielsen in [16, §5.5] is considered, the main difference being that a nonsimplex active-set method [18] (also known as a basis-deficiency-allowing simplex variation, cf. [17]) is used in Phase-II instead of the simplex method to avoid having to start it from a highly degenerate square basis. This proposed linear programming (LP) framework is briefly described in §2. The main goal of the paper is to develop an useful sparse implementation methodology which could be taken as the stepping stone to a robust sparse implementation of this LP method which could take advantage of parallel architectures when refactorizations of the working matrix have to be done, and the latter sections are devoted to describe the sparse orthogonal factorizations used to accomplish such goal. Hence a static data structure come into play, and after describing in §3 the original technique used in PhaseII, the modification of this technique to be used also in Phase-I is given in §4. The key J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 603–610, 2006. c Springer-Verlag Berlin Heidelberg 2006
´ Pablo Guerrero-Garc´ıa and Angel Santos-Palomo
604
point is that the same sparse static data structure is then used in all phases. Additional features are included in §5. The notation used here is as follows. Let us consider the usual unsymmetric primaldual pair of LPs using a non-standard notation (we have deliberately exchanged the usual roles of b and c, x and y, n and m, and (P ) and (D), in a clear resemblance with the work of Madsen, Nielsen and Pinar [8, p. 614]): . . (P ) min (x) = cT x , x ∈ Rn (D) max L(y) = bT y , y ∈ Rm T s.t A x ≥ b s.t Ay = c , y≥O where A ∈ Rn×m with m ≥ n and rank(A) = n. We denote with F and G the feasible regions of (P ) and (D), respectively, and with Ak the current working set matrix formed by a subset of the columns of the fixed matrix A. For example, the sparse structure of the A matrix of the A FIRO problem from N ETLIB is shown in the left part of figure 1, with n = 27 and m = 51; moreover, the working set matrix A25 corresponding to the working set B (25) in iteration 25 could be . A25 = A( : , B (25) ) = A( : , [9 16 18 22 29 40 43 44 50 51]) (1) where M ATLAB notation has been used. We shall indicate with mk the number of . elements of the ordered basic set B (k) , with N (k) = [1 : m]\B (k) and with Ak ∈ Rn×mk and Nk ∈ Rn×(m−mk ) the submatrices of A formed with the columns included in B (k) and N (k) , respectively, and mk ≤ n and rank(Ak ) = mk are only required in Phase-II; as usual, the ith column of A is denoted by ai , and consistently the matrix whose ith row is aTi is denoted by AT . We shall indicate with Zk ∈ Rn×(n−mk ) a matrix whose columns form a basis (not necessarily orthonormal) of the null space of ATk . Superindex (k) will be withheld when it is clear from context.
2 Proposed LP Framework A two-phase hybrid method like that described by Nielsen in [16, §5.5] is considered, where the Phase-I is based on computing —by Newton method— a minimizer of a piecewise quadratic, see [15]. The dual feasibility is maintained during his Phase-I, as pointed T T
Cholesky factor of PA(:,[9,16,18,22,29,40,43,44,50,51]) on top of that of PAA P 0
The A matrix of AFIRO problem from NETLIB collection 0
5
5
10
10
15
15
20 20
25 25
0
5
10
15
20
25 30 nz = 102
35
40
45
50 0
Fig. 1. Sparse structures
5
10
15 nz = 19
20
25
A Direct Orthogonal Sparse Static Methodology
605
out by Nielsen [16, p. 90]; then a crossover to obtain a vertex of G is accomplished, and finally a non-simplex active-set method is employed to obtain primal feasibility (and hence optimality, as dual feasibility is maintained through both the crossover and the Phase-II). Nielsen [16, pp. 97–98] has shown that all we have to do during this Phase-I is to device an efficient way to deal with a sequence of sparse linear system of equations whose coefficient matrix is σIn + Ak ATk (constant σ), which is positive definite. Once an y ∈ G is obtained, a vertex of G has to be obtained for the Phase-II to start with. We resort to the well-known constructive result about the existence of a basic feasible solution for a system of linear equality constraints in nonnegative variables (cf. [12, §3.5.4], [13, §11.4.1.5], [9, §2.1.1]), which is closely related with the first stage of the celebrated crossover procedure of Megiddo [11]. Instead of a Gaussian-like methodology, the constraints determined by the Phase-I are firstly rotated one after another in the sparse static structure described in §3, and a dual ascent direction is then computed by projecting the appropriate subvector of b onto the null space of the current working set. A deletion or an exchange is performed accordingly in the working set, as well as a systematic updating of y ∈ G after the usual primal min-ratio. The crucial point to be shown here is that the final (full-column rank) working matrix after the completion of this crossover is not restricted to be square. As an illustration, let us consider the following crossover examples: Example 1 [12, p. 122]: y (0) = [1; 2; 3; 4; 0; 0; 0], ⎡
⎤ 10 −4 −6 −2 −4 −8 −10 T ⎢ 1 1 3 4 0 0 0 28 ⎥ ⎢ ⎥ b ⎥ =⎢ ⎢ 0 −1 −1 −2 1 0 0 −13 ⎥ , A c ⎣ 0 1 1 2 0 1 0 13 ⎦ 1 0 2 2 0 0 −1 15
k B (k) L(k) 0 {1, 2, 3, 4} −24.0 1 {1, 2, 4} +10.9 2 {1, 2} +98.0
Note that n = 4 but Phase-II has to be started with m2 = 2, hence if we would have to proceed with the simplex method as in [16, §5.5], then a highly degenerate square basis would arise. This is the reason why we prefer a basis-deficiency-allowing simplex variation like that described at the end of this section. Example 2 [13, pp. 475–476]: y (0) = [5; 12; 13; 1; 2; 0; 0]/2,
⎤ 10 −4 −6 −2 −4 −8 −10 ⎢ 1 0 0 1 0 1 −1 3 ⎥ bT ⎥ =⎢ ⎣ 0 1 0 0 −1 2 −1 5 ⎦ , A c 0 0 1 −1 1 1 −2 7
⎡
k B (k) L(k) 0 {1, 2, 3, 4, 5} −43.0 1 {1, 2, 3, 5} −33.3 2 {1, 2, 3} −32.0
In this case n = 3 and m2 = 3, so we proceed with the usual simplex method, although we want to use the same sparse data structure as that used in Phase-I.
606
´ Pablo Guerrero-Garc´ıa and Angel Santos-Palomo
Example 3 [12, p. 123, Ex. 3.24a]: y (0) = [1; 2; 3; 4; 5; 0; 0], ⎡ ⎤ −1 −1 −1 −2 −2 −2 −4 T ⎢ 1 1 0 2 1 0 1 16 ⎥ b B (k) L(k) ⎥, k =⎢ ⎣ ⎦ A c 0 1 1 1 2 0 5 19 0 {1, 2, 3, 4, 5} −24.0 1 0 1 1 1 2 2 13 In this case n = 3 and m0 = 5, but we do not have to proceed to Phase-II because optimality has been detected during the crossover. Example 4 [12, p. 123, Ex. 3.24b]: y (0) = [1; 2; 3; 4; 5; 0; 0], ⎡
⎤ −1 −1 −1 −2 −2 0 −4 T ⎢ 1 1 0 2 1 0 1 16 ⎥ b ⎥ =⎢ ⎣ 0 1 1 1 2 0 5 19 ⎦ , A c 1 0 1 1 1 2 2 13
k B (k) L(k) 0 {1, 2, 3, 4, 5} −24.0 1 {2, 3, 4, 5, 6} −23.2 2 {2, 4, 5, 6} −22.4 3 {2, 5, 6} −19.0
The goal of this example is twofold, because it illustrates the facts that an exchange can take place during the crossover, and that the final working set obtained after the crossover is not necessarily a subset of that obtained after the Phase-I. Finally, the non-simplex active-set method (cf. Santos-Palomo and Guerrero-Garc´ıa [18]) is a primal-feasibility search loop in which a violated constraint is chosen to enter the working set, and a constraint is selected for leaving the working set only when the normal of this entering constraint is in the range space of the current working matrix. Note that an addition or an exchange is thus done in each iteration, whereas an exchange would always be done in the simplex method. A preliminary check of the independence of the constraint to be rotated next is performed to decide whether an addition or an exchange takes place. We omit here the scheme of this Phase-II, and refer the interested reader to [18] for additional details.
3 Original Technique Used in Phase-II The direct orthogonal sparse methodology that we are about to describe dispenses with an explicit orthogonal (dense) factor because of sparsity considerations. It works on top of a sparse triangular matrix, namely the static structure of the Cholesky factor R of AAT , and a permutation P ∈ Rn×n of the rows of A takes place a priori to improve the sparsity of R. Denoting with Rk a Cholesky factor of Ak ATk , a well-known result [4] from sparse linear algebra is that Rk ⊂ R in a structural sense, namely that we have enough memory already allocated in the static structure which has been a priori set up whatever Ak (and consequently Rk ) is. T As an illustration, the sparse structure of a (transposed) Cholesky factor R25 of T T P A25 A25 P in (1) is shown in the right part of figure 1 (big dots) on top of that of P AAT P T (small dots). Note that when all the small dots are set to zero, a columnechelon permuted lower trapezoidal matrix would arise if null columns were deleted; the transpose of this matrix is what we have to maintain.
A Direct Orthogonal Sparse Static Methodology
607
Bearing the static structure previously described in mind, let us show how the implementation of the Phase-II is accomplished with reduced-gradient type techniques described by Santos-Palomo and Guerrero-Garc´ıa [19], where the sparse QR factorization of ATk is used to maintain a sparse Cholesky factor Rk of Ak ATk , with Ak now being a full column rank matrix. In particular, there exists an implicitly defined (by the staircase shape of the structure) permutation ΠkT of the columns of ATk such that
. ATk ΠkT = Qk Rk = Qk Ri Rd ,
Qk , Ri ∈ Rmk ×mk and Rd ∈ Rmk ×(n−mk ) ,
with Rk upper trapezoidal, Ri upper triangular and nonsingular, and Qk orthogonal. Nevertheless, the permutation ΠkT is not explicitly performed and Qk is avoided due to sparsity considerations, thus the row-echelon permuted upper trapezoidal matrix Rk Πk is what we have to maintain. As an example, ΠkT = [e1 , e3 , e6 , e2 , e4 , e5 , e7 ] in ⎛⎡
⎤⎞ ⎡ ⎤ ⎡ ⎤ XXXXXXX XXX XXXXXXX qr ⎝⎣ X X X X X X X ⎦⎠ = ⎣ X X X ⎦ · ⎣ O O X X X X X ⎦ XXXXXXX XXX OOOOOXX Qk
AT k
Rk ·Πk
Note that a reordering of the columns of Ak has no impact in Rk , and that Πk Ak ATk ΠkT = RkT QTk Qk Rk = RkT Rk . I
Adding and deleting rows to ATk is allowed, which is respectively known as row updating and downdating the factorization. Therefore, what we have done is to adapt Saunders’ techniques for square matrices [20] to matrices with more rows than columns, using the static data structure of George and Heath [4] but allowing a L INPACK-like row downdating on it. A M ATLAB toolbox based on this technique has been developed and tested by Guerrero-Garc´ıa and Santos-Palomo [6] in a dual non-simplex active-set framework, and the interested reader can find the details of these row updating and downdating O(n2 ) processes in [19]. This sparse orthogonal factorization allows us to obtain a basis of each fundamental subspace involved, because columns of RkT form a basis of R(Πk Ak ) and those of Zk a basis of N (ATk ), with Zk =
ΠkT
ZT , I
where
. Z = −RdT R−T . i
Thus, the independence check can be done as Bj¨orck and Duff [1] did, as follows:
608
´ Pablo Guerrero-Garc´ıa and Angel Santos-Palomo
v ∈ R(Ak )
⇔
∀s ∈ N (ATk ) , v T s = 0
⇔
ZkT v = O
v∈ R(Ak ) ⇔ ZkT v =O (ZkT v > ) v Πk v = 1 , v1 ∈ Rmk ⇒ ZkT v = Zv1 + v2 = −RdT R−T v1 + v2 i v2 Furthermore, a basic solution of ATk x = bB can be obtained by using some special “seminormal equations”: Ak ATk x = ΠkT RkT Rk Πk x = Ak bB
⇒
RkT (Rk (Πk x)) = Πk (Ak bB ),
. and analogously for Ak yB = c with yB = ATk z but with Ak ATk z = c Ak ATk z = ΠkT RkT Rk Πk z = c
⇒
RkT (Rk (Πk z)) = Πk c.
Note that Ak ATk ≥ 0 in this special seminormal equations, whereas ordinary seminormal equations use ATk Ak > 0; moreover, fixed precision iterative refinement will improve the results. We do not have to write an “ad-hoc” routine for trapezoidal solves because zero diagonals of R are temporarily set to 1 before an usual triangular solve. The same data structure is used for the crossover, in spite of the fact that Ak has not full-column rank in this case. Compatibility of ATk x = bB is checked at each crossover iteration to decide whether a deletion or an exchange is performed.
4 The Modification Used in Phase-I The main goal was to use the same sparse static data structure in all phases, so let us describe how we have to proceed during Phase-I. As previously noted, we have to work in Phase-I with the Cholesky factor Rk of σIn + Ak ATk > 0 (constant σ > 0), instead of with the Cholesky factor Rk of Ak ATk ≥ 0. Our proposal relies on the fact that the Cholesky factor of σIn + Ak√ ATk coincides (barring signs) with the triangular factor of the QR factorization of [Ak , σIn ]T , for T
T
ATk √ Rk T √ Rk Rk = Rk , Ok Qk Qk = Ak , σIn = σIn + Ak ATk . Ok σIn The presence of the Marquardt-like term σIn in the expression σIn + Ak ATk to be factorized allows us to dispense with rank requisites on Ak , and this diagonal elements do not destroy the predicted static structure for AkATk described below. Taking advantage of this sparse Marquardt-like regularization has also been done by Guerrero-Garc´ıa and Santos-Palomo [7] in a quite different context. The crucial point is that in Phase-I we also work on top of the static structure of the Cholesky factor R of AAT (because this structure coincides with√that of the Cholesky factor of σIn + AAT , cf. [10, p. 105]). An initialization from R = σIn in the diagonal of the static structure is done at the start of Phase-I, thus each row of ATk will be linearly dependent of the rows already present in the structure, and hence a sequence of Givens rotations has to be performed to annihilate all nonzero elements (and eventual fill-ins)
A Direct Orthogonal Sparse Static Methodology
609
√ of each row of ATk . Since all rows of σIn remain in the structure during Phase-I, downdating is done as described by Santos-Palomo and Guerrero-Garc´ıa [19]. Being able to use one initial A NALYSE phase is important when dealing with sparse matrix aspects. Previous work does not take advantage of this fact [16, p. 99], and then a dense implementation [14] based on the work of Fletcher and Powell [3] have been shown to be useful to obtain encouraging results in several continuation algorithms. A sparse updatable implementation of this Phase-I could also be done with the multifrontal orthogonal dynamic approach with hyperbolic downdating recently proposed by Edlund [2], although this would entail the use of regularization techniques in the remaining two phases of the hybrid algorithm.
5 Additional Features A F ORTRAN implementation of the updating process (without the downdating) can be found in the S PARSPAK package, which is currently being translated into C++ as described by George and Liu in [5]. The availability of this S PARSPAK++ package would imply a rapid prototyping of a low-level implementation of the sparse proposal given here, and then meaningful comparative computational results for solving LPs could be obtained. As pointed out by a referee, this new LP approach is condemned to face fierce competition with professional implementations of both simplex and interior-point methods, but being able to update/downdate a factorization on a static data structure is crucial to become competitive in a distributed memory HPC environment. A decrease of the computational effort when a refactorization (i.e., recomputing factorization from scratch) has to be done (because of the accumulation of rounding errors in its maintenance) can be accomplished by taking into account a suitable row ordering of AT ; note that doing so every iteration would prohibitively imply O(n3 ) per iteration. The key point is that the column order of Ak does not affect the density of the Cholesky factor of σIn + Ak ATk nor that of Ak ATk . Our M ATLAB implementation takes advantage of this fact by using the sparse option of the qr command, which is an executable file (originally written in C, no source code available) implementing the updating process (without the downdating) referred to in §3. Moreover, we could also take advantage of the parallelizability of this refactorization.
Acknowledgements The authors gratefully acknowledge the referees for their helpful corrections and remarks, which have improved the quality of the paper.
References 1. ˚Ake Bj¨orck and Iain S. Duff. A direct method for the solution of sparse linear least squares problems. Linear Algebra Appl., 34:43–67, 1980. 2. Ove Edlund. A software package for sparse orthogonal factorization and updating. ACM Trans. Math. Software, 28(4):448–482, December 2002.
610
´ Pablo Guerrero-Garc´ıa and Angel Santos-Palomo
3. Roger Fletcher and Michael J. D. Powell. On the modification of LDLT factorizations. Math. Comp., 29:1067–1087, 1974. 4. J. Alan George and Michael T. Heath. Solution of sparse linear least squares problems using Givens rotations. Linear Algebra Appl., 34:69–83, 1980. 5. J. Alan George and Joseph W.-H. Liu. An object-oriented approach to the design of a user interface for a sparse matrix package. SIAM J. Matrix Anal. Applics., 20(4):953–969, 1999. ´ 6. Pablo Guerrero-Garc´ıa and Angel Santos-Palomo. Solving a sequence of sparse compatible systems. Technical report MA-02-03, Department of Applied Mathematics, University of M´alaga, November 2002. Submitted for publication. ´ Santos-Palomo. A sparse orthogonal factorization technique 7. Pablo Guerrero-Garc´ıa and Angel for certain stiff systems of linear ODEs. In J. Mar´ın and V. Koncar (eds.), Industrial Simulation Conference 2004 Proceedings, pp. 17–19, June 2004. 8. Kaj Madsen, Hans B. Nielsen and Mustafa C ¸ . Pinar. A new finite continuation algorithm for linear programming. SIAM J. Optim., 6(3):600–616, August 1996. 9. Istv´an Maros. Computational Techniques of the Simplex Method. Kluwer Academic Publisher, 2003. 10. Pontus Matstoms. Sparse linear least squares problems in optimization. Comput. Optim. Applics., 7(1):89–110, 1997. 11. Nimrod Megiddo. On finding primal- and dual-optimal bases. ORSA J. Computing, 3(1):63– 65, Winter 1991. 12. Katta G. Murty. Linear Programming. John Wiley and Sons, 1983. 13. Katta G. Murty. Linear Complementarity, Linear and Nonlinear Programming. Heldermann, 1988. 14. Hans B. Nielsen. AAFAC: A package of FORTRAN 77 subprograms for solving AT Ax = c. Tech. report, Institute for Numerical Analysis, Technical University of Denmark, Lyngby, Denmark, March 1990. 15. Hans B. Nielsen. Computing a minimizer of a piecewise quadratic. Tech. report, Department of Mathematical Modelling, Technical University of Denmark, Lyngby, Denmark, September 1998. 16. Hans B. Nielsen. Algorithms for linear optimization, 2nd edition. Tech. report, Department of Mathematical Modelling, Technical University of Denmark, Lyngby, Denmark, February 1999. 17. Ping-Qi Pan. A basis-deficiency-allowing variation of the simplex method for linear programming. Computers Math. Applic., 36(3):33–53, August 1998. ´ 18. Angel Santos-Palomo and Pablo Guerrero-Garc´ıa. A non-simplex active-set method for linear programs in standard form. Stud. Inform. Control, 14(2), June 2005. ´ 19. Angel Santos-Palomo and Pablo Guerrero-Garc´ıa. Updating and downdating an upper trapezoidal sparse orthogonal factorization. Accepted for publication in IMA J. Numer. Anal. 20. Michael A. Saunders. Large-scale linear programming using the Cholesky factorization. Tech. report CS-TR-72-252, Department of Computer Science, Stanford University, Stanford, CA, January 1972.
Applying Parallel Direct Solver Techniques to Build Robust High Performance Preconditioners Pascal H´enon1 , Franc¸ois Pellegrini1 , Pierre Ramet1 , Jean Roman1 , and Yousef Saad2 1
ScAlApplix Project, INRIA Futurs and LaBRI UMR 5800 Universit´e Bordeaux 1, 33405 Talence Cedex, France {henon,pelegrin,ramet,roman}@labri.fr 2 Dept of computer Science & Eng. University of Minnesota, 200 Union st. SE. Minneapolis, MN 55155, USA [email protected]
Abstract. The purpose of our work is to provide a method which exploits the parallel blockwise algorithmic approach used in the framework of high performance sparse direct solvers in order to develop robust preconditioners based on a parallel incomplete factorization. The idea is then to define an adaptive blockwise incomplete factorization that is much more accurate (and numerically more robust) than the scalar incomplete factorizations commonly used to precondition iterative solvers.
1 Introduction Solving large sparse linear systems by iterative methods [18] has often been unsatisfactory when dealing with pratical “industrial" problems. The main difficulty encountered by such methods is their lack of robustness and, generally, the unpredictability and unconsistency of their performance when they are used over a wide range of different problems; some methods work quite well for certain types of problems but can fail completely on others. Over the past few years, direct methods have made significant progress thanks to research studies on both the combinatorial analysis of Gaussian elimination process and on the design of parallel block solvers optimized for high-performance computers. It is now possible to solve real-life three-dimensional problems having in the order of several millions equations, in a very effective way with direct solvers. These is achievable by exploiting superscalar effects of modern processors and taking advantage of computer architectures based on networks of SMP nodes (IBM SP, DEC-Compaq, for example) [1,7,9,10,12]. However, direct methods will still fail to solve very large three-dimensional problems, due to the large amount of memory needed for these cases. Some improvments to the classical scalar incomplete factorization have been studied to reduce the gap between the two classes of methods. In the context of domain decomposition, some algorithms that can be parallelized in an efficient way have been investigated in [14]. In [16], the authors proposed to couple incomplete factorization with a selective inversion to replace the triangular solutions (that are not as scalable as J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 611–619, 2006. c Springer-Verlag Berlin Heidelberg 2006
612
Pascal H´enon et al.
the factorization) by scalable matrix-vector multiplications. The multifrontal method has also been adapted for incomplete factorization with a threshold dropping in [11] or with a fill level dropping that measures the importance of an entry in terme of its updates [2]. In [3], the authors proposed a block ILU factorization technique for block tridiagonal matrices. Our goal is to provide a method which exploits the parallel blockwise algorithmic approach used in the framework of high performance sparse direct solvers in order to develop robust parallel incomplete factorization based preconditioners [18] for iterative solvers. The originality of our work is to use a supernodal algorithm what allows us to drop some blocks during the elimination process on the quotient graph. Unlike multifrontal approaches for which parallel threshold-based ILU factorization has been studied, the supernodal method permits to build at low cost a dense block structure for an ILU(k) factorization with an important fill-in. Indeed, using dense block formulation is crutial to achieved high performance computations. The idea is then to define an adaptive blockwise incomplete factorization that is much more accurate (and numerically more robust) than the scalar incomplete factorizations commonly used to precondition iterative solvers. Such incomplete factorizations can take advantage of the latest breakthroughts in sparse direct methods and can therefore be very competitive in terms of CPU time due to the effective usage of CPU power. At the same time this approach does not suffer from the memory limitation encountered by direct methods. Therefore, we can expect to be able to solve systems in the order of hundred millions of unknowns on current platforms. Another goal of this paper is to analyze the chosen parameters that can be used to define the block sparse pattern in our incomplete factorization. The remainder of the paper is organized as follows: the section 2 describes the main features of our method. We provide some experiments in section 3. At last we give some conclusions in section 4.
2 Methodology The driving rationale for this study is that it is easier to incorporate incomplete factorization methods into direct solution software than it is to develop new incomplete factorizations. As a starting point, we can take advantage of the algorithms and the software components of PaStiX [10] which is a parallel high performance supernodal direct solver for sparse symmetric positive definite systems and for sparse unsymmetric systems with a symmetric pattern. These different components are (see Fig. 1): 1. the ordering phase, which computes a symmetric permutation of the initial matrix such that factorization process will exhibit as much concurrency as possible while incurring low fill-in. In this work, we use a tight coupling of the Nested Dissection and Approximate Minimum Degree algorithms [15]; 2. the block symbolic factorization phase, which determines the block data structure of the factorized matrix associated with the partition resulting from the ordering phase. This structure consists of several column-blocks, each of them containing a dense diagonal block and a set of dense rectangular off-diagonal blocks [5];
Applying Parallel Direct Solver Techniques
613
3. the block repartitioning and scheduling phase, which refines the previous partition by splitting large supernodes in order to exploit concurrency within dense block computations in addition to the parallelism provided by the block elimination tree, both induced by the block computations in the supernodal solver. In this phase, we compute a mapping of the matrix blocks (that can be made by column block (1D) or block (2D)) and a static optimized scheduling of the computational and communication tasks according to BLAS and communication time models calibrated on the target machine. This static scheduling will drive the parallel factorization and the backward and forward substitutions [8,10]. graph associated to matrix A
block symbolic matrix L
supernodal partition
distributed block symbolic matrix L
distributed factorized block matrix L
Scotch
Fax
Blend
Sopalin
ordering & amalgamation
block symbolic factorization
refinement & mapping
factorization & solution
distributed solution
Fig. 1. Phases for the parallel complete block factorization
Our approach consists of computing symbolically the block structure of the factors that would have been obtained with a complete factorization, and then deciding to drop off some blocks of this structure according to relevant criteria. This incomplete factorization induced by the new sparse pattern is then used in a preconditioned GMRES or Conjugate Gradient solver [18]. Our main goal at this point is to achieve a significant reduction of the memory needed to store the incomplete factors while keeping enough fill-in to make the use of BLAS3 primitives cost-effective. Naturally, we must still have an ordering phase and a mapping and scheduling phase (this phase is modified to suit the incomplete factorization block computation) to ensure an efficient parallel implementation of the block preconditioner computation and of the forward and backward substitutions in the iterations. Then, we have the new processing chain given at Fig. 2. incomplete block symbolic matrix L
refined supernodal partition
graph associated to matrix A
distributed incomplete block symbolic matrix L
distributed incomplete factorized block matrix L
Scotch
iFax
Blend
Sopalin
ordering & amalgamation & refinement
block symbolic factorization & dropping
mapping & scheduling
incomplete factorization
C.G. GMRES distributed solution
Fig. 2. Phases for the parallel incomplete block factorization
The first crucial point is then to find a good initial partition that can be used in the dropping step after the block symbolic factorization. It cannot be the initial supernodal partition computed by the ordering phase (phase 1) because it would be too costly to consider the diagonal blocks as dense blocks like in a complete factorization. Therefore,
614
Pascal H´enon et al.
we resort to a refined partition of this supernodal partition which will then define the elementary column blocks of the factors. We obtain the refined partition by splitting the column blocks issued from the supernodal partition according to a maximum blocksize parameter. This allows more options for dropping some blocks in the preconditioner. This blocksize parameter plays a key role in finding a good trade-off as described above. An important result from theoretical analysis is that if we consider nested dissection ordering based on separator theorems [13], and if we introduce some asymptotically refined partitions, one can show that the total number of blocks computed by the block symbolic factorization and the time to compute these blocks are quasi-linear (these quantities are linear for the supernodal partition). The second crucial point concerns the various criteria that are used to drop some blocks from the blockwise symbolic structure induced by the refined partition. The dropping criterion we use is based on a generalization of the level-of-fill [18] metric that has been adapted to the elimination process on the quotient graph induced by the refined partition. One of the most common ways to define a preconditioning matrix M is through Incomplete LU (ILU) factorizations. ILU factorizations are obtained from an approximate Gaussian elimination. When Gaussian elimination is applied to a sparse matrix A, a large number of nonzero elements may appear in locations originally occupied by zero elements. These fill-ins are often small elements and may be dropped to obtain approximate LU factorizations. The simplest of these procedures, ILU(0) is obtained by performing the standard LU factorization of A and dropping all fill-in elements that are generated during the process. In other words, the L and U factors have the same pattern as the lower and upper triangular parts of A (respectively). More accurate factorizations denoted by ILU(k) and IC (k) have been defined which drop fill-ins according to their “levels”. Level-1 fill-ins for example are generated from level-zero fill-ins (at most). So, for example, ILU(1) consists of keeping all fill-ins that have level zero or one and dropping any fill-in whose level is higher. We now provide a few details. In level-based ILUs, originally introduced by Watts III [19], a level-of-fill is associated with every entry in the working factors L and U . Conceptually these factors together are the L and U parts of a certain working matrix A which initially contains the original matrix. At the start of the procedure, every zero entry has a level-of-fill of infinity and every nonzero entry has a level-of-fill of zero. Whenever an entry is modified by the standard Gaussian Elimination update aij := aij − aik ∗ akj /akk its level-of-fill is updated by the formula levij = min{levij , levik + levkj + 1}. In practice, these levels are computed in a symbolic phase first and used to define the patterns of L and U a priori. As the level-of-fill increases, the factorization becomes more expensive but more accurate. In general, the robustness of the iterative solver will typically improve as the level-of-fill increases. It is common practice in standard
Applying Parallel Direct Solver Techniques
615
preconditioned Krylov methods to use a very low level-of-fill , typically no greater than 2. Now it is easy to see what will be the main limitation of this approach: the levelof-fill is based entirely on the adjacency graph of the original matrix, and its definition implicitly assumes some form of diagonal dominance, in order for the dropped entries to be indeed small. It cannot work for matrices that are highly indefinite for example. There are two typical remedies, each with its own limitations. The first is to replace the level-of-fill strategy by one that is based on numerical values. This yields the class of ILUT algorithms [17,18]. Another is to resort to a block algorithm, i.e., one that is based on a block form of Gaussian elimination. Block ILU methods have worked quite well for harder problems, see for example [4,6]. In the standard block-ILU methods, the blocking of the matrix simply represents a grouping of sets of unknowns into one single unit. The simplest situation is when A has a natural block structure inherited from the blocking of unknowns by degrees of freedom at each mesh-point. Extending block ILU to these standard cases is fairly easy. It suffices to consider the quotient graph obtained from the blocking. The situation with which we must deal is more complex because the block partitioning of rows varies with the block columns. An illustration is shown in Figure 3. The main difference between the scalar and the block formulation of the level-of-fill algorithm is that, in general, a block contribution may update only a part of the block (see figure 3(a)). As a consequence, a block is split according to its level-of-fill modification. Nevertheless, we only allow the splitting along the row dimension in order to preserve an acceptable blocksize (see figure 3(b)).
k
k
j Ljk
lev1
Lik
lev2
0000 1111 0000 1111 lev 0000 1111 0000 1111
j
Aij
Fig (a)
t 111 000 Aij is modified in Aij := Aij − Lik.(Ljk) 000 111 000 111
Ljk
lev1
Lik
lev2
lev lev3
Aij
lev
Fig (b) lev_3 = min(lev, lev1+lev2+1)
Fig. 3. Computation of the block level of fill-in the block elimination process
As shown in table 1, two consecutive levels-of-fill can produce a large increase of the fill ratio in the factors (N N ZA is the number of non-zeros in A, N N ZL is the number of non-zeros that would have beeen obtained with a direct factorization, and OP CL is the number of operations required for the direct factorization).
616
Pascal H´enon et al. Table 1. Fill rate for a 47x47x47 3D mesh (finite element, 27 connectivity) level of fill % N N ZL ×N N ZA % OP CL ≤0 9.28 3.53 0.29 ≤1 23.7 9.02 2.33 ≤2 38.4 14.6 7.61 ≤3 54.9 20.9 19.0 ≤4 66.1 25.2 31.5 ≤5 75.4 28.7 45.1 ≤6 83.2 31.6 58.5 ... ... ... ... ≤ 15 100 38.1 100
In order to choose intermediate ratios of fill between two levels, we have introduced a second criterion to drop some blocks inside a level-of-fill. We allow the possibility to choose a fill ratio according to the formula: N N Zprec = (1 − α).N N ZA + α.N N ZL where α ∈ [0, 1] and N N Zprec is the number of non-zeros in the factors for the incomplete factorization. To reach the exact number of non-zeros corresponding to a selected α, we consider the blocks allowed in the fill-in pattern in two steps. If we denote by N N Zk the number of non-zeros obtained by keeping all the blocks with levels-of-fill ≤ k, then we find the first value λ such that N N Zλ ≥ N N Zprec . In a second step, until we reach N N Zprec , we drop the blocks, among those having a level-of-fill λ, which undergo the fewest updates by previous eliminations.
3 Tests In this section, we give an analysis of some first convergence and scalability results for practical large systems. We consider two difficult problems for direct solvers (see table 2). The AUDI matrix (symmetric) corresponds to an automotive crankshaft model and the MHD1 is a magnetohydrodynamic problem (unsymmetric). The ratio, between the number of non-zeros in the complete factor and the number of non-zeros in the initial matrix A is about 31 for the AUDI test case and about 67 for the MHD1 one. Table 2. Description of our test problems Name Columns NNZA
NNZL
OPCL
AUDI 943695 39297771 1.21e+09 5.3e+12 MHD1 485597 24233141 1.62e+09 1.6e+13
Numerical experiments were run on a 28 NH2 IBM nodes (16 Power3+, 375Mhz, 1.5 Gflops, 16GB) located at CINES (Montpellier, France) with a network based on a Colony switch. All computations are performed in double precision and all time results are given in seconds. We use the PA S TI X software with recent improvements such as an efficient MPI/Thread implementation to compute the preconditioner. The stopping criterion for GMRES iterations uses the relative residual norm and is set to 10−7 .
Applying Parallel Direct Solver Techniques
617
Table 3 presents results for different values of the fill rate parameter α for the AUDI problem. The blocksize (for the refined partition) is set to 8 and the results are performed on 16 processors. Table 3. AUDI problem with blocksize=8 α = 0.1
α = 0.2
α = 0.3
inc.f act. nb.iter. time/iter. inc.f act. nb.iter. time/iter. inc.f act. nb.iter. time/iter. 24 429 0.71 39 293 1.30 55 279 1.64 ×N N ZA % OP CL 3.77 0.51
tot.time 328.6
α = 0.4
×N N ZA % OP CL tot.time 6.75 2.91 419.9
×N N ZA % OP CL tot.time 9.75 5.97 512.5
α = 0.5
α = 0.6
inc.f act. nb.iter. time/iter. inc.f act. nb.iter. time/iter. inc.f act. nb.iter. time/iter. 80 144 2.12 135 49 3.13 195 36 3.70 ×N N ZA % OP CL tot.time 12.84 8.32 385.3
×N N ZA % OP CL tot.time 15.92 15.63 288.4
×N N ZA % OP CL 18.99 24.10
tot.time 328.2
For each run we give: – in the first line, the time to compute the incomplete factorisation (inc.f act.), the number of iterations (nb.iter.) and the time to perform one iteration (time/iter.); – in the second line, the ratio between the number of non-zeros in the incomplete factors and the number of non-zeros in the matrix A (×N N ZA), the percentage of the number of operations to compute the incomplete factorization compared with the number of operations required for the direct factorization (% OP CL ), and the total time (tot.time). The same results for a blocksize set to 16 can be found in table 4. Theses results can be compared with the time required by the direct solver: on 16 processors, PA S TI X needs about 482s to solve the problem and the solution has an accuracy (relative residual norm) about 10−15 . Table 4. AUDI problem with blocksize=16 α = 0.1
α = 0.2
α = 0.3
inc.f act. nb.iter. time/iter. inc.f act. nb.iter. time/iter. inc.f act. nb.iter. time/iter. 11 214 0.84 19 196 1.20 40 177 1.87 ×N N ZA % OP CL 5.29 0.97
tot.time 190.8
α = 0.4
×N N ZA % OP CL tot.time 6.50 2.16 254.2
×N N ZA % OP CL tot.time 9.57 6.82 371.0
α = 0.5
α = 0.6
inc.f act. nb.iter. time/iter. inc.f act. nb.iter. time/iter. inc.f act. nb.iter. time/iter. 62 186 2.13 77 140 2.41 130 42 3.45 ×N N ZA % OP CL tot.time 12.76 11.93 458.2
×N N ZA % OP CL tot.time 15.55 15.57 414.4
×N N ZA % OP CL 19.01 24.39
tot.time 274.9
As expected, the time for the incomplete factorization and for the iterations increases with the fill rate parameter whereas the number of iterations decreases. We can see that
618
Pascal H´enon et al.
the best result is obtained with α set to 0.1 and a blocksize set to 16. Thus, we can solve the problem 2.5 faster than with our direct solver and with only 17.2% of N N ZL (about 5.3 × N N ZA ). We have also report results with higher values for the blocksize (32): block computations are more efficient but these blocksizes do not allow to drop enough entries to be competitive. For next results the blocksize is set to 16 and α to 0.1. With such a fill rate parameter, the number of iterations for the MHD1 problem is small (5 iterations to reduce the residual norm by 10−7 ). This problem is easier than the AUDI problem, and in that case, the advantage of our approach will be less important compared with traditional iterative solvers. Table 5 shows that the run-time scalability is quite good for up to 64 processors for both the incomplete factorization and the iteration phase. We remind that the number of iterations is independent of the number of processors in our approach. Table 5. Scalability results with α = 0.1 and blocksize=16 Name 1 AUDI inc.f act. AUDI time/iter. MHD1 inc.f act. MHD1 time/iter.
90.9 10.5 48.1 3.82
Number of processors 2 4 8 16 32 52.5 5.56 26.2 1.97
29.2 3.06 15.6 1.06
15.7 1.49 9.1 0.61
10.6 0.84 5.1 0.40
5.9 0.45 3.0 0.32
64 3.3 0.33 2.2 0.25
So on 64 processors, for a relative precision set to 10−7 , the total time is 74s what is twice faster than the 152s needed by the direct solver.
4 Conclusion In conclusion, we have shown a methodology for bridging the gap between direct and iterative solvers by blending the best features of both types of techniques: low memory costs from iterative solvers and effective reordering and blocking from direct methods. The approach taken is aimed at producing robust parallel iterative solvers that can handle very large 3-D problems which arise from realistic applications. The preliminary numerical examples shown indicate that the algorithm performs quite well for such problems. Robustness relative to standard (non-block) preconditioners is achieved by extracting more accurate factorizations via higher amounts of fill-in. The effective use of hardware is enabled by a careful block-wise processing of the factorization, which yields fast factorization and preconditioning operations. We plan on performing a wide range of experiments on a variety of problems, in order to validate our approach and to understand its limitations. We also plan on studying better criteria for dropping certain blocks from the blockwise symbolic structure.
Acknowledgment The work of the authors was supported from NSF grants INT-0003274. The work of the last author was also supported from NSF grants ACI-0305120 and by the Minnesota Supercomputing Institute.
Applying Parallel Direct Solver Techniques
619
References 1. P. R. Amestoy, I. S. Duff, S. Pralet, and C. V¨omel. Adapting a parallel sparse direct solver to architectures with clusters of SMPs. Parallel Computing, 29(11-12):1645–1668, 2003. 2. Y. Campbell and T.A. Davis. Incomplete LU factorization: A multifrontal approach. http://www.cise.ufl.edu/˜davis/techreports.html 3. T.F. Chang and P.S. Vassilevski. A framework for block ILU factorizations using block-size reduction. Math. Comput., 64, 1995. 4. A. Chapman, Y. Saad, and L. Wigton. High-order ILU preconditioners for CFD problems. Int. J. Numer. Meth. Fluids, 33:767–788, 2000. 5. P. Charrier and J. Roman. Algorithmique et calculs de complexit´e pour un solveur de type dissections emboˆıt´ees. Numerische Mathematik, 55:463–476, 1989. 6. E. Chow and M. A. Heroux. An object-oriented framework for block preconditioning. Technical Report umsi-95-216, Minnesota Supercomputer Institute, University of Minnesota, Minneapolis, MN, 1995. 7. Anshul Gupta. Recent progress in general sparse direct solvers. Lecture Notes in Computer Science, 2073:823–840, 2001. 8. P. H´enon. Distribution des Donn´ees et R´egulation Statique des Calculs et des Communications pour la R´esolution de Grands Syst`emes Lin´eaires Creux par M´ethode Directe. PhD thesis, LaBRI, Universit´e Bordeaux I, France, November 2001. 9. P. H´enon, P. Ramet, and J. Roman. PaStiX: A High-Performance Parallel Direct Solver for Sparse Symmetric Definite Systems. Parallel Computing, 28(2):301–321, January 2002. 10. P. H´enon, P. Ramet, and J. Roman. Efficient algorithms for direct resolution of large sparse system on clusters of SMP nodes. In SIAM Conference on Applied Linear Algebra, Williamsburg, Virginie, USA, July 2003. 11. G. Karypis and V. Kumar. Parallel Threshold-based ILU Factorization. Proceedings of the IEEE/ACM SC97 Conference, 1997. 12. X. S. Li and J. W. Demmel. A scalable sparse direct solver using static pivoting. In Proceedings of the Ninth SIAM Conference on Parallel Processing for Scientific Computing, San Antonio, Texas, March 22-24, 1999. 13. R. J. Lipton, D. J. Rose, and R. E. Tarjan. Generalized nested dissection. SIAM Journal of Numerical Analysis, 16(2):346–358, April 1979. 14. M. Magolu monga Made and A. Van der Vorst. A generalized domain decomposition paradigm for parallel incomplete LU factorization preconditionings. Future Generation Computer Systems, Vol. 17(8):925–932, 2001. 15. F. Pellegrini, J. Roman, and P. Amestoy. Hybridizing nested dissection and halo approximate minimum degree for efficient sparse matrix ordering. Concurrency: Practice and Experience, 12:69–84, 2000. 16. P. Raghavan, K. Teranishi, and E.G. Ng. A latency tolerant hybrid sparse solver using incomplete Cholesky factorization. Numer. Linear Algebra, 2003. 17. Y. Saad. ILUT: a dual threshold incomplete ILU factorization. Numerical Linear Algebra with Applications, 1:387–402, 1994. 18. Y. Saad. Iterative Methods for Sparse Linear Systems, Second Edition. SIAM, 2003. 19. J. W. Watts III. A conjugate gradient truncated direct method for the iterative solution of the reservoir simulation pressure equation. Society of Petroleum Engineers Journal, 21:345–353, 1981.
The Design of Trilinos Michael A. Heroux and Marzio Sala Sandia National Laboratories {maherou,msala}@sandia.gov
Abstract. The Trilinos Project is an effort to facilitate the design, development, integration and ongoing support of mathematical software libraries within an object-oriented framework for the solution of large-scale, complex multi-physics engineering and scientific problems. Trilinos is a two-level software structure, designed around a collection of packages. Each package focuses on a particular area of research, such as linear and nonlinear solver or algebraic preconditioners, and is usually developed by a small team of experts in this particular area of research. Packages exist underneath the Trilinos top level, which provides a common look-and-feel, including configuration, documentation, licensing, and bug-tracking. In this paper we present the Trilinos design and an overview of the Trilinos packages. We discuss about the package interoperability and interdependence, and the Trilinos software engineering environment for developers. We also discuss how Trilinos facilitates high-quality software engineering practices that are increasingly required from simulation software.
1 Introduction Trilinos [1,2], winner of an 2004 R&D 100 award, is a suite of parallel numerical solver libraries within an object-oriented software framework for the solution of large-scale, complex multi-physics engineering and scientific applications in a production computing environment. The goal of the project is to facilitate the design, development, integration, and ongoing support of mathematical software libraries. The emphasis of Trilinos is on developing robust and scalable algorithms in a software framework. Abstract interfaces are provided for flexible interoperability of packages while providing a full-featured set of concrete classes that implement all abstract interfaces. Furthermore, Trilinos includes an extensive set of tools and processes that support software engineering practices. Specifically, Trilinos provides the following: – A suite of parallel object-oriented solver libraries for solving linear systems (using iterative methods and providing interfaces to third-party sequential and parallel direct solvers), nonlinear systems, defining incomplete factorizations, domain decomposition and multi-level preconditioners, solving eigensystem and time-dependent problems. – Tools for rapid deployment of new packages. – A scalable model for integrating new solver capabilities. – An extensive set of software quality assurance (SQA) and software quality engineering (SQE) processes and tools for developers. J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 620–628, 2006. c Springer-Verlag Berlin Heidelberg 2006
The Design of Trilinos
621
The fundamental concept in Trilinos is the package. Trilinos uses a two-level software structure designed around collections of packages. Each package is a numerical library (or collection of libraries) that is: – – – –
Focused on important, state-of-the-art algorithms in its problem regime; Developed by a small team of domain experts; Self-contained, with minimal dependencies on other packages; Configurable, buildable, tested and documented on its own.
These packages can be distributed within Trilinos or separately. The Trilinos framework provides a common look-and-feel that includes configuration, documentation, licensing, and bug tracking. There are also guidelines for adding new packages to Trilinos. Trilinos packages are primarily written in C++, but we provide some C and Fortran user interface support. Trilinos provides an open architecture that allows easy integration of other solver packages and we make our software available to the outside community via the GNU Lesser General Public License (LGPL) [3]. This paper is organized as follows. Section 2 gives an overview of the Trilinos packages. Section 3 describes the packages that can be used to define linear algebra components (like serial and distributed matrices and vectors), that can be used by all packages. Section 4 introduces to the NewPackage package, a jumpstart for new Trilinos packages. Section 5 outlines the difference between interoperability and interdependence between Trilinos packages. The Trilinos software engineering environment for solver developers is described in Section 6. Finally, Section 7 outlines the conclusions.
2 Overview of Trilinos Packages The word “Trilinos” is Greek for “string of pearls”. It is a metaphor for the Trilinos architecture. Packages are strung together within Trilinos like pearls on a necklace. While each package is valuable and whole in its own right, the collection represents even more than the sum of its parts. Trilinos began as only three packages but has rapidly expanded. Table 1 lists the Trilinos packages with a brief description and their availability status. At the bottom of the table is the package count in each Trilinos release. General releases are available via automatic download. Limited releases are available upon request. A full description of Trilinos packages is available at the Trilinos Home Page [1]. A partial overview of what can be accomplished using Trilinos includes: – – – – – – – –
Advanced serial dense or sparse matrices (Epetra); Advanced utilities for Epetra vectors and sparse matrices (EpetraExt); Templated distributed vectors and sparse matrices (Tpetra); Distributed sparse matrices (Epetra); Solve a linear system with preconditioned Krylov accelerators (AztecOO, Belos); Incomplete Factorizations (AztecOO, IFPACK); Multilevel preconditioners (ML); “Black-box” smoothed aggregation preconditioners (ML);
622
Michael A. Heroux and Marzio Sala Table 1. Trilinos Package Descriptions and Availability Package
Description
Release 3.1 (9/2003)
4 (5/2004)
3.1
4
3.1
4
General Limited General Limited Amesos
3rd Party Direct Solver Suite
Anasazi
Eigensolver package
AztecOO
Linear Iterative Methods
Belos
Block Linear Solvers
Epetra
Basic Linear Algebra
EpetraExt
Extensions to Epetra
Ifpack
Algebraic Preconditioners
Jpetra
Java Petra Implementation
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X X X
X
X
Kokkos
Sparse Kernels
X
X
Komplex
Complex Linear Methods
X
X
X
X
LOCA
Bifurcation Analysis Tools
X
X
X
X
Meros
Segregated Preconditioners
X
X
X
ML
Multi-level Preconditioners
X
X
X
X
NewPackage Working Package Prototype
X
X
X
X
NOX
Nonlinear solvers
X
X
X
X
Pliris
Dense direct Solvers
X
X
Teuchos
Common Utilities
X
X
TSFCore
Abstract Solver API
X
X
TSFExt
Extensions to TSFCore
X
Tpetra
Templated Petra
X
Totals
8
11
15
20
– One-level Schwarz preconditioner (overlapping domain decomposition) (AztecOO, IFPACK); – Two-level Schwarz preconditioner, with coarse matrix based on aggregation (AztecOO and ML); – Systems of nonlinear equations (NOX); – Interface with various direct solvers, such as UMFPACK, MUMPS, SuperLU DIST and ScaLAPACK (Amesos); – Eigenvalue problems for sparse matrices (Anasazi); – Complex linear systems (using equivalent real formulation) (Komplex); – Segregated and block preconditioners (e.g., incompressible Navier-Stokes equations) (Meros); – Light-weight interface to BLAS and LAPACK (Epetra); – Templated interface to BLAS and LAPACK, arbitrary-precision arithmetic, parameters’ list, smart pointers (Teuchos);
The Design of Trilinos
623
– Definition of abstract interfaces to vectors, linear operators, and solvers (TSF, TSFCore, TSFExtended); – Generation of test matrices (Triutils). A detailed description of each package’s functionalities is beyond the scope of this manuscript. In the next two sections, we focus our attention to the Petra object model, and to the NewPackage package.
3 The Petra Object Model Scalable parallel implementation of algorithms is extremely challenging. Therefore, one of the most important features of Trilinos is its data model for global objects, called the Petra Object Model [6]. There are three implementation of this model, but the current production implementation is called Epetra [7]. Epetra is written for real-valued doubleprecision scalar field data, and restricts itself to a stable core of the C++ language standard. As such, Epetra is very portable and stable, and is accessible to Fortran and C users. Epetra combines in a single package: – Object-oriented, parallel C++ design: Epetra has facilitated rapid development of numerous applications and solvers at Sandia because Epetra handles many of the complicated issues of working on a parallel, distributed-memory machine. Furthermore, Epetra provides a lightweight set of abstract interfaces and a shallow copy mode that allows existing solvers and data models to interact with Epetra without creating redundant copies of data. – High performance: Despite the fact that Epetra provides a very flexible data model, performance has always been the top priority. Aztec [8] won an R&D 100 award in 1997 and has also been the critical performance component in two Gordon Bell finalist entries. Despite Epetra’s much more general interfaces and data structures, for problems where Aztec and Epetra can be compared, Epetra is at least as fast and often faster than Aztec. Epetra also uses the BLAS and LAPACK wherever possible, resulting in excellent performance for dense operations. – Block algorithm support for next-generation solution capabilities: The Petra Object Model and Epetra specifically provide support for multivectors, including distributed multivector operations. The level of support that Epetra provides for this feature is unique among available frameworks, and is essential for supporting next-generation applications that incorporate features such as sensitivity analysis and design optimization. – Generic parallel machine interfaces: Epetra does not depend explicitly on MPI. Instead it has its own abstract interface for which MPI is one implementation. Experimental implementations for hybrid distributed/shared memory, PVM [9] and UPC [10] are also available or under development. – Parallel data redistribution: Epetra provides a distributed object (DistObject) base class that manages global distributed objects. Not only does this base class provide the functionality for common distributed objects like matrices and vectors, it also supports distributed graphs, block matrices, and coloring objects. Furthermore, any class can derive from DistObject by implementing just a few simple methods to pack and unpack data.
624
Michael A. Heroux and Marzio Sala
– Interoperability: Application developers can use Epetra to construct and manipulate matrices and vectors, and then pass these objects to any Trilinos solver package. Furthermore, solver developers can develop many new algorithms relying on Epetra classes to handle the intricacies of parallel execution. Whether directly or via abstract interfaces, Epetra provides one of the fundamental interoperability mechanisms across Trilinos packages.
4 NewPackage Package: Fully Functional Package Prototype One of the most useful features of Trilinos is the NewPackage package. This simple “Hello World" program is a full-fledged Trilinos package that illustrates: – Portable build processes via Autoconf and Automake; – Automatic documentation generation using Doxygen [18]; – Configuring for interaction with another package (Epetra) including the use of M [4,19] macros to customize package options; – Regression testing using special scripts in the package that will be automatically executed on regression testing platforms. In addition, Trilinos provides a NewPackage website that can be easily customized, requiring only specific package content. This website is designed to incorporate the reference documentation that is generated by Doxygen from the NewPackage source code. Using NewPackage as a starting point, developers of mathematical libraries are able to become quickly and easily interoperable with Trilinos. It is worth noting that a package becomes Trilinos-interoperable without sacrificing its independence. This fact has made Trilinos attractive to a number of developers. As a result, Trilinos is growing not only by new development projects, but also by incorporation of important, mature projects that want to adopt modern software engineering capabilities. This is of critical importance to applications that depend on these mature packages.
5 Package Interoperability vs. Interdependence Trilinos provides configuration, compilation and installation facilities via GNU Autoconf [4] and Automake [5]. These tools make Trilinos very portable and easy to install. The Trilinos-level build tools allow the user to specify which packages should be configured and compiled via --enable-package_name arguments to the configure command. A fundamental difference between Trilinos and other mathematical software frameworks is its emphasis on interoperability. Via abstract interfaces and configure-time enabling, packages within Trilinos can interoperate with each other without being interdependent. For example, the nonlinear solver package NOX needs some linear algebra support, but it does not need to know the implementation details. Because of this, it specifies the interface it needs. At configure time, the user can specify one or more of three possible linear algebra implementations for NOX:
The Design of Trilinos
625
– --enable-nox-lapack: compile NOX lapack interface libraries – --enable-nox-epetra: compile NOX epetra interface libraries – --enable-nox-petsc: compile NOX PETSc interface libraries Alternatively, because the NOX linear algebra interfaces are abstract, users can provide their own implementation and build NOX completely independent from Trilinos. Another notable example is ML [20]. ML is the multilevel preconditioning package of Trilinos, and it can be used to define multigrid/multilevel preconditioners. ML contains basic Krylov accelerators and a variety of smoothers (like Jacobi, Gauss-Seidel, and polynomial smoothers), and as such it can be configured and used as a stand-alone package. That is, ML does not depend on any other Trilinos packages. However, ML can easily interoperate with other Trilinos packages, since light-weight conversions from the ML proprietary format to Epetra-format (and vice-versa) exist1 . For instance, ML can: – – – –
use the incomplete factorizations of AztecOO and IFPACK as smoothers; exploit the eigensolvers of Anasazi when an eigen-computation is required; take advantage of the Amesos interface to third-party libraries; use the general-purpose tools provided by Teuchos to define better interfaces.
Other packages can take advantages of ML as well: for example, an ML object can be used to define a preconditioner for an AztecOO linear system. For ML applications, interoperability is extremely important. Multilevel preconditioners are based on a variety of components (possibly, a different aggregation scheme, pre-smoother and post-smoother for each level, and the coarse solver), and it is often of important to test different combinations of parameters, or add new components. To this aim, several Trilinos packages can be used to extend ML. Even if the kernel features of ML remain the same, independently of the available Trilinos packages, ML itself can be less effective as a stand-alone package, as it is as part of the Trilinos project.
6 Trilinos Software Engineering Environment for Solver Developers Trilinos package developers enjoy a rich set of tools and well-developed processes that support the design, development and support of their software. The following resources are available for each package team: – Repository (CVS): Trilinos uses CVS [11] for source control. Each package exists both as an independent module as well as part of the larger whole. – Issue Tracking (Bugzilla): Trilinos uses Mozilla’s Bugzilla [12] for issue tracking. This includes the ability to search all previous bug reports and to automatically track (via email notifications) submissions. 1
If required, ML interoperates with packages outside the Trilinos project. For example, METIS and ParMETIS [21] can be used to define the aggregates in smoothed aggregation, or ParaSails [22] can be called to compute smoothers based on sparse approximate inverses.
626
Michael A. Heroux and Marzio Sala
– Communication (Mailman): Trilinos uses Mailman [3] for managing mailing lists. There are Trilinos mailing lists as well as individual package mailing lists as follows. Each package has low volume lists for announcements as well as a general users mailing list. Further, each package also has developer-focused lists for developer discussion, tracking of CVS commits, and regression test results. Finally, there is a mailing list for the leaders of the packages, and these leaders also have monthly teleconferences where any necessary Trilinos-level decisions are made. In addition to the above package-specific infrastructure, the following resources are also available to package developers: – Debugging (Bonsai): Mozilla Bonsai [4] is a web-based GUI for CVS, allowing for queries and easy comparison of different versions of the code. Bonsai is integrated with the Trilinos CVS and Bugzilla databases. – Jumpstart (NewPackage package): One of the fundamental contributions of Trilinos is “NewPackage". This is a legitimate package within Trilinos that does “Hello World". We discuss it in more detail below. – Developer Documentation: Trilinos provides an extensive developers guide [15] that provides new developers with detailed information on Trilinos resources and processes. All Trilinos documents are available online at the main Trilinos home page1. – SQA/SQE support: The Advanced Scalable Computing Initiative (ASCI) program within the Department of Energy (DOE) has strong software quality engineering and assurance requirements. At Sandia National Laboratories this has led to the development of 47 SQE practices that are expected of ASCI-funded projects [16]. Trilinos provides a special developers guide17 for ASCI-funded Trilinos packages. For these packages, 32 of the SQE practices are directly addressed by Trilinos. The remaining 15 are the responsibility of the package development team. However, even for these 15, Trilinos provides significant support. To summarize, the two-level design of Trilinos allows package developers to focus on only those aspects of development that are unique to their package and still enjoy the benefits of a mature software engineering environment.
7 Summary Computer modeling and simulation is a credible third approach to fundamental advances in science and engineering, along with theory and experimentation. A critical issue for continued growth is access to both new and mature mathematical software on high performance computers in a robust, production environment. Trilinos attempts to address this issue in the following ways: Provides a rapidly growing set of unique solver capabilities: Many new algorithms are being implemented using Trilinos. Trilinos provides an attractive development environment for algorithm specialists. Provides an independent, scalable linear algebra package: Epetra provides extensive tools for construction and manipulation of vectors, graphs and matrices, and provides the default implementation of abstract interfaces across all other Trilinos packages. Supports SQA/SQE requirements: First solver framework to provide explicitly documented processes for SQA/SQE.
The Design of Trilinos
627
An essential feature for production computing. Provides an innovative architecture for modular, scalable growth: The Trilinos package concept and the NewPackage package provide the framework for rapid development of new solvers, and the integration and support of critical mature solvers. Is designed for interoperability: Designed "from the ground up" to be compatible with existing code and data structures. Trilinos is the best vehicle for making solver packages, whether new or existing, interoperable with each other. Uses open source tools: Extensive use of third-party web tools for documentation, versioning, mailing lists, and bug tracking. To the best of our knowledge, no competing numerical libraries offer anything nearly as comprehensive.
Acknowledgments The authors would like to acknowledge the support of the ASCI and LDRD programs that funded development of Trilinos.
References 1. The Trilinos Home Page: http://software.sandia.gov/trilinos, 16-Feb-04. 2. Michael A. Heroux, et. al., An Overview of Trilinos. Technical Report SAND2003-2927, Sandia National Laboratories, 2003. 3. The GNU Lesser General Public License: http://www.gnu.org/copyleft/lesser.html, 16-Feb-04. 4. GNU Autoconf Home Page: http://www.gnu.org/software/autoconf, 16-Feb-04. 5. GNU Automake Home Page: http://www.gnu.org/software/automake, 16-Feb-04. 6. Erik Boman, Karen Devine, Robert Heaphy, Bruce Hendrickson and Michael A. Heroux. LDRD Report: Parallel Repartitioning for Optimal Solver Performance. Technical Report SAND2004-XXXX, Sandia 7. Epetra Home Page: http://software.sandia.gov/trilinos/packages/epetra, 16-Feb-04. 8. Ray S. Tuminaro, Michael A. Heroux, Scott A. Hutchinson and John N. Shadid. Official Aztec User’s Guide, Version 2.1. Technical Report SAND99-8801J, Sandia National Laboratories, 1999. 9. PVM Home Page: http://www.csm.ornl.gov/pvm, 16-Feb-04. 10. UPC Home Page: http://upc.gwu.edu/, 16-Feb-04. 11. GNU CVS Home Page: http://www.gnu.org/software/cvs, 16-Feb-04. 12. Bugzilla Home Page: http://www.bugzilla.org, 16-Feb-04. 13. GNU Mailman Home Page: http://www.gnu.org/software/mailman, 16-Feb-04. 14. The Bonsai Project Home Page:, http://www.mozilla.org/projects/bonsai, 16-Feb-04. 15. Michael A. Heroux, James M. Willenbring and Robert Heaphy, Trilinos Developers Guide, Technical Report SAND2003-1898, Sandia National Laboratories, 2003. 16. J. Zepper, K Aragon, M. Ellis, K. Byle and D. Eaton, Sandia National Laboratories ASCI Applications
628
Michael A. Heroux and Marzio Sala
17. Michael A. Heroux, James M. Willenbring and Robert Heaphy, Trilinos Developers Guide Part II: ASCI Software Quality Engineering Practices, Version 1.0. Technical Report SAND20031899, Sandia National Laboratories, 2003. 18. Doxygen Home Page: http://www.stack.nl/ dimitri/doxygen, 16-Feb-04. 19. GNU M4 Home Page: http://www.gnu.org/software/m4, 16-Feb-04. 20. M. Sala and J. Hu and R. Tuminaro, ML 3.0 Smoothed Aggregation User’s Guide, Technical Report SAND2004-2195, Sandia National Laboratories, 2004. 21. G. Karypis and V. Kumar, ParMETIS: Parallel Graph Partitioning and Sparse Matrix Ordering Library, Technical Report 97-060, Department of Computer Science, University of Minnesota, 1997. 22. E. Chow, ParaSails User’s Guide, Technical Report UCRML-MA-137863, Lawrence Livermore National Laboratory, 2000.
Software Architecture Issues in Scientific Component Development Boyana Norris Mathematics and Computer Science Division, Argonne National Laboratory 9700 South Cass Ave., Argonne, IL 60439, USA [email protected]
Abstract. Commercial component-based software engineering practices, such as the CORBA component model, Enterprise JavaBeans, and COM, are wellestablished in the business computing community. These practices present an approach for managing the increasing complexity of scientific software development, which has motivated the Common Component Architecture (CCA), a component specification targeted at high-performance scientific application development. The CCA is an approach to component development that is minimal in terms of the complexity of component interface requirements and imposes a minimal performance penalty. While this lightweight specification has enabled the development of a number of high-performance scientific components in several domains, the software design process for developing component-based scientific codes is not yet well defined. This fact, coupled with the fact that componentbased approaches are still new to the scientific community, may lead to an ad hoc design process, potentially resulting in code that is harder to maintain, extend, and test and may negatively affect performance. We explore some concepts and approaches based on widely accepted software architecture design principles and discuss their potential application in the development of high-performance scientific component applications. We particularly emphasize those principles and approaches that contribute to making CCA-based applications easier to design, implement, and maintain, as well as enabling dynamic adaptivity with the goal of maximizing performance.
1 Introduction Component-based software engineering (CBSE) practices are well established in the business computing community [1,2,3]. Component approaches often form the heart of architectural software specifications. Garlan and Shaw [4] define a software architecture as a specification of the overall system structure, such as “gross organization and global control structure; protocols for communication, synchronization, and data access; assignment of functionality to design elements, physical distribution; composition of design elements; scaling and performance; and selection among design alternatives.” This decade-old definition largely holds today. Recently, the emergence of a component specification targeted at high-performance computing has led to initial experimentation with CBSE in scientific software. The Common Component Architecture Forum [5,6] has produced a component architecture specification [7] that is semantically similar to other component models, with an J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 629–636, 2006. c Springer-Verlag Berlin Heidelberg 2006
630
Boyana Norris
emphasis on minimizing programming requirements and not imposing a significant performance penalty as a result of using components. The CCA specification is defined by using SIDL (Scientific Interface Definition Language), which provides language interoperability without sacrificing performance. CCA employs the notion of ports, public interfaces that define points of interaction between components. There are two types of ports: provides and uses, used to specify functionality provided by a component, and to access functionality of other components that provide the matching port type. To be CCA-compliant, a component must implement the gov.cca.Component interface, which uses the method setServices to pass a reference to a framework services handle to the component. This handle subsequently is used for most framework interactions, such as registering uses and provides ports and obtaining and releasing port handles from the framework. The setServices method is normally invoked by the framework after the component has been instantiated. A CCA framework provides an implementation of the Services interface and performs component management, including dynamic library loading, component instantiation, and connection and disconnection of compatible ports. In addition to the benefits of a component system for managing software complexity, the CCA offers new challenges and opportunities. In the remainder of this paper, we present some software architecture principles and approaches that build on the current CCA specification and help better define the component design and development process, as well as enable better dynamic adaptivity in scientific component applications. Going beyond the software architecture specification, we briefly discuss other important aspects of scientific component development, including code generation, implementation, compilation, and deployment.
2 Software Architecture Principles In this section we examine software design approaches in the context of their applicability to high-performance scientific component development. One of the main distinctive features of the CCA is its minimal approach to component specification. This was partly motivated by the need to keep component overhead very low. Another motivation is to make the software design process more accessible to scientists who are concerned with the human overhead of adopting an approach new to the scientific computing community. The result is a working minimal-approach component model that has been used successfully in diverse applications [8,9,10,11]. The CCA specification incorporates a number of principles that Buschmann et al. [12] refer to as architecture-enabling principles, including abstraction, encapsulation, information hiding, modularization, coupling and cohesion, and separation of interface and implementation. While the minimal CCA approach does make scientific component development possible, it is not a complete design methodology, as is, for example, the Booch method [13]. Furthermore, other than the minimal requirements of the component interface and framework-component interactions, the actual component design is not dictated by any set of rules or formal specifications. This makes the CCA very general and flexible but imposes a great burden on the software developer, who in many cases has no previous component- or object-oriented design experience. We believe that a number of widely accepted software architecture principles, such as those summarized in [12,14],
Software Architecture Issues in Scientific Component Development
631
can contribute greatly to all aspects of scientific application development—from interface specification to component quality-of-service support. These enabling technologies [12] can be used as guiding design principles for the scientific component developer. Moreover, they can enable the development of tools that automate many of the steps in the component development life cycle, leading to shorter development cycles and to applications that are both more robust and better performing. Although the following software architecture principles are applicable to a wide range of domains, we focus on their impact on high-performance components, with an emphasis on adaptability. We briefly examine each approach and discuss an example context of its applicability in scientific component development. Separation of Concerns. Currently the CCA provides a basic specification for components and does not deal directly with designing ports and components so that different or unrelated responsibilities are separate from each other and that different roles played by a component in different contexts are independent and separate from each other within the component. Several domain-specific interface definition efforts are under way, such as interfaces for structured and unstructured mesh access and manipulation, linear algebra solvers, optimization, data redistribution, and performance monitoring. These common interfaces ideally would be designed to ensure a clear separation of concerns. In less general situations, however, this principle is often unknown or ignored in favor of quicker and smaller implementation. Yet clearly separating different or unrelated responsibilities is essential for achieving a flexible and reliable software architecture. For example, performance monitoring functionality is often directly integrated in the implementation of components, making it difficult or impossible to change the amount, type, and frequency of performance data gathered. A better design is to provide ports or components that deal specifically with performance monitoring, such as those described in [15], enabling the use of different or multiple monitor implementations without modifying the implementation of the client components, as well as generating performance monitoring ports automatically. Similar approaches can be used for other types of component monitoring, such as that needed for debugging. Although in some cases the separation of concerns is self-evident, in others it is not straightforward to arrive at a good design while maintaining a granularity that does not deteriorate the performance of the application as a whole. For example, many scientific applications rely on a discretized representation of the problem domain, and it would seem like a good idea to extract the mesh management functionality into separate components. Depending on the needs of the application, however, accessing the mesh frequently through a fine-grained port interface may have prohibitive overhead. That is, not to say that mesh interfaces are doomed to bad performance; rather, care must be taken in the design of the interface, as well as the way in which it is used, in order to avoid loss of performance. Separation of Policy and Implementation. A policy component deals with contextsensitive decisions, such as assembly of disjoint computations as a result of selecting parameter values. An implementation component deals only with the execution of a fully specified algorithm, without being responsible for making context-dependent decisions internally. This separation enables the implementation of multimethod solution methods, such as the composite and adaptive linear solvers described in [16,17]. Initial implementations of our multimethod linear system solution methods were not in
632
Boyana Norris
component form (rather, they were based directly on PETSc [18]), and after implementing a few adaptive strategies, adding new ones became a very complex and error-prone task. The reason was that, in order to have nonlinear solver context information in our heuristic implementation of the adaptive linear solution, we used (or rather, “abused”) the PETSc nonlinear user-defined monitor routine, which is invoked automatically via a call-back mechanism at each nonlinear iteration. Without modifying PETSc itself, this was the best way both to have context information about the nonlinear solution, while monitoring the performance and controlling the choice of linear solvers. While one can define the implementation in a structured way, one is still limited by the fact that the monitor is a single function, under which all adaptive heuristics must be implemented. Recently we have reworked our implementation to use well-defined interfaces for performance data management and adaptive linear solution heuristics. This separation of policy (the adaptive heuristic, which selects linear solvers based on the context in the nonlinear solver) and implementation (the actual linear solution method used, such as GMRES) not only has lead to a simpler and cleaner design but has made it possible to add new adaptive heuristics with a negligible effort compared to the noncomponent version. Design experiences, such as this nonlinear PDE solution example, can lead to guidlines or “best practices” that can assist scientists in achieving separation of policy and implementation in different application domains. Reflection. The Reflection architectural pattern enables the structure and behavior of software systems to be changed dynamically [12]. Although Reflection is not directly supported by the CCA specification, similar capabilities can be provided in some cases. For example, for CCA components implemented with SIDL, interface information is available in SIDL or XML format. Frameworks or automatically generated ports (which provide access to component metainformation) can be used to provide Reflection capabilities in component software. Reflection also can be used to discover interfaces provided by components, a particularly useful capability in adaptive algorithms where a selection of a suitable implementation must be made among several similar components. For example, in an adaptive linear system solution, several solution methods implement a common interface and can thus be substituted at runtime, but knowing more implementation-specific interface details may enable each linear solver instance to be tuned better by using application-specific knowledge. We note, however, that while reflection can be useful in cases such as those cited, it may negatively affect performance if used at a very fine granularity. As with other enabling architectural patterns, one must be careful to ensure that it is applied only at a coarse-enough granularity to minimize the impact on performance. While the omission of formal specifications for these and other features makes the CCA approach general and flexible, it also increases the burden on the scientific programmer. While in theory it is possible to design and implement CCA components in a framework-independent fashion, in practice the component developer is responsible for ensuring that a component is written and built in such a way that it can be used in a single framework that is usually chosen a priori1 . 1
The CCA Forum has defined a number of interfaces for framework interoperability, so that while components are written in a way that may not be portable across frameworks, interaction between components executing in different frameworks is possible in some cases.
Software Architecture Issues in Scientific Component Development
633
3 Implementation and Deployment Implementation and deployment of CCA components are potentially complex and timeconsuming tasks compared to the more traditional library development approaches. Therefore, CCA Forum participants have recently begun streamlining the component creation and build process. For example, CHASM [19] is being used to automate some of the steps required to create language-independent components from legacy codes or from scratch [20]. We have also recently begun automating the build process by generating most of the scripts needed to create a build system based on GNU tools, such as automake and autoconf [21,22]. Currently we are integrating the component generation tools based on CHASM [19] and Babel [23] with the build automation capabilities. Other efforts are making the code generation, build automation, and deployment support tools easy to use, extensible, portable, and flexible. Until recently, little attention was given to the human overhead of component design and implementation. Many of the abovementioned efforts aim to increase the software developers’ efficiency. Most ongoing work focuses on the tasks of generating ports and components from existing codes, generating code from existing interfaces, compiling, and deploying components. At this point, however, little is available on the techniques for designing ports and components. We believe that the architecture-enabling principles mentioned in this paper, as well as others, can help guide the design of scientific applications. Furthermore, we hope that in the near future, when more and more component scientific applications go through this design and implementation process, domain-specific patterns will emerge that will lead to the same type of improvement in the software development approaches and efficiency that design and architectural patterns have led to in the business software development community. Scientific applications may need to execute in environments with different capabilities and requirements. For example, in some cases it is possible and desirable to build components as dynamic libraries that are loaded by the framework at runtime. In other cases, the computing environment, such as a large parallel machine without a shared file system, makes the use of dynamic libraries onerous, and one statically linked executable for the assembled component application is more appropriate. Being able to support both dynamic and static linking in an application makes the build process complex and error-prone, thus necessitating automation. Furthermore, support of multiple deployment modes (e.g., source-based, RPM or other binary formats) is essential for debugging, distribution, and support of multiple hardware platforms. Many of these issues can be addressed by emerging CCA-based tools that generate the required component meta-information, configuration, build, and execution scripts.
4 QoS-Enabled Software Architecture As more functionally equivalent component implementations become available, the task of selecting components and assembling them into an application with good overall performance becomes complex and possibly intractable manually. Furthermore, as computational requirements change during the application’s execution, the initial selection of components may no longer ensure good overall performance. Some of the considerations that influence the choice of implementation are particular to parallel codes, for
634
Boyana Norris
example, the scalability of a given algorithm. Others deal with the robustness of the solution method, or its convergence speed (if known) for a certain class of problems. Recent work [24,25,26] on computational quality of service (QoS) for scientific components has defined a preliminary infrastructure for the support of performancedriven application assembly and adaptation. The CCA enables this infrastructure to augment the core specifications, and new tools that exploit the CCA ports mechanism for performance data gathering and manipulations are being developed [15]. Related work in the design and implementation of QoS support in component architectures includes component contracts [27], QoS aggregation models [28], and QoS-enabled distributed component computing [29,30].
5 Conclusion We have examined a number of current challenges in handling the entire life cycle of scientific component application development. We outlined some software architecture ideas that can facilitate the software design process of scientific applications. We discussed the need and ongoing work in automating different stages of component application development, including code generation from language-independent interfaces, automatic generation of components from legacy code, and automating the component build process. We noted how some of the software architecture ideas can be used to support adaptivity in applications, enabling computational quality-of-service support for scientific component applications. Many of these ideas are in the early stages of formulation and implementation and will continue to be refined and implemented as command-line or graphical tools, components, or framework services.
Acknowledgments Research at Argonne National Laboratory was supported in part by the Mathematical, Information, and Computational Sciences Division subprogram of the Office of Advanced Scientific Computing Research, Office of Science, U.S. Dept. of Energy, under Contract W-31-109-ENG-38. We thank Matt Knepley, Lois McInnes, Paul Hovland, and Kate Keahey of Argonne National Laboratory for many of the ideas that led to this work, Sanjukta Bhowmick and Padma Raghavan of the Pennsylvania State University for their ongoing contributions, and the CCA Forum for enabling this collaborative effort.
References 1. Anonymous: CORBA component model. http://www.omg.org/technology/documents/formal/components.htm (2004) 2. Anonymous: Enterprise JavaBeans downloads and specifications. http://java.sun.com/products/ejb/docs.html (2004) 3. Box, D.: Essential COM. Addison-Wesley Pub. Co. (1997) 4. Garlan, D., Shaw, M.: An Introduction to Software Architecture. In: Advances in Software Engineering and Knowledge Engineering. World Scientific Publishing Company (1993)
Software Architecture Issues in Scientific Component Development
635
5. Armstrong, R., Gannon, D., Geist, A., Keahey, K., Kohn, S., McInnes, L. C., Parker, S., Smolinski, B.: Toward a common component architecture for high-performance scientific computing. In: Proceedings of High Performance Distributed Computing. (1999) 115–124 6. Common Component Architecture Forum: CCA Forum website. http://www.cca-forum.org (2004) 7. CCA Forum: CCA specification. http://cca-forum.org/specification/ (2003) 8. Norris, B., Balay, S., Benson, S., Freitag, L., Hovland, P., McInnes, L., Smith, B.: Parallel components for PDEs and optimization: Some issues and experiences. Parallel Computing 28 (12) (2002) 1811–1831 9. Lefantzi, S., Ray, J.: A component-based scientific toolkit for reacting flows. In: Proceedings of the Second MIT Conference on Computational Fluid and Solid Mechanics, Boston, Mass., Elsevier Science (2003) 10. Benson, S., Krishnan, M., McInnes, L., Nieplocha, J., Sarich, J.: Using the GA and TAO toolkits for solving large-scale optimization problems on parallel computers. Technical Report ANL/MCS-P1084-0903, Argonne National Laboratory (2003) 11. Larson, J. W., Norris, B., Ong, E. T., Bernholdt, D. E., Drake, J. B., Elwasif, W .R., Ham, M. W., Rasmussen, C. E., Kumfert, G., Katz, D. S., Zhou, S., DeLuca, C., Collins, N. S.: Components, the Common Component Architecture, and the climate/weather/ocean community. In: 84th American Meteorological Society Annual Meeting, Seattle, Washington, American Meteorological Society (2004) 12. Buschmann, F., Meunier, R., Rohnert, H., Sommerlad, P., Stal, M.: Pattern-Oriented Software Architecture: A System of Patterns. John Wiley & Sons Ltd. (1996) 13. Booch, G.: Unified method for object-oriented development Version 0.8. Rational Software Corporation (1995) 14. Medvidovic, N., Taylor, R. N.: A classification and comparison framework for software architecture description languages. IEEE Transactions on Software Engineering 26 (2000) 15. Shende, S., Malony, A. D., Rasmussen, C., Sottile, M.: A performance interface for component-based applications. In: Proceedings of International Workshop on Performance Modeling, Evaluation and Optimization, International Parallel and Distributed Processing Symposium (2003) 16. Bhowmick, S., Raghavan, P., McInnes, L., Norris, B.: Faster PDE-based simulations using robust composite linear solvers. Future Generation Computer Systems 20 (2004) 373–387 17. McInnes, L., Norris, B., Bhowmick, S., Raghavan, P.: Adaptive sparse linear solvers for implicit CFD using Newton-Krylov algorithms. In: Proceedings of the Second MIT Conference on Computational Fluid and Solid Mechanics, Massachusetts Institute of Technology, Boston, USA, June 17-20, 2003 18. Balay, S., Buschelman, K., Gropp, W., Kaushik, D., Knepley, M., McInnes, L., Smith, B. F., Zhang, H.: PETSc users manual. Technical Report ANL-95/11 - Revision 2.1.5, Argonne National Laboratory (2003). http://www.mcs.anl.gov/petsc. 19. Rasmussen, C. E., Lindlan, K. A., Mohr, B., Striegnitz, J.: Chasm: Static analysis and automatic code generation for improved Fortran 90 and C++ interoperability. In: 2001 LACSI Symposium (2001) 20. Rasmussen, C. E., Sottile, M. J., Shende, S. S., Malony, A. D.: Bridging the language gap in scientific computing: the chasm approach. Technical Report LA-UR-03-3057, Advanced Computing Laboratory, Los Alamos National Laboratory (2003) 21. Bhowmick, S.: Private communication. Los Alamos National Laboratory (2004) 22. Wilde, T.: Private communication. Oak Ridge National Laboratory (2004) 23. Anonymous: Babel homepage. http://www.llnl.gov/CASC/components/babel.html (2004)
636
Boyana Norris
24. Trebon, N., Ray, J., Shende, S., Armstrong, R.C., Malony, A.: An approximate method for optimizing HPC component applications in the presence of multiple component implementations. Technical Report SAND2003-8760C, Sandia National Laboratories (2004) Also submitted to 9th International Workshop on High-Level Parallel Programming Models and Supportive Environments, held during the 18th International Parallel and Distributed Computing Symposium, 2004, Santa Fe, NM, USA. 25. Hovland, P., Keahey, K., McInnes, L. C., Norris, B., Diachin, L. F., Raghavan, P.: A quality of service approach for high-performance numerical components. In: Proceedings of Workshop on QoS in Component-Based Software Engineering, Software Technologies Conference, Toulouse, France (2003) 26. Norris, B., Ray, J., Armstrong, R., McInnes, L.C., Bernholdt, D.E., Elwasif, W.R., Malony, A.D., Shende, S.: Computational quality of service for scientific components (2004). In: Proceedings of International Symposium on Component-Based Software Engineering (CBSE7), Edinburgh, Scotland. 27. Beugnard, A., J´ez´equel, J.M., Plouzeau, N., Watkins, D.: Making components contract aware. IEEE Computer (1999) 38–45 28. Gu, X., Nahrstedt, K.: A scalable QoS-aware service aggregation model for peer-to-peer computing grids. In: Proceedings of High Performance Distributed Computing. (2002) 29. Loyall, J. P., Schantz, R. E., Zinky, J. A., Bakken, D. E.: Specifying and measuring quality of service in distributed object systems. In: Proceedings of the First International Symposium on Object-Oriented Real-Time Distributed Computing (ISORC ’98). (1998) 30. Raje, R., Bryant, B., Olson, A., Auguston, M., , Burt, C.: A quality-of-service-based framework for creating distributed heterogeneous software components. Concurrency Comput: Pract. Exper. 14 (2002) 1009–1034
Parallel Hybrid Sparse Solvers Through Flexible Incomplete Cholesky Preconditioning Keita Teranishi and Padma Raghavan Department of Computer Science and Engineering The Pennsylvania State University 111 IST Bldg., University Park, PA 16802, USA {teranish,raghavan}@cse.psu.edu
Abstract. We consider parallel preconditioning schemes to accelerate the convergence of Conjugate Gradients (CG) for sparse linear system solution. We develop methods for constructing and applying preconditioners on multiprocessors using incomplete factorizations with selective inversion for improved latency-tolerance. We provide empirical results on the efficiency, scalability and quality of our preconditioners for sparse matrices from model grids and some problems from practical applications. Our results indicate that our preconditioners enable more robust sparse linear system solution.
1 Introduction Consider the solution of a sparse linear system Ax = b on a distributed-memory multiprocessor. When A is symmetric positive definite, preconditioned Conjugate Gradients (CG) [5,15] can be used to solve the system. Although CG has scalable memory requirements, its overall performance depends to a large extent on the scalability and effectiveness of the preconditioner. A general purpose preconditioner can be based on ˆ such that L ˆL ˆ T ≈ A. incomplete Cholesky with drop thresholds (ICT) [19] to compute L ICT preconditioning is typically the method of choice on uniprocessors, but its scalable parallel implementation poses many challenges. A major aspect concerns the latencytolerant application of the preconditioner at every CG iteration; we solve this problem using ‘Selective Inversion’ (SI) [12,14,17] to replace inefficient distributed triangular solution by effective matrix-vector multiplication. We now report on the overall performance including parallel preconditioner construction and its application. We provide a brief overview of our incomplete Cholesky preconditioning with SI in Section 2. We provide some empirical results on the performance of our schemes in Section 3 followed by the conclusions in Section 4.
2 Parallel Incomplete Cholesky with Selective Inversion (ICT-SI) Our parallel incomplete Cholesky with SI uses many of the ideas from parallel sparse direct solution. We start with a good fill-reducing strategy such as minimum-degree and
This work was supported in part by the National Science Foundation through grants ACI0102537, EIA-0221916, and DMR-0205232.
J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 637–643, 2006. c Springer-Verlag Berlin Heidelberg 2006
638
Keita Teranishi and Padma Raghavan
nested dissection [3,7]; the latter also helps provide a natural data partitioning for the ˆ corresponding to the true parallel implementations. We then compute an approximate L factor L for the given ordering. ˆ is guided by the traversal of The parallel factorization and triangular solution of L the elimination tree [9] in conjunction with supernodes [10,11]. The elimination tree represents the data dependency between columns during factorization; a supernodal tree is a compact version of this tree. Chains in the elimination tree represent a set of consecutive columns of the factor with the same nonzero structure. In a supernodal tree these chains are coalesced into a single supernode. During factorization, dense matrix techniques can be used for columns in a supernode to improve performance through effective cache-reuse. Our ICT factorization is based on the left-looking variant of sparse factorization, which has been regarded as a memory-efficient scheme compared to the multifrontal approach [11]. During incomplete factorization some nonzero elements are dropped based on a threshold test. Consequently, the column dependencies and the structure for supernodes derived from the ordered coefficient matrix are no longer exact; the factor comprises sparse columns instead of small dense submatrices. However, that structure still enables managing the implementation of a single scheme that can (through dropping) cover the entire range of fill-in to meet a variety of preconditioning needs. Our parallel implementation takes advantage of individual subtrees which are rooted at the log P level of the tree to achieve coarse-grain parallelism as shown in Figure 1 for P = 4. Incomplete factorization associated with these subtrees proceeds independently and in parallel in a local-phase of computations. At levels above, each supernode is processed by multiple processors using data-parallel dense/blocked operations. At each such, supernode with computations distributed across multiple processors, columns of the incomplete factor are computed by a fan-in scheme with asynchronous communication. This technique allows each processor to overlap the computation of its local elements with interprocessor communication to achieve greater efficiency. A major performance limitation concerns the parallel application of such ICT preconditioners. At each CG iteration a parallel triangular solution is required. However, when parallel substitution is used for such triangular solution, the performance can degrade considerably due to the relatively large latencies of interprocessor communication. More precisely, the triangular solution at a supernode a involves:
La 11 La 21
a
x
1
=
ba 1 ba 2
.
ˆ a and L ˆ a for incomplete factor) are portions of The submatrices La 11 and La 21 (L 11 21 a the factor associated with supernode a; L 11 is lower-triangular. Parallel substitution is performed to obtain xa 1 using La 11 and ba 1 ; next ba 2 is updated as b2 − L22 x1 and used in an ancestor supernode. The SI scheme [12,14,17] overcomes the latency problem of substitution through parallel matrix inversion of La11 in the distributed supernodes. Thus, parallel substitution is replaced by the sparse matrix vector multiplication x1 ← L−1 11 b1 . This incurs extra computational cost during preconditioner construction, but the improvements in applying
Parallel Hybrid Sparse Solvers
639
Procs 0, 1 ,2 ,3
0
1
2
In incomplete factorization, these matrices will not be dense due to drops.
3
Procs 0,1
Procs 2, 3
(Node Parallelism ) 0
Proc 0
1
2
00 1111 11 0000 00 0000 11 1111 00 0000Proc 1 11 1111
111 000 00 11 000 111 00 000 11 111 00 11
1 0 1111 0 0000 1 0000 1111 0000 1111
Distributed Phase
3
Proc 2
111 000 000 000 111 111 000 000Proc111 111 000 3 000 111 111 000 111
0 1 111 000 0 000 1 111 0 1
Local Phase (Tree Parallelism)
0000 1111 11 00 0000 1111 00 11 0000 001111 11
Fig. 1. Parallel ICT factorization on 4 processors using a supernodal tree
the preconditioner can be substantial [14,17]. Our implementation of ICT-SI is based on an incomplete version of the parallel direct solver DSCPACK [13].
3 Empirical Results We now provide some preliminary results on the performance of our parallel ICT-SI. Additionally, we include for purposes of comparison, results on the performance of level-0 incomplete Cholesky (ICF(0)) [6], and sparse approximate inverse (Parasails) [2] preconditioners. Our experiments were performed on a cluster with 2.4 Ghz Intel Xeon processors and a Myrinet interconnect using parallel CG implementation in the PETSc package [1]. We first report scaled performance when the problem size is scaled with the processors to keep the arithmetic cost of factorization fixed per processor. We used a series of sparse matrices from model grids as shown in Table 1. Figure 2 shows the total time for a single solution and for a sequence of ten right-hand side vectors. The number of iterations, a measure of effectiveness of the preconditioner, is shown in the left half of Figure 3. ICT-SI typically results in lower iterations compared to other methods, but the cost of preconditioner construction is relatively high. We conjecture that the scheme is more appropriate when the cost of preconditioner construction can be amortized over a sequence of solutions for different right hand side vectors (as shown in Figure 2 for a sequence of 10 solutions).
640
Keita Teranishi and Padma Raghavan
Table 1. Description of sparse test matrices from a model 2 dimensional finite-difference formulation; K is the grid size, N is the matrix dimension, |A| is the number of nonzeroes in the matrix, and |L| is number of nonzeroes in L where A = LLT Processors (P ) K
N
|A|
|L|
(106 ) (106 ) 1
200 40000 0.119 1.092
2
250 62500 0.187 1.819
4
312 97344 0.291 2.976
8
392 153664 0.469 4.959
16
490 240110 0.326 3.399
32
618 386884 1.159 13.774
Table 2. Description of sparse matrices. N is the matrix dimension and |A| is the number of nonzeroes in the matrix Matrix
N
augustus5 134,144 engine
|A|
Description
645,028
Diffusion equation from 3D mesh
143,571 2,424,822 Engine head, Linear tetrahedral elements
augustus7 1,060,864 9,313,876
Diffusion equation from 3D mesh
We next report on the performance of ICT-SI on three matrices from practical applications, described in Table 2. The results for these three matrices are summarized in Table 3. These results indicate that ICT-SI constructs good quality preconditioners that can significantly reduce the number of CG iterations. However, preconditioner construction costs are still relatively large. When the preconditioner is used for a sequence of ten solutions, ICT-SI is the best performing method for all instances. For augustus7, the largest matrix in our collection, we show solution times for one and ten right-hand side vectors using 4–64 processors in Figure 3. These plots indicate that ICT-SI is an effective solver when the costs of preconditioner construction can be amortized over several solutions.
4 Conclusions We have developed a parallel incomplete Cholesky preconditioner (ICT-SI) which can accommodate the full range of fill-in to meet the preconditioning needs of a variety of applications. Preconditioner construction in ICT-SI relies largely on the techniques derived from efficient parallel sparse direct solvers. Efficient preconditioner application is achieved by using ‘selective inversion’ techniques. The performance of ICT-SI on the matrices from model two-dimensional grid problems compares well with that of sparse approximate inverse preconditioners. However, the cost of preconditioner construction dominates the overall cost of a single solution especially on larger numbers of processors. The preconditioners produced by ICT-SI are quite effective when systems from practical applications are considered. We expect ICT-SI to be useful in applications that produce
Parallel Hybrid Sparse Solvers
641
Table 3. Performance of parallel preconditioners on three large sparse systems from practical applications. the column labeled ‘Mem’ indicates memory required as multiple of of that for the coefficient matrix. DNC indicates that the method does not converge within 2000 iterations. The best performing instance is shown in boldface Method
Number of Processors 1
2 Time
4 Time
Its
Time
16
Time
Its
IC(0)
119.3
593
125.7
846
72.5
595
36.5
Parasails
28.1
306
24.4
305
18.7
305
ICT-SI
16.2
62
12.1
63
11.0
69
IC(0)
698.5
5925 1018.7 8623
474.3 5924 241.6 6485 152.9 6309
Parasails
217.4
3324
196.3 3342
145.5 3340 66.92 3335
36.8 3344
ICT-SI
65.27
624
44.90
35.26
20.1 740
augustus5
Its
8 Its
Time
Its
648
24.7
633
9.2
317
5.1
306
8.2
74
6.2
79
Matrix size: 134,144 Nonzeroes: 645,028
10 right-hand side vectors
engine
630
692
24.0
740
Matrix size: 143,571 Nonzeroes: 2,424,822
IC(0)
239.9
1626
168.5 1707
114.4 1576
99.5 1643
73.8 1699
Parasails
125.9
1036
89.8
1034
62.89 1037
44.9 1036
33.4 1036
ICT-SI
48.1
282
35.2
252
35.1
45.2
34.2
287
306
308
10 right-hand side vectors IC(0)
2279.1 16260 1577.3 16958 1104.9 16457 901.8 16337 659.7 17014
Parasails
1166.3 11281 837.6 11329 573.1 11247 408.0 11284 288.2 11277
ICT-SI
480.1
3321
240.4 2591
4
8
augustus7
179.6 2862 165.9 3047 132.4 3095 16
32
64
Matrix size: 1,060,864 Nonzeroes: 9,313,876
IC(0)
1569.1 1508
633.0 1238
396.2 1367 193.8 1363 118.7 1377
Parasails
378.8
692
186.7
692
103.1
692
49.1
ICT-SI
473.2
145
539.9
149
372.9
157
275.5 167
691
25.9
691
135.4 194
10 right-hand side vectors IC(0)
12652.3 15074 1307.5 15866 723.3 15875 359.7 15892 179.8 15819
Parasails
3478.8 6931 1703.6 6924
934.8 6919 585.7 6924 228.8 6911
ICT-SI
895.7
542.9 1591 406.5 1672 245.4 1937
1449
802.9 1493
ill-conditioned sparse matrices especially when the preconditioner can be re-used for a sequence of solutions. We are currently working to reduce the costs of preconditioner construction [16,18] in our parallel ICT-SI scheme by using sparse approximate inversion [2,4,8] instead of explicit inversion within supernodes.
642
Keita Teranishi and Padma Raghavan Execution time
Execution time
10
100 ICF(0) Parasails(1) ICT−SI(0.01)
8
80
7
70
6
60
5
50
4
40
3
30
2
20
1
10
0 1
2
ICF(0) Parasails(1) ICT−SI(0.01)
90
Seconds
Seconds
9
4
8
16
0 1
32
2
4
Processors
8
16
32
Processors
Fig. 2. Total execution time for scaled 2-dimensional grids for a single (left) and ten (right) righthand side vectors Number of iterations
Execution time
900 800
ICF(0) Parasails(1) ICT−SI(0.01)
7000 6000
700
5000 Seconds
600 Iterations
ICF(0) Parasails(1) ICT−SI(0.01)
500 400
4000 3000
300 2000 200 1000
100 0 1
2
4
8 Processors
16
32
0 4
8
16 Processors
32
64
Fig. 3. The number of iterations for convergence of CG for scaled 2-dimensional grids (left) and the execution time for ten solutions for augustus7 (right)
Acknowledgments We gratefully acknowledge several useful discussions with Barry Smith. We thank Michele Benzi for providing sparse matrices from his collection. We also thank the Mathematics and Computer Sciences Division at the Argonne National Laboratory for allowing us to use their network of workstations.
Parallel Hybrid Sparse Solvers
643
References 1. BALAY, S., G ROPP, W. D., M C I NNES , L. C., AND S MITH , B. F. PETSc users manual. Tech. Rep. ANL-95/11 - Revision 2.1.1, Argonne National Laboratories, 2002. 2. C HOW, E. Parallel implementation and practical use of sparse approximate inverse preconditioners with a priori sparsity patterns. Int. J. High Perf. Comput. Apps. 15 (2001), 56–74. 3. G EORGE , A., AND L IU , J. W.-H. Computer Solution of Large Sparse Positive Definite Systems. Prentice-Hall, Englewood Cliffs, NJ, USA, 1981. 4. G ROTE , M. J., AND H UCKLE , T. Parallel preconditioning with sparse approximate inverses. SIAM J. Sci. Comput. 18, 3 (1997), 838–853. 5. H ESTENES , M. R., AND S TIEFEL , E. Methods of conjugate gradients for solving linear systems. National Bureau Standard J. Res. 49 (1952), 409–436. 6. J ONES , M., AND P LASSMANN , P. The efficient parallel iterative solution of large sparse linear systems. In Graph Theory and Sparse Matrix Computations, A. George, J. R. Gilbert, and J. W. H. Liu, Eds., vol. 56 of IMA. Springer-Verlag, 1994, pp. 229–245. 7. K ARYPIS , G., AND K UMAR , V. A parallel algorithm for multilevel graph partitioning and sparse matrix ordering. Journal of Parallel and Distributed Computing 48, 1 (1998), 71–95. 8. KOLOTILINA , L. Y., AND Y EREMIN , A. Y. Factorized sparse approximate inverse preconditionings. I. Theory. SIAM J. Matrix Anal. Appl. 14, 1 (1993), 45–58. 9. L IU , J. W.-H. The role of elimination trees in sparse factorization. SIAM J. Matrix Anal. Appl. 11, 1 (1990), 134–172. 10. L IU , J. W.-H., N G , E., AND P EYTON , B. W. On finding supernods for sparse matrix ccomputation. SIAM J. Matrix Anal. Appl. 14, 1 (1993), 242–252. 11. N G , E. G., AND P EYTON , B. W. Block sparse Cholesky algorithm on advanced uniprocessor computers. SIAM J. Sci. Comput. 14 (1993), 1034–1056. 12. R AGHAVAN , P. Efficient parallel triangular solution using selective inversion. Parallel Processing Letters 9, 1 (1998), 29–40. 13. R AGHAVAN , P. DSCPACK: Domain-separator codes for solving sparse linear systems. Tech. Rep. CSE-02-004, Department of Computer Science and Engineering, The Pennsylvania State University, 2002. 14. R AGHAVAN , P., T ERANISHI , K., AND N G , E. G. A latency tolerant hybrid sparse solver using incomplete Cholesky factorization. Numer. Linear Algebra Appl. 10 (2003), 541–560. 15. S AAD , Y. Iterative method for sparse linear systems, second ed. SIAM, Philadelphia, PA, 2003. 16. T ERANISHI , K. Scalable Hybrid Sprase Lienar Solvers. PhD thesis, Department of Computer Science and Engineering, The Pennsylvania State University, 2004. 17. T ERANISHI , K., R AGHAVAN , P., AND N G , E. A new data-mapping scheme for latencytolerant distributed sparse triangular solution. In SuperComputing 2002 (2002), pp. 238–247. 18. T ERANISHI , K., R AGHAVAN , P., S UN , J., AND M ICHARELIS , P. A hybrid preconditioner for sparse systems from thermo-mechanical applications. Intl. J. Numerical Methods in Engineering (Submitted). 19. Z LATEV, Z. Use of iterative refinement in the solution of sparse linear systems. SIAM J. Numer. Anal. 19 (1982), 381–399.
Parallel Heuristics for an On-Line Scientific Database for Efficient Function Approximation Ivana Veljkovic1 and Paul E. Plassmann2 1
Department of Computer Science and Engineering The Pennsylvania State University, University Park, PA 16802, USA [email protected] 2 The Bradley Department of Electrical and Computer Engineering Virginia Tech, Blacksburg, VA 24061, USA [email protected]
Abstract. An effective approach for improving the efficiency of multi-scale combustion simulations is the use of on-line scientific databases. These databases allow for the approximation of computationally expensive functions by archiving previously computed exact values. A sequential software implementation of these database algorithms has proven to be extremely effective in decreasing the running time of complex reacting flow simulations. To enable the use of this approach on parallel computers, in this paper we introduce three heuristics for coordinating the distributed management of the database. We compare the performance of these heuristics on two limiting case test problems. These experiments demonstrate that a hybrid communication strategy offers the best promise for a large-scale, parallel implementation.
1 Introduction With a new generation of supercomputers, such as the Earth Simulator and Blue Gene, and the availability of commodity parallel computers, such as Beowulf clusters, the computational power available to scientists today is immense. As a result, the physical models that can be simulated on these computers can be extremely complex and sophisticated. Nonetheless, the complete multi-scale simulation of many phenomena can still be computationally prohibitive. For example, in combustion simulations, the inclusion of detailed chemistry with a Computational Fluid Dynamics (CFD) simulation that accurately models turbulence can be intractable. This difficulty appears because a detailed description of the combustion chemistry usually involves tens of chemical species and thousands of highly nonlinear chemical reactions. The numerical integration of these systems of differential equations is often stiff with time-scales that can vary from 10−9 seconds to 1 second. Solving the conservation equations for such chemically reacting flows, even for simple, two-dimensional laminar flows, with this kind of chemistry is computationally expensive [1]. For more complex reacting flows, especially those involving turbulence effects, calculations that involve detailed chemistry are typically so
This work was supported by NSF grants CTS-0121573, ACI–9908057, and DGE–9987589 and ACI-0305743.
J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 644–653, 2006. c Springer-Verlag Berlin Heidelberg 2006
Parallel Heuristics for an On-Line Scientific Database
645
complex and demanding that they often exceed computational resources of today’s most powerful parallel computers. One approach to address this problem involves the implementation of a specialized database to archive and retrieve accurate approximations to repetitive expensive calculations. That is, instead of solving the equations at every point required by the simulation, one can solve it for some significantly smaller number of points and interpolate these solutions to obtain approximations at nearby points. This approach was originally proposed by Pope for combustion simulations [2]. In his paper Pope proposed to tabulate previously computed function values; when a new function value is required, the set of previously computed values is searched and, if possible, the function is approximated based on these values. A mechanism based on ellipsoids of accuracy was introduced to estimate the error of the function approximations retrieved from the tabulation. In this manner, the tabulation scheme attempts to keep this error within a user-specified tolerance. This approach also has the potential to be used in other application areas. For example, it could be used in radiative heat transfer calculations involving non-gray radiation models—when the radiative properties have a strong wavelength dependence. With these models, frequent, computationally expensive evaluations of the absorption coefficient as a function of temperature, pressure, and mass fraction of chemical species and wavelength are required. These coefficients could be similarly archived and queried using this database approach. Another potential application is the modeling of biological systems, such as the interaction of viruses, antibodies and cells, via the solution of systems of differential equation. Such systems also lead to extremely stiff equations. We are investigating the potential use of this database approach to improve the efficiency of such calculations. In previous work we introduced a sequential algorithm that extended Pope’s approach in several directions [3]. First, we introduced a dynamic updating of the database based on usage statistics to accommodate non-steady state simulations. Second, we developed a geometric partitioning algorithm that guarantees the retrieval of a database entry if it satisfies the error tolerance for the function approximation. These algorithms are implemented in the software system DOLFA. Because of the importance of the use of parallel computers to solve large-scale combustion problems, it is extremely desirable to develop a parallel version of this sequential algorithm. In previous work we proposed an a approach for distributing the function database and introduced three communication strategies for addressing load balancing problems in the parallel implementation [4]. In this paper we introduce a detailed formulation of these three communication heuristics and present computational results for two limiting cases that demonstrate the advantage of a proposed hybrid strategy. The remainder of our paper is structured as follows. In Section 2we briefly review our sequential algorithm as implemented in the software package DOLFA [3]. In Section 3 we describe the initial parallel implementation of DOLFA as presented in [4]. In this section we also present a detailed formulation of three proposed communication heuristics. Finally, in Section 4 experimental results are presented showing the parallel performance of our three proposed load balancing strategies that illustrates the advantage of a hybrid communication heuristic.
646
2
Ivana Veljkovic and Paul E. Plassmann
The Sequential Scientific Database Algorithm
The function values to be computed, stored, and approximated by the database are ndimensional functions of m parameters, where both n and m can be large (in the 10’s or 100’s for typical combustion problems). We denote this function as F (θ) which can be stored as an n-dimensional vector and where the parameters θ can be represented by an m-dimensional vector. The sequential database algorithm works by storing these function values when they are directly (exactly) computed. If possible, other values are approximated using points stored in the database. In addition to the function values, the database stores other information used to estimate the error based on an Ellipsoid Of Accuracy (EOA) as described in [3]. To illustrate how the sequential database works, when F (θ) is to be calculated for some new point, we first query the database to check whether this value can be approximated with some previously calculated value. For computational combustion, the parameter space is often know as the composite space. The composite space consists of variables that are relevant for the simulation of a chemical reaction, such as temperature, pressure and mass fractions of chemical species.
{X}
Xo
p1 p2 {Y,X,Z} {X} {Z,X}
p1 {X} {Y,X}
o
Y
o
Y
Xo
Xo
o
p1
Z
p1
p2
Fig. 1. The generation of a BSP tree and the resulting spatial decomposition of a two-dimensional domain from the insertion of three data points. Arrows illustrate the orientation of the cutting planes and the ovals represent the EOAs. Figure taken from [3]
In DOLFA, the database is organized as a Binary Space Partition (BSP) tree to enable the efficient search of the composite space. Internal nodes in the BSP tree denote planes in the m-dimensional space. Terminal nodes (leafs) represent convex regions determined by the cutting planes on the path from the given terminal node to the root of the BSP tree. This construction is illustrated in Fig. 1. For each region we associate a list of database points whose EOAs intersect the region. In addition, to avoid storing points that are unlikely to be used for approximation again, we employ techniques based on usage statistics that allow us to remove such points from the database. The software system DOLFA has demonstrated significant speedups for combustion applications. For example, the software was tested with a code that implements a detailed chemistry mechanism in an HCCI piston engine [5]. For this application, when compared to simulation results based only on direct integration insignificant effects on the accuracy of the computed solution were observed.
Parallel Heuristics for an On-Line Scientific Database
3
647
The Parallel Database Algorithm
In developing a parallel version of the database, perhaps the most straightforward approach would be to use the sequential version on each processor, without any interprocessor collaboration in building a distributed database. This approach would work; however, it suffers from two significant drawbacks. First, this approach cannot take advantage of direct calculations done on other processors—hence, we would expect significant redundant function calculations. Second, the memory required to store the database can be significant—without partitioning the database among processors, we cannot take full advantage of the total memory available on a parallel computer. In this section, we present parallel algorithms that address each of these issues. The importance of addressing the first issue of redundant evaluations becomes obvious if we compare typical times for function evaluation and interprocessor communication. Namely, for the HCCI combustion simulation which was considered earlier in Section 2, one function evaluation takes 0.3 seconds (an ODE system with 40 chemical species). On the other hand, on the Beowulf cluster where our experiments were conducted the time to communicate the computed results (40 double precision values) is approximately 2.7 microseconds. The relative difference in times is significant, and even if we consider all the synchronization issues, it is still important to design a parallel approach that will coordinate direct calculations between processors to minimize the number of redundant direct calculations. Thus, our goal is to design a parallel implementation of DOLFA that minimizes the total number of direct calculations performed on all processors. However, this minimization must be obtained while maintaining good load balancing and while minimizing the total interprocessor communication. The parallel algorithm consists of two main parts. First, the approach used for building the distributed (global) BSP tree and second, the heuristics developed for managing interprocessor communication and the assignment of direct function evaluations. In next two subsections we will discuss solutions to these two problems. 3.1 Building the Global BSP Tree As with the sequential case, we typically have no a priori knowledge of the distribution of queries within the search space. The global BSP tree has to be built on-line; it will be partitioned into unique subtrees which are assigned to processors as illustrated in Fig. 2. To achieve a well-balanced partitioning of the global BSP tree, we initially accumulate a certain number of points in the database without constructing the BSP tree. The more points we accumulate, the better chance that the distributed BSP tree will partition the space in a well balanced manner, since the partitioning is based on more information about the accessed space. On the other hand, the longer we wait before constructing the global BSP tree, the greater the chance that we will do redundant direct calculations. We must balance these two issues and attempt to design a strategy that will yield sufficient information about the accessed space as early as possible. A formula based on the average number of points in the database (total number of points divided with number of processors) is used to determine when to build the BSP tree. After a sufficient number of points is accumulated in this initial phase, our goal is to have enough information to construct a BSP tree for the entire search space with reasonably balanced subtrees.
648
Ivana Veljkovic and Paul E. Plassmann p1 p2
111 000 000 111 000 111 000 111 000 111 000 111 000 111
p1
T1 p1
T2 p2
T3 p3
T4 p4
p2
111 000 000 111 000 111 000 111 000 111 000 111 000 111
Fig. 2. On the left: an illustration of how the global BSP tree is partitioned among processors—each processor owns a unique subtree of the global tree. On the right: an illustration of the “ping-pong” communication between processors. Figure taken from [4]
These subtrees are computed by recursively decomposing the global search space by introducing cutting planes that roughly equally divide the set of points in each subregion. The resulting subtrees (and their corresponding subregions) are then assigned to individual processors. The choice of cutting planes is limited to coordinate planes chosen by computing the dimension with maximum variance. As we mentioned before, each processor owns its subtree but also has information about the portion of the search space stored on every other processor in the environment. Therefore, the search for a suitable point φ to use for approximation for point θ is conducted in a coordinated manner between processors. More precisely, if the query point θ is submitted to processor i but this point belongs to the subtree of processor j, processor i will be able to recognize this fact. A simple model for the computational complexity of this initial phase of the parallel database algorithm was developed in [4]. This is a one-time cost which is not significant. 3.2 Managing Interprocessor Communication Before each call to the database from the main driver application, all processors must synchronize to be able to participate in global database operations. This synchronization may be very costly if it is conducted often. For example, if the database is called with a single point at a time as is done with the sequential version of DOLFA. Modifying the DOLFA interface by allowing calls that are made with lists of query points instead of one point at a time minimizes the synchronization overhead and enables more efficient interprocessor communication. Fortunately, in most combustion simulations, updates to the fluid mechanics variables and chemical species mass fractions happen independently of one another on the mesh subdomain assigned to each processor. More precisely, the chemical reaction at point P at time t does not affect the chemical reaction at point Q at time t. Therefore, we can pack the points for which we want to know function values into a list and submit it to the database. The output is a list of function values at points given in the input list. Based on our assumption that the function evaluation is relatively expensive, we want to design a searching scheme that will minimize the number of direct evaluations that must be performed. Whenever there is a possibility that another processor may be
Parallel Heuristics for an On-Line Scientific Database
649
able to approximate the function value at a given point, we would like to communicate that query to that processor for processing. This means that, for a given query point X submitted to processor p1, we have one of two possibilities. First, when point X belongs to a subtree T 1 of p1 (see Fig. 2-left), we simply proceed with the sequential algorithm. Second, the more challenging case is when the point X belongs to a subtree of some other processor, say p2, X ∈ T 2. As p1 stores some number of points that belong to processor p2 (for example, from the first phase of computation when BSP tree is not yet built) p1 may still be able to approximate this query. If it cannot, the processor sends the point to processor p2. If p2 can approximate it, it sends back the result. But, if p2 also cannot satisfy the query, the open question is to determine which processor should conduct the direct integration. The biggest issue to consider in answering this question is to determine the effect on load balancing between processors. It may happen that, for a particular time step, all the points for which we want to calculate function values belong to a small subregion of the accessed space. This effect is common in many combustion simulations. If the load imbalance is large, we may loose the performance advantages gained by minimizing the number of redundant direct calculations. Thus, the answer to the question: “Should p1 or p2 conduct the direct calculation?” proves to be a crucial part of the parallel implementation of DOLFA. We introduce the following notation to describe the three heuristics proposed in this paper. We denote by L(i) the list of points submitted to processor pi . This list is n (i) (i) decomposed in sublists L(i) = j=1 Lj where Lj is the portion of the list that belongs (i) to subtree Tj of processor pj , Lj = L(i) Tj . So every processor pi will have three different types of list to deal with: (i)
– B (i) = Li – the list of points that belong to the subtree of pi . This is the list of the points that processor pi has to calculate. (i) – N (i) = j=i Lj – the list of points that belong to the subtrees of other processors. This is the list of the points that processor pi will send off to other processors. (j) – E (i) = j=i Li – the list of points that pi received from other processors since they belong to its subtree Ti . With C (i) we denote the list of points for which processor pi does the direct calculations. (1) The open question remains: “If X belongs to L2 – it was submitted to p1 for calculation but it belongs to a subtree T2 of processor p2 – which processor will calculate F(X)?". We compare three approaches for solving this problem as described below. Approach 1. The owner of the subtree (p2 in the above example) does the direct integration. In this approach, processor p i does the direct calculations for all the points in the lists B (i) and E (i) , C (i) = B (i) E (i) . This approach leads to maximum retrieval rates but introduces a significant load imbalance. Because problems in combustion chemistry usually include turbulence and nonhomogeneous media, this approach can result in significant load imbalance. However, as this approach should minimize the number of direct calculations, it can be used as a benchmark determining a lower bound for the overall number of direct calculations. This lower bound can be used to compare with other approaches to estimate the number of redundant function evaluations.
650
Ivana Veljkovic and Paul E. Plassmann
Approach 2. The processor to which the query was submitted (p1 in the above example) does the direct calculations. In this approach, processor pi does the direct calculations for all the points the list B (i) and for some of the points in the list ¯ in (i) (i) (i) (i) N (bar denotes a subset of the list). This approach N , where C = B may mitigate the load imbalance, but it will introduce a communication overhead as we have additional two-way communication. This communication would consist of processor p2 letting processor p1 know that it cannot approximate F (X) and processor p1 returning to p2 the calculated F (X) together with some other items needed to calculate the ellipsoid of accuracy discussed in subsection 2. We describe this communication pattern as a “ping-pong," as illustrated in the right of Fig. 2. A problem with this approach is a significant reduction in retrieval rates. For example, lists on processors p3 and p4 may contain points that are close to X and that could be approximated with F (X). In approach 1, we would have only one direct calculation for F (X). In this approach, since p2 will send off point X back to original processors to calculate it (in this case, p1, p3 and p4), we would have these three processors redundantly calculate F (X). Hybrid Approach. We consider a hybrid version combining the best of the two previous approaches. In this approach, processor pi does the direct calculations for all the points in thelist B (i)and for some of the points in the lists N (i) and E (i) , where ¯ (i) E¯(i) (bar denotes a subset of the list). Processor pi decides C (i) = B (i) N whether to calculate F (X) for some X ∈ E (i) based on the average number of direct retrievals in the previous iteration and additional metric parameters. In particular, processor pi can predict how many calculations will it have in the B (i) and N (i) list—denote this number with avgLocal. The number of direct calculations can vary substantially in the course of computation; therefore, our prediction has to be local and we calculate it based on the two previous iterations. Processor pi also knows the average number of direct retrievals (over all processors) in the previous iteration— denote it with avgDirect. Then, processor pi will do no more than avgExtra = avgDirect − avgLocal direct calculations on the E (i) list. For the rest of the points in the E (i) list, pi will send them back to the original processors to calculate them, using the “ping-pong" communication model. Based on this decision, F (X) is either computed on this processor or sent to the original processor for evaluation.
4 Experimental Results We have designed a test application that mimics typical combustion simulations to test the performance of our parallel implementation. This application mimics a two-dimensional laminar reacting flow. The model parameters are the following: the number of time steps is 20, the mesh size is 50 × 50 and the vector size (size of query point) is 10. The userdefined tolerance is 0.01 and it takes roughly 0.1 second to perform one direct evaluation. Our experiments were run on a parallel Beowulf cluster with 81 AMD MP2000+ dual processor nodes, 1 GB memory per node, and a high-speed Dolphin interconnection network. We ran two limiting case experiments which we refer to as Experiments 1 and 2. For Experiment 1, each processor gets a different domain on which to calculate. Thus,
Parallel Heuristics for an On-Line Scientific Database
6000 5000 4000 3000 2000 1000
9
32 processors,CASE 1
x 10
Version 1 Hybrid Version 2 No Comm
8 Overal number of direct calculations
Max − Min (number of direct calculations)
Version 1 Hybrid Version 2 No Communication
7000
0
4
32 processors,CASE 1
8000
651
7 6 5 4 3 2 1
1
2 Experiment
0
1
2 Experiment
Fig. 3. Left: Difference between the maximum and minimum number of direct calculations across the parallel environment. Load imbalance is obvious for algorithm version 1. Right: Overall (summed over all processors) number of direct calculations across the parallel environment The number of redundant calculations is very high for the case when we have no communication between processors
for this case, the number of direct calculations should be roughly a linear function of the number of processors. For Experiment 2, each processor gets the same domain. For this case the number of direct calculations should be roughly constant function of the number of processors. The objective is to observe how will the load balancing and communication patterns behave in these extreme cases. We first compared the load balancing for each of the communication heuristics described in the preceding section. Load balancing was measured by computing the difference between the maximum and minimum number of direct calculations across all processors. If this relative difference is near zero, this implies that roughly every processor conducted the same amount of work over the entire calculation (however, there may still be load imbalance at individual time-steps). On the left of Fig. 3, we present the results for Experiments 1 and 2 run on 32 processors. As can be seen, version 1 has the largest load imbalance while the hybrid approach achieves the best load balance. On the right of Fig. 3 we compare the total number of direct calculations (summed over all processors in the environment) for Experiments 1 and 2 run on 32 processors. As expected, we observe that the no-communication implementation has a significant number of redundant calculations compared to any version of the parallel algorithm. In the next series of simulations, we keep the load constant as we increase the number of processors. In Fig. 4 we compare the average overhead as a function of the number of processors used for the four strategies (version 1, hybrid approach, version 2 and no-communication) for Experiment 1 (left) and Experiment 2 (right). The overhead is calculated as the difference between the overall time spent performing database operations and time to do the direct calculations. For the Experiment 1, where the overall number of direct calculations is roughly a linear function of number of processors, the computational overhead is unacceptable for version 1 as it increases as the number of
652
Ivana Veljkovic and Paul E. Plassmann
400
EXPERIMENT 1:Average overhead time (Σ Toverhi)/p
200 180
Hybrid case Version 1 Version 2
300
Average overhead time (seconds)
Average overhead time (seconds)
350
250 200 150 100 50 0
EXPERIMENT 2: Average overhead time (Σ Toverhi)/p
160 140 Hybrid case Version 1 Version 2
120 100 80 60 40 20
4
8
16 Processor number
0
32
4
8
16 Processor number
32
Fig. 4. Average time overhead as a function of number of processors for our 4 schemes. Version 1 has a prohibitive overhead
500
EXPERIMENT 2: Average execution time (Σ T )/p
EXPERIMENT 1: Average execution time (Σ T )/p i
260 Average execution time (seconds)
Average execution time (seconds)
450 Hybrid case Version 1 Version 2 No Comm
400 350 300 250 200
Hybrid case Version 1 Version 2 No Comm
240 220 200 180 160 140
150 100
i
280
120
4
8
16 Processor number
32
100
4
8
16 Processor number
32
Fig. 5. Average execution time as a function of number of processors for our 4 schemes. Hybrid version and version 2 exhibit scalability. However, version 1 does a better job in minimizing the number of direct calculations
processors increase. Therefore, even though we observe on the right of Fig. 3 that version 1 minimizes the overall number of direct calculations better than any other scheme, the load imbalance is high (Fig. 3,left) and this scheme fails to meet our goals. However, for Experiment 2, the overhead for version 1 is roughly constant which implies that this scheme appears to be more scalable. We expect this approach will work well for large numbers of processors when we know that the number of direct calculations is a slowly growing function of number of processors. Because we expect that real-time applications will be more like the extreme situation described in Experiment 1, version 1 is in our opinion not a feasible parallel scheme. But as it has some limiting properties, it can be used for comparison to other schemes that we design. We also notice that the overhead is roughly constant for the hybrid approach. In Fig. 5 we compare the average execution time as a function of the number of processors for the four strategies with Experiment 1 shown on the left and Experiment
Parallel Heuristics for an On-Line Scientific Database
653
2 on the right. If we compare Figs. 4 and 5, we note that the overhead time is actually dominating the running time for version 1. This effect also demonstrates that version 1 is not a practical scheme for communication for our on-line database. On the other hand, we notice that the average execution time for version 2 and the hybrid scheme becomes constant for Experiment 1 and almost constant for Experiment 2 as we increase the number of processors. This suggest that these schemes should be scalable for larger problems and numbers of processors. However, the hybrid version has an advantage over version 2 since it performs less redundant direct calculations.
5 Conclusions We showed in our previous work that sequential database system DOLFA can significantly reduce the execution time for large, multi-scale simulations that involve frequent expensive function evaluations and we introduced its parallel extension. In this paper, we give a detailed description of the parallel algorithms that are crucial for the implementation and conduct experiments that demonstrate the performance of the proposed schemes. Our approach is based on a partitioning of the search space and maintaining a global BSP tree which can be searched on each processor in a manner analogous to the sequential database algorithm. Several heuristics have been introduced which aim to balance the computation of points which are not resolved locally. In the future work, we plan to test our parallel implementation with an application that models reacting flow with complex chemistry using a high-order, Direct Numerical Simulation (DNS) fluid simulation. In addition, we plan to conduct experiments on larger numbers of processors and further examine the scalability of these approaches.
References 1. U. Maas and S.B. Pope. Laminar flame calculations using simplified chemical kinetics based on intrinsic low-dimensional manifolds. Twenty-Fifth Symposium (International) Combustion/The Combustion Institute, pages 1349-1356, 1994. 2. S.B. Pope. Computationally efficient implementation of combustion chemistry using in situ adaptive tabulation. Combustion Theory Modeling, Vol 1, pages 41-63, 1997. 3. I. Veljkovic, P.E. Plassmann and D.C. Haworth. A scientific on-line database for efficient function approximation. Computational Science and Its Applications—ICCSA 2003, The Springer Verlag Lecture Notes in Computer Science (LNCS 2667) series, Part I, pages 643– 653, 2003. 4. I. Veljkovic, P.E. Plassmann and D.C. Haworth. A parallel implementation of a scientific on-line database for efficient function approximation. Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications—PDPTA 2004, Vol 1, pages 24-29, 2004. 5. M. Embouazza, D.C. Haworth and N. Darabiha. Implementation of detailed chemical mechanisms into multidimensional CFD using in situ adaptive tabulation: Application to HCCI engines. Journal of Fuels and Lubricants 111, pages 1544-1559, 2002. 6. S.B. Pope. ISAT-CK User manual (Version 3.0). October 2000.
Software Engineering and Problem Solving Environments for Scientific Computing: An Introduction Organizers: Jose C. Cunha1 and Omer F. Rana2 1
Universidade Nova de Lisboa, Portugal [email protected] 2 Cardiff University, UK [email protected]
As computational infrastructure becomes more powerful and complex, there is a greater need to provide tools to support the scientific computing community to make better use of such infrastructure. The abscence of such tools is likely to lead to users not taking full advantage of newer functionality made available in computational infrastructure. It is also unlikely that a user (a scientist or a systems administrator) will know the full range of options available – or how a combination of configuration options from elements within the infrastructure (such as compute or data servers) can be usefully specified. This issue has two aspects: (1) allowing such functionality to be made available in tools without requiring the user to know about the functionality – an example includes the ability to support service discovery. In this case, the user does not really need to know where a particular service instance resides, only that a particular type of service is available for use within their application at a given point in time. (2) increasing the reliance of the user on tools – an example includes the ability to automatically compose a set of services based on the requirement outlined by a user. In this case, a user must place greater trust in the infrastructure to deliver a particular result. This aspect has not been fully realised within existing infrastructure, and remains an active research area. It involves both the need to better integrate functionality made available in computational infrastructure, but also requires a “culture change”, in the way that users perceive and make use of computational infrastructure. The last decade has also seen an unprecendented focus on making computational resources sharable (parallel machines and clusters, and data repositories) across national boundaries. Significantly, the emergence of Computational Grids in the last few years, and the tools to support scientific users on such Grids (sometimes referred to as e-Science) provides new opportunities for the scientific community to undertake collaborative, and multi-disciplinary research. Often tools for supporting application scientists have been developed to support a particular community (Astrophysics, Biosciences, etc), a common perspective on the use of these tools and making them more generic is often missing. On the other hand, the software engineering community in computer science often aims to develop approaches which are more generic, and can be more widely deployable. Research into Problem-Solving Environments (PSEs) is also an active and expanding field, with the potential for a wide impact on how computational resources can be harnessed in a more effective manner. Ideally, PSEs should provide software tools and expert assistance to a user, and serve as an easy-to-use interface to support the development and J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 654–655, 2006. c Springer-Verlag Berlin Heidelberg 2006
Software Engineering and Problem Solving Environments for Scientific Computing
655
deployment of scientific applications, thereby allowing the rapid prototyping of ideas, and support for undertaking analysis on experimental data. The aim of this minisymposium was to bring together experts who have experience of developing (implementing and deploying) software tools to support application scientists, and those who make use of these tools. A particular aim is to highlight lessons learned – and, more importantly, identify good (and bad) approaches. The paper by Cannataro et al. describes different elements of a PSE, and why such elements are useful to support. A key aspect is a focus on Grid-based PSEs. Schreiber describes the TENT system – a PSE for use in Aerospace Engineering applications. The author describes how new components may be integrated within the PSE, and issues that arise in this process. The paper by Overeinder and Brazier, and by Jost et al. describe the inclusion of artificial intelligence based approaches within PSEs and the computational infrastructure. Overeinder and Brazier report on the “AgentScape" project, which provides distributed Operating Systems-like functionality for supporting computational services. The elements of the AgentScape system are outlined – with particular reference to the problem solving capability of such infrastructure. Jost et al. report on an Expert System-based approach for supporting a user in executing their application on an existing infrastructure – based on dynamic performance data, and on the static structure of the application.
A General Architecture for Grid-Based PSE Toolkits Mario Cannataro1 , Carmela Comito2 , Antonio Congiusta2 , Gianluigi Folino3 , Carlo Mastroianni3, Andrea Pugliese2,3 , Giandomenico Spezzano3 , Domenico Talia2 , and Pierangelo Veltri1 1
Universit`a “Magna Græcia” di Catanzaro {cannataro,veltri}@unicz.it 2 DEIS, Universit`a della Calabria {apugliese,comito,congiusta,talia}@si.deis.unical.it 3 ICAR-CNR {folino,mastroianni,spezzano}@icar.cnr.it
Abstract. A PSE Toolkit can be defined as a group of technologies within a software architecture that can build multiple PSEs. We designed a general architecture for the implementation of PSE toolkits on Grids. The paper presents the rationale of the proposed architecture, its components and structure.
1 Introduction Problem Solving Environments (PSEs) have been investigated over the past 30 years. In the pioneering work “A system for interactive mathematics”, Culler and Fried in 1963 initiated to investigate automatic software systems for solving mathematical problems with computers, by focusing primarily on applications issues instead of programming issues. At that time, and later for many years, the term “applications” indicated scientific and engineering applications that were generally solved using mathematical solvers or scientific algorithms using vectors and matrices. More recently, PSE for industry, commercial, and business applications are gaining popularity. Nowadays, there is not a precise definition of what a PSE is. The following wellknown definition was given by Gallopoulos, Houstis, and Rice [7]: “A PSE is a computer system that provides all the computational features necessary to solve a target class of problems. [...] PSEs use the language of the target class of problems.” Moreover, they tried to specify the main components of a PSE by defining the equation PSE = Natural Language + Solvers + Intelligence + Software Bus where – Natural Language is specific for the application domain and natural for domain experts; – Solvers are software components that do the computational work and represent the basic elements upon which a PSE is built; J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 656–664, 2006. c Springer-Verlag Berlin Heidelberg 2006
A General Architecture for Grid-Based PSE Toolkits
657
– Intelligence is useful for resource location, input assistance (with recommender interfaces), scheduling, and analysis of results; – Software Bus is the software infrastructure that supports the PSE and its use. Other definitions that partially agree with the above definition have been given in the last five years. According to Walker et al. [12], “a PSE is a complete, integrated computing environment for composing, compiling, and running applications in a specific area.” whereas Schuchardt and his co-authors defined PSEs as [11] “problem-oriented computing environments that support the entire assortment of scientific computational problem-solving activities ranging from problem formulation to algorithm selection to simulation execution to solution visualization. PSEs link a heterogeneous mix of resources including people, computers, data, and information within a seamless environment to solve a problem.” Finally, in 2003, Cunha defined a PSE as [6] “an integrated environment for solving a class of related problems in an application domain; easy to use by the end-user; based on state-of-the-art algorithms. It must provide support for problem specification, resource management, execution services.” The existence of several different definitions demonstrates that the term PSE is perhaps too generic and not completely investigated for reaching a full consensus in the scientific community. The main motivation for developing PSEs is that they provide software tools and expert assistance to the computational scientist in a user-friendly environment, thus allowing for more rapid prototyping of ideas and higher research productivity. A PSE provides users with an interface to high-performance computing resources, relieving them from hardware/software details and letting them concentrate on applications; collaboration, visualization, and knowledge are the three main aspects that distinguish a PSE from a mere interface. As specified in the PSE definitions given above, a PSE is typically aimed at a particular computing domain. An advancement of the PSE concept is the PSE Toolkit concept. A PSE Toolkit can be defined as a group of technologies within a software architecture that can build multiple PSEs. A PSE Toolkit allows for building meta-applications from preexisting modules, in order to meet the needs of specific solutions. It can support the construction of problem solving environments on different domains allowing designers to develop multidisciplinary PSEs. PSEs can benefit from advancements in hardware/software solutions achieved in parallel and distributed systems and tools. One of the most interesting models in the area of parallel and distributed computing is the Grid. This paper presents a general architecture for Grid-based PSE Toolkits. Section 2 describes the issues to be faced in designing Grid-based PSE Toolkits and reviews some related work. Section 3 presents the reference architecture, and Section 4 concludes the paper.
658
Mario Cannataro et al.
2 Grid-Based PSE Toolkits As said before, PSEs are typically designed with a specific application domain in mind. This approach simplifies the designer task and generally produces an environment that is particularly tailored for a particular application class. On the other hand, this approach does limit portability of solutions. That is, the resulting PSE cannot be generally used in different application domains without re-designing and re-implementing most or all of the environment. PSE portability, extensibility, and flexibility can be provided using the PSE Toolkit model. PSE Toolkit designers provide the most important components for building up PSEs and a way to compose them when a specific PSE must be developed. Therefore, a PSE Toolkit allows for developing domain-specific PSEs, thus creating meta-applications from pre-existing modules on different domains. PSE users, which often expect to be interfaced with a tightly integrated environment, must have transparent access to disperse and de-coupled components and resources. Dealing with distributed environments is another main issue in PSEs design and use. Distributed and parallel computing systems are used for running PSEs to get high performance and use distributed data, machines, or software modules. The Grid is a high-performance infrastructure that combines parallel and distributed computing systems. Its main goal is scalable and secure resource sharing and coordinated problem solving in “dynamic, multi-institutional virtual organizations”. The role of the Grid is fundamental, since it potentially provides an enormous amount of dispersed hardware and software resources; such resources can be transparently accessed by a PSE Toolkit. In Grid environments, instruments can be connected to computing, data, and collaboration environments and all of these can be coordinated for simultaneous operation. Moreover, vast quantities of data can be received from instruments and simulations, and catalogued, archived, and published. The Grid can thus provide a high-performance infrastructure for running PSEs and, at the same time, a valuable source of resources that can be integrated in PSEs and PSE Toolkits; therefore, Grid-aware PSEs can search and use dispersed high performance computing, networking, and data resources. Through the combination of PSE Toolkit issues and the exploitation of Grid features, we have the possibility to design Grid-based PSE Toolkits. Some main issues to deal with in designing a Grid-based PSE Toolkit are: – – – – – – – –
common and domain-specific components identification; understanding and evaluation global properties; component integration; distribution of resources and components; component middleware, technology, and infrastructures; adaptivity and heterogeneity; standardization; efficiency.
Common components can be designed and implemented in a PSE Toolkit and they can be used when a specific PSE must be developed. Design of application-bound components and their interfaces must be considered in the Toolkit, but its implementation
A General Architecture for Grid-Based PSE Toolkits
659
will change depending on the PSE in which they will be included. Most of the PSE Toolkit components are common. Specific components are: – – – – –
components that define goals, resources, actors of the application domain; ontology of the domain components; domain knowledge; performance history; the interface between the application domain and the PSE Toolkit.
Grid-based PSEs may succeed in meeting more complex applications requirements both in terms of performance and solution complexity. They integrate heterogeneous components into an environment that provides transparent access to distributed resources, collaborative modeling and simulation, and advanced interfaces. 2.1 Related Work In this section we give a short review of some existing Grid-based PSE Toolkits, so as to illustrate their main characteristics and the different approaches adopted. The WebFlow toolkit [1] is a Web-based three-tier system where high-performance services are implemented on the Grid using the Globus toolkit [8]. WebFlow provides a job broker to Globus, while Globus takes responsibility of actual resource allocation. The WebFlow front-end allows specifying user’s task in the form of an Abstract Task Descriptor (ATD). The ATD is constructed recursively and can comprise an arbitrary number of subtasks. The lowest level, or atomic, task corresponds to the atomic operation in the middle tier, such as instantiation of an object, or establishing interactions between two objects through event binding. A mesh of CORBA-based servers gives the WebFlow middle tier; one of these servers, the gatekeeper server, facilitates a secure access to the system. The middle-tier services provide the means to control the life cycles of modules and to establish communication channels between them. Services provided by the middle tier include methods for submitting and controlling jobs, manipulating files, providing access to databases and mass-storage, and querying the status of the system and users’ applications. WebFlow applications range from land management systems to quantum simulations and gateway seamless access. GridPort [9] is an open architecture providing a collection of services, scripts and tools that allow developers to connect Web-based interfaces to the computational Grid behind the scenes. The scripts and tools provide consistent interfaces between the underlying infrastructure and security, and are based on Grid technologies such as Globus and standard Web technologies such as CGI and Perl. GridPort is designed so that multiple application portals share the same installation of GridPort, and inherit connectivity to the computational Grid that includes interactive services, data, file, and account management, and share a single accounting and login environment. An interesting application of GridPort is a system called Hot-Page, which provides users with a view of distributed computing resources and allows individual machines to be examined about status, load, etc.; moreover, users can access files and perform routine computational tasks. The XCAT Grid Science Portal (XCAT-SP) [10] is an implementation of the NCSA Grid Science Portal concept, that is a problem solving environment that allows scientists to program, access and execute distributed applications using Grid resources which are
660
Mario Cannataro et al.
launched and managed by a conventional Web browser and other desktop tools. XCATSP is based on the idea of an “active document” which can be thought of as a “notebook” containing pages of text and graphics describing the structure of a particular application and pages of parameterized, executable scripts. These scripts launch and manage an application on the Grid, then its results are dynamically added to the document in the form of data or links to output results and event traces. Notebooks can be “published” and stored in web based archives for others to retrieve and modify. The XCAT Grid Science Portal has been tested with various applications, including the distributed simulation of chemical processes in semiconductor manufacturing and collaboratory support for X-ray crystallographers. The Cactus Code and Computational Toolkit [3] provides application programmers with a high level set of APIs able to hide features such as the underlying communication and data layers; these layers are implemented in modules that can be chosen at runtime, using the proper available technology for each resource. Cactus core code and toolkits are written in ANSI C, offer parallel I/O, checkpointing and recovery of simulations, and provide users with an execution steering interface. Much of the Cactus architecture is influenced by the vast computing requirements of its main applications, including numerical relativity and astrophysics. These applications, which are being developed and run by large international collaborations, require Terabyte and Teraflop resources, and will provide an ideal test-case for developing Grid computing technologies for simulation applications. DataCutter [2] is an application framework providing support for developing dataintensive applications that make use of scientific datasets in remote/archival storage systems across a wide-area network. DataCutter uses distributed processes to carry out a rich set of queries and application specific data transformations. DataCutter also provides support for sub-setting very large datasets through multi-dimensional range queries. The programming model of DataCutter is the filter-stream one, where applications are represented by a collection of filters connected by streams, and each filter performs some discrete function. A filter can also process multiple logically distinct portions of the total workload. This is referred to as a unit-of-work, and provides an explicit time when adaptation decisions may be made while an application is running. The Knowledge Grid [5] is a software infrastructure for Parallel and Distributed Knowledge Discovery (PDKD). The Knowledge Grid uses basic Grid services such as communication, authentication, information, and resource management to build more specific PDKD tools and services. Knowledge Grid services are organized into two layers: core K-Grid layer, which is built on top of generic Grid services, and high level K-Grid layer, which is implemented over the core layer. The core K-Grid layer comprises services for (i) managing metadata describing data sources, data mining software, results of computations, data and results manipulation tools, execution plans, etc.; (ii) finding mappings between execution plans and available resources on the Grid, satisfying application requirements. The high-level K-Grid layer comprises instead services for building and executing PDKD computations. Such services are targeted to (i) searching, selection, extraction, transformation and delivery of data to be mined and data mining tools and algorithms; (ii) generation of possible execution plans on the basis of application requirements; (iii) generation and visualization of PDKD results.
A General Architecture for Grid-Based PSE Toolkits
661
PSE Toolkit Graphical User Interface
Search/Discovery System
Description Interface C1
C2
… …
… …
Composition Interface
Execution/Steering Interface
C2 C1
C3 C4
Metadata Repository
Description System Component Repository
C2 C1
C3 C4
Execution/Resource Manager
Programming Environment (Grid) Middleware Software/Hardware
Fig. 1. A reference architecture of a PSE Toolkit
3 A Reference Architecture for a Grid-Based PSE Toolkit As mentioned in the previous section, PSE portability, extensibility, and flexibility can be provided through the use of a PSE Toolkit. In this section, we identify the main components of a PSE Toolkit and describe how these components should interact to implement PSEs in a distributed setting. A minimal reference architecture of a PSE Toolkit should include: – A Graphical User Interface; – A Repository of usable components, including user-defined applications; – A metadata-based Description System, possibly based on application-domain ontologies; – A Metadata Repository, possibly including an ontology repository; – A Search/Discovery System; – An Execution/Resource Manager. Figure 1 shows all these components and the interactions among them to compose the architecture. In particular, the Component Repository represents a library of objects used to build PSE applications, the Metadata Repository is a knowledge base describing such library, whereas the remaining components are services used to search, design, and execute PSE applications. Details of each component are now discussed.
662
Mario Cannataro et al.
Graphical User Interface. It allows for: – The description of available components, the semi-automatic construction of the associated metadata and their publishing. To this aim, the GUI interacts with the Description System. These actions can be driven by the underlying ontology that helps user in classifying components. – The construction of applications and their debugging. For doing this, it goes through the following steps: 1. Interaction with the Search and Discovery System for letting users pose queries and collect results about available components potentially usable for the PSE composition, or directly browse the metadata repository. As before, both querying and browsing can be guided by ontology. 2. Design of an application through visual facilities. 3. Interactive validation and debugging of the designed application. 4. Invocation of the Execution Manager. – The execution, dynamic steering of applications and the visualization of results. To this end, the GUI interacts with the Execution Manager to monitor the execution and to show the results, and it gives the user the possibility to steer the application. – The storing of useful applications and results, annotated with domain knowledge. In this way, applications and their associated domain knowledge can be shared among PSE users. In order to seamlessly integrate the facilities of external tools, the Graphical User Interface must also allow for the direct use of their native GUIs. This capability is fundamental, e.g., when external tools are provided through their native GUIs only, or when their integration in the Toolkit’s GUI would result in excessive complexity. Component Repository. The repository of usable components holds the basic elements/modules that are used to build a PSE. Examples of such components are software tools, data files, archives, remote data sources, libraries, etc. The repository must be able to integrate a possibly pre-existing one provided by the programming environment used to implement the PSE Toolkit, also comprising components defined by different languages and/or tools. The Component Repository must be able to manage dynamic connection/disconnection of Grid resources offering components. In fact, in this case computing nodes and network connections must be considered as components that can be used in the implementation of a PSE. The component repository can include complete user-defined applications that, when published in the repository, enhance the PSE capability to solve specific problems. Metadata Repository. The Metadata Repository stores information about components in the Component Repository. Such information comprises component owners and providers, access modalities, interfaces, performances (offered or required), usage history, availability and cost. For the description of component interfaces, the PSE Toolkit provides an interface definition language (IDL) with which every component must comply, in order to be usable by the Toolkit. Thus, a component provider either makes it directly accessible with the IDL, or describes in the associated metadata how to translate the invocations written using the Toolkit IDL into actions to be taken and/or into
A General Architecture for Grid-Based PSE Toolkits
663
invocations written in the native language of the component. Finally, also reusable solutions, i.e. modules of applications for solving particular sub-problems, that are typical for PSEs, can be managed and described as resources. The Metadata Repository can be implemented by using a Domain Ontology, i.e. a conceptualization, in a standard format, of component metadata, utilization, and relationships. Such ontology can be extended to describe user-defined applications. Description System. The Description System must be able to offer a clear description of each element a PSE can be composed of; in the proposed architecture, it is based on the Metadata Repository. An important issue to be considered for the Description System is the use of ontologies for enriching the semantics of component descriptions. This layer extends the metadata facilities and can provide the user with a semantic-oriented view of resources. Basic services of the Description System are component classification through taxonomies, component annotations, for example indicating which problem they are useful for, and structured metadata description, for example by using standard, searchable data. Whenever new components/applications are added to the Component Repository, new knowledge is added to the Metadata Repository through the Description System. Search/Discovery System. The Search and Discovery System accesses the Metadata Repository to search and find all the resources in the environment where the Toolkit runs, which are available to a user for composing a PSE. While the implementation of this system on a sequential machine is straightforward, in a parallel/distributed setting the Search/Discovery System should be designed as a distributed service able to search resources on a fixed set of machines. Moreover, in a Grid computing environment such a system should be able to search over a dynamically changing heterogeneous collection of computers. Basic services of the Search/Discovery System are: ontology-based search of components, that allows for searching components on the basis of belonging taxonomies as well as specifying constraints on their functions, and the key-based search. The former operates on a portion of the knowledge base selected through the ontology, whereas the latter operates on the entire knowledge base. Execution/Resource Manager. The PSE Toolkit Execution/Resource Manager is the run-time support of generated PSEs. Its complexity depends on the components involved in the PSE composition and in the hardware/software architecture used. The Execution/Resource Manager must tackle several issues, e.g. the selection of the actual component instances to be used, the suitable assignment of processes to heterogeneous computing resources (scheduling), and the distributed coordination of their execution, possibly adapting the running applications to run-time (unpredictable) changes in the underlying hardware/software infrastructure. If a Grid computing architecture is considered, the Execution/Resource Manager, besides interaction with the software/hardware system, has a tight connection with the Grid fabric environment and with the Grid middleware. Note that the Execution/Resource Manager is also responsible for performing all the actions needed to activate the available components, driven by their metadata. These actions can comprise, e.g., compiling a piece of code or a library before using them, or launching a program onto a particular virtual machine. The execution manager
664
Mario Cannataro et al.
must have tight relationships with the run-time support of the programming languages used for coding the components.
4 Conclusions We have proposed a reference architecture for a Grid-based PSE Toolkit, whose main characteristics are: (i) the existence of a knowledge base built around pre-existing components (software and data sources); (ii) the composition of the application, conducted through the searching and browsing of the knowledge base; and (iii) the distributed and coordinated execution of the application on the selected distributed platform, that can be a traditional distributed system or a Grid. A more complete description of the architecture and of the related topics can be found in [4]. We are working to develop a prototype version of the PSE toolkit for a Grid environment based on Globus Toolkit 3. This will allow us also to investigate how to exploit Grid services for implementing the PSE Toolkit functionality.
References 1. E. Akarsu, G. Fox, W. Furmanski, T. Haupt. Webflow - High-Level Programming Environment and Visual Authoring Toolkit for High Performance Distributed Computing. Supercomputing’98, Florida, November 1998. 2. M. D. Beynon, R. Ferreira, T. Kurc, A. Sussman, and J. Saltz. DataCutter: Middleware for Filtering Very Large Scientific Datasets on Archival Storage Systems. MASS2000, pages 119133. National Aeronautics and Space Administration, Mar. 2000. NASA/CP 2000-209888. 3. The cactus code website. http://www.cactuscode.org. 4. M.Cannataro, C. Comito, A. Congiusta, G.Folino, C. Mastroianni, A.Pugliese, G. Spezzano, D. Talia, P. Veltri. Grid-based PSE Toolkits for Multidisciplinary Applications. RT-ICARCS-03-10. ICAR-CNR Technical Report, October 2003. 5. M. Cannataro, D. Talia. KNOWLEDGE GRID: An Architecture for Distributed Knowledge Discovery. Communications of the ACM, January 2003. 6. J. C. Cunha. Future Trends in Distributed Applications and PSEs. Talk held at the Euresco Conference on Advanced Environments and Tools for High Performance Computing, Albufeira (Portugal), 2003. 7. E. Gallopoulos, E. N. Houstis, J. Rice. Computer as Thinker/Doer: Problem-Solving Environments for Computational Science. IEEE Computational Science and Engineering, vol.1, n. 2, 1994. 8. The Globus Project. http://www.globus.org. 9. The Grid Portal Toolkit. http://gridport.npaci.edu. 10. S. Krishnan, R. Bramley, D. Gannon, M. Govindaraju, R. Indurkar, A. Slominski, B. Temko, R. Alkire, T. Drews, E. Webb, and J. Alameda. The XCAT Science Portal. High-Performance Computing and Networking Conference (SC), 2001. 11. K. Schuchardt, B. Didier, G. Black. Ecce - a Problem-Solving Environment’s Evolution toward Grid Services and a Web Architecture. Concurrency and computation: practice and experience, vol. 14, pages 1221-1239, 2002. 12. D. Walker, O. F. Rana, M. Li, M. S. Shields, Y. Huang. The Software Architecture of a Distributed Problem-Solving Environment. Concurrency: Practice and Experience, vol. 12, No. 15, pages 1455-1480, December 2000.
An Expert Assistant for Computer Aided Parallelization Gabriele Jost1, , Robert Chun2 , Haoqiang Jin1 , Jesus Labarta3 , and Judit Gimenez3 1
NAS Division, NASA Ames Research Center Moffett Field, CA 94035-1000, USA {gjost,hjin}@nas.nasa.gov 2 Computer Science Department, San Jose State University San Jose, CA 95192, USA [email protected] 3 European Center for Parallelism in Barcelona Technical University of Catalonia (CEPBA-UPC), cr. Jordi Girona 1-3, Modul D6, 08034 – Barcelona, Spain {jesus,judit}@cepba.upc.es
Abstract. The prototype implementation of an expert system was developed to assist the user in the computer aided parallelization process. The system interfaces to tools for automatic parallelization and performance analysis. By fusing static program structure information and dynamic performance analysis data the expert system can help the user to filter, correlate, and interpret the data gathered by the existing tools. Sections of the code that show poor performance and require further attention are rapidly identified and suggestions for improvements are presented to the user. In this paper we describe the components of the expert system and discuss its interface to the existing tools. We present a case study to demonstrate the successful use in full scale scientific applications.
1 Introduction When porting an application to a parallel computer architecture, the program developer usually goes through several cycles of code transformations followed by performance analysis to check the efficiency of parallelization. A variety of software tools have been developed to aid the programmer in this challenging task. A parallelization tool will usually provide static source code analysis information to determine if and how parallelization is possible. A performance analysis tool provides dynamic runtime information in order to determine the efficiency of time consuming code fragments. The programmer must take an active role in driving the analysis tools and in interpreting and correlating their results to make the proper code transformations to improve parallelism. For large scientific applications, the static and dynamic analysis results are typically very complex and their examination can pose a daunting task for the user. In this paper we describe the coupling of two mature analysis tools by an expert system. The tools under consideration
The author is an employee of Computer Sciences Corporation.
J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 665–674, 2006. c Springer-Verlag Berlin Heidelberg 2006
666
Gabriele Jost et al.
are the CAPO [3] parallelization tool and the Paraver [12] performance analysis system. The techniques described in this paper are applicable to many parallel programming paradigms, but we will restrict our discussion to OpenMP [10] parallelization for shared memory computer architectures. OpenMP supports loop level parallelization in the form of compiler directives. The program developer has to insert the directives and specify the scope of the variables. The CAPO parallelization tool was developed to aid the programmer in this task. CAPO automates OpenMP directive insertion into existing Fortran codes and allows user interaction for an efficient placement of the directives. This is achieved by use of extensive interprocedural analysis from CAPTools [2] (now known as ParaWise) which was developed at the University of Greenwich. Dependence analysis results and other static program structure information are stored in an application database. CAPO provides an extensive set of browsers which allow the user to examine the information and to provide additional information, such as values of runtime parameters. While this enables a very precise analysis and the generation of highly efficient parallel code, it also introduces challenges for the program developer: A full scale scientific application will usually contain many subroutines, loops, and variables, yielding a large amount of analysis. Since it is impossible to examine all dependences, the programmer has to make a choice where to focus his efforts on. Runtime performance analysis information provides means to help the user focus his optimization efforts. The Paraver performance analysis system is being developed and maintained at the European Center for Parallelism in Barcelona (CEPBA). Its major components are the tracing package OMPItrace [9], a graphical user interface to examine the traces based on time line views, and an analysis module for the computation of performance statistics. Profiling information can be displayed in form of tables or histograms. The statistics can be correlated and mapped onto other objects, such as subroutines and threads of execution. Just like in the case of the static application database, the performance trace of a large application will contain a plethora of information and the user is faced with the challenge of its meaningful interpretation. An expert analyst will usually go through repetitive inspections of application data dependences and performance statistics. The novice user will often not even know where to start and what to look for. In our approach we try to address the needs of expert as well as novice user. The rest of the paper is structured as follows: In Section 2 we describe the prototype implementation of our expert system. In Section 3 we present a case study to show the usefulness of the system. In Section 4 we discuss some related work and draw our conclusions in Section 5.
2 The Expert System Prototype Implementation An intelligent computerized parallelization advisor was developed using an expert system implemented in CLIPS (“C” Language Integrated Production System) [1]. By performing data fusion on the static and dynamic analysis, the expert system can help the user to filter, correlate, and interpret the data. Relevant performance indices are automatically calculated and correlated with program structure information. The overall architecture of our program environment is depicted in Figure 1. The application database contain-
An Expert Assistant for Computer Aided Parallelization
667
Fig. 1. Architecture of the expert system programming environment. The Parallelization Assistant Expert System fuses program structure knowledge and performance trace information to aid the user in narrowing down performance problems. The user can still interact directly with the parallelization and performance analysis tools for fine tuning of the application’s performance
ing static program structure information such as dependence analysis results is gathered during a CAPO session. The information stored in the application database is used to generate parallelized source code. In the following we describe the steps performed by the expert system in order to evaluate the efficiency of the generated code. Instrumenting the Parallel Source Code for Performance Analysis: The expert system prototype uses the Paraver OMPItrace package for instrumentation and tracing of the application. OMPItrace generates traces containing time stamped events during the program’s execution. Tracing of some events occurs automatically and does not require any modification of the source code or relinking of the application with special libraries. Examples of such events are – entry end exit of compiler generated routines for the execution of parallelized code segments, – entry and exit of parallelization runtime libraries, such as the OpenMP runtime library, – hardware counters, if they are available. User level subroutines have to be manually instrumented in the source code. Instrumentation of all subroutines may introduce a lot of overhead. For critical code segments, on the other hand, it is often desirable to obtain information at a finer granularity than a subroutine call. The prototype expert system uses CAPO’s source code transformation capabilities and the program structure information from the application database to selectively instrument the code. Only routines that are not contained within a parallel region or a parallel loop and that contain at least one DO-loop chosen for instrumentation. In addition to this outermost serial loops are also instrumented. More details on the automatic selective instrumentation can be found in [4].
668
Gabriele Jost et al.
Automatic Retrieval of Performance Indices: We have extended the Paraver system by Paramedir, a non-graphical command line interface to the analysis module. The specification of Paraver trace file views and metric calculations can be saved to reusable configuration files. Paramedir accepts the same trace and configuration files as Paraver. In this way the same information can be captured in both systems. Paramedir is invoked in batch mode, providing a trace file and a configuration file which specifies the metric calculation as input. The generated output is an ASCII file containing the calculated statistics. This supports the programmability of performance analysis in the sense that complex performance metrics, determined by an expert user, can be automatically computed and processed. The detailed human driven analysis can thus be translated into rules suitable for processing by an expert system. Details on Paramedir can be found in [5]. Examples of metrics of the instrumented code segments which are automatically calculated by the current prototype are: – timing profiles: the percentage of time which is spent in subroutines and loops, – sequential sections: time that the master thread spends outside of parallel regions. – useful time: time that the applications spends running user code and not idling, synchronizing, or other parallelization introduced overhead. – workload balance: the coefficient of variation over the threads in the number of instructions within useful time, – parallelization granularity: the average duration of a parallel work sharing chunk, – estimated parallel efficiency: the speed-up divided by the number of threads. The execution time for a run on one thread is estimated using the useful time of all threads of the parallel run. These metrics are automatically calculated using Paramedir and are stored in a table. They are facts for the rule based analysis of the expert system. Information Fusion: After calculating the performance indices, the expert system extracts program structure information from the application database. At this point the main interest is the optimal placement of OpenMP directives. The prototype implementation retrieves the type of the instrumented code segment, such as loop or subroutine, the loop identifier, and the type of the loop, such as being parallel or sequential. Rule Based Analysis: The expert system uses observations to infer reasons for poor performance. The observations are a list of facts about the calculated performance metrics and static program analysis for the calculated code segments, such as described in the previous paragraph. Examples are: “the subroutine takes 50% of the execution time” or “the variation of executed instructions among the threads is high”. The metrics are compared to empirically determined threshold values in order to determine what is high and what is low. Conclusions are inferred through a set of rules that interpret certain patterns as reasons for poor performance. An example would be a rule of the form: – If: The code segment takes a large amount of time and the parallel efficiency is low and there are large sequential sections and the granularity of the parallelization is fine – Then: Try to move the parallelization to an outer level An illustration of the reasoning is given in Figure 2. The conclusions are presented to the user either as ASCII text or via the CAPO user interface. This will be discussed in the case study presented in Section 3.The rules are stored in a knowledge base. The
An Expert Assistant for Computer Aided Parallelization
669
Fig. 2. Example of facts and conclusions within the expert system. A set of rules is applied to the observed facts to determine patterns of causes for poor performance
expert system employs a data driven, forward chaining approach where a conclusion is infered given all asserted facts.
3 A Case Study The parallelization assistant expert system was tested on several full scale scientific applications. In order to give a flavor of its usability we present the analysis of the PSAS Conjugate Gradient Solver. The code is a component of the Goddard EOS (Earth Observing Systems) Data Assimilation System. It contains a nested conjugate gradient solver implemented in Fortran 90. The code showed little speed-up on an SGI Origin 3000 when increasing the number of threads from 1 to 4. A sample trace was collected on an SGI Origin 3000 for a run on 4 threads. Figure 3 shows three Paraver timeline views which were used for the calculation of performance metrics. The top image shows the time that the application spends in different subroutines. The different shadings indicate time spent in different routines. It turned out that there were two major time consuming routines: – the conjugate gradient solver, indicated by the lighter shading in the timeline, – the symmetric covariance matrix vector multiply, indicated by the darker shading in the timeline The middle image in Figure 3 shows the useful computation time spent within different parallel loops. This is a thread level view, where the different shading indicate different parallel loops. No shading indicates time outside of computations. Visual inspection immediately shows an imbalance between the threads, with thread number 1 spending a lot more time inside parallel computations. The reason for the imbalance can be inferred from the view displayed in the bottom image in Figure 3. It shows a time line view
670
Gabriele Jost et al.
Fig. 3. Three Paraver timeline views for a run on 4 threads: The top image shows the time spent in different subroutines on application level. The horizontal axis represents the time. Different shadings indicate time in different subroutines. The middle image shows time spent in useful computations within parallel loops on thread level, displaying a time line for each of the 4 threads. Useful computation time is shaded, non-useful time is white. The view shows an obvious imbalance, with the master thread spending more time in parallel computations. The bottom image shows the number of executed instructions for each thread. Darker shading indicates a higher number of executed instructions. This view indicates an imbalance in the computational workload among the threads
on thread level of the number of instructions between traced events. A darker shading indicates a higher value of the number of instructions. Comparison of the time line views clearly shows that thread 1 executes many more instructions during the useful time within parallel loops. A second important observation is, that a relatively large number of instructions are executed by threads 2,3 and 4 during non-useful time. These
An Expert Assistant for Computer Aided Parallelization
671
Fig. 4. Paraver Analysis showing the number of instructions during useful computations. The hardware counter value is mapped onto threads and parallel regions. The analysis shows a large variation across the threads, indicating an unbalance of the computational workload
are typically instructions executed by threads when idling while waiting for work. It is important to exclude these instructions when comparing the computational workload among the threads. A Paraver analysis of the instructions only during useful time is displayed in Figure 4, where the instruction counter value is correlated to the thread numbers and the parallel loops, indicating a large variation in useful instructions among the threads. The reason for the imbalance is that the amount of calculations per iteration differs greatly. By default, the iterations of a parallel loop are distributed block-wise among the threads, which is a good strategy if the amount of work per iteration is approximately the same. In the current case, thread number 1 ended up with all of the computationally intensive iterations, leaving the other threads idle for a large amount of time. The OpenMP directives provide clauses to schedule work dynamically, so that whenever a thread finished its chunk of work, it is assigned the next chunk. This strategy should only be used if the granularity of the work chunks is sufficiently coarse. After calculating the performance metrics described in Section 2 the expert system was able to detect the imbalance and its possible cause. The rule which fired for this particular case was:
672
Gabriele Jost et al.
Fig. 5. The expert system analysis can be displayed in CAPO’s Dynamic Analysis Window. Selecting a particular routine or loop, the corresponding source code is highlighted in the Directives Browser window. The Why Window displays information about the status of the parallelization of the code, such as the scope of variables or dependences that prevent parallelization
– If: The code segment takes a large amount of time and the parallel efficiency is low and there is a large variation in the amount of useful computation time and there is a large variation in the number of instructions within the parallelized loops and the granularity of the parallelization is sufficiently large – Then: Try dynamic scheduling in order to achieve a better work load balance. The expert system analysis output is generated as an ASCII file. The user has the possibility to display the analysis results using the CAPO directives browser. This integrates the dynamic performance analysis data with program structure information. An example is shown in Figure 5. The user is presented with a list of time consuming loops and subroutines. Selecting an item from the list displays the corresponding expert system analysis and highlights the corresponding source in the directive browser window. For the current case it helps the user to identify loops which suffer from workload imbalance. The expert system analysis provides guidance through the extensive set of CAPO browsers.
An Expert Assistant for Computer Aided Parallelization
673
4 Related Work There are several research efforts on the way with the goal to automate performance analysis to integrate performance analysis and parallelization. We can only name a few of these projects. KOJAK [6] is a collaborative project of the University of Tennessee and the Research Centre Juelich for the development of a generic automatic performance analysis environment for parallel programs aiming at the automatic detection of performance bottlenecks. The Paradyn Performance Consultant [8] automatically searches for a set of performance bottlenecks. The system dynamically instruments the application in order to collect performance traces. The URSA MINOR project [11] at Purdue University uses program analysis information as well as performance trace data in order to guide the user through the program optimization process. The SUIF Explorer [7,13] Parallelization Guru developed at Stanford University uses profiling data to bring the user’s attention to the most time consuming sections of the code. Our approach using an expert system differs from the previous work in that we are integrating two mature tools. Our rule based approach takes advantage of the high degree of flexibility in collecting and analyzing information provided by the underlying tools.
5 Conclusions We have built the prototype of an expert assistant to integrate two mature tools for computer aided parallelization and performance analysis. The system helps the user in navigating and interpreting the static program structure information, as well as the accompanying dynamic run-time performance information so that more efficient optimization decisions can be made. It enables the users to focus their tuning efforts on the sections of code that will yield the largest performance gains with the least amount of recoding. Using the expert system approach in a case study has demonstrated how to fuse data from the static and dynamic analysis to provide automated correlation and filtering of this information before conveying it to the user. The first conclusion we draw is that a number of relatively simple rules can capture the essence of the human expert’s heuristics needed to narrow down a parallelization related performance problem. Secondly, we found it to be very important to be able to switch to the direct usage of the tools at any point during the analysis process. The expert system analysis rapidly guides the user to code segments, views, and effects that require further detailed analysis with either CAPO or Paraver. This detailed analysis will in turn often lead to the design of new rules which can then be included in the automated process.
Acknowledgments This work was supported by NASA contract DTTS59-99-D-00437/A61812D with Computer Sciences Corporation/AMTI, by the NASA Faculty Fellowship Program, and by the Spanish Ministry of Science and Technology, by the European Union FEDER program under contract TIC2001-0995-C02-01, and by the European Center for Parallelism of Barcelona (CEPBA). We thank the CAPTools development team for many helpful discussions and the continued support of our work.
674
Gabriele Jost et al.
References 1. CLIPS: A Tool for Building Expert Systems, http://www.ghg.net/clips/CLIPS.html. 2. C.S. Ierotheou, S.P. Johnson, M. Cross and P. Leggett, “Computer Aided Parallelisation Tools (CAPTools), Conceptual Overview and Performance on the Parallelisation of Structured Mesh Codes,” Parallel Computing, 22 (1996) 163-195. http://www.parallelsp.com/parawise.htm. 3. H. Jin, M. Frumkin and J. Yan, "Automatic Generation of OpenMP Directives and Its Application to Computational Fluid Dynamics Codes," Proceedings of Third International Symposium on High Performance Computing (ISHPC2000), Tokyo, Japan, October 16-18, 2000. 4. G. Jost, H. Jin, J. Labarta and J. Gimenez, “Interfacing Computer Aided Parallelization and Performance Analysis," Proceedings of the International Conference of Computational Science - ICCS03, Melbourne, Australia, June 2003. 5. G. Jost, J. Labarta and J. Gimenez, “Paramedir: A tool for programmable Performance Analysis”, Proceedings of the International Conference of Computational Science - ICCS04, Krakow, Poland, June 2004. 6. Kit for Objective Judgment and Knowledge based Detection of Performance Bottlenecks, http://www.fz-juelich.de/zam/kojak/. 7. S. Liao, A. Diwan, R.P. Bosch, A. Ghuloum and M. Lam, “SUIF Explorer: An interactive and Interprocedural Parallelizer,” 7th ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming, Atlanta, Georgia, (1999), 37-48. 8. B.P. Miller, M.D. Callaghan, J. M. Cargille, J. K. Hollingsworth, R. B. Irvin, K.L. Karavanic, K. Kunchithhapdam and T. Newhall, “The Paradyn Parallel Performance Measurement Tools,” IEEE Computer 28, 11, pp.37-47 (1995). 9. OMPItrace User’s Guide, http://www.cepba.upc.es/paraver/manual i.htm. 10. OpenMP Fortran/C Application Program Interface, http://www.openmp.org/. 11. I. Park, M. J. Voss, B. Armstrong and R. Eigenmann, “Supporting Users’ Reasoning in Performance Evaluation and Tuning of Parallel Applications,” Proceedings of PDCS’2000, Las Vegas, NV, 2000. 12. Paraver, http://www.cepba.upc.es/paraver/. 13. SUIF Compiler System. http://suif.stanford.edu/.
Scalable Middleware Environment for Agent-Based Internet Applications Benno J. Overeinder and Frances M.T. Brazier Department of Computer Science, Vrije Universiteit Amsterdam De Boelelaan 1081a, 1081 HV Amsterdam, The Netherlands {bjo,frances}@cs.vu.nl
Abstract. The AgentScape middleware is designed to support deployment of agent-based applications on Internet-scale distributed systems. With the design of AgentScape, three dimensions of scalability are considered: size of distributed system, geographical distance between resources, and number of administrative domains. This paper reports on the AgentScape design requirements and decisions, its architecture, and its components.
1 Introduction Agent-based Internet applications are by design autonomous, self-planning, and often self-coordinating distributed applications. They rely on the availability of aggregated resources and services on the Internet to perform complex tasks. However, current technology restricts developers options for large-scale deployment: there are problems with the heterogeneity of both computational resources and services, fault tolerance, management of the distributed applications, and geographical distance with implied latency and bandwidth limitation. To facilitate agent-based applications, a middleware infrastructure is needed to support mobility, security, fault tolerance, distributed resource and service management, and interaction with services. Such middleware makes it possible for agents to perform their tasks, to communicate, to migrate, etc.; but also implements security mechanisms to, for example, sandbox agents to prevent malicious code harm the local machine, or vice versa, protect an agent from tampering by a malicious host. The AgentScape middleware infrastructure is designed to this purpose. Its multilayered design provides minimal but sufficient support at each level. The paper presents the AgentScape design and implementation, with a discussion on decisions made, and concludes with future directions.
2 Scalable Multi-layered Design of Agent Middleware A number of high-level agent concepts are first introduced to define their role in the paper. The requirements for scalable agent middleware are then discussed, followed by the current software architecture of the AgentScape middleware [6].
This research is supported by the NLnet Foundation, http://www.nlnet.nl/
J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 675–679, 2006. c Springer-Verlag Berlin Heidelberg 2006
676
Benno J. Overeinder and Frances M.T. Brazier
2.1 Concepts
AgentScape middleware
AgentScape middleware
AgentScape middleware
Solaris
Linux
W2K/XP
Location B
Location A
Within AgentScape, locations exist, agents are active entities, and services are external software systems accessed by agents hosted by AgentScape middleware. Agents in AgentScape are defined according to the weak notion of agency [7]: (i) autonomy: agents control their own processes; (ii) social ability: ability to communicate and cooperate; (iii) reactiveness: ability to receive information and respond; (iv) pro-activeness: ability to take the initiative. Agents may communicate with other agents and may access services. Agents may migrate from one location to another location. Agents may create and delete agents; agents can start and stop service access. All operations of agents are modulo authorization and security precautions, e.g., an agent is allowed to start a service if it has the appropriate credentials (ownership, authorization, access to resources, etc.). Mechanisms are provided to make external services available to agents hosted by AgentScape middleware. For example, external services are wrapped and presented as Web services, using SOAP/WSDL generated dynamic interfaces. A location is a “place” in which agents and services can reside. (see also Fig. 1). More precisely stated, agents and services are supported by agent servers and (Web) service gateways (respectively) which belong to a location. From the perspective of agents, agent servers give exclusive access to the AgentScape middleware. Similarly, service gateways provide access to services external to AgentScape. agent service AgentScape middleware
AgentScape middleware
Solaris
Mac OS X
Fig. 1. Conceptual model of the AgentScape middleware environment
2.2 Requirements Successful deployment of Internet applications has to take dimensions of scale into account [5]. Applications and systems on Internet-scale networks have to include mechanisms to deal with: (i) number of entities in a distributed system: agents (create, delete, migration), resources (add and delete resources to/from distributed platform, allocate resources), lookup service (resolve identifiers of entities to contact address, etc.); (ii) geographical size: the distance between different resources (communication latency and bandwidth); (iii) number of administrative domains: security mechanisms for authentication and authorization. The design goals set in the AgentScape project are to provide efficient support for distributed systems that scale along the dimensions outlined above. A distinction is made between agent application level support and agent middleware mechanisms. At the agent application level, functionality to develop and implement scalable applications must be made available. The application programming interface must reflect this, e.g., to allow
Scalable Middleware Environment for Agent-Based Internet Applications
677
for latency hiding and mobility, or give access to scalable services in the middleware layer. The agent middleware’s task is to implement the functionality provided by the application programming interface and the available services. This implies, amongst others, support for asynchronous communication mechanisms for latency hiding, include scalable services for name resolution (lookup service), and an effective management system that scales with the number of agents and resources. 2.3 Software Architecture The leading principle in the design of the AgentScape middleware is to develop a minimal but sufficient open agent platform that can be extended to incorporate new functionality or adopt (new) standards into the platform. The multiple code base requirement, e.g., supporting agents and services developed in various languages, makes that language specific solutions or mechanisms cannot be used. This design principle resulted in a layered agent middleware, with a small middleware kernel implementing basic mechanisms and high-level middleware services implementing agent platform specific functionality and policies (see Fig. 2). This approach simplifies the design of the AgentScape kernel and makes the kernel less vulnerable to errors or improper functioning. A minimal set of middleware services are agent servers, host managers, and location managers. The current, more extensive set includes a lookup service and a web service gateway. AgentScape component
Agent
AgentScape API Agent Server
Host Manager
interface of component
Location Manager
AgentScape kernel interface
AOS kernel
Fig. 2. The AgentScape software architecture
Minimal Set of Services Middleware services can interact only with the local middleware kernel. That is, all interprocess communication between agent servers and web service gateways is exclusively via their local middleware kernel. The kernel either directly handles the transaction (local operation) or forwards the request messages to the destination AgentScape kernel (remote operation). The agent server gives an agent access to the AgentScape middleware layer (see also Fig. 2). Multiple code base support in AgentScape is realized by providing different agent servers per code base. For security considerations, it is important to note that an agent is “sandboxed” by an agent server. For Java agents this is realized by the JVM, for Python (or other interpreted scripting languages like Safe-Tcl) by the interpreter, and for C or C++ (binary code) agents are “jailed”.
678
Benno J. Overeinder and Frances M.T. Brazier
A location manager is the coordinating entity in an AgentScape location (thus managing one or more hosts in one location). The host manager manages and coordinates activities on a host. The host manager acts as the local representative of the location manager, but is also responsible for local (at the host) resource access and management. The policies and mechanisms of the location and host manager infrastructure are based on negotiation and service level agreements [4]. Additional Services Extensibility and interoperability with other systems are realized by additional middleware services. The web service gateway is a middleware service that provides controlled (by AgentScape middleware) access to SOAP/WSDL web services. Agents can obtain WSDL descriptions in various ways, e.g., via UDDI and can generate a stub from the WSDL document to access a web service. However, stub generation using the Axis WSDL2Java tool is slightly modified so that calls on the web service interface are directed to the AgentScape web service gateway. If access is authorized, the web service gateway performs the operation on the web service and routes the results transparently back to the agent that issued the call.
3 Implementation and Experiences A number of implementation alternatives have been tested and evaluated, and development still evolves: the AgentScape project aims to provide both a scalable, robust agent platform, and a research vehicle to test and evaluate new ideas. The AgentScape middleware has been tested on different Linux distributions (Debian 3.1, Fedora Core 1, and Mandrake 9.2.1) and Microsoft Windows 2000 (service pack 4). The current kernel is implemented in the Python programming language, allowing for rapid prototyping. All following middleware services available in the current AgentScape release have been implemented in Java: Java agent server, web service gateway, host manager, and a location manager. The current lookup service, implemented in Python, is a centralized service that basically does its job, but is not scalable and secure. A next generation lookup service is under development: a distributed peer-to-peer lookup service.
4 Related Work Over the last few years, a number of agent platforms have been designed and implemented. Each design has its own set of design objectives and implementation decisions. JADE is FIPA compliant agent platform [1]. In JADE, platform and containers are similar concepts as location and hosts in AgentScape. A JADE platform is a distributed agent platform (location), and with one or more containers (hosts). A special front end container listens for incoming messages from other platforms. The AgentScape location manager fulfills a similar function. The front end container is also involved in locating agents on other JADE platforms. In AgentScape this is a different (distributed) service. Mobility in JADE is limited to one platform (intra-platform), whereas in AgentScape, agents can migrate to any location.
Scalable Middleware Environment for Agent-Based Internet Applications
679
Cougaar is a Java-based agent architecture that provides a survivable base on on which to deploy large-scale, robust distributed multi-agent systems [2]. The design goals are scalability, robustness, security, and modularity. Cougaar is not standards’ compliant, and messages are encoded using Java object serialization. The lack of standards’ compliance was in part due to Cougaar’s complex planning language, which does not easily fit into the ACL format, and Cougaar’s research focus of a highly scalable and robust system as opposed to interoperability. DIET is an agent platform that addresses present limitations in terms of adaptability and scalability [3]. It provides an environment for an open, robust, adaptive and scalable agent ecosystem. The minimalistic “less is more” design approach of DIET is similar to AgentScape. The platform is implemented in Java. Mobility is implemented by state transfer only, so no code is migrated. Agents can only migrate if their classes are already in the local classpath of the target machine. Security is not specifically addressed in DIET.
5 Summary and Future Directions The design rationale for scalability in AgentScape lies in the three dimensions as defined in Section 2.2. The integrated approach to solve the scalability problem is a unique signature of the AgentScape middleware. The small but extensible core of AgentScape allows for interoperability with other open systems. Future directions in the AgentScape project are a new middleware kernel developed in Java and completion of the implementation of security model. Agent servers for binary agents (C and C++) and Python agents are being considered. A new decentralized lookup service based on distributed hash tables is also being developed.
References 1. F. Bellifemine, A. Poggi, and G. Rimassa. Developing multi-agent systems with a FIPAcompliant agent framework. Software: Practice and Experience, 31(2):103–128, 2001. 2. A. Helsinger, M. Thome, and T. Wright. Cougaar: A scalable, distributed multi-agent architecture. In Proceedings of the International Conference on Systems, Man and Cybernetics (IEEE SMC 2004), The Hague, The Netherlands, October 2004. 3. C. Hoile, F. Wang, E. Bonsma, and P. Marrow. Core specification and experiments in DIET: A decentralised ecosystem-inspired mobile agent system. In Proceedings of the 1st International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS2002), pages 623–630, Bologna, Italy, July 2002. 4. D.G.A. Mobach, B.J. Overeinder, O. Marin, and F.M.T. Brazier. Lease-based decentralized resource management in open multi-agent systems. In Proceedings of the 18th International FLAIRS Conference, Clearwater Beach, FL, May 2005. 5. B.C. Neuman. Scale in distributed systems. In T. Casavant and M. Singhal, editors, Readings in Distributed Computing Systems, pages 463–489. IEEE Computer Society Press, Los Alamitos, CA, 1994. 6. N.J.E. Wijngaards, B.J. Overeinder, M. van Steen, and F.M.T. Brazier. Supporting Internet-scale multi-agent systems. Data Knowledge Engineering, 41(2–3):229–245, 2002. 7. M. J. Wooldridge and N. R. Jennings. Intelligent agents: Theory and practice. The Knowledge Engineering Review, 10(2):115–152, 1995.
Automatic Generation of Wrapper Code and Test Scripts for Problem Solving Environments Andreas Schreiber Deutsches Zentrum f¨ur Luft- und Raumfahrt e.V., Simulation and Software Technology Linder H¨ohe, 51147 Cologne, Germany
Abstract. Problem Solving Environments are useful tools to support complex simulation tasks in the field of engineering and research (e. g. aerospace, automotive, etc.). The TENT environment represents this type of system, into which external applications are being integrated. In this paper two software engineering issues are addressed, which are dealing with the automation of specific tasks while developing and extending the TENT system. First, we will deal with the automatic generation of code for integrating new applications into the system. This is important for reducing the effort of the manual creation of new application components. Code generation is implemented as a plug-in for the open source development platform Eclipse. Second, the automatic test case generation and execution for system tests. This feature of test generation is embedded within the TENT graphical user interface. It records all user actions as Python scripts to be used for automatic replaying the operations as a set using the embedded Python interpreter.
1 Introduction In complex scientific problem solving environments (PSEs) usually lots of modules are integrated into to the system to extend its functionality and to cover broader problem classes [1,2]. In many PSEs this integration of new modules (e. g., new numerical methods) is a software development task. For example, to extend a PSE with an external application one has to develop some kind of wrapper or interface code to connects the application to the programming and extension interfaces of the PSE. Usually the software engineering tasks for developing software systems consist of several phases: Analysis, design, coding, unit testing, integration testing, and system testing. Depending on the software development process, the order and frequency of these phases may vary, but almost all development processes have two major primary objectives in common [3]: To avoid errors and to find undiscovered errors as early as possible. For extending PSEs these are the two major software engineering phases: the actual development of the code for wrapping the new modules, and the test of the extendend PSE. Usually the analysis and design phases are less important, because the overall design of the PSE should be finished and the (programming) interfaces for the PSE extension should already be specified. For PSEs the necessity to extend it with new modules occurs relatively often. Therefore one has got to develop some kind of wrapper code frequently. These codes are J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 680–689, 2006. c Springer-Verlag Berlin Heidelberg 2006
Automatic Generation of Wrapper Code and Test Scripts
681
usually quite similar, but one still has to test the extended PSE and check for correct results. To support these tasks, tools or methods can be used to automate parts of the process. The major objectives for these supporting tools or methods are: To reduce the effort and to minimize coding errors. A feasible technique to accomplish these objectives is the automated generation of code. Especially in our case the codes for wrapping new modules and scripts for running test cases can be generated quite easily. In the following we give a very short overview of the integration system TENT [4] for creating domain specific problem solving environments. After that we first present how new applications are integrated into the system using code generation, and how this is achieved with the help of a plug-in for the development environment Eclipse [5]. The second topic is the automatic generation of test scripts in Python [6] which are used to test the TENT system using an embbeded Python interpreter.
2 Integration System TENT TENT is a software integration and workflow management system that helps to build and manage process chains for complex simulations in distributed environments. All tools typical for scientific simulation workflows can be integrated into TENT, and controlled from the user’s desktop using a graphical user interface (GUI, see Fig. 1). The key features of TENT are: – Flexible configuration, online steering, and visualization of simulations. – Utilization of distributed computing resources such as PCs, workstations, clusters, supercomputers, and computational grids. – Integration of a project based data management with support for cooperative working. Additional features are: – – – –
Visual composition of process chains (workflows). Monitoring and steering of simulations from any computer on the internet. Easy integration of existing applications. Automatic and efficient data transfer between the stages in the process chain.
The TENT system has been developed as a component system using several software technologies as Java, CORBA [7], the Globus Toolkit [8], XML, Python, and WebDAV [9]. Fig. 2 shows the general architecture of the TENT system. See [4] for more details. 2.1 TENT Components Components in TENT are defined by interfaces specified in CORBA IDL (Interface Definition Language). All TENT components are inherited from a common base interface which is extended with component specific methods and event interfaces. Events are the primary means of communication between TENT components; they are defined as IDL
682
Andreas Schreiber
Fig. 1. TENT Graphical User Interface
data structures in combination with corresponding listener interfaces. Components can be connected within workflows by connecting one component’s suitable event sender to another component’s corresponding event listener. To integrate new applications into TENT, one has to provide a TENT component (wrapper) for the particular application. The implementation of a component can be divided into the following steps: 1. Definition of the component’s interfaces by defining its IDL interfaces (inherited from a base interface, inclusion of event listener interfaces, and definition of methods for steps 3 and 4). 2. Implementation of the component’s interfaces (inherited from the generated CORBA skeleton base classes) and initialization of properties and data transfer. 3. Development of application-related communication code (depends on desired functionality of the component). 4. Development of necessary GUI plug-ins (depends on controllability of the component). 5. Creation of a proper default configuration file for the component’s set of properties. 2.2 TENT-Based Problem Solving Environments The integration framework TENT is being used to build specific problem solving environments for solving high-level domain tasks. To get the PSE, all necessary applications need to be integrated into TENT. The number of applications varies for each PSE, it
Automatic Generation of Wrapper Code and Test Scripts
683
System Components Scripting
Coupling
Plug-In
Factories
Plug-In
Data server
Plug-In
Name Server
GUI
CORBA Wrapper
Wrapper
Wrapper
Wrapper
CFD
CSM
Visualization
Filter
Applications
Fig. 2. Architecture of the TENT system
ranges from just a few numerical codes to many dozens of codes together with analysis, pre-/post-processing, and visualization tools. PSEs based on the TENT framework have been created for aerodynamic studies of civil aircrafts [10], numerical analysis of combat aircraft maneuvering [11,12], numerical simulation of space re-entry vehicles [13,14], or constructional optimization and crash simulation of car bodies [15].
3 Code Generation for Extending PSEs Our goal was to obtain effective tools for extending TENT-based PSEs with new components according to the steps described in section 2.1. The chosen approach was the development of tools based on Eclipse in the form of plug-ins to enhance the code creation efficiency. The most important one is the plug-in for the automatic generation of TENT component code. 3.1 What Is the Eclipse Platform? “The Eclipse Platform is designed for anything, but nothing in particular” [16]. It is a universal open platform for tool integration. The basic framework of Eclipse is very extensible via plug-ins (like Wizards, Editor, or Views). For example, the standard Eclipse distribution is distributed with a set of plug-ins for Java development which makes Eclipse a very competetive Java IDE (Integrated Development Environment). 3.2 Generating TENT Components Using Eclipse To generate TENT components (wrapper codes), we have developed a set of Eclipse plug-ins consisting of wizards and editor extensions. In the wizard all details for the
684
Andreas Schreiber
component are specified. The editor extension’s purpose is to add or change certain code features in the generated (or already existing) code. Additionaly to components, the code generation wizard can create code for TENT GUI plug-ins. The creation of a TENT component includes the definition of the component IDL interface and their Java implementation according to steps one and two in section 2.1. The user defines meta data (as the name and location of the generated code) interactively using Wizard dialogs. A dialog (see Fig. 4) provides the means to select the listener interfaces for the events to receive. This information will be added to the component’s IDL interface and the Java implementation code. The base for the code to generate is specified in template files (see Fig. 3). After merging all metadata and additional code with the templates, the result is a functional basic TENT component code for wrapping applications. Currently, these still have to be edited manually to add non-standard code functionalities. But this remaining work is well supported by the standard Eclipse Tasks view. It presents a list of all places within the code which need manual change or completion (see Fig. 4). The tool for component code generation described here is an Eclipse Feature. An Eclipse feature consists of more than one plug-in and can be used to extend the basic Eclipse platform easily. Our Eclipse feature consists of two plug-ins: The first plug-in consists of the component code generation functionality, the second contains an implementation of the Velocity template engine [17]. Metadata (user input) Templates
Eclipse Plug-In and Velocity generates
IDL Code
generates
Java Code
Fig. 3. Generation of wrapper code using templates
3.3 Current Implementation Velocity is used for the generation of the TENT component source code. It is a template engine implemented in Java and can be used to generate web pages, Java code, or any plain text based output from templates. The templates give the possibility to use declarations, conditional statements or loops. After getting all necessary variables, the engine interprets the template and replaces the variables inside. The wizard for creating new components, and the dialogs that are opened from within the Java source code editor (Fig. 4), are implemented using the Standard Widget Toolkit (SWT) [18]. The component code generation plug-in described uses the org.eclipse.ui.newWizards extension point. After installation the wizard can
Automatic Generation of Wrapper Code and Test Scripts
685
Fig. 4. Eclipse with generated Java code, popup menu with extensions, and dialog for adding events
be found using New → TENT → Wrapper Component (see Fig. 5). All available wizards in Eclipse are accessible at this point. So creating a new TENT component with Eclipse is similar to starting any (general) new Java project.
4 Script Generation for System Tests 4.1 System Testing The system test is a major test phase. It tests the system as a whole. The main goal of the system test is the verification of the overall functionality (see [19] or [20] for more details on software tests). For a PSE that means overall testing of the base framework together with all integrated modules. During the system test many things need testing. Some very common tests are: – – – – –
Functional tests (including GUI tests), performance and stress tests, client/server tests, usability tests, and installation tests.
686
Andreas Schreiber
Fig. 5. Wizard for new TENT specific classes
To be effective, the system test has to be performed frequently. On the other hand, performing these tests by humans (developers or quality assurance staff members) is relatively expensive and slow. A more appropriate way for performing system tests is the creation of proper test scripts, especially for the functional, performance, stress and client/server tests. These test can then be automated and started manually or automatically. In the TENT development the system test were formerly done by a quality assurance staff member who performed test procedures specified in written documents. The test results were added to the document and – in case of discovered errors – in a bug tracking system. Some GUI tests have been performed using a commercial capture/replay tool (TestComplete [21]). 4.2 Script Based System Testing in TENT Currently all manual tests are being replaced by automated scripted tests for performing complex system tests. In general there are two different methods for creating test scripts: The first method is the manual coding of these scripts. The major disadvantage of coding them by hand is that this approach is slow and error-prone. The second method is the automatic generation or recording of these scripts by code generators or the application itself. This method is much faster and the scripts are error-free. Our technique for generating test scripts is Command Journaling. Originaly, Command Journaling has been added to TENT for tracing user actions (an audit trail feature to answer the question “What did I do?”). In Command Journaling all user actions are logged to a journal or session file in a format that can be replayed or executed afterwards. In TENT the actions the user is performing in the GUI are logged in a Python syntax.
Automatic Generation of Wrapper Code and Test Scripts
687
For this purpose all important Java classes of the TENT GUI are recording all actions as Python code to a journal file, which can be used to replay the same operations by a Python interpreter. In TENT, the generated journal script can be run in the GUI embedded Python interpreter (Jython [22,23]), or non-interactively in a stand-alone Python interpreter without the need of a GUI (which is useful for batch processing jobs). The following Python code shows an example of a generated journal script. In this example, the user added four components to the workflow editor, connected them by events, set some parameters, and started a simulation run. # Command journaling, generated example script (Python syntax) from de.dlr.tent.gui import GUI cf = GUI.getControlFacade() cf.addComponent("IDL:de.dlr.tent/ActionEventSender:1.0.Type*ActionEvent.Subtype*action.Instance") cf.addComponent("IDL:de.dlr.tent/Script:1.0.Type*Script.Subtype*script.Instance") cf.addComponent("IDL:de.dlr.tent/Polar:1.0.Type*PolarControl.Subtype*PolarControl.Instance") cf.addComponent("IDL:de.dlr.tent/Simulation:1.0.Type*FLOWer.Subtype*AGARD.Instance") cf.addWire("IDL:de.dlr.tent/ActionEventSender:1.0.Type*ActionEvent.Subtype*action.Instance", "IDL:de.dlr.tent/Script:1.0.Type*Script.Subtype*script.Instance", "ActionEvent") cf.addWire("IDL:de.dlr.tent/Script:1.0.Type*Script.Subtype*script.Instance", "IDL:de.dlr.tent/Polar:1.0.Type*PolarControl.Subtype*PolarControl.AGARD.Instance", "ActionEvent") cf.addWire("IDL:de.dlr.tent/Script:1.0.Type*Script.Subtype*script.Instance", "IDL:de.dlr.tent/Simulation:1.0.Type*FLOWer.Subtype*AGARD.Instance", "Simulation") cf.addWire("IDL:de.dlr.tent/Script:1.0.Type*Script.Subtype*script.Instance", "IDL:de.dlr.tent/Simulation:1.0.Type*FLOWer.Subtype*AGARD.Instance", "Simulation") cf.activateAll() cf.getComponent("IDL:de.dlr.tent/Simulation:1.0.Type*FLOWer.Subtype*AGARD.Instance"). setProperty("Executable", "/work/flower116") cf.saveWorkflowAs("polar_with_FLOWer.wfl") cf.getComponent("IDL:de.dlr.tent/ActionEventSender:1.0.Type*ActionEvent.Subtype*action.Instance"). fireStart(1) cf.getComponent("IDL:de.dlr.tent/Polar:1.0.Type*PolarControl.Subtype*PolarControl.Instance"). setDoubleProperty("EndValue", 40) cf.getComponent("IDL:de.dlr.tent/Polar:1.0.Type*PolarControl.Subtype*PolarControl.Instance"). setDoubleProperty("Increment", 2)
In practice, the TENT system tester has to perform the following steps to create and run system test cases: 1. The tester performs all steps to test a certain functionality using the GUI. The result is a generated journal script. 2. If necessary, the tester edits the generated script. This can be necessary for removing unneeded commands, for adding evaluation statements to check the internal state of TENT for errors, for generalizing file path names, or for extending the script with further and more complex action, such as loops. 3. Optionally, the new test script may be added to a test suite. 4. The tester starts the test run by invoking the test script using either the embedded Python interpreter of the TENT GUI or a stand-alone Python interpreter. 5. All discovered errors are reported to the bug tracking system. This is is currently still a manual task, but the future goal is to automate this step as well, so that the generated test script submits a new bug tracking entry in case of certain errors. Most of these steps can also be performed using (commercial) capture/replay tools for GUI tests. But an advantage for creating tests with the described technique is the ability to get information about the running system from within the TENT framework. This is prossible, because the Python interpreter can directly access all public information within the same process. Using this internal information, one can add very sophisticated error detection functionality to the test script.
688
Andreas Schreiber
5 Conclusions Code generation is more and more important for repetetive and errer-prone tasks in all kinds of software development. As we described, code generation can be a very useful technique in scientific computing and related topics. It can streamline the process of creating and debugging code. In our opinion, developers will be writing code generators instead of writing “regular” code more often in the future. In the presented solution for generating wrapper code, there is still some manual work necessarry for specific and unforseen functionalities. But the goal is the complete generation of wrapper code. It should be possible to integrate new methods into PSEs simply by entering some meta information about the code or application to be added. As a base for developing a tool for extending our integration framework TENT, the open source platform Eclipse has been used. We think that Eclipse is a well suited development framework for all types of GUI-based applications and tools (see also [24]).
Acknowledgements The author would like to acknowledge the work of Thijs Metsch (BA Mannheim and DLR) for his implementation of the first Eclipse code generation plug-in.
References 1. J. R. Rice and R. F. Boisvert, From scientific software libraries to problem-solving environments, IEEE Computational Science and Engineering, Fall, pp. 44–53, 1996. 2. S. Gallopoulos, E. Houstis, and J. R. Rice, Problem-solving environments for computational Science, IEEE Computational Science and Engineering, Summer, pp. 11–23, 1994. 3. I. Sommerville, Software Engineering, Addison-Wesley, Harlow, UK, 5th edition, 1995. 4. A. Schreiber, The Integrated Simulation Environment TENT, Concurrency and Computation: Practice and Experience, Volume 14, Issue 13–15, pp. 1553–1568, 2002. 5. Eclipse home page. http://www.eclipse.org 6. Python Language Website. http://www.python.org 7. CORBA home page. http://www.omg.org/corba 8. Globus Project Website. http://www.globus.org 9. J. Whitehead and M. Wiggins, WebDAV: IETF Standard for Collaborative Authoring on the Web, In IEEE Internet Computing, Vol 2, No. 5, Sept/Oct, 1998. 10. R. Heinrich, R. Ahrem, G. Guenther, H.-P. Kersken, W. Krueger, J. Neumann, Aeroelastic Computation Using the AMANDA Simulation Environment. In Proc. of CEAS Conference on Multidisciplinary Design and Optimization (DGLR-Bericht 2001–05), June 25–26, 2001, Cologne, pp. 19–30 11. Project SikMa home page. http://www.dlr.de/as/forschung/projekte/sikma 12. A. Sch¨utte, G. Einarsson, A. Madrane, B. Sch¨oning, W. M¨onnich, and W.-R. Kr¨uger, Numerical Simulation of Manoeuvring Aircraft by Aerodynamic and Flight-Mechanic Coupling, RTA/AVT Symposium on Reduction of Military Vehicle Aquisition Time and Cost through Advanced Modeling and Virtual Product Simulation, Paris, 2002, RTO-MP-089, RTO/NATO 2003. 13. A. Schreiber, T. Metsch, and H.-P. Kersken, A Problem Solving Environment for Multidisciplinary Coupled Simulations in Computational Grids, In Future Generation Computer Systems, in press, 2005.
Automatic Generation of Wrapper Code and Test Scripts
689
14. R. Sch¨afer, A. Mack, B. Esser, A. G¨ulhan, Fluid-Structure Interaction on a generic Model of a Reentry Vehicle Nosecap. In Proc. of the 5th International Congress on Thermal Stresses, Blacksburg, Virginia 2003. 15. Project AUTO-OPT home page. http://www.auto-opt.de 16. Eclipse Platform Technical Overview, Object Technology International, Inc., February 2003. 17. Velocity Template Engine. http://jakarta.apache.org/velocity 18. Standard Widget Toolkit. http://www.eclipse.org/platform/index.html 19. R. J. Patton, Software Testing, Sams Publishing, 2000. 20. D. Graham and M. Fewster, Software Test Automation. Effective Use of Test Execution Tools, Addison–Wesley, 1999. 21. TestComplete product home page. http://www.automatedqa.com/products/tc.asp 22. Jython home page. http://www.jython.org 23. S. Pedroni, N. Rappin, Jython Essentials, O’Reilly and Associates, 2002. 24. T. E. Williams and M. R. Erickson, Eclipse & General-Purpose Applications, Dr. Dobbs Journal, September 2004, pp. 66–69.
Runtime Software Techniques for Enhancing High-Performance Applications: An introduction Masha Sosonkina Ames Laboratory and Iowa State University Ames IA 50010, USA [email protected]
Preface Parallel computing platforms advance rapidly, both in speed and in size. However, often only a fraction of the peak hardware performance is achieved by high-performance scientific applications. The main reason is in the mismatch between the way parallel computation and communication are built into applications and the processor, memory, and interconnection architectures, which vary greatly. One way to cope with the changeability of hardware is to start creating applications able to adapt themselves “on-the-fly”. In accomplishing this goal, the following general questions need to be answered: How does an application perceive the system changes? How are the adaptations initiated? What is the nature of application adaptations? The papers in this minisymposium attempt to answer these questions by providing either an application-centric or a system-centric viewpoint. Given changing computational resources or a variety of computing platforms, application performance may be enhanced by, for example, modifying the underlining computational algorithm or by using “external” runtime tools which may aid in better load balancing or mapping of scientific applications to parallel processors. A new dynamic load balancing algorithm is considered by Dixon for particle simulations. Cai shows a way to decrease parallel overhead due to compute node duplication in parallel linear system solution. For a parallel adaptive finite element method implemented on clusters of SMPs, Hippold et al. investigate the interaction of cache effects, communication costs and load balancing. Argollo et al. take a system-centric approach and propose a novel system architecture for geographically distributed clusters. This architecture is tested for a matrix multiplication application. Sosonkina integrates an external middleware tool with a parallel linear system solver to monitor its communications and to inform about possible times to invoke adaptive mechanisms of the application.
J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, p. 690, 2006. c Springer-Verlag Berlin Heidelberg 2006
Efficient Execution of Scientific Computation on Geographically Distributed Clusters Eduardo Argollo1, Dolores Rexachs1 , Fernando G. Tinetti2 , and Emilio Luque1 1
2
Computer Science Department, Universitat Aut`onoma de Barcelona, Spain [email protected], {dolores.rexachs,emilio.luque}@uab.es Fac. de Inform´atica, Inv. Asistente CICPBA, Universidad Nacional de La Plata, Argentina [email protected]
Abstract. To achieve data intensive computation, the joining of geographically distributed heterogeneous clusters of workstations through the Internet can be an inexpensive approach. To obtain effective collaboration in such a collection of clusters, overcoming processors and networks heterogeneity, a system architecture was defined. This architecture and a model able to predict application performance and to help its design is described. The matrix multiplication algorithm is used as a benchmark and experiments are conducted over two geographically distributed heterogeneous clusters, one in Brazil and the other in Spain. The model obtained over 90% prediction accuracy in the experiments.
1 Introduction For an economical approach to achieving data intensive computation, one increasing widespread approach is the use of dedicated heterogeneous networks of workstations (HNOW) with standard software and libraries. These clusters have become popular and are used to solve scientific computing problems in many universities around the world. However, the user’s needs are usually beyond the performance of these clusters. Internet, through its evolution over recent decades, has become a real possibility to interconnecting geographically distributed HNOWs in such a way that the joint behave as a single entity: the collection of HNOWs (CoHNOW). Obtaining effective collaboration of this kind of systems is not a trivial matter [1]. In a CoHNOW there are two levels of communication: an intra-cluster network level to communicate machines locally, and an inter-cluster network level, responsible for interconnecting the clusters. To achieve an efficient level of collaboration performance between the clusters the use of efficient policies relating to workload distribution is crucial. This need is increased when Internet is used, due to its unpredictable latency and throughput, and its performance limitations. A key concept is to evaluate the possibilities of using the CoHNOW as a single cluster with some specific characteristics. To do this evaluation, this paper proposes a system architecture, a system model and an application tuning methodology. The system architecture evolved from the initial proposal of [2] in such a way that the CoHNOW has been organized as a hierarchical Master/Worker-based collection of clusters. In each communication level, pipeline strategies are implemented in such a way that J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 691–698, 2006. c Springer-Verlag Berlin Heidelberg 2006
692
Eduardo Argollo et al.
communication and computation can be overlapped. For intra-cluster communications MPI MPICH library [3] is used and to optimize the inter-cluster communication performance, guarantee transparency and add reliability to Internet connections, a specific service was introduced to the architecture: the Communication Manager. Some of the characteristics of the multiple-cluster system (worker performance, intra and inter network throughput) and of the algorithm (workload size and its distribution management) were examined to obtain an analytical system model that permits the prediction of the execution performance and explains its behavior over time. The proposed methodology describes how to assess whether the collaboration makes sense and, if it does, it then describes how to tune the application, adjusting some of the parameters, thereby identifying the best strategies for obtaining the best performance. In order to demonstrate and evaluate the proposed architecture, methodology and data distribution policies in the field of scientific computation, it is necessary to select a representative application as the benchmark program. The matrix multiplication (MM) problem, using a blocked algorithm [4], is the selected benchmark application because it is a highly scalable problem with an easily manageable workload, including the possibility of granularities changes. In addiction, the MM operation is a key component of the Linear Algebra Kernel [5] [6] used by a wide range of scientific applications. Its execution over different-speed processors turns out to be surprisingly difficult. Actually, its NP-completeness was proved [7]. Experiments were done in order to validate our work using two geographically distributed clusters: one located in Brazil and the other in Spain. Each cluster is a dedicated HNOW and they are interconnected by the Internet. The following sections present our study in further detail. Section 2 briefly describes the system architecture. The analytical model for the system and benchmark application is explained in section 3. In section 4, experiments description, obtained results and details are described and results are shown. Finally, conclusions and further work are presented in section 5.
2 System Architecture The necessity of constantly controlling network usability and ensuring a dynamic rearrangement of the workload of each HNOW led us to use an extended and hierarchical version of the Master/Worker programming paradigm (M/W) as the run-time system model. The proposed hierarchical organization, see Fig.1, implies there is a “maincluster”, from which the application execution starts, connected to the “remote clusters” by the communication manager (CM). These “remote clusters”, or sub-clusters, are organized around their sub-masters and sub-workers. Two different strategies were developed for the two different communication levels. Inside the local network, its low latency and high throughput permits the utilization of standard MPI library functions. To manage the inter-cluster communication task, carried out over a long-distance network (LDN), a specific service has been developed to be included in each cluster: the Communication Manager (CM). CMs isolate the local network from the public one, manage the public network disconnections, guarantee the inter-cluster communication and maintain it continuously,
Efficient Execution of Scientific Computation on Geographically Distributed Clusters
693
Fig. 1. System architecture as a hierarchical Master/Worker CoHNOW
exploiting as much as possible the unpredictable communication link throughput and bandwidth peaks by means of an adaptive multithreading communication scheme [8].
3 System Model In such a heterogeneous system, the CoHNOW, it is important to determine if the collaboration is profitable for a certain execution. It is also desirable to have, in advance and along the application execution, a performance estimation, including the expected execution time. To attain these estimation goals an analytic model of the system was developed. The purpose of this model is to evaluate the possibilities of an efficient collaboration on a CoHNOW, taking into consideration the characteristics and parameters of system’s clusters, intra and inter communication links and application. When the collaboration makes sense, the model equations predict the application execution behavior and its performance. The model can also be used in the development of a methodology to design and tune the application in order to obtain the best possible performance over the available computational resources. The system model is based in a simple computation-communication analysis that is applied to each communication level: inside a cluster (intra-cluster) and between a pair of clusters (inter-cluster). It is important to remark that this approach can be applied for any amount of interconnected clusters, with any possible hierarchical M/W topology since, in this scheme, one cluster just communicates with one other cluster. In this section we explain the generic computation-communication analysis in both communication levels: intra and inter-cluster. We apply this analysis to the selected benchmark program, the matrix multiplication (MM) algorithm. 3.1 Computation-Communication Analysis The computation-communication analysis is oriented to the evaluation of the possible performance that can be obtained as a function of the communication network throughput. This evaluation is done through the comparison of the amount of data needed to be communicated for a certain workload, and the amount of operations involved in processing this workload.
694
Eduardo Argollo et al.
For a certain workload, the “computation time” (CptTime) can be defined as the ratio between the number of operations (Oper) and the system performance (Perf ). The communication time (CommTime) is the ratio between the volume of data communication (Comm) and the interconnection network throughput (TPut). Using the pipeline strategy, there will be no workers’ idle time whenever “communication time” is equal-to “computation time”, this means maximum system efficiency is achieved. We can then conclude that the attainable system performance depends on the network throughput as shown in Eq. 1. CommT ime = CptT ime ⇒ P erf =
Oper ∗ T P ut Comm
(3.1)
The performance that can be obtained is then the minimum between the Available System Performance (ASP) and the performance limited by the network (Perf ). The Available System Performance (ASP) can be obtained experimentally and represents the summarization of the performance for each worker executing the algorithm locally (without communication). This simple analysis should be applied to the target algorithm, for each workload distribution level (intra and inter-cluster), providing the prediction of the overall execution performance. 3.2 Intra-cluster Performance Contribution Evaluation At the intra-cluster level, the MM blocked algorithm will be used so that the master will divide the M x M elements matrix in blocks of B x B elements. The master sends a pair of matrix blocks to be multiplied, receiving one result block. The amount of communication (Comm) is then the total number of bytes of the above mentioned three blocks, Comm=3 ∗ α ∗ B 2 , being (α) the floating point data size in bytes. For this workload, the number of operations (floating point multiplications and additions) is Oper=2B 3 − B 2 . Applying the same computation-communication approach of section 3.1, considering now the local area network throughput (LanTPut), the local Cluster Performance Limit (CPL) in flops is provided by Eq. 2. This equation determines the performance limits for the intra-cluster execution and it can be used to determine the best local block granularity with which the best local execution performance can be obtained. CP L =
(2B 3 − B 2 ) ∗ LanT P ut (2 ∗ B − 1) ∗ LanT P ut = 3 ∗ α ∗ B2 3∗α
(3.2)
The Expected Cluster Performance (ECP) is then the minimum between its CPL and the experimentally obtained Available System Performance (ASP) of the cluster. 3.3 Inter-cluster Performance Contribution Evaluation Considering a cluster as a single element we can then analyze the inter-cluster communication and the workload distribution and management between clusters in order to determine the amount of the remote clusters performance that can be attained to the CoHNOW.
Efficient Execution of Scientific Computation on Geographically Distributed Clusters
695
The maximum CoHNOW performance is obtained when idle time is avoided on its clusters. To do this for the remote clusters, despite the low LDN throughput, a different workload granularity size should be used. For some algorithms this distribution can be improved using the data locality in a way that the granularity will have a dynamic growth since the previous received data can be reused when joined with the recently received data. This improvement can be obtained for the blocked MM algorithm through a pipeline-based distribution of complete first-operand rows of blocks (R) and second-operand columns of blocks (C). Each time a new row/column (R/C) pair reaches the sub-master, it is added to the previously received rows and columns, exploiting the O(n3 ) computation complexity against the O(n2 ) data communication complexity of the MM algorithm, therefore increasing the computation-communication ratio. Using the whole row by column multiplication operations, Oper=2M B 2 − B 2 , and the result communication, Comm=α ∗ B 2 , in the computation-communication analysis it is possible to derive the equation that describes the Potential Contribution Performance Limit (CoPeL) the remote cluster can provide as a function of the LDN average throughput (LdnAvgTPut) and the matrix elements (Eq. 3). CoP eL =
(2M B 2 − B 2 ) ∗ LdnAvgT P ut (2 ∗ M − 1) ∗ LdnAvgT P ut = (3.3) α ∗ B2 α
Then, for a remote cluster, the Expected Performance Contribution (EPC) is the minimum value between the CoPeL and the cluster CPL. In the inter-cluster pipeline, when a new pair P of R/C is received, Comm=2 ∗ α ∗ M ∗ B, then 2*P -1 new row by column blocks multiplications operations are possible, Oper=(2P −1)(2M B 2 −B 2 ). This application property implies a constant growth in the remote cluster performance collaboration, until the Expected Performance Collaboration (EPC) value is reached. This moment is called stabilization point because, from this time on, the contributed performance value will stabilize. Substituting in Eq. 1 Comm and Oper for the inter-cluster pipeline, and deducing P, Eq. 4 is obtained. The value of P represents the number of block rows and columns to be sent to reach the stabilization point. Consequently the StabilizationTime, the elapsed execution time until this stabilization point is reached, in other words the time spent to send P rows and columns blocks, is provided by Eq. 5. It is important to note that although the inter-cluster throughput is variable, its average value can be easily obtained experimentally. Based on this average value the availability and performance of the collaboration can be established and the equations can be dynamically calculated through those parameter variations so that workload and flow actions can be taken in order to maintain or achieve new levels of collaboration. P =
α ∗ M ∗ ContribP erf 1 + (2 ∗ M ∗ B − B) ∗ LdnAvgP ut 2
StabilizationT ime =
P ∗ Comm 2∗P ∗α∗M ∗B = LdnAvgP ut LdnAvgP ut
(3.4)
(3.5)
696
Eduardo Argollo et al.
4 Experiments In order to check the accuracy of the developed model, experiments were executed over the testbed CoHNOW. Some experiments results will be shown and one experiment will be analyzed in more detail. Table 1 is presenting the obtained results for 4 significant experiments. The experiments duration are in the range of 400 to 4,000 minutes in order to cover the Internet variable behavior along the time. The prediction for all experiments present over 90% precision. The experiments were chosen with two different granularities: 100 x 100 block size was chosen to illustrate the capability of prediction with the local area networks saturated, while 400 x 400 block size was obtained with the use of the model equations because it represents the smallest block on which the best execution performance is possible. To evaluate the Internet throughput contribution, two different CM usage levels were selected to obtain averages throughput of 10KB/sec and 50KB/sec. These averages values were empirically evaluated after exhaustive monitoring of the network behavior. Table 1. Prediction and real execution comparison for different experiments Single CM Thread Multiple CM Thread Matrix Order (M)
10,000
20,000
10,000
10,000
Block Order (B)
100
400
100
400
10
10
50
50
Expected LdnAvgTput (KB/sec)
Predicted Values Brazil Cluster Perf (MFlops)
16.78
28.97
16.78
28.97
Spain Cluster Perf (MFlops)
16.78
58.10
16.78
58.10
Stabilization R/C (P)
34
29
7
6
Stabilization Time (min)
433
3,007
18
64
Execution Time (min)
1,106
4,117
998
409
31.90
78.80
Experiments Results CoHNOW Performance (MFlops) 32.72
79.02
Stabilization R/C (P)
34
29
7
6
Execution Time (min)
1,145
4,482
1,038
448
Prediction Accuracy
96.4%
91.8%
96.2%
91.3%
The performance behavior along time for the 10,000 x 10,000 matrix using 400 x 400 block with 50KB/sec of expected Internet bandwidth experiment is shown in Fig.2. For the same experiment, Fig.3 shows the real LDN link throughput along the time. The performance attained by the remote cluster (“remote” line in Fig. 2) increases until it reaches the stabilization point at minute 64. From this time on the cluster is contributing with its best performance. Although at minute 130 there was an abrupt drop of performance (a in Fig. 2) because of a 5 minutes Internet disconnection (a in Fig. 3). This disconnection caused no penalty for the overall system performance since the pipe was full at that time and after the reconnection all results of the disconnected period were sent, causing a sudden peak of performance (b in Fig. 2).
Efficient Execution of Scientific Computation on Geographically Distributed Clusters
697
The model equations were also used to determine when to stop sending data to the remote cluster (b in Fig. 3) because it has enough work data for the remaining of the predicted execution time.
120
b
100
Remote
Performance (MFlops)
Local 80
a 60
40
20
0 0
34
67
100
134
167
200
234
267
300
334
367
401
434
Time (min)
Fig. 2. Experiment M=10000 B=400 LdnAvgTput = 50KB/s: execution performance behavior 70
Throughput (Kbytes/sec)
Outgoing
b
60
Incoming
50 40
a
30 20 10 0 1
34
67
100
133
166
199
232 Time (min)
265
298
331
364
397
430
Fig. 3. Experiment M=10000 B=400 LdnAvgTput = 50KB/s: communication throughput along the time
5 Conclusions and Future Work In order to evaluate the possibilities using a CoHNOW as a "single cluster", a system architecture was defined including a new service, the CM that is in charge of managing the long-distance communication and a system model was developed to predict and tune the application execution. Experiments to validate the model predicted results were made. For the selected benchmark application, some experiments, with different sizes, granularities and inter-cluster network throughputs, were made and the obtained results had been predicted with an error less than 9%. The performance behavior for a specific experiment was analyzed and this analysis demonstrates that the architecture support the inter-cluster effective collaboration and that the model can be used to predict the stabilization point and the performances behavior before and after it. Some important future lines consist of generalizing the model to include other scientific computation applications, and the selection of the optimal CoHNOW virtual configuration for maximizing the global system performance.
698
Eduardo Argollo et al.
Acknowledgment The authors want to thank the anonymous reviewers whose comments led to a substantial clarification of the contribution of this paper.
References 1. Olivier Beaumont, Arnaud Legrand, and Yves Robert. The master-slave paradigm with heterogeneous processors. IEEE Trans. Parallel Distributed Systems, 14(9) : 897–908, 2003. 2. A. Furtado, J. Souza, A. Rebouc¸as, D. Rexachs, E. Luque. Architectures for an Efficient Application Execution in a Collection of HNOWS. In D. Kranzlm¨uller et al. editors, Proceedings of the 9th Euro PVM/MPI 2002, Lecture Notes in Computer Science vol. 2474, pages 450–460, 2002. 3. W. Gropp, E. Lusk, N.Doss, and A. Skjellum. A high-performance, portable implementation of the MPI message passing interface standard Scientific and Engineering Computation Series, Parallel Computing vol 22 number 6, 789–828, 1996. 4. M. S. Lam, E. Rothberg, M. E. Wolf. The Cache Performance and Optimizations of Blocked Algorithms. 4th Intern. Conference on Architectural Support for Programming Languages and Operating Systems, Palo Alto CA, pp 63-74, April 1999. 5. Dongarra J., J. Du Croz, S. Hammarling, I. Duff. A set of Level 3 Basic Linear Algebra Subprograms. ACM Trans. Math. Soft., 16(1):1–17, 1990. 6. Dongarra J., D. Walker. Libraries for Linear Algebra. in Sabot G. W. (Ed.), High Performance Computing: Problem Solving with Parallel and Vector Architectures, Addison-Wesley Publishing Company, Inc., pp. 93–134, 1995. 7. Beaumont O., F. Rastello and Y. Robert. Matrix Multiplication on Heterogeneous Platforms. IEEE Trans. on Parallel and Distributed Systems, vol. 12, No. 10, pp. 1033-1051, October 2001. 8. E. Argollo, J. R. de Souza, D. Rexachs, and E. Luque. Efficient Execution on Long-Distance Geographically Distributed Dedicated Clusters. In D. Kranzlm¨uller et al. editors, Proceedings of the 11th Euro PVM/MPI 2004, Lecture Notes in Computer Science vol. 3241, pages 311–319, 2004.
Improving the Performance of Large-Scale Unstructured PDE Applications Xing Cai1,2 1
Simula Research Laboratory, P.O. Box 134, N-1325 Lysaker, Norway [email protected] 2 Department of Informatics, University of Oslo P.O. Box 1080 Blindern, N-0316 Oslo, Norway
Abstract. This paper investigates two types of overhead due to duplicated local computations, which are frequently encountered in the parallel software of overlapping domain decomposition methods. To remove the duplication-induced overhead, we propose a parallel scheme that disjointly re-distributes the overlapping mesh points among irregularly shaped subdomains. The essence is to replace the duplicated local computations by an increased volume of the inter-processor communication. Since the number of inter-processor messages remains the same, the bandwidth consumption by an increased number of data values can often be justified by the removal of a considerably larger number of floating-point operations and irregular memory accesses in unstructured applications. Obtainable gain in the resulting parallel performance is demonstrated by numerical experiments.
1 Introduction An important class of domain decomposition (DD) algorithms is based on the Schwarz iterations that use overlapping subdomains; see e.g. [3,8]. These algorithms have become popular choices for solving partial differential equations (PDEs) mainly due to three reasons: (1) a simple algorithmic structure, (2) rapid numerical convergence, and (3) straightforward applicability for parallel computers. The series of the international DD conferences has shown, among other things, the wide spread of the overlapping DD algorithms; see e.g. [1,6,5]. Moreover, the overlapping DD approach is well suited for implementing parallel PDE software, especially with respect to code re-use. This is because the involved subdomain problems are of the same type as the target PDE in the global domain, and no special interface solvers are required. Therefore, an existing serial PDE solver can, in principle, be re-used as the subdomain solver inside a generic DD software framework; see [2]. Besides, the required collaboration between the subdomains, in the form of communication and synchronization, can be implemented as generic libraries independent of the target PDE. The starting point of an overlapping DD method is a set of overlapping subdomains that decomposes the global solution domain as Ω = ∪P s=1 Ωs . Figure 1 shows such an example of partitioning an unstructured finite element mesh, where a zone of overlapping points is shared between each pair of neighboring subdomains. Roughly speaking, each subdomain is an independent “working unit” with its own discretization task and a subdomain solver. Collaboration between the subdomains is through the exchange J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 699–708, 2006. c Springer-Verlag Berlin Heidelberg 2006
700
Xing Cai
Fig. 1. An example of an unstructured computational mesh and one possible partitioning into several overlapping and irregularly shaped subdomains
of local solutions inside the overlapping zones. A light-weight “global administrator” is required to iteratively improve the “patched” global solution and to synchronize the working pace of all the subdomains. We remark that sufficient overlap between neighboring subdomains is important for obtaining the rapid convergence of an overlapping DD method; see e.g. [3,8]. The main computational tasks involved in an overlapping DD method are in the form of linear algebra operations, most importantly, matrix-vector product, inner-product between two vectors, and addition of vectors. These linear algebra operations are carried out on two levels: the global level and the subdomain level. The global-level operations are typically needed to check the global convergence of the DD method. In particular, when the overlapping DD method is used as a preconditioner (see [3,8]), such as to obtain more robust convergence of a global Krylov solver, global-level operations also constitute the computational kernel of the Krylov solver. In such a setting, the original global system Ax = b is replaced by an equivalent system BAx = Bb. For example, an additive Schwarz preconditioner is defined as B add =
P
RTs A−1 s Rs .
(1.1)
s=1
In the above definition, A−1 s refers to a local solver within Ωs (see Section 2), while matrices RTs and Rs denote, respectively, the associated interpolation and restriction operators. For more mathematical details, we refer to [3,8]. The computational kernel of each Schwarz iteration, such as (1.1), is constituted by the subdomain-level operations which typically find an approximate local solution on each subdomain, when a suitable subdomain right-hand side is given. For each overlapping mesh point that is shared between multiple host subdomains, a global-level operation should give rise to the same value in all its host subdomains, whereas a subdomain-level operation may compute different values in the different host subdomains. Therefore, duplicated computations arise in the global-level operations, provided that the parallel software is based on a straightforward re-use of existing serial
Improving the Performance of Large-Scale Unstructured PDE Applications
701
linear algebra software over all the local points on each subdomain. Removing the duplication-induced overhead associated with parallel matrix-vector products and innerproducts thus constitutes the main theme of the remaining text. In Section 2, a distributed data structure is explained and the associated duplicated computations are examined in detail in Section 3. The main idea of a disjoint re-distribution of all the mesh points is presented in Section 4, and we propose in Section 5a parallel scheme for carrying out such a disjoint re-distribution. The issue of re-ordering the mesh points on each subdomain is also explained. Finally, Section 6 presents numerical experiments demonstrating the obtainable improvement of parallel performance, after the duplicated local computations are removed from global-level parallel matrix-vector products and inner-products.
2 Distributed Data for Linear Algebra Operations In the context of parallel overlapping DD methods, the involved linear algebra operations use a distributed data structure. That is, all the mesh points (including the overlapping points) in a subdomain participate in constructing the local matrix and vectors of the subdomain. In addition to being used in the subdomain-level operations, such a distributed set of subdomain matrices and vectors is also used in global-level operations. This is because a global matrix or vector can be collectively represented by the set of subdomain local matrices or vectors. A global matrix A is thus represented by a set of subdomain matrices As , 1 ≤ s ≤ P , where P denotes the number of subdomains and As arises from discretizing the target PDE restricted inside Ωs . We note that some of the rows in A are duplicated among neighboring subdomain matrices, due to the overlap between subdomains. Here, it is important to note the so-called internal boundary of Ωs , which refers to ∂Ωs \∂Ω, i.e., the part of ∂Ωs that borders into a neighboring subdomain (thus not on the global boundary ∂Ω). Since a Dirichlet type boundary condition is normally applied on the internal boundary of Ωs , the corresponding rows in As are different from those in A, whereas the other rows in As are identical with their corresponding rows in A. A global solution vector, say x, is distributed as the subdomain vectors xs , 1 ≤ s ≤ P . The “overlapping entries” of x are assumed to be always correctly duplicated on all the neighboring subdomains, including on the internal boundaries. During each Schwarz iteration, the neighboring subdomains may compute different local solutions in the overlapping regions. The subdomains thus need to exchange values between the neighbors and agree upon an averaged value for each overlapping point.
3 Overhead due to Duplicated Computations Parallel execution of global-level operations can often be realizable through serial local operations plus appropriate inter-subdomain communication. To parallelize a globallevel matrix-vector product y = Ax, we can first individually compute y s = As xs on each subdomain. Then, the entries of y s that correspond to the internal boundary points on Ωs must rely on neighboring subdomains to “pass over” the correct values. The needed communication is of the form that each pair of neighboring subdomains exchanges an array of values.
702
Xing Cai
Regarding the parallelization of a global-level inner-product w = x · y, the strategy is to let each subdomain first compute its contribution before being added up globally. A naive approach is to first individually compute ws = xs · y s , by applying existing serial inner-product software to all the points on Ωs . However, the local result ws must be adjusted before we can add them up to produce w. More specifically, we find Ks,k − 1 w ˜s = ws − xs,k × ys,k , (3.2) Ks,k k∈Ks
where Ks denotes an index set containing the local indices of all the overlapping mesh points on Ωs . Moreover, Ks,k is an integer containing the degree of overlap for an overlapping point, i.e., the total number of subdomains inwhich this point is duplicated. Finally, the desired global result can be found as w = P ˜s , through an all-to-all s=1 w collective communication. The above strategies for parallelizing global-level operations are particularly attractive with respect to code re-use. This is because the involved subdomain-level operations y s = As xs and ws = xs · y s can directly re-use existing serial software, involving all the mesh points on Ωs . However, the disadvantage of these simple strategies is the resulting overhead due to duplicated computations. More specifically, some entries of y s = As xs are also computed on other subdomains. Similarly, the local computation of ws = xs · y s is “over-sized” and the later adjustment of form (3.2) introduces additional overhead. For three-dimensional finite element computations, in particular, each point in an unstructured mesh normally has direct couplings to about 15–20 other points, meaning that each row in A has approximately 15–20 non-zero entries. (The use of higherorder basis functions will give rise to more non-zero entries.) The sparse matrix A is usually stored in a compressed row format, i.e., a one-dimensional floating-point array a entries containing row-wise all the non-zero entries, together with two onedimensional integer arrays i row and j col, which register the positions of the nonzero entries. The computation of y = Ax on row i is typically implemented as y[i] = 0.0; for (j=i_row[i]; j
It can be observed that the memory access pattern for x is irregular and indirect, of form x[j col[j]]. In other words, the combination of a large number of floatingpoint operations and irregular memory accesses can make the computation per row quite expensive. This motivates a strategy of removing the duplicated computation per overlapping point by one extra data value in communication, which may be performancefriendly for many unstructured PDE applications. We remark that the number of intersubdomain messages remains the same, so the increased communication volume does not result in more latency, but only consumes more bandwidth.
4 Disjoint Re-distribution of Computational Points The first step towards removing the duplication-induced overhead, which is described in Section 3, is to find a disjoint re-distribution of the entries of x and y (and also rows
Improving the Performance of Large-Scale Unstructured PDE Applications
703
of A) among the subdomains. The result of enforcing such a disjoint partitioning is that, during y s = As xs , only a subset of the rows in As carries out the computation, so that each entry of the global vector y is computed in exactly one subdomain. Similarly, when computing xs · y s on Ωs , we only go through a subset of the entries in xs and y s to obtain a “correctly-sized” ws value, which can then be used to compute w = P s=1 ws without any extra adjustments of form (3.2). Overlapping partitioning of a global domain
⇓ Subdomain No.1
⇓ Subdomain No.2
⇓
⇓
Fig. 2. A simple example of reducing the number of computational points (black-colored) in two overlapping subdomains. The white-colored internal boundary points and the grey-colored overlapping points do not participate in the distributed global-level computations, these points receive computational results from the neighbor
To illustrate the main idea of the disjoint re-distribution, we show a very simple example in Figure 2. In this example, a global domain is decomposed into two overlapping subdomains, with three columns of points in the overlap region. The internal boundary points on each subdomain are marked with the white color. On the middle
704
Xing Cai
row of Figure 2, the two columns of overlapping points (those lying immediately beside the internal boundary points) are initially treated on each subdomain as computational points, marked with the same black color as for the interior non-overlapping points. After enforcing a disjoint re-distribution, which is particularly simple for this example, some of the overlapping points are marked with the grey color (see the two subdomains on the bottom row of Figure 2). These grey-colored points thus become “non-computational”, meaning that their computational values, together with values of the internal boundary points, are to be provided by the neighboring subdomain. In other words, removal of the duplicated computations comes at the cost of an increased communication volume. We remark that this overhead removal technique only applies to global-level matrixvector products and inner-products. During each Schwarz iteration, all the points on a subdomain have to participate in the subdomain solver. For unstructured computational meshes, such as that depicted in the left picture of Figure 1, devising a disjoint re-distribution scheme is non-trivial. Moreover, the redistribution scheme needs to fulfill at least two requirements: (1) it should be based on the existing overlapping partitioning of the points, such that each subdomain chooses a subset of its local points to work as computational points, and (2) the number of computational points per subdomain should be distributed evenly. An optional requirement may be that the re-distribution scheme can be executed in parallel, instead of having to use a “master” processor for doing the re-distribution.
5 A Parallel Re-distribution Scheme Before presenting a parallel re-distribution scheme, we assume that the global domain Ω has N points and there already exists a set of P overlapping subdomains, where Ns denotes the number of points on Ωs . Moreover, we assume that Ns = NsI + NsO , where NsI denotes the number of interior points on Ωs , while NsO denotes the number of overlapping points on Ωs , including NsB internal boundary points. A Parallel Re-Distribution Scheme Phase 1: Mark all the NsI interior points on Ωs as computational points. In addition, some of the non-internal-boundary overlapping points that lie closest to the interior points are also marked, so that the number of initially marked computational points on Ωs becomes approximately N/P . A Parallel Re-Distribution Scheme (continued) Phase 2: Use inter-subdomain communication to find out, for each overlapping point, the number of subdomains (denoted by C) that have initially marked it as a computational point. If C = 1, choose one host subdomain out of possibly several candidates (in which the considered point does not lie on the internal boundary). Mark thus the overlapping point as a computational point on the host subdomain, and when necessary (if C > 1), unmark the overlapping point as non-computational on the other subdomains.
Improving the Performance of Large-Scale Unstructured PDE Applications
705
We remark that in the above parallel re-distribution scheme, the choice of the host subdomain in Phase 2 has to be made unanimous among all the candidate host subdomains. For this purpose, we use a unique global number of every overlapping point, say g, which is agreed on all the subdomains beforehand. Let us denote by S ≥ 1 the number of candidate host subdomains for g. In case none or more than one candidate have initially marked point g, we can select the jth candidate host subdomain, where e.g. j = mod(g, S) + 1. This unanimous selection algorithm has a random nature and preserves the load balance quite satisfactorily, provided that the overlapping subdomains are well balanced originally. After a re-distribution, all the points on Ωs receive a “label”, either as a computational point or as a non-computational point. It is clear that all the NsI interior points on Ωs are computational points by default. In addition, some of the overlapping points will also be labelled as computational points. Let us denote by NsOc the number of these computational overlapping points on Ωs (note we always have NsOc ≤ NsO − NsB ). The re-distribution scheme aims at a disjoint partitioning of the N global points among the subdomains, i.e., N=
P I Ns + NsOc . s=1
This labeling of the subdomain points is, in principle, sufficient for avoiding overhead due to duplicated computations. However, if the chosen computational points are scattered in the list of all the subdomain points on Ωs , we have to resort to some kind of iftest when traversing the rows of As to compute only the desired subset of y s = As xs . The same if-test is also needed for traversing the desired subsets of xs and y s in association with a parallel inner-product. In practice, this additional if-test is a serious obstacle for achieving good computational performance. A remedy to this problem is to re-order the points on each subdomain. More specifically, the first segment contains the NsI interior points, and the second segment contains the NsOc computational overlapping points. The third segment contains the remaining overlapping points that do not lie on the internal boundary (such as the grey-colored points in Figure 2). Finally, the fourth segment contains the NsB internal boundary points. It is clear that the disjoint re-distribution is unconditionally beneficial to the parallel inner-products, because the number of used entries in xs and y s is reduced, while the additional adjustment of form (3.2) is avoided. For the parallel matrix-vector products, the “benefit” on Ωs is a reduced number of the computational rows from Ns to NsI +NsOc , whereas the “loss” is an increase in the communication volume of order NsO −2NsB . The latter is because the number of incoming data values is raised from NsB to NsO − NsOc , meanwhile the number of outgoing data values is raised from approximately NsB to NsOc . In other words, if the additional bandwidth cost of moving NsO − 2NsB data values is lower than the computational cost of NsO −NsOc rows in ys = As xs , we should consider replacing the duplicated computations by an increased communication volume.
706
Xing Cai
6 Numerical Experiments As a set of concrete numerical experiments, we measure the wall-time consumptions of parallel global-level matrix-vector products and an inner-products, which are associated with the discretization of a Poisson equation on a three-dimensional unstructured mesh using 10,375,680 tetrahedral elements and N = 1, 738, 607 mesh points. The average number of non-zeros per row in the resulting sparse matrix A is approximately 15, and two layers of elements are shared between neighboring subdomains. (The global mesh is partitioned into P overlapping irregularly shaped subdomains by using the Metis package [4] in combination with the parallel software toolbox in Diffpack [7].) We have tested two quite different parallel platforms, where the first platform is a Linux cluster consisting of 1 GHz Pentium III processors interconnected through a 100 Mbit/s ethernet. The second platform is an SGI Origin 3800 system using 600 MHz R14000 processors. The Linux cluster is particularly interesting because the singleprocessor computational cost per row in A is approximately the same as the bandwidth cost of moving one data value (8 bytes), i.e., 6.4 × 10−7 s. (We note that the R14000 processors are approximately 60% faster than the Pentium III processors for our unstructured serial computations, whereas the bandwidth advantage of the Origin system over the Linux cluster is more than ten-fold.) Table 1. The effect of replacing duplicated computations by an increased communication volume; measurements on a Linux cluster and an SGI Origin 3800 system Linux cluster w/ duplications P
max Ns max NsC
Origin system
w/o duplications
w/ duplications
w/o duplications
Tmatvec Tinner Tmatvec Tinner Tmatvec Tinner Tmatvec Tinner
2 894605 869385 5.81e-01 4.60e-02 5.77e-01 3.28e-02 4.75e-01 3.50e-02 4.07e-01 2.87e-02 4 465695 435364 3.26e-01 2.63e-02 3.25e-01 2.05e-02 2.69e-01 1.18e-02 2.53e-01 7.06e-03 8 244205 218313 1.79e-01 1.41e-02 1.94e-01 1.06e-02 1.45e-01 6.62e-03 1.32e-01 2.09e-03 16 130156 109550 1.02e-01 1.01e-02 9.75e-02 5.79e-03 8.15e-02 3.07e-03 6.51e-02 1.11e-03 32 69713
55175
N/A
N/A
N/A
N/A
4.09e-02 1.91e-03 3.06e-02 6.58e-04
In Table 1, we denote the wall-time consumption for each matrix-vector product c and inner-product by Tmatvec and Tinner, respectively. Moreover, NsC = NsI + NsO denotes the number of computational points on Ωs . (The difference between max Ns and max NsC indicates the amount of removable duplicated computations.) The parallel re-distribution scheme from Section 5 has been applied to duplication removal. The computational cost of the re-distribution itself is quite small, no more than a couple of parallel matrix-vector products. It can be observed from Table 1 that there is always gain in Tinner after removing the duplicated local computations. For Tmatvec , however, the gain on the Linux cluster is smaller than that obtained on the Origin system. This is because the Linux cluster has a much higher bandwidth cost. In one extreme case, i.e., P = 8 on the Linux cluster, Tmatvec has increased. The reason for not obtaining perfect
Improving the Performance of Large-Scale Unstructured PDE Applications
707
speed-ups, even after removing the duplicated computations, is mainly due to (1) the increased bandwidth consumption and (2) a somewhat imperfect distribution of both c NsI + NsO and the communication overhead among the subdomains. These factors can be illustrated by Figure 3, where the mesh points on each subdomain are categorized into four types as described in Sections 4 and 5. 4
x 10 16
internal boundary points non−computational overlapping points computational overlapping points interior non−overlapping points
14
Number of points
12
10
8
6
4
2
0
2
4
6
8 10 Subdomain ID
12
14
16
Fig. 3. The categorization of mesh points into four types for sixteen overlapping subdomains
7 Concluding Remarks In this paper, we have explained the origin of duplicated local computations that frequently arise in parallel overlapping DD software. A disjoint re-distribution of the overlapping mesh points among the subdomains removes the duplications associated with global-level parallel matrix-vector products and inner-products. Although numerical experiments have already shown considerable gain in the parallel performance, more effort is needed to improve the parallel re-distribution scheme from Section 5. It needs first and foremost to produce an even better load balance. Secondly, we may have to “relax” the objective of a complete removal of the duplications in parallel matrix-vector products. That is, a certain amount of duplicated computations may be allowed, such as to limit an acceleration of the communication volume. This means that a “compromise” may be required between reducing the duplicated computations and increasing the communication volume.
708
Xing Cai
Acknowledgement The work presented in this paper has been partially supported by the Norwegian Research Council (NFR) through Programme for Supercomputing in form of a grant of computing time.
References 1. P. E. Bjørstad, M. Espedal, and D. Keyes, editors. Domain Decomposition Methods in Sciences and Engineering. Domain Decomposition Press, 1998. Proceedings of the 9th International Conference on Domain Decomposition Methods. 2. X. Cai. Overlapping domain decomposition methods. In H. P. Langtangen and A. Tveito, editors, Advanced Topics in Computational Partial Differential Equations – Numerical Methods and Diffpack Programming, pages 57–95. Springer, 2003. 3. T. F. Chan and T. P. Mathew. Domain decomposition algorithms. In Acta Numerica 1994, pages 61–143. Cambridge University Press, 1994. 4. G. Karypis and V. Kumar. Metis: Unstructured graph partitioning and sparse matrix ordering system. Technical report, Department of Computer Science, University of Minnesota, Minneapolis/St. Paul, MN, 1995. 5. R. Kornhuber, R. Hoppe, J. P´eriaux, O. Pironneau, O. Widlund, and J. Xu, editors. Domain Decomposition Methods in Sciences and Engineering, volume 40 of Lecture Notes in Computational Science and Engineering. Springer, 2004. Proceedings of the 15th International Conference on Domain Decomposition Methods. 6. C.-H. Lai, P. E. Bjørstad, M. Cross, and O. Widlund, editors. Domain Decomposition Methods in Sciences and Engineering. Domain Decomposition Press, 1999. Proceedings of the 11th International Conference on Domain Decomposition Methods. 7. H. P. Langtangen. Computational Partial Differential Equations - Numerical Methods and Diffpack Programming. Texts in Computational Science and Engineering. Springer, 2nd edition, 2003. 8. B. F. Smith, P. E. Bjørstad, and W. Gropp. Domain Decomposition: Parallel Multilevel Methods for Elliptic Partial Differential Equations. Cambridge University Press, 1996.
A Runtime Adaptive Load Balancing Algorithm for Particle Simulations Matthew F. Dixon Department of Mathematics, Imperial College, London [email protected]
Abstract. A new dynamic load balancing algorithm is proposed for particle simulations on clusters of workstations which operates at two levels; it both dynamically defines groups of particles in which the force calculations are localised and at the same time adaptively assigns groups to processors based on the group sizes and runtime CPU usages. It therefore adaptively balances using both the dynamics of the simulation and the load usage of the cluster at runtime. The algorithm is implemented in the POOMA framework and applied to a particlein-cell approximation of a three dimensional elastic particle collision model. Load balancing metrics and parallel scalability is determined on an 8 quadprocessor 833MHZ Compaq Alpha cluster connected by a gigabit ethernet.
1 Introduction Particle simulations are widespread in scientific computing and provide a natural computational approach to many equations derived from particle mechanics. particle-in-cell methods remain one of the most common techniques used for particle simulations as they are usually simple. However, like many other particle simulation methods, particlein-cell methods suffer from poor parallel scalability [8][13]. This is because the force Fi on a particle i with position Xi is the gradient of the potential energy V (X1 , . . . , XN ) of the particle system (which is model specific) ˙ i = Fi and X ˙ i = Ui Fi = −∇Xi V (X1 , . . . , XN ), U mi depends on the position of the other particles in the system. The velocity Ui and hence new position of each particle with mass mi is determined by the above equations of motion. Parallelisation of the model by assignment of either particles or physical subdomains to different processes leads to either high interprocessor communication requirements or irregular load balance. That is, partitioning of label space into subdomains of equal number of particles, allocated to each processor, does not localise the data required for the force calculations, if particles in close proximity of each other in physical space reside on different processors. See, for example, [9][12]. Conversely, partitioning of the physical domain into subdomains, allocated to each processor, is not necessary compatible with
The author thanks Alvin Chua and Steven Capper for their careful proof reading and useful comments on this article.
J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 709–718, 2006. c Springer-Verlag Berlin Heidelberg 2006
710
Matthew F. Dixon
the distribution of the work load, namely, calculation of the particle forces. For example, if particles cluster together inside a subdomain, no work load is distributed to the other subdomains. An extensive number of dynamic load balancing algorithms for parallel particle simulations have been developed which are able to distribute particles across processors based on their proximities to others or by their position in the domain, see for example [2][7][8]. Each consider the main factors in designing an effective load balancing algorithm (i) data locality (ii) granularity of balancing and (iii) the additional computation and interprocessor communication required by the algorithm, referred to hereon, as the overhead. However, none of these approaches use runtime information such as the processor load, performance and latency between other processors to enhance calibration of the algorithm. In particular, the taskfarm approach described in [8], assembles particles into tasks and balances them by instructing processors, which have finished processing their own tasks, to fetch tasks from the others. In theory, this approach could be improved for use on a cluster, by allocating tasks to processors based on processor usages, thus taking a predictive rather than reactive approach. The potential for performance improvement using prediction is dependent on both the level of fluctuation of the usages and the simulated particle dynamics. Many of the dynamic load balancing schemes that have been developed can be classified as either Scratch-and-Remap or Diffusion-based schemes [6]. The former approach partitions workload from scratch at each stage whilst the latter achieves balance by successive migration. The choice of scheme is dependent on the dynamics of the simulation. A runtime adaptive algorithm would be most appropriate for a smoothly varying but inhomogeneous particle simulation. Predictive information on the particle clustering can then be used by a diffusion-based load balancing algorithm, to group particles. It would also suit a shared cluster of workstations running multiple applications at any one instance, provided that load usages remain consistent between load balancing cycles and the overhead of rebalancing is small. A simple and efficient approach for load assignment and rebalance event detection based on measuring efficiency ratios of load to capacity is described in [3]. Since this approach was not developed for particle simulations, there is no concept of balancing at the granularity of dynamically defined groups of entities. This approach is extended, here, to recalibrate particle to processor assignment based on both the particle positions and runtime information. This combined approach, to the best knowledge of the author, has not been previously designed. Different metrics for detecting load balancing events based on the homogeneity of the distribution of the particle are considered. Finally details of the implementation in POOMA [4] and numerical results are presented for a parallel 3D elastic particle collision model [1] running on a Compaq Alpha cluster.
2 Algorithm Consider a single-threaded message-based Single-Processor Multiple-Data parallel model where each processor is represented as a physical node to which numerous virtual nodes (groups) are assigned based on its runtime capacity. Particles are then allocated to
A Runtime Adaptive Load Balancing Algorithm for Particle Simulations
711
each virtual node based on their proximity to one another. Consider first, for simplicity, dynamic balancing of individual particles. Single Particle Balancing Let the set of all particles in the system be given by P := {p1 , p2 , · · · , pn } where n is the total number of particles in the system. The subset of nj particles on processor j ∈ {1, 2, · · · , Np } is Pj (t) := {pβ1 , pβ2 , · · · , pβnj }, using the convention that roman numerals are local indices, such that P = ∪j Pj . Assume that all particles remain equi-distributed and fixed in number, then there is one particle per virtual node and the particles can be considered as virtual nodes. The number of particles assigned to processor j at time ti , nj (ti ), depends on n and the weights wj (ti ). Each wj (ti ) is a function of the processor peak FLOP rate rj and its usage uj (ti ), normalised by the total capacity C(ti ) = j Cj (ti ). Particle assignment: The number of particles initially assigned to each processor is based on its availability and performance. Cj (ti ) nj (t0 ) = wj (t0 )n , wj (ti ) = and Cj (ti ) = uj (ti )rj . C(ti ) After setting up the system as prescribed, it will be necessary to reassign the particles if the usage of the processor changes. Particle reassignment: A processor event causes reassignment if wj (ti ) − wj (ti−1 ) > wtol , where wtol is a tolerance to reduce the sensitivity of the load balancing to small changes in relative processor capacity and ti − ti−1 > ∆τ the time step size used in the simulation. The number of particles at time ti is related to the number of particles at time ti−1 by the formula w (t ) j i nj (ti ) = nj (ti−1 ) wj (ti−1 ) As such, nj (ti ) − nj (ti−1 ) particles are relocated to or from physical node j at time step ti . Particle relocation, however, reduces parallel scalability of the potential calculations. This motivates the derivation of an algorithm for a courser level of granularity using groups. Group Load Balancing l Multiple particles p ∈ Pj can be uniquely assigned to gj groups Gα j (t) := {pγ1 , pγ2 , · · · , pγSαl }. Defines local subindices l, m ∈ {1, 2, · · · , gj } to α on processor j, so that j
groups are uniquely labelled with the superscript over the computational domain. Let
712
Matthew F. Dixon
X(pn , ti ) be the position of particle pn at time ti and R(., .) the distance between any l two particles. If particle proximities to a fictious reference particle pα ref are less than a tolerance rtol = ∆X(t0 ), such as the initial equi-interparticle distance, for example, then the particles are assigned to group αl . αl m l Gα ∈ ∪l<m Gα j (ti ) := {ps ∈ Pj , ps j (ti )|R(X(ps , ti ), X(pref , ti )) < rtol } l {pα ref } are initially chosen as equidistant points in the Eulerian reference frame and once groups have been assembled are recomputed as the mean position of all the particles in the group. In this way all Sjαl particles in a group are thus categorised by proximity and are defined as interacting. Non-interacting particles are also assigned to single element groups and the load balancing algorithm then calibrates with groups and not particles. For simplicity, it is assumed that the number of particles remain fixed in number. In addition to defining a metric for load balancing based upon the particles distances and the processor capacities, classification of the particle dynamics can also be used to calibrate the algorithm. For ease of explanation, two categories are considered, (A) weakly and (B) strongly non-equidistributed particle simulations.
Particle Group Assignment The number of groups initially assigned to each processor j is based on its availability and performance, and mean group size. Consider the two cases A and B of particle classification w (t)n w (t)n A: gj (t0 ) = µ(Sjm (t0 )) or B: gj (t0 ) = µ(Sjαl (t )) j
0
where the mean group size is µ(S (t)), m ∈ {1, 2, · · · , M } and M = µ(Sjαl ) is the mean group size on physical node j. m
j
gj , and
Particle Group Reassignment The number of groups assigned to processor j at time ti is related to the number of groups at time ti−1 by the formula below. Consider, again, the two cases A and B, where µ ˆ := µ(S m ) and µ∗ := µ(Sjαl ) ∗ µ ˆ (t )w (t ) µ (t )w (t ) A: gj (ti ) = µˆ (ti−1i )wjj (tii−1 ) gj (ti−1 ) or B: gj (ti ) = µ∗ (ti−1i )wjj (tii−1 ) gj (ti−1 ) For case A reassignment thus depends on the mean size of all groups or, for case B, the mean size of the groups on processor j, depending on the distribution of the particles at time ti . The particle distribution is analysed every k th simulation time step and hardware evaluation for context mapping every mth simulation time step, where k and m are suitably chosen parameters based on the dynamics of the particles, available processor capacity and time taken to measure the particle positions and processor state.
A Runtime Adaptive Load Balancing Algorithm for Particle Simulations
713
3 Implementation In the POOMA framework, particle indices can be represented as a domain in label space which is partitioned into patches (virtual nodes) by layouts, which can be easily dynamically reconfigured. One or more patches are then allocated to a context associated with the hardware by a context mapper. See [5] for further details. A special purpose context mapper class has been developed which is initialised by the master node with a set of processor capacities Ci based on a pre-defined map for peak processor capacities whose key is each machine name in the cluster. At each iteration of the load balancing algorithm, each client is instructed to determine the runtime processor statistics by calling the API of the profiling tool on a separate thread. The client then sends this data to the context mapper class on the master node which then determines the number of patches to reallocate to each context sequentially. Finally each client exchanges group data through the master node. This approach is liable to cause a bottleneck at the master processor. A parallel task queue, as used in [10] would help to alleviate this problem. The clients also exchange particle data peer-to-peer. This is required (i) for a force calculation or (ii) after a repartitioning event. Cross context dependencies can be further minimised by using Guard Layers to store particles of adjacent patches, but this was not used in testing. Load Balancing Algorithm: Simulation Event At every k th simulation time step τik = ti each client j performs the algorithm defined in Figure 3. Particles on client j are regrouped every k th time step if the total particle displacements, for all local particles, is greater than a threshold tolsim . This approach to associating a particle with a group could be improved using , for example, Voronoi cells as described in [9]. Each client then requests a set of reference nodes for all gsum groups l {pα ref } and the corresponding processors where the groups are located {prβl } defined with a global index β. For each local group, each member particle is checked to determine whether it still belongs to the group. If it doesn’t, then it is compared against all reference nodes. If it is associated with a remote reference node, then the particle and the target client are added to the arrays relocateP articles and targetP rocessors for dispatch. Otherwise, the particle is reassigned to the target local group. The client maintains an array of all local particles and the groups to which they belong. The master maintains an array associating groups to the clients and arrays for the size and reference particles for each group. In this way, the master is not overloaded with particle information, but is dedicated to group balancing. Load Balancing Algorithm: Processor Event At every mth simulation time step τim = ti , each processor j performs the algorithm defined in Figure 3. The master processor first computes the weights for each client. Processing for each client sequentially, if determines whether the change in a client weight ∆wj (ti ) exceeds
714
Matthew F. Dixon
1: if ||{∆X(pn )}|| < tolsim , pn ∈ ∪l Gj l (ti ) then l 2: ReceiveFrom(prmaster , {pα ref }, {prβl }, gsum ) 3: for l := 1 → gj (ti ) do 4: for n := 1 → Sjαl (ti ) do l 5: if R(X(pγn , X(pα ref )) > rtol then 6: for k := 1 → gsum do k 7: if R(X(pγn ), X(pα ref )) < rtol then 8: if prβk ! = prj then 9: relocationParticles.add(pγn ) 10: targetProcessors.add(prβk ) 11: else k 12: Update group(Gα j (ti ), pγn ) αl 13: Update group(Gj (ti ), pγn ) 14: end if 15: end if 16: end for 17: end if 18: end for 19: end for 20: SendTo(targetProcessors,relocationParticles) 21: Receive(relocationParticles) 22: for r := 1 → N {relocationParticles} do 23: for l := 1 → gj (ti ) do l 24: Update group(Gα j (ti ),relocationParticles[r]) 25: end for 26: end for 27: end if α
Fig. 1.
a threshold tolcpu and if so calculates the group imbalance , that is the number of groups above or below its capacity, assuming uniform group sizes. It then searches over all unbalanced clients with a greater index to find the client whose imbalance best matches. It then instructs the client with excess numbers of groups to send them to the target client and the group numbers are appropriately modified. Once a client has been rebalanced, group numbers remain fixed until the next load balancing iteration and the remaining unbalanced clients must exchange groups amongst themselves. Althought this approach is simple, the last clients to be processed may not be as well balanced as those first processed. Alternating direction of processing of the client lists may potentially improve performance. This approach also assumes, for simplicity, that group sizes remain near uniform, either globally or per client depending on the particle dynamics classification. Many approaches, however, also prioritise by load unit size, see for example [8][12]; this approach could be extended to prioritise larger groups. A further extension is to regroup in a way that keeps messages between neighbouring clients, as suggested in [11].
A Runtime Adaptive Load Balancing Algorithm for Particle Simulations 1: if prj ! = prmaster then 2: Cj (ti )=GetCapacity() 3: SendTo(prmaster , Cj ) 4: ReceiveFrom(prmaster , ∆gj (ti ), prtarget ) 5: for l := 1 → ∆gj (ti ) do l 6: relocationGroups.add(Gα j (ti )) 7: end for 8: SendTo(prtarget , relocationGroups) 9: Receive(relocationGroups) 10: for r := 1 → N {relocationGroups} do 11: for l := 1 → gj (ti ) do l 12: Update groups(Gα j (ti ),relocationGroups[r]) αl 13: Calculate(µ(Gj (ti ))) 14: end for 15: end for 16: else 17: Receive({Ck (ti )}) 18: for k := 1 → Np do 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 32: 33: 34: 35: 36: 37: 38: 39: 40: 41: 42:
wk (ti ) = CkC(tni(t) i ) n end for for k := 1 → Np do if |wk (ti )| > tol cpu then
m (ti ))wk (ti ) ∆gk (ti ) = ( µ(Sµ(S − 1)g (t ) m (t k i−1 i−1 ))wk (ti−1 )
∆gk (t) sign= |∆g k (t)| minj ({∆gj (ti )}j:=k+1→Np + ∆gk (ti )) if sign< 0 then for l := 1 → ∆gj (ti ) do l relocationGroups.add(Gα j (ti )) end for prtarget = k else for l := 1 → ∆gj (ti ) do l relocationGroups.add(Gα i (ti )) end for prtarget = j end if SendTo(prtarget , relocationGroups) gk (ti )− = sign ∗ ∆gj (ti ) gj (ti ) + − = sign ∗ ∆gj (ti ) end if end for end if
Fig. 2.
715
716
Matthew F. Dixon −3
10.1
0.044
x 10
k=5 k=10
k=10 k=5 0.042
Ratio of particles regrouped
Ratio of remotely dependent particles
10 0.04
0.038
0.036
0.034
9.9
9.8
9.7
0.032
9.6 0.03
0.028
0
5
10
15
20
25 timesteps
30
35
40
45
9.5
50
5
10
15
20
25 timestep
30
35
40
45
50
−3
6
x 10
19
m=1 m=5
m=1 m=5
18
Number of groups rebalanced
Ratio of unbalanced particles
5
4
3
2
1
0
17
16
15
14
0
5
10
15
20
25 timesteps
30
35
40
45
50
13
5
10
15
20
25 timesteps
30
35
40
45
50
Fig. 3. The effect of k on (top left) remote data dependencies and (top right) regrouping costs. The effect of m on (bottom left) load imbalance and (bottom right) group rebalancing costs
4 Results The weakly non-equidistributed particle group load balancing algorithm was tested in a particle-in-cell method for a 3D dimensional elastic particle collision model(see [1]) using POOMA v2.4.0 and Cheetah v1.1.4, a MPI based library built on MPICH with GNU C++ v3.1 on a gigabit ethernet cluster of 8 4 x 833 M HZ Compaq Alpha ES40s with Quadrics interconnects running Linux and several applications at any instance. Two sets of results are presented. The first shows the effect of the frequency of particle grouping and rebalancing on the data locality, imbalance and overhead. Parallel scalability is then assessed with and without the dynamic load balancing algorithm. The group threshold is set to be greater than the radius of influence of the particles. Clearly, this approach requires that the forces are short range otherwise it becomes impossible to construct groups which preserve data locality. The number of particles is chosen to be significantly larger than the number of processors, see [13]. The parameter, tolcpu determines the sensitivity of the load balancing algorithm to load variation. A lower value will ensure more frequent calibration but incur more overhead. Analysis of cpu load histories could be used to determine this parameter. Additionally, m determines the maximum frequency that the load balancing algorithm is performed. It is set to a
A Runtime Adaptive Load Balancing Algorithm for Particle Simulations Simulation time for N thousand Particles on X Processors
Simulation time for a Dynamically Load Balanced N thousand Particles on X Processors
180
180 10 20 40 60 80
160
140
120
120
100
100
80
80
60
60
40
40
20
20
0
5
10
15 X Processors
20
10 20 40 60 80
160
Time(s)
Time(s)
140
0
717
25
30
0
0
5
10
15 X Processors
20
25
30
Fig. 4. A 3D elastic particle collision model without (left) and with (right) runtime adaptive load balancing
lower value when the particles are regrouped more frequently, load variations are high and the relative effect of load balancing overhead is small. Figure 3 shows the effect of m on the load imbalance and the corresponding overhead of group balancing for tolcpu = 0.15. The parameter, tolsim sets the sensitivity of the particle regrouping. The value chosen depends on the range of forces, the dynamics of the particles and the effect of remote dependencies on parallel scalability. Additionally, k sets the maximum frequency of particle regrouping and is determined by the particle dynamics. Figure 3 shows the effect of increasing k on the ratio of remotely dependent particles and the corresponding overhead of regrouping for tolsim = 0.1. k and m should be compatible with each other; a low value of k results in a potentially high number of particle regrouping and there is thus more potential for a higher frequency of group imbalances. Parallel scalability tests were performed using 1000 time steps over a range of 10−80 thousand particles and 1 − 32 processors. The load balancing algorithm was performed every k = m = 100 time steps. The graphs in Figure 4 demonstrate that the load balancing is much more effective as the number of particles and processors is increased. However, with smaller numbers of particles, the opposite effect is observed because of the overhead of the algorithm. The greater the number of groups, the greater the granularity of the load balancing but the greater the overhead of rebalancing. In highly latent environments, preliminary results suggest that it is preferable to allocate fewer groups by allocating all non-interacting particles into one group on each processor.
References 1. R.W. Hockney and J.W. Eastwood, Computer Simulation using Particles, Taylor and Francis Inc., 1988 2. G.A. Kohring, Dynamic Load Balancing for Parallelized Particle Simulations on MIMD Computers, Parallel Computing, Volume 21, Issue 4,pp. 683–694, 1995. 3. J. Watts and S. Taylor, A Practical Approach to Dynamic Load Balancing, IEEE Transactions on Parallel and Distributed Systems, Volume 9, issue 3, 1998, pp. 235–248.
718
Matthew F. Dixon
4. J.C. Cummings and W. Humphrey, Parallel Particle Simulations Using the POOMA Framework, Proceedings of the 8th SIAM Conference on Parallel Processing for Scientific Computing, M. Heath et al. eds., SIAM 1997. 5. J.D. Oldham, POOMA Manual, A C++ Toolkit for High-Performance Parallel Scientific Computation, http://www.codesourcery.com, 2002. 6. Z. Lan, V.E. Taylor and G. Bryan, A Novel Dynamic Load Balancing Scheme for Parallel Systems, J. Parallel Distrib. Computing, Volume 62, 2002, pp. 1763–1781. 7. F.R. Pearce and H.M.P. Couchman, Hydra: a parallel adaptive grid code, New Astronomy, Volume 2, 1997, pp. 411–427 8. C. Othmer and J. Sch¨ule, Dynamic load balancing of plasma particle-in-cell simulations: The taskfarm alternative, Computer Physics Communications, Volume 147, pp. 741–744 9. M. A. Stijnman, R. H. Bisseling and G. T. Barkema, Partitioning 3D space for parallel manyparticle simulations, Computer Physics Communications, Volume 149, Issue 3, January 2003, pp. 121–134 10. T. Rauber, G. R¨unger and C. Scholtes, Execution behavior analysis and performance prediction for a shared-memory implementation of an irregular particle simulation method, Simulation Practice and Theory, Volume 6, Issue 7, 15 November 1998, pp. 665–687 11. B. Hendrickson and K. Devine, Dynamic load balancing in computational mechanics, Computer Methods in Applied Mechanics and Engineering, Volume 184, Issues 2-4, 14 April 2000, pp. 485–500 12. L. Kal´e et al., NAMD: Greater Scalability for Parallel Molecular Dynamics, Journal of Computational Physics, Volume 151, 1999, pp. 283–312 13. B. Di Martino et al., Parallel PIC plasma simulation through particle decomposition techniques, Parallel Computing, Volume 27, 2001, pp. 295–314
Evaluating Parallel Algorithms for Solving Sylvester-Type Matrix Equations: Direct Transformation-Based Versus Iterative Matrix-Sign-Function-Based Methods Robert Granat and Bo K˚agstr¨om Department of Computing Science and HPC2N, Ume˚a University SE-901 87 Ume˚a, Sweden {granat,bokg}@cs.umu.se
Abstract. Recent ScaLAPACK-style implementations of the Bartels-Stewart method and the iterative matrix-sign-function-based method for solving continuous-time Sylvester matrix equations are evaluated with respect to generality of use, execution time and accuracy of computed results. The test problems include well-conditioned as well as ill-conditioned Sylvester equations. A method is considered more general if it can effectively solve a larger set of problems. Illconditioning is measured with respect to the separation of the two matrices in the Sylvester operator. Experiments carried out on two different distributed memory machines show that the parallel explicitly blocked Bartels-Stewart algorithm can solve more general problems and delivers far more accuracy for ill-conditioned problems. It is also up to four times faster for large enough problems on the most balanced parallel platform (IBM SP), while the parallel iterative algorithm is almost always the fastest of the two on the less balanced platform (HPC2N Linux Super Cluster). Keywords: Sylvester matrix equation, continuous-time, Bartels–Stewart method, explicit blocking, GEMM-based, level 3 BLAS, matrix sign function, c-stable matrices, Newton iteration, ScaLAPACK, PSLICOT.
1 Introduction We consider two different methods for solving the continuous-time Sylvester (SYCT) equation AX − XB = C, (1) where A of size m × m , B of size n × n and C of size m × n are arbitrary matrices with real entries. SYCT has a unique solution X of size m × n if and only if A and B have disjoint spectra, i.e., they have no eigenvalues in common, or equivalently the separation sep(A, B) = 0, where sep(A, B) = inf XF =1 AX − XBF . The Sylvester equation appears naturally in several applications. Examples include block-diagonalization of a matrix in Schur form and condition estimation of eigenvalue problems (e.g., see [13,11,15]). J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 719–729, 2006. c Springer-Verlag Berlin Heidelberg 2006
720
Robert Granat and Bo K˚agstr¨om
In this contribution, we experimentally compare parallel implementations of two methods for solving SYCT on distributed memory systems. The first is based on BartelsStewart method [1,15,7,8] and is reviewed in Section 2. Several papers, starting with [16], have considered iterative methods for solving SYCT. The second parallel implementation is an iterative method based on a Newton iteration of the matrix sign function [3,4], and is reviewed in Section 3. In Section 4, we display and compare some computational results of implementations of the two parallel algorithms. A discussion of the measured execution times on two different parallel distributed memory platforms is presented in Section 5. Finally, in Section 6, we summarize our findings and outline ongoing and future work.
2 Explicitly Blocked Algorithms for Solving SYCT The explicitly blocked method, implemented as the routine PGESYCTD, is based on the Bartels-Stewart method [1]: 1. Transform A and B to upper (quasi-)triangular Schur form TA and TB , respectively, using orthogonal similarity transformations: QT AQ = TA ,
P T BP = TB .
2. Update the right hand side C of (1) with respect to the transformations done on A and B: = QT CP. C 3. Solve the reduced (quasi-)triangular matrix equation: − XT B = C. TA X back to the original coordinate system: 4. Transform the solution X T. X = QXP To carry out Step 1 we use a Hessenberg reduction directly followed by the QR-algorithm. The updates in Step 2 and the back-transformation in Step 4 are carried out using general matrix multiply and add (GEMM) operations C ← βC + αop(A)op(B), where α and β are scalars and op(A) denotes A or its transpose AT [2,12]. We now focus on Step 3. If the matrices A and B are in Schur form, we partition the matrices in SYCT using the blocking factors mb and nb, respectively. This implies that mb is the row-block size and nb is the column-block size of the matrices C and X (which overwrites C). By defining Da = m/mb and Db = n/nb, (1) can be rewritten as Aii Xij − Xij Bjj = Cij − (
Da
k=i+1
Aik Xkj −
j−1
Xik Bkj ),
(2)
k=1
where i = 1, 2, . . . , Da and j = 1, 2, . . . , Db . From (2), a serial blocked algorithm can be formulated, see Figure 1.
Evaluating Parallel Algorithms for Solving Sylvester-Type Matrix Equations
721
for j=1, Db for i=Da , 1, -1 {Solve the (i, j)th subsystem using a kernel solver} Aii Xij − Xij Bjj = Cij for k=1, i − 1 {Update block column j of C} Ckj = Ckj − Aki Xij end for k=j + 1, Db {Update block row i of C} Cik = Cik + Xij Bjk end end end Fig. 1. Block algorithm for solving AX − XB = C, A and B in upper real Schur form
Assume that the matrices A, B and C are distributed using 2D block-cyclic mapping across a Pr × Pc processor grid. For Steps 1, 2 and 4, we use the ScaLAPACK library routines PDGEHRD, PDLAHQR and PDGEMM [6]. The first two routines are used in Step 1 to compute the Schur decompositions of A and B (reduction to upper Hessenberg form followed by the parallel QR algorithm [10,9]). PDGEMM is the parallel implementation of the level 3 BLAS GEMM operation and is used in Steps 2 and 4 for doing the two-sided matrix multiply updates. To carry out Step 3 in parallel, we traverse the matrix C/X along its block diagonals from South-West to North-East, starting in the South-West corner. To be able to compute Xij for different values of i and j, we need Aii and Bjj to be owned by the same process that owns Cij . We also need to have the blocks used in the updates of Cij in the right place at the right time. This means that in general we have to communicate for some blocks during the solves and updates. This is done “on demand”: whenever a processor misses any block that it needs for solving a node subsystem or doing a GEMM update, it is received from the owner in a single point-to-point communication [8]. A brief outline of a parallel algorithm PTRSYCTD is presented in Figure 2. The (quasi)triangular subsystems Aii Xij − Xij Bjj = Cij in Figure 2 are solved on the grid nodes using the LAPACK-solver DTRSYL [2]. The explicitly blocked method is general since (in theory) it can be used to solve any instance of SYCT as long as the spectra of A and B are disjoint. Notice that the algorithm contains an iterative part, the reduction of A and B in upper Hessenberg form to upper Schur form, which is the most expensive in terms of execution time [7,8]. For further details we refer to [7,8,15].
3 Solving SYCT Using Newton Iteration for the Matrix Sign Function The sign function method can be used to solve SYCT if the spectra of A and −B are contained in the open left half complex plane, i.e., A and −B are so called Hurwitz- or c-stable [3].
722
Robert Granat and Bo K˚agstr¨om
for k=1, Da + Db − 1 {Solve subsystems on current block diagonal in parallel} if(mynode holds Cij ) if(mynode does not hold Aii and/or Bjj ) Communicate for Aii and/or Bjj Solve for Xij in Aii Xij − Xij Bij = Cij Broadcast Xij to processors that need Xij for updates elseif(mynode needs Xij ) Receive Xij if(mynode does not hold block in A needed for updating block column j) Communicate for requested block in A Update block column j of C in parallel if(mynode does not hold block in B needed for updating block row i) Communicate for requested block in B Update block row i of C in parallel endif end Fig. 2. Parallel block algorithm for AX − XB = C, A and B in upper real Schur form
Let Z be a real p × p matrix with real eigenvalues and let − J 0 Z=S S −1 0 J+
(3)
denote its Jordan decomposition with J − ∈ C k×k , J + ∈ C (p−k)×(p−k) containing the Jordan blocks corresponding to the eigenvalues in the open left and right half planes, respectively. Then the matrix sign function of Z is defined as −Ik 0 sign(Z) = S S −1 . (4) 0 Ip−k The sign function can be computed via the Newton iteration for the equation Z 2 = I where the starting point is chosen as Z, i.e., Z0 = Z, Zk+1 = (Zk + Zk−1 )/2,
k = 0, 1, 2, . . .
It can be proved [16] that sign(Z) = limk→∞ Zk and moreover that A −C 0X sign + Im+n = 2 , 0 B 0 I
(5)
(6)
which means that under the given assumptions, SYCT can be solved by applying the Newton iteration (5) to A −C Z0 = . (7) 0 B
Evaluating Parallel Algorithms for Solving Sylvester-Type Matrix Equations
723
This iterative process only requires basic numerical linear algebra operations as matrix-matrix multiplication, inversion and/or solving linear systems of equations. The method has been parallelized and implemented as the PSLICOT [14,17] routine psb04md. The parallel implementation uses the ScaLAPACK routines PDGETRF (LU factorization), PDGETRS (solving linear systems of equations), PDGETRI (inversion based on LU factorization), PDTRSM (solution of triangular systems with multiple righthand sides) and PDLAPIV (pivoting of a distributed matrix). We expect this iterative method to be fast and scalable since it consists of computationally well-known and highly parallel operations. The obvious drawback of the method is the lower degree of generality, i.e., the fact that we cannot solve all instances of SYCT due to the restrictions on the spectra of A and B. The matrix sign function method can also be applied to other related problems, e.g., the stable generalized Lyapunov equation AT XE + E T XA = C, where A, E, X, C ∈ Rn×n and C = C T [3], and it can also be combined with iterative refinement for higher accuracy. However, this has not been incorporated in psb04md [4]. For further details we refer to [16,4,3,5].
4 Computational Results In this section, we present and compare measured speed and accuracy results of PGESYCTD and psb04md using two different parallel platforms. We solve a number of differently conditioned (regarding the separation of A and B) problems of various sizes using different processor mesh configurations. Our target machines are the IBM Scalable POWERparallel (SP) system and the Super Linux Cluster at High Performance Computing Center North (HPC2N), where we utilize up to 64 processors of each machine (see Table 4). The test problems are constructed as follows. Consider the matrix A in the form A = Q(αDA + βMA )QT , where DA is (quasi-)diagonal, MA is strictly upper triangular, Q is orthogonal and α and β are real scalars. We choose MA as a random matrix with uniformly distributed elements in [-1,1] and prescribe the eigenvalues of A by specifying the elements of DA . If the matrix B is constructed similarly, we can construct differently conditioned problems by varying the eigenvalues in DA and DB and choosing appropriate values of the scaling factors. For example, the factor β is used to control the distance from normality, βMA , of the matrix A. A representative selection of our results for problems with m = n are shown in Tables 1, 2, and 3. Results for problems with m = n are not presented here but will not lead to any different conclusions. We have chosen close to optimal block sizes for the different parallel algorithms and parallel architectures. The upper parts of Tables 1 and 2 correspond to well-conditioned problems, and the lower parts represent moderately to very ill-conditioned problems. We display the performance ratios qT , qX and qR , which correspond to the execution time ratio and two accuracy ratios, the Frobenius ˜ F and the absolute residual RF = norms of the absolute (forward) error X − X ˜ denote the exact and ˜ ˜ C − AX + XBF of the two implementations, where X and X computed solutions, respectively. If a ratio is larger than 1.0, psb04md shows better results, otherwise PGESYCTD is better.
724
Robert Granat and Bo K˚agstr¨om
Table 1. Speed and accuracy results on IBM SP system for the routines PGESYCTD and psb04md solving the general equation AX −XB = C for well- and ill-conditioned problems. All problems use the blocking factors mb = nb = 64 . In the upper part of the table, A and −B have the eigenvalues λi = −i, i = 1, 2, . . . , m = n , and α = β = 1.0. For the lower part A and −B have the eigenvalues: a) λAi = −i, λBi = −1000.0 · i−1 , α = β = 1.0, b) λAi = −i, λBi = −2000.0 · i−1 , α = β = 1.0 , c) A and −B have the eigenvalues λi = −i, αA = αB = βA = 1.0, βB = 50.0 m=n 512 512 512 512 1024 1024 1024 1024 1024 2048 2048 2048 2048 2048 3072 3072 3072 3072 4096 4096 4096 512a 512a 512a 512a 1024b 1024b 1024b 1024b 1024b 2048c 2048c 2048c 2048c 2048c 3072c 3072c 3072c 3072c 4096c 4096c 4096c
sep−1 Pr × Pc 3.7E-3 1 × 1 3.8E-3 2 × 1 3.8E-3 2 × 2 3.8E-3 4 × 2 2.1E-3 1 × 1 2.1E-3 2 × 1 2.1E-3 2 × 2 2.1E-3 4 × 2 2.1E-3 4 × 4 1.1E-3 2 × 2 1.1E-3 4 × 2 1.1E-3 4 × 4 1.1E-3 8 × 4 1.1E-3 8 × 8 7.6E-4 4 × 2 7.6E-4 4 × 4 7.6E-4 8 × 4 7.6E-4 8 × 8 5.9E-4 4 × 4 5.9E-4 8 × 4 5.9E-4 8 × 8 0.17 1×1 0.16 2×1 0.15 2×2 0.16 4×2 2.8 1×1 2.8 2×1 2.7 2×2 2.7 4×2 3.4 4×4 mem 2 × 2 4.8E4 4 × 2 5.2E4 4 × 4 5.4E4 8 × 4 4.8E4 8 × 8 mem 4 × 2 3.9E3 4 × 4 4.7E3 8 × 4 4.8E3 8 × 8 mem 4 × 4 5.5E4 8 × 4 6.5E4 8 × 8
PGESYCTD ˜ F RF Tp X − X 60.0 4.1E-12 1.2E-9 49.2 4.1E-12 1.2E-9 40.4 4.3E-12 1.2E-9 33.6 3.8E-12 1.1E-9 534 1.2E-11 6.6E-9 287 1.2E-11 6.5E-9 177 1.1E-11 5.4E-9 138 1.1E-11 6.6E-9 106 1.3E-11 6.7E-9 1625 3.8E-11 3.5E-8 835 3.5E-11 3.7E-8 446 3.3E-11 3.7E-8 359 4.0E-11 3.7E-8 302 3.5E-11 3.7E-8 3355 6.3E-11 9.7E-8 1677 6.3E-11 1.0E-7 1056 6.4E-11 1.0E-7 705 6.6E-11 1.0E-7 3788 9.5E-11 2.0E-7 2330 9.6E-11 2.1E-7 1365 1.0E-10 2.1E-7 61.7 6.1E-11 7.2E-10 47.0 8.5E-11 6.9E-10 36.1 8.8E-11 6.9E-10 30.6 8.3E-11 6.7E-10 635 7.6E-9 4.0E-9 326 4.1E-9 3.9E-9 188 8.8E-9 3.8E-9 125 1.1E-8 3.8E-9 91 9.3E-9 3.8E-9 1797 8.3E-5 4.1E-8 892 5.8E-5 4.1E-8 446 7.2E-5 4.1E-8 348 5.1E-5 4.1E-8 281 4.2E-5 4.1E-8 3148 1.4E-5 1.1E-7 1635 1.1E-5 1.1E-7 878 7.3E-6 1.1E-7 769 8.6E-6 1.1E-7 3805 1.9E-4 2.1E-7 2066 3.3E-4 2.1E-7 1327 1.9E-4 2.1E-7
psb04md ˜ F Tp X − X 43.8 1.1E-12 37.1 2.1E-12 33.0 1.8E-12 33.1 1.9E-12 529 1.0E-10 253 8.2E-12 169 2.2E-11 170 8.1E-12 142 6.6E-12 6141 2.9E-11 1031 3.5E-11 658 2.1E-11 670 2.5E-11 588 2.5E-11 11504 5.7E-11 2473 4.9E-11 2078 1.3E-10 1556 6.3E-10 11602 8.3E-11 5036 7.8E-11 3102 8.8E-11 33.8 1.1E-9 29.7 9.4E-10 26.8 1.2E-9 25.1 1.1E-9 575 4.9E-5 206 3.1E-5 156 6.0E-5 131 3.6E-5 118 4.4E-5 43708 dnc 1158 1.2 708 0.86 738 0.92 733 0.88 11358 2.9 2936 1.8 2051 1.9 1785 1.7 15435 85.9 41439 dnc 22487 dnc
RF 2.4E-10 9.1E-10 8.0E-10 8.0E-10 3.3E-8 6.8E-9 8.9E-9 6.8E-9 5.5E-9 4.9E-8 4.8E-8 3.6E-8 3.9E-8 3.9E-8 1.5E-7 1.3E-7 2.0E-7 6.8E-7 2.7E-7 2.5E-7 2.7E-7 2.6E-7 2.2E-7 2.8E-7 2.8E-7 2.9E-2 1.8E-2 3.3E-2 2.1E-2 2.7E-2 dnc 1343 982 1034 1003 5054 3198 3305 2926 180879 dnc dnc
Performance ratios qT qX qR 1.37 3.73 0.50 1.33 1.95 1.32 1.22 2.39 1.50 1.02 2.00 1.38 1.01 0.12 0.20 1.13 1.38 0.96 1.05 0.50 0.61 0.81 1.36 0.97 0.75 1.97 1.22 0.26 1.31 0.71 0.81 1.00 0.08 0.68 1.57 1.03 0.54 1.60 0.95 0.51 1.40 0.95 0.29 1.11 0.65 0.68 1.29 0.77 0.51 0.49 0.50 0.45 0.10 0.15 0.33 1.14 0.74 0.46 1.23 0.84 0.44 0.11 0.78 1.8 5.5E-2 2.8E-3 1.6 3.1E-3 3.1E-3 1.4 7.3E-2 2.5E-3 1.2 7.5E-2 2.4E-3 1.1 1.6E-4 1.4E-7 1.6 1.3E-4 2.2E-7 1.2 1.5E-4 1.2E-7 0.95 3.1E-4 1.8E-7 0.77 2.1E-4 1.4E-7 0.0 0.0 0.0 0.77 4.8E-5 3.1E-11 0.63 8.4E-5 4.2E-11 0.47 5.5E-5 4.0E-11 0.38 4.8E-5 4.1E-11 0.28 4.8E-6 2.2E-11 0.51 6.1E-6 3.4E-11 0.43 3.8E-6 3.3E-11 0.43 5.1E-6 3.8E-11 0.25 2.2E-6 1.2E-12 0.0 0.0 0.0 0.0 0.0 0.0
When the algorithm in psb04md converged, it did so in less than 20 iterations. We used an upper threshold of 100 iterations. To signal that psb04md does not converge, we use the notation dnc. Recall that SYCT has a unique solution if and only if the A and B have disjoint spectra, or equivalently sep(A, B) = 0, where −1 sep(A, B) = inf XF =1 AX − XBF = σmin (ZSYCT ) = ZSYCT −1 2 ,
(8)
Evaluating Parallel Algorithms for Solving Sylvester-Type Matrix Equations
725
Table 2. Speed and accuracy results on Super Linux Cluster seth for the routines PGESYCTD and psb04md solving the general equation AX − XB = C for well- and ill-conditioned problems. All problems use the blocking factors mb = nb = 32 . In the upper part of the table, A and −B have the eigenvalues λi = −i, i = 1, 2, . . . , m = n , and α = β = 1.0. For the lower part A and −B have the eigenvalues: a) λAi = −i, λBi = −1000.0 · i−1 , α = β = 1.0, b) λAi = −i, λBi = −2000.0 · i−1 , α = β = 1.0 , c) A and −B have the eigenvalues λi = −i, αA = αB = βA = 1.0, βB = 50.0 m = n sep−1 Pr × Pc 1024 2.5E-3 1×1 1024 2.4E-3 2×2 1024 2.4E-3 3×3 1024 2.3E-3 4×4 1024 2.3E-3 5×5 1024 2.3E-3 6×6 1024 2.3E-3 7×7 1024 2.5E-3 8×8 2048 1.3E-3 1×1 2048 1.2E-3 2×2 2048 1.2E-3 3×3 2048 1.2E-3 4×4 2048 1.2E-3 5×5 2048 1.2E-3 6×6 2048 1.2E-3 7×7 2048 1.3E-3 8×8 4096 7.0E-4 2×2 4096 7.0E-4 3×3 4096 7.0E-4 4×4 4096 7.0E-4 5×5 4096 7.0E-4 6×6 4096 7.0E-4 7×7 4096 6.6E-4 8×8 8192 mem 4×4 8192 3.5E-4 5×5 8192 3.5E-4 6×6 8192 3.5E-4 7×7 8192 3.6E-4 8×8 1024a 9.2 1×1 1024a 8.2 2×2 1024a 5.9 3×3 1024a 7.2 4×4 1024a 4.9 5×5 1024a 6.4 6×6 1024a 9.0 7×7 1024a 4.3 8×8 b 2048 5.1E4 1×1 2048b 4.9E4 2×2 2048b 6.4E4 3×3 2048b 5.1E4 4×4 b 2048 6.6E4 5×5 b 2048 6.7E4 6×6 b 2048 6.7E4 7×7 2048b 6.6E4 8×8 4096b 1.3E6 2×2 b 4096 6.6E4 3×3 b 4096 5.6E4 4×4 b 4096 5.8E4 5×5 4096b 6.4E4 6×6 4096b 8.4E4 7×7 4096b 6.7E4 8×8 b 8192 mem 4×4 b 8192 2.5E4 5×5 b 8192 2.5E4 6×6 8192b 4.8E4 7×7 8192b 3.4E4 8×8
PGESYCTD psb04md ˜ ˜ Tp X − X F RF Tp X − XF RF 277 6.4E-12 3.5E-9 101 2.9E-12 2.4E-9 136 6.6E-12 3.5E-9 44 2.8E-12 2.3E-9 54 6.2E-12 3.5E-9 27 2.8E-12 2.3E-9 50 6.4E-12 3.6E-9 21 2.8E-12 2.3E-9 41 7.1E-12 3.6E-9 18 2.8E-12 2.3E-9 30 6.2E-12 3.6E-9 16 2.8E-12 2.3E-9 29 7.0E-12 3.6E-9 14 2.8E-9 2.3E-9 27 5.9E-12 3.4E-9 18 2.9E-12 2.4E-9 2053 2.1E-11 1.9E-8 1166 1.0E-11 1.8E-8 763 1.8E-11 1.9E-8 269 1.0E-11 1.7E-8 505 1.9E-11 2.0E-8 211 1.0E-11 1.7E-8 331 1.9E-11 1.9E-8 120 1.0E-11 1.7E-8 181 1.9E-11 2.0E-8 87 1.0E-11 1.7E-8 143 1.9E-11 2.0E-8 71 1.0E-11 1.7E-8 128 1.9E-11 2.0E-8 63 1.0E-11 1.7E-8 140 2.0E-11 2.0E-8 54 1.0E-11 1.7E-8 6514 5.2E-11 1.1E-7 3174 3.8E-11 1.3E-7 3338 5.0E-11 1.1E-7 1131 3.8E-11 1.3E-7 2912 5.0E-11 1.1E-7 909 3.8E-11 1.3E-7 1466 5.1E-11 1.1E-7 526 3.8E-11 1.3E-7 1027 5.1E-11 1.1E-7 398 3.8E-11 1.3E-7 804 5.8E-11 1.1E-7 342 3.8E-11 1.3E-7 890 5.4E-11 1.1E-7 295 3.8E-11 1.3E-7 15543 1.6E-10 7.5E-7 7425 1.5E-10 9.8E-7 10435 1.5E-10 7.2E-7 3817 1.5E-10 9.9E-7 7987 1.5E-10 6.4E-7 2740 1.5E-10 9.9E-7 6224 1.7E-10 6.5E-7 2197 1.5E-10 9.8E-7 5313 1.7E-10 6.8E-7 2247 1.5E-10 9.9E-7 274 4.3E-9 2.2E-9 73 2.1E-5 1.2E-2 121 6.3E-9 2.1E-9 36 1.2E-5 7.2E-3 52 4.7E-9 2.1E-9 22 2.1E-5 1.2E-2 51 3.3E-9 2.1E-9 18 3.1E-5 1.8E-2 35 5.2E-9 2.1E-9 15 3.5E-5 2.1E-2 28 5.5E-9 2.1E-9 12 1.9E-5 1.1E-2 28 6.3E-9 2.1E-9 12 1.9E-5 1.2E-2 26 4.2E-9 2.1E-9 10 1.6E-5 9.6E-3 2223 2.2E-5 2.3E-8 1211 0.34 389 859 1.7E-5 2.3E-8 284 0.36 407 496 3.4E-5 2.3E-8 178 0.41 474 356 6.3E-5 3.1E-8 132 0.36 421 189 4.4E-5 2.3E-8 95 0.36 428 152 8.5E-5 2.3E-8 77 0.34 422 127 3.5E-5 2.3E-8 68 0.36 412 145 3.1E-5 2.3E-8 58 0.35 394 6470 8.1E-3 1.2E-7 24008 dnc dnc 3611 2.2E-4 1.2E-7 1470 63.3 1.5E5 2271 1.9E-4 5.1E-6 1202 57.9 1.4E5 1628 2.8E-4 1.2E-7 4201 dnc dnc 1134 2.3E-4 1.2E-7 476 62.4 1.5E5 839 2.2E-4 1.2E-5 409 44.9 1.0E5 916 1.5E-4 1.2E-7 337 51.5 1.2E5 16538 5.6E-4 6.4E-7 7652 1.22 5.7E3 10532 7.5E-4 6.6E-7 4037 1.22 5.7E3 8388 2.7E-4 6.3E-7 3146 0.87 4.1E3 5623 3.5E-4 6.3E-7 2387 0.37 1.8E3 5365 8.8E-4 6.0E-7 2240 0.67 2.4E3
Performance ratios qT qX qR 2.74 2.21 1.46 3.09 2.36 1.52 2.00 2.21 1.52 2.38 2.29 1.57 2.28 2.54 1.57 1.88 2.21 1.57 2.07 2.50 1.57 1.50 2.03 1.17 1.76 2.10 1.06 2.84 1.80 1.12 2.50 1.90 1.18 2.76 1.90 1.12 2.08 1.90 1.18 2.01 1.90 1.18 2.03 1.90 1.18 2.59 2.00 1.18 2.05 1.37 0.85 2.95 1.32 0.85 3.20 1.32 0.85 2.79 1.34 0.85 2.58 1.34 0.85 2.35 1.53 0.85 3.02 1.42 0.85 2.09 1.07 0.77 2.73 1.00 0.73 2.91 1.00 0.65 2.83 1.13 0.66 2.36 1.13 0.69 3.75 2.0E-4 1.8E-7 3.36 5.3E-4 2.9E-7 2.36 2.2E-4 1.8E-7 2.83 1.1E-4 1.2E-7 2.33 1.5E-4 1.0E-7 2.33 2.9E-4 1.2E-7 2.33 3.3E-4 1.8E-7 2.60 2.6E-4 2.2E-7 1.84 6.5E-4 5.9E-11 3.02 4.7E-4 5.7E-11 2.79 8.3E-5 4.9E-11 2.70 1.8E-4 7.4E-11 1.99 1.2E-4 5.4E-11 1.97 2.5E-4 5.5E-11 1.87 9.7E-5 5.6E-11 2.50 8.9E-5 5.8E-11 0.27 0.0 0.0 2.46 3.5E-6 8.0E-13 1.89 3.3E-6 3.6E-11 0.39 0.0 0.0 2.38 3.7E-6 8.0E-13 2.05 4.9E-6 1.2E-10 2.72 2.9E-6 1.0E-12 2.16 4.6E-4 1.1E-10 2.61 6.1E-4 1.2E-10 2.67 3.1E-4 1.5E-10 2.36 9.5E-4 3.5E-10 2.40 1.3E-4 2.5E-10
and ZSYCT is the mn × mn matrix representation of the Sylvester operator defined by ZSYCT = In ⊗ A + B T ⊗ Im . Moreover, sep(A, B) is a condition number for the SYCT equation, but it is expensive to compute in practice (O(m3 n3 ) operations). However, −1 we can compute a lower bound of sep(A, B)−1 = ZSY CT 2 in parallel, which is based on the same technique as described in [13,11]. Since a tiny value of sep(A, B) signals ill-conditioning for SYCT, the sep−1 -estimate signals ill-conditioning when it
726
Robert Granat and Bo K˚agstr¨om
Table 3. Speed and accuracy results on Super Linux Cluster seth for the routines PGESYCTD and psb04md solving the triangular (QA = QB = I) equation AX − XB = C, A and B in real Schur form for well-conditioned problems. All problems use the blocking factors mb = nb = 128. A and −B have the eigenvalues λi = −i, i = 1, 2, . . . , m = n , and α = β = 1.0 m = n sep−1 Pr × Pc 1024 2.5E-3 1×1 1024 2.5E-3 2×2 1024 2.5E-3 3×3 1024 2.5E-3 4×4 1024 2.5E-3 5×5 1024 2.5E-3 6×6 1024 2.5E-3 7×7 1024 1.1E − 3 8 × 8 2048 1.3E-3 1×1 2048 1.3E-3 2×2 2048 1.3E-3 3×3 2048 1.3E-3 4×4 2048 1.3E-3 5×5 2048 1.3E-3 6×6 2048 1.3E-3 7×7 2048 1.3E-3 8×8 4096 7.0E-4 1×1 4096 6.9E-4 2×2 4096 6.9E-4 3×3 4096 6.9E-4 4×4 4096 6.9E-4 5×5 4096 6.9E-4 6×6 4096 6.9E-4 7×7 4096 6.9E-4 8×8 8192 3.5E-4 4×4 8192 3.5E-4 5×5 8192 3.5E-4 6×6 8192 3.5E-4 7×7 8192 3.5E-4 8×8
PGESYCTD ˜ F RF Tp X − X 2.6 3.2E-13 3.8E-10 1.7 3.9E-13 5.3E-10 1.6 3.9E-13 4.9E-10 1.5 3.8E-13 5.2E-10 1.4 4.0E-13 4.7E-10 1.4 4.1E-13 4.7E-10 1.4 4.4E-13 5.1E-10 1.0 3.9E-13 5.5E-10 17.8 1.2E-12 3.1E-9 9.5 1.6E-12 4.2E-9 7.9 1.6E-12 3.9E-9 6.7 1.5E-12 4.0E-9 5.9 1.6E-12 3.9E-9 5.2 1.5E-12 3.8E-9 5.0 1.6E-12 3.8E-9 4.4 1.5E-12 4.1E-9 129 4.9E-12 2.5E-8 63.4 6.1E-12 3.3E-8 45.7 6.1E-12 3.2E-8 37.1 6.2E-12 3.2E-8 29.3 6.3E-12 3.1E-8 25.3 6.3E-12 3.0E-8 25.6 6.2E-12 3.0E-8 22.3 6.2E-12 3.0E-8 235 2.5E-11 2.5E-7 167 2.5E-11 2.5E-7 139 2.6E-11 3.5E-7 146 2.5E-11 2.5E-7 128 2.6E-11 2.5E-7
psb04md ˜ F Tp X − X 94.3 3.8E-13 45.7 4.7E-13 30.3 4.7E-13 23.5 4.7E-13 22.9 4.8E-13 22.5 4.9E-13 21.1 5.0E-13 17.0 4.5E-13 1116 1.1E-12 287 1.5E-12 166 1.5E-12 125 1.5E-12 114 1.6E-12 87.4 1.5E-12 83.8 1.6E-12 64.8 1.5E-12 8812 3.0E-12 2687 5.1E-12 1018 5.4E-12 701 5.3E-12 554 5.5E-12 439 5.6E-12 385 5.6E-12 326 5.5E-12 6003 2.0E-11 dnc dnc 2202 2.1E-11 2045 2.1E-11 dnc dnc
RF 4.1E-10 5.5E-10 5.1E-10 5.5E-10 5.0E-10 5.0E-10 5.3E-10 5.8E-10 2.3E-9 3.7E-9 3.4E-9 3.5E-9 3.4E-9 3.2E-9 3.2E-9 3.6E-9 1.3E-8 2.6E-8 2.4E-8 2.4E-8 2.3E-8 2.3E-8 2.2E-8 2.4E-8 1.8E-7 dnc 1.6E-7 1.7E-7 dnc
Performance ratios qT qX qR 2.7E-2 0.84 0.93 3.7E-2 0.83 0.96 5.2E-2 0.83 0.96 6.4E-2 0.81 0.95 6.1E-2 0.83 0.94 6.2E-2 0.84 0.94 6.6E-2 0.88 0.96 5.9E-2 0.87 0.95 1.6E-2 1.01 1.35 3.3E-2 1.07 1.14 4.8E-2 1.07 1.15 5.4E-2 1.00 1.14 5.2E-2 1.00 1.15 5.9E-2 1.00 1.19 6.0E-2 1.00 1.19 6.7E-2 1.00 1.14 1.5E-2 1.63 1.92 2.3E-2 1.20 1.27 4.5E-2 1.13 1.33 5.3E-2 1.17 1.33 5.3E-2 1.15 1.35 5.8E-2 1.13 1.30 6.6E-2 1.10 1.36 6.8E-2 1.13 1.25 3.9E-2 1.25 1.39 0.0 0.0 0.0 6.3E-2 1.23 2.19 7.1E-2 1.19 1.47 0.0 0.0 0.0
gets large. We expect the explicitly blocked method to handle ill-conditioned problems better than the fully iterative, since it relies on orthogonal transformations and it has a direct (backward stable) solution method for the reduced triangular problem. To signal when there is not enough memory for both solving the problem and doing condition estimation, we use the notation mem.
5 Discussion of Computational Results The experimental results from the last section reveal the following: For triangular problems (QA = QB = I), see Table 3 which displays results from the Linux Cluster, the parallel Bartels-Stewart based method is very much faster than the fully iterative since it always computes the solution directly using a fixed number of arithmetic operations. For general problems (QA = I, QB = I) the results differ depending on the target machine. For the IBM SP system (see Table 1), PGESYCTD is able to compete with the fully iterative method regarding both speed, accuracy and residual errors, for both well- and ill-conditioned problems. For example, PGESYCTD uses only 33% (wellconditioned case) and 25% (ill-conditioned case) of the execution time of psb04md for m = n = 4096, Pr = 4, Pc = 4. For the ill-conditioned problems, the difference
Evaluating Parallel Algorithms for Solving Sylvester-Type Matrix Equations
727
Table 4. Hardware characteristics for Super Linux Cluster and IBM SP System. The parameter ta denotes the time for performing an arithmetic operation, ts denotes the experimentally measured startup time for message passing, tn denotes the time to transfer one byte over a single link in the network and tm denotes the peak time to transfer one byte through the memory of a node. The SP system has 3 times better flop/network bandwidth ratio and over 12 times better flop/memory bandwidth ratio than the Super Linux Cluster Hardware Super Linux Cluster IBM SP System Parameter Super Linux Cluster CPU 120 × 2 Athlon MP2k+ 64 thin P2SC ta 3.0 × 10−10 sec. & 1.667Ghz nodes, 120 MHz nodes, ts 3.7 × 10−6 sec. Memory 1-4 Gb/node, 128 Mb/node, tn 3.0 × 10−9 sec. peak 800 Gflops/sec. peak 33.6 Gflops/sec. tm 9.6 × 10−10 sec. Network Wolfkit3 SCI h s i, Multistage network, ta /ts 8.1 × 10−5 3-dim. torus, peak 150 Mb/sec. ta /tn 0.10 peak 667 Mb/sec. ta /tm 0.31
IBM SP System 2.1 × 10−9 sec. 4.0 × 10−5 sec. 6.7 × 10−9 sec. 5.6 × 10−10 sec. 5.3 × 10−5 0.31 3.8
in accuracy on the IBM SP system is remarkable: as expected, the explicitly blocked method gives far more accuracy than the fully iterative, which sometimes did not even converge. For the Linux-Cluster (see Table 2), psb04md is about two or three times faster than PGESYCTD for both types of problems. This difference in speed can be explained by the different characteristics of the machines, see Table 4. For uniprocessor runs psb04md is faster on both machines, but when we go in parallel the superior memory and network bandwidth of the SP system makes it possible for PGESYCTD to scale much better than psb04md. On the less balanced Super Linux Cluster the heavy communication in PGESYCTD becomes a bottleneck. For the ill-conditioned problems PGESYCTD gives the best accuracy, both regarding ˜ F and the residual error RF up to magnitudes 6 and 12, the forward error X − X respectively. For the well-conditioned problems, the routines in general have forward and residual errors of the same order, even if psb04md shows slightly better forward error (up to 63% for the largest problems on the Linux Cluster (m = n = 4096, 8192)), and PGESYCTD shows slightly better residual error (up to 35% for the largest problems on the Linux Cluster). We remark that the column sep−1 in Tables 1–3 are lower bounds on the exact values.
6 Summary and Future Work We have presented a comparison of parallel ScaLAPACK-style implementations of two different methods for solving the continuous-time Sylvester equation (1), the BartelsStewart method and the iterative matrix-sign-function-based method. The comparison has dealt with generality, speed and accuracy. Experiments carried out on two different distributed memory machines show that the parallel explicitly blocked Bartels-Stewart algorithm can solve more general problems and delivers far more accuracy for ill-conditioned problems. A method that imposes more restrictions on the spectra on A and B in AX − XB = C is considered less general. Ill-conditioning is measured with respect to the separation of A and B, sep(A, B), as defined in equation (8). We remark that sep(A, B) can be much smaller than the minimum distance between the eigenvalues of A and B. This means that we can have
728
Robert Granat and Bo K˚agstr¨om
an ill-conditioned problem even if the spectra of A and B are well-separated, which for some examples could favour the iterative matrix-sign-function-based method. The Bartels-Stewart method is also up to four times faster for large enough problems on the most balanced parallel platform (IBM SP), while the parallel iterative algorithm is almost always the fastest of the two on the less balanced platform (HPC2N Linux Super Cluster). Ongoing work includes implementing general Bartels–Stewart solvers for related matrix equations, e.g., the continuous-time Lyapunov equation AX + XAT = C, C = C T and the discrete-time Sylvester equation AXB T − X = C. Our objective is to construct a software package SCASY of ScaLAPACK-style algorithms for the most common matrix equations, including generalized forms of the Sylvester/Lyapunov equations.
Acknowledgements This research was conducted using the resources of the High Performance Computing Center North (HPC2N). Financial support has been provided by the Swedish Research Council under grant VR 621-2001-3284 and by the Swedish Foundation for Strategic Research under grant A3 02:128.
References 1. R.H. Bartels and G.W. Stewart Algorithm 432: Solution of the Equation AX + XB = C, Comm. ACM, 15(9):820–826, 1972. 2. E. Anderson, Z. Bai, C. Bischof. J. Demmel, J. Dongarra, J. DuCroz, A. Greenbaum, S. Hammarling, A. McKenny, S. Ostrouchov and D. Sorensen. LAPACK User’s Guide. Third Edition. SIAM Publications, 1999. 3. P. Benner, E.S. Quintana-Orti. Solving Stable Generalized Lyapunov Equations with the matrix sign functions, Numerical Algorithms, 20 (1), pp. 75-100, 1999. 4. P. Benner, E.S. Quitana-Orti, G. Quintana-Orti. Numerical Solution of Discrete Schur Stable Linear Matrix Equations on Multicomputers, Parallel Alg. Appl., Vol. 17, No. 1, pp. 127-146, 2002. 5. P. Benner, E.S. Quitana-Orti, G. Quintana-Orti. Solving Stable Sylvester Equations via Rational Iterative Schemes, Preprint SFB393/04-08, TU Chemnitz, 2004. 6. S. Blackford, J. Choi, A. Clearly, E. D’Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R.C. Whaley. ScaLAPACK Users’ Guide. SIAM Publications, Philadelphia, 1997. 7. R. Granat, A Parallel ScaLAPACK-style Sylvester Solver, Master Thesis, UMNAD 435/03, Dept. Computing Science, Ume˚a University, Sweden, January, 2003. 8. R. Granat, B. K˚agstr¨om, P. Poromaa. Parallel ScaLAPACK-style Algorithms for Solving Continous-Time Sylvester Equations. In H. Kosch et al., Euro-Par 2003 Parallel Processing. Lecture Notes in Computer Science, Vol. 2790, pp. 800-809, 2003. 9. G. Henry and R. Van de Geijn. Parallelizing the QR Algorithm for the Unsymmetric Algebraic Eigenvalue Problem: Myths and Reality. SIAM J. Sci. Comput. 17:870–883, 1997. 10. G. Henry, D. Watkins, and J. Dongarra. A Parallel Implementation of the Nonsymmetric QR Algorithm for Distributed Memory Architectures. Technical Report CS-97-352 and Lapack Working Note 121, University of Tennessee, 1997.
Evaluating Parallel Algorithms for Solving Sylvester-Type Matrix Equations
729
11. N.J. Higham. Perturbation Theory and Backward Error for AX −XB = C, BIT, 33:124–136, 1993. 12. B. K˚agstr¨om, P. Ling, and C. Van Loan. GEMM-based level 3 BLAS: High-performance model implementations and performance evaluation benchmark. ACM Trans. Math. Software, 24(3):268–302, 1998. 13. B. K˚agstr¨om and P. Poromaa. Distributed and shared memory block algorithms for the triangular Sylvester equation with Sep−1 estimators, SIAM J. Matrix Anal. Appl., 13 (1992), pp. 99–101. 14. Niconet Task II: Model Reduction, website: www.win.tue.nl/niconet/NIC2/NICtask2.html 15. P. Poromaa. Parallel Algorithms for Triangular Sylvester Equations: Design, Scheduling and Scalability Issues. In K˚agstr¨om et al. (eds), Applied Parallel Computing. Large Scale Scientific and Industrial Problems, Lecture Notes in Computer Science, Vol. 1541, pp 438–446, Springer-Verlag, 1998. 16. J.D. Roberts. Linear model reduction and solution of the algebraic Ricatti equation by use of the sign function, Intern. J. Control, 32:677-687, 1980. (Reprint of Technical Report No. TR-13, CUED/B-Control, Cambridge University, Engineering Department, 1971). 17. SLICOT library in the Numerics in Control Network (NICONET), website: www.win.tue.nl/niconet/index.html
Performance Analysis for Parallel Adaptive FEM on SMP Clusters Judith Hippold and Gudula R¨unger Chemnitz University of Technology, Department of Computer Science 09107 Chemnitz, Germany {juh,ruenger}@informatik.tu-chemnitz.de
Abstract. The parallel implementation of 3-dimensional adaptive finite element methods (FEM) on recent clusters of SMPs has to deal with dynamic load increase due to adaptive refinement which might result in load imbalances. Additionally, the irregular behavior of adaptive FEM causes further impacts like varying cache effects or communication costs which make a redistribution strategy more complex. In this paper we investigate the interaction of these effects and the performance behavior of parallel adaptive FEM and present several performance results on recent SMP clusters.
1 Introduction Finite element (FE) methods are frequently used to solve problems in natural sciences and engineering modeled by partial differential equations (PDE). FE methods discretize the physical domain into a mesh of finite elements and approximate the unknown solution function by a set of shape functions on those elements. Adaptive mesh refinement (AMR) has been designed to reduce computational effort and memory needs. But, especially for simulations of real-life problems with 3-dimensional, hexahedral meshes parallel execution is still necessary to provide a satisfying solution in reasonable time. A parallel execution of adaptive FEM on distributed memory machines, however, usually exhibits load imbalances resulting from adaptive refinements concentrated on a few processors. Thus load redistribution is necessary but causes additional overhead and a tradeoff between redistribution costs and load balance has to be found in order to allow an efficient dynamic redistribution strategy. The use of a specific parallel platform further influences the parallel execution time. In this paper we consider clusters of SMPs. Clusters of SMPs (symmetric multiprocessors) gain more and more importance in high performance computing because they offer large computing power for a reasonable price. We have realized a parallel program for adaptive FEM [2] working on adaptively refined, hexahedral meshes and have investigated the execution on SMP cluster architectures. First runtime experiments with parallel adaptive hexahedral FEM on SMP clusters have shown a quite irregular behavior of parallel runtime and speedup values. This includes constant speedup behavior despite load increase due to refinement, superlinear speedups also for unbalanced load, or saturation of speedups for quite small numbers of processors. These effects vary for different application problems and are
Supported by DFG, SFB393 Numerical Simulation on Massively Parallel Computers
J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 730–739, 2006. c Springer-Verlag Berlin Heidelberg 2006
Performance Analysis for Parallel Adaptive FEM on SMP Clusters
731
further influenced when using SMP clusters and exploiting them differently for communication. Compared to single-node clusters the two-level architecture of SMP clusters offers different communication realizations, where SMP internal communication might be implemented more efficiently. Well-known redistribution strategies cannot capture this irregular behavior. The goal of this paper is to investigate the irregular effects of performance behavior caused by adaptive parallel FEM on SMP clusters and to analyze reasons for these irregular effects, which is a necessary prerequisite for dynamic redistribution. Possible reasons for irregular parallel execution times or speedups of adaptive methods are cache effects, non-uniform communication time on SMP clusters, and varying load increase due to mesh refinement. In this paper we present runtime experiments for the parallel adaptive FEM implementation on SMP clusters for different aspects of load increase, load imbalances, and architecture specific properties. Especially we consider cache effects for basic matrix operations intensively used in the program. Also the communication behavior on SMP clusters is investigated. Results are presented for three different cluster platforms, which show a strong influence of the specific architecture. The runtime experiments exhibit quite diverse effects. The contribution of this paper is to gather important influential reasons for an irregular runtime behavior. To understand these reasons is important for a potential adaptive load balancing strategy on SMP clusters. The paper is structured as follows: Section 2 introduces the parallel adaptive FEM implementation. Section 3 illustrates irregular runtime and speedup effects on SMP clusters. Section 4 presents several specific runtime experiments and Section 5concludes.
2 Parallel Adaptive FEM Implementation The parallel adaptive FEM implementation [2] is based on a sequential version [1] and is used to solve systems of 2nd order partial differential equations, such as the Poisson equation (1) or the Lam´e system of linear elasticity (2) (see [1]). Lu := −∇ · (A(x)∇u) + cu = f in Ω ⊂ IR3 , u = u0
on ∂Ω1 ,
−µ∆u − (λ + µ) grad div u = f (i)
u(i) = u0
(i)
on ∂Ω1 ,
nt A(x)∇u = g in Ω ⊂ IR3 , t(i) = g (i)
A(x) = diag(ai )3i=1
(1)
on ∂Ω2 u = (u(1) , u(2) , u(3) )t (i)
on ∂Ω2 ,
(2)
i = 1, 2, 3
In the parallel version the finite elements are spread over the available processors and the global stiffness matrix is represented in local memories by triangular element stiffness matrices. Nodes of FEs shared between processors are duplicated and exist in the address space of each owning processor. The different owners of duplicates calculate only subtotals which have to be accumulated using communication operations to yield the totals. For the parallel implementation a specific numerical parallelization approach is adopted from [6,7,8] so that the processors mainly work on unaccumulated vectors. The design of the parallel adaptive FEM implementation reduces the number of data exchanges by separating computation phases from communication phases [3]
732
Judith Hippold and Gudula R¨unger
(see also Section 4.2). The following pseudo-code illustrates the coarse structure of the implementation: create initial mesh; while(convergence is not achieved) { *** Refinement phase (REF) *** while((there are FEs labeled by the error estimator) OR (neighboring FEs have refinement level difference larger than 1)) refine these FEs; *** Repartitioning phase *** calculate a new partitioning and redistribute; *** Assembling phase (ASS) *** foreach(new FE) assemble the element stiffness matrix; *** Solution phase (SOL) *** while(convergence is not achieved) solve the system of equations with the preconditioned conjugate gradient method; *** Error estimation phase (EST) *** estimate the error per FE; } The program executes the five phases within the outer while-loop until the maximum error drops below a predefined threshold value. The outer loop includes two iterative phases: adaptive mesh refinement (REF) and solving the system of equations (SOL). REF is performed until a valid mesh is created which means all FEs labeled for refinement are subdivided and the difference of the refinement level between all neighboring elements is less than two. SOL uses the iterative preconditioned conjugate gradient method to solve the system of equations [6].
3 Parallel Runtime Behavior This section investigates the parallel runtime behavior of the adaptive FEM implementation on SMP clusters. In order to capture the different effects, we perform several runtime experiments with varying load, varying load imbalances, and different machine utilization. As hardware platforms the three following machines are used: XEON:
a Xeon cluster with 16 nodes and two 2.0 GHz Intel Xeon processors per node (SCI, ScaMPI [4], 512 KB L2 cache),
SB1000:
a dual cluster of four Sun Blade 1000 with 750 MHz Ultra Sparc3 processors (SCI, ScaMPI [4], 8 MB L2 cache), and
JUMP:
an IBM Regatta system with 41 nodes and 32 1.7 GHz Power4+ processors per node (High Performance Switch, Parallel Environment, 1.5 MB L2 cache per 2 processors, 512 MB L3 cache per 32 processors).
Figure 1 shows speedups on XEON for ct01, solving the Lam´e equation (Ω = (0, 2) × (0, 1) × (0, 4);
(i)
∂Ω1
= (0, 1) × (0, 1) × {0}, i = 1, 2, 3;
(3)
∂Ω2
= (0, 2) ×
Performance Analysis for Parallel Adaptive FEM on SMP Clusters Speedup for ct01 on XEON 8
"ideal" "CLU_L4" "SMP_L4" "CLU_L6" "SMP_L6" "CLU_L7" "SMP_L7"
18 16 14 Speedup
6 Speedup
Speedup for layer3 on XEON 20
"ideal" "CLU_L3" "SMP_L3" "CLU_L7" "SMP_L7" "CLU_L8" "SMP_L8"
7
733
5 4
12 10 8 6
3
4 2
2
1
0 2
3
4
5 Processors
6
7
8
2
3
4
5
6
7
Processors
Fig. 1. Speedups for examples ct01 (left) and layer3 (right) on XEON. L denotes the different refinement levels measured. CLU and SMP distinguish the assignment of MPI processes to processors. The CLU version runs only one process per cluster node and the SMP version assigns MPI processes to processors of the same cluster node rather than to different nodes
(0, 1) × {4}; g (3) = 1000 (see [1])), and layer3, a boundary layer for the convectiondiffusion equation (−∆u + u = 1 in Ω = (0, 1)3 ; u = 0 f or x, y, z = 0; ∂u = ∂n 0 f or x, y, z = 1 (see [1])) . The execution times are measured for different numbers
of adaptive refinement steps and program runs without repartitioning. We distinguish the CLU and SMP process assignment strategies. CLU assigns MPI processes only to different cluster nodes and SMP assigns MPI processes to processors of the same cluster node rather than to different nodes. For the example application ct01 the load concentrates on two processors. Thus increasing the number of processors does not have effects on the parallel runtime. With growing refinement level (L7, L8) the load for the two mainly employed processors is increasing and slightly superlinear speedups can be observed. For this levels also a significant difference between the different process assignment strategies emerges: the CLU strategy achieves better results. For the example application layer3 the imbalance of load is not as heavy as for ct01. Even for small refinement levels (L4) good speedups can be measured up to four processors. Larger meshes (L6, L7) create strongly superlinear speedups which remain constant or drop only slightly when the number of processors is increasing and is larger than four although the load for the processors develops unevenly. The CLU process assignment strategy achieves in most cases higher speedups. These example applications illustrate the irregular behavior of parallel runtime and speedup values for adaptive FEM on SMP clusters.
4 Determining and Quantifying the Dependences This section examines the impact of different hardware characteristics and algorithmic properties on the execution time of the adaptive FEM implementation in order to determine reasons for the runtime behavior illustrated in Section 3. We present measurements for a program solving the Lam´e equation on the input meshes called M8, M64, and M512 which mainly differ in the number of initial volumes.
734
Judith Hippold and Gudula R¨unger Average execution time for "axmebe" on JUMP (27) 47
100
46 Time in seconds/10^6
Time in seconds/10^6
Average execution time for "axmebe" on XEON 120
80 60 40 20 0 0
1
"8 nodes" "20 nodes"
2
3 4 5 6 Refinement level "27 nodes"
7
8
"iteration 1" "iteration 2" "iteration 3"
45 44 43 42 41 40
0
1
2
3 4 5 Refinement level
6
7
8
Fig. 2. Left: Average execution times for the operation axmebe on XEON for different finite elements (8, 20, 27 nodes per FE) and different refinement levels. Right: Average execution times for distinct solver iterations (first, second, third iteration) on JUMP (27 nodes per FE)
4.1 Cache Behavior When the number of finite elements grows the number of different memory accesses is increasing and cache misses can influence performance disadvantageously. Thus superlinear speedups are possible for parallel program runs on reduced data sets. To investigate the cache behavior the sequential execution of operations working on large data structures is regarded on different refinement levels. These operations are axmebe, axpy, and scalar of phase SOL. Axmebe: The operation axmebe performs the matrix-vector multiplication y = y + A ∗ u where A is the triangular element stiffness matrix and y and u are global vectors. During program execution the size of A is constant and is between 2.3 KB and 26 KB depending on the finite element type. The length of the global vectors y and u varies according to the current refinement level of the mesh and additionally depends on the FE type. The memory requirements for the vectors in the program run used for Figure 2 is about 6.3 KB at refinement level three and 70.7 KB at refinement level eight for FEs with 8 nodes. For elements with 27 nodes the requirement is 36.6 KB and 453.4 KB at the corresponding levels. The global vectors are accessed irregularly because the vector entries for the nodes of a finite element are not stored consecutively. This is due to the shared nodes between the different elements. Figure 2 (left) illustrates the average execution times for the operation axmebe running on finite elements with 8, 20, and 27 nodes on one processor of XEON. The execution times vary for finite elements creating larger vectors and matrices. Although the average execution time for one vector entry over the entire number of iterations necessary for phase SOL is calculated, drops in the average execution time emerge for small numbers of solver iterations. This behavior can be observed for all platforms investigated. Thus we have additionally measured distinct solver iterations to reduce the effects of e. g. context switches: On the right of Figure 2 the average execution times for the first, second, and third solver iteration for a program run on JUMP with 27 nodes per FE are shown. Scalar: The operation scalar calculates the scalar product prod = x∗y of the global vectors x and y. The average execution times to calculate one vector entry on SB1000
Performance Analysis for Parallel Adaptive FEM on SMP Clusters Average time for "scalar" on SB1000 (8)
735
Average time for "axpy" on XEON (27)
0.04 scalar 1
0.03
Time in seconds/10^7
Time in seconds/10^6
0.14
scalar 2
0.02
0.01
0.12 axpy 1
axpy 2
axpy 3
0.1 0.08 0.06 0.04 0.02
0
012345678 012345678 Refinement level
0
0 2 4 6 8
0 2 4 6 8 0 2 4 6 8 Refinement level
Fig. 3. Left: Average execution times for the operation scalar on different refinement levels (SB1000, 8 nodes per FE). Right: Average execution times for the operation axpy on different refinement levels (XEON, 27 nodes per FE)
are shown on the left of Figure 3 for different refinement levels. Scalar is performed two times during one solver iteration which is denoted by scalar 1 and scalar 2. The execution behavior of both scalar calls is similar but is influenced by the previously performed operation, especially for large vectors: If an input vector of the operation scalar has already been used as a parameter in the previous operation, scalar is faster. Analogous to the operation axmebe the number of iterations of phase SOL has influence on the average execution times. Axpy: The average execution times to compute one vector entry for the operation axpy on XEON are shown on the right of Figure 3. Axpy calculates y = alpha ∗ x + y where y and x are global vectors and alpha is a scalar. As scalar, axpy is performed several times during one iteration of SOL and the average execution times of the different calls depend on the data used by the previous operation and the number of solver iterations.
Level
1 3 5 7
axmebe 30.05 78.49 80.61 81.33
SB1000 scalar 0.36 0.79 0.69 0.67
axpy 0.63 1.23 1.00 1.00
axmebe 30.71 34.11 37.73 37.84
XEON scalar 1.18 0.78 0.30 0.42
axpy 1.90 0.47 0.52 0.27
axmebe 76.25 77.27 80.18 74.51
JUMP scalar axpy 6.60 9.09 2.43 2.83 1.16 1.48 0.87 1.28
Fig. 4. Percentage of time for the operations axmebe, scalar, and axpy on the total, sequential time of phase SOL using finite elements with 8 nodes
The three operations investigated show an increase in their average execution times for large finite element meshes and high numbers of FE nodes. Especially the operation axmebe constitutes a high proportion of the total solver runtime (see Table 4), so increases in execution time may have influence on the total runtime of the program.
Judith Hippold and Gudula R¨unger SMP vs. CLU on XEON (SOL)
SMP vs. CLU on JUMP (SOL) 10
8 initial elements 64 initial elements
15 10
Efficiency gain in %
Efficiency gain in %
20
SMP
5 0
SMP vs. CLU on SB1000 (SOL) 50
8 inital elements
8 initial elements 64 initial elements
40
5
Efficiency gain in %
736
SMP
0 −5
CLU
−10
30
SMP
20 10 0
CLU −5
0
1
2
3
4
CLU 5
6
7
8
−15
Refinement level
0
1
2
3
4
5
6
Refinement level
7
8
−10
0
1
2
3
4
5
6
7
8
Refinement level
Fig. 5. Comparison of the process assignment strategies for phase SOL on the input meshes M8 and M64 on two processors of XEON, JUMP, and SB1000. Positive scale: efficiency gain of the SMP version (two MPI processes per cluster node); negative scale: efficiency gain of the CLU version (one MPI process per cluster node)
However, the increase on high refinement levels is small or the values even decrease on some platforms as illustrated in Table 4. Thus the impact is not strong enough to cause the heavy superlinear speedups shown in Section 3. 4.2 Communication Overhead We consider the SMP and CLU process assignment strategies. If one MPI process is assigned to one processor, MPI processes situated at the same cluster node may take benefit from the internal SMP communication which is more efficient than network communication in many cases [5]. However, as shown in Section 3 the CLU assignment strategy can achieve better results than the SMP strategy, although the load is increasing. This subsection investigates the reasons by parallel program runs on two processors of different platforms. We use a maximum parallelism of two because there are only two CPUs per SMP on XEON and SB1000. Thus the independent consideration of both strategies is possible. The parallel adaptive FEM implementation uses a special communication scheme. This scheme gathers the data to exchange and performs a global exchange (GE) after all local calculations are done. For the different algorithmic phases we have the following numbers of global exchanges: Adaptive refinement (REF): Assemble element stiffness matrices (ASS): Solve the system of equations (SOL): Error estimation (EST):
ref inement iterations ∗ (2 GE) (1 GE) solver iterations ∗ (6 GE) + (2 GE) (2 GE)
The algorithmic design decouples the number of FEs and the number of necessary communication operations. Thus fast internal communication does not necessarily dominate network communication when the load is increasing. Figure 5 compares the two strategies for the most communication-intensive phase SOL. On all platforms the SMP version is more efficient for low refinement levels. With increasing data (increased refinement level or increased size of the initial mesh) the benefit of the SMP strategy is lowered or the CLU version even dominates. Possible
Performance Analysis for Parallel Adaptive FEM on SMP Clusters SMP vs. CLU on SB1000 (ASS) 25
15
SMP Efficiency gain in %
Efficiency gain in %
SMP vs. CLU on XEON (ASS) 2
8 initial elements 64 initial elements
20
SMP
0
−2
10
CLU
−4
5
−6
0
8 initial elements 64 initial elements
CLU −5
737
0
1
2
3
4
5
6
Refinement level
7
8
−8
0
1
2
3
4
5
6
7
8
Refinement level
Fig. 6. Comparison of the process assignment strategies for assembling the element stiffness matrices on the input meshes M8 and M64 on two processors of SB1000 and XEON. Positive scale: efficiency gain of the SMP version (two MPI processes per cluster node), negative scale: efficiency gain of the CLU version (one MPI process per cluster node)
reasons are process scheduling or shared hardware components which cause additional overhead for the SMP version. Figure 6 illustrates the impact of the process assignment strategy for the phase assembling the element stiffness matrices. During this phase only one global exchange is performed. Compared to Figure 5 the efficiency for the SMP version on SB1000 is lower. On XEON only for small numbers of finite elements the SMP version achieves better runtimes than the CLU assignment strategy. Due to the varying numbers of new finite elements on the different refinement levels there are strong fluctuations in efficiency. The total execution time comprises runtimes of algorithmic phases with high and low communication needs and gathers the values of different refinement levels. So the benefits and drawbacks of the different strategies are combined. On XEON and SB1000 the differences in total runtime caused by the different process assignment strategies are approximately two percent for high refinement levels (levels 8 and 9) where on XEON the CLU version and on SB1000 the SMP version is slightly faster. On JUMP the SMP process assignment strategy is about 10 percent faster after 9 refinement steps for the investigated example. Thus a platform adapted assignment of MPI processes to processors can improve performance. 4.3 Algorithmic Properties Subsection 4.2 has investigated the dependence between load increase and process assignment strategy and 4.1 has examined the dependence between load increase and cache behavior. However, the runtime behavior shown in Section 3, especially the superlinear speedup, cannot completely explained with these examinations. This subsection analyzes the influence of algorithmic properties. Figure 7 shows execution times for the different algorithmic phases per different refinement level. On the left measurements for a sequential program run with mesh M8 on JUMP and on the right measurements for a parallel run on two processors with mesh
738
Judith Hippold and Gudula R¨unger Execution time on JUMP (seq., M8)
8
Execution time on SB1000 (par., M64) 3.5
Time in seconds/10
3
Time in seconds
6
2.5
4
2
1.5
2
1
0.5
0
REF
ASS
SOL
EST
0
REF
ASS
SOL
EST
Fig. 7. Execution times per refinement level for the algorithmic phases REF, ASS, SOL, EST (left: sequential execution time on JUMP with mesh M8, right: parallel execution time on two processors of SB1000 with mesh M64)
M64 on SB1000 are shown. In general the diagrams show similar shapes for the different platforms. The most time consuming phases for parallel and sequential execution are the solution of the system of equations (SOL) and the adaptive mesh refinement (REF). The execution times increase with growing mesh size except for the phases ASS and SOL: For solving the system of equations this results from the different convergence behavior of the preconditioned conjugate gradient method and for assembling the element stiffness matrices this is caused by the unevenly growing number of new FEs and their element stiffness matrices which have to be created. Especially on XEON and JUMP the phase REF shows very high sequential execution times compared to the parallel version. These originate from operations on lists of data structures with non-linear complexity. The length of these lists shrinks with growing parallelism. Thus superlinear speedups are possible if the load imbalances are not too strong. On SB1000 this effect has not been observed for the investigated example.
5 Conclusion and Future Work In this paper we have presented several runtime experiments to analyze the irregular behavior of parallel execution and speedups of adaptive 3-dimensional FEM on recent SMP clusters. The investigations have shown that superlinear speedup behavior can occur for unbalanced load or saturation of speedup can occur for balanced load. Existing dynamic redistributions based on the mesh structure and load imbalances cannot capture these effects and are too expensive since a cost intensive redistribution might be applied although the efficiency of the program is already good. The reasons for this behavior are a conglomeration of only slight cache effects, of varying communication efficiency on SMP nodes, and efficiency changes in the algorithm, like a faster search in list structures. Thus the irregular effects cannot be predicted and for an efficient use of redistribution a decision based on explicit performance data seems reasonable. The advantage is that redistribution is only applied when necessary, and thus, redistributing is cheaper since
Performance Analysis for Parallel Adaptive FEM on SMP Clusters
739
the dynamic performance analysis is less expensive than redistribution operations. Our analysis also suggests to exploit SMP cluster nodes in a varying mode and to adapt the numbers of processors. Future work will realize the redistribution strategy based on a decision component that guides the redistribution according to the performance analysis.
References 1. S. Beuchler and A. Meyer. SPC-PM3AdH v1.0, Programmer’s Manual. Technical Report SFB393/01-08. Chemnitz University of Technology, 2001. 2. J. Hippold, A. Meyer, and G. R¨unger. An Adaptive, 3-Dimensional, Hexahedral Finite Element Implementation for Distributed Memory. In J. J. Dongarra, M. Bubak, G. D. van Albada, editor, Proc. of Int. Conf. on Computational Science (ICCS04), LNCS 3037, pages 149–157. Springer, Poland, Krak´ow, 2004. 3. J. Hippold and G. R¨unger. A Data Management and Communication Layer for Adaptive, Hexahedral FEM. In M. Danelutto, D. Laforenza, M. Vanneschi, editor, Proc. of the 10th Int. Euro-Par Conf., LNCS 3149, pages 718–725 . Springer, Pisa, Italy, 2004. 4. http://www.scali.com 5. L. P. Huse, K. Omanga, H. Bugge, H. Ry, A. T. Haugsdal, and E. Rustad. ScaMPI - Design and Implementation. 6. A. Meyer. A parallel preconditioned conjugate gradient method using domain decomposition and inexact solvers on each subdomain. Computing, 45:217–234, 1990. 7. A. Meyer. Parallel Large Scale Finite Element Computations. In G. Cooperman, G. Michler, and H. Vinck, editors, LNCIS 226, pages 91–100. Springer, 1997. 8. A. Meyer and D. Michael. A modern approach to the solution of problems of classic elasto– plasticity on parallel computers. Numerical Linear Algebra and Application, 4(3):205–221, 1997.
Performance Tuning of Matrix Triple Products Based on Matrix Structure Eun-Jin Im1 , Ismail Bustany2 , Cleve Ashcraft3 , James W. Demmel4 , and Katherine A. Yelick4 1
3
Kookmin University, Seoul, Korea [email protected] 2 Barcelona Design Inc., USA [email protected] Livermore Software Technology Corporation, USA [email protected] 4 U.C. Berkeley, USA {demmel,yelick}@eecs.berkeley.edu
Abstract. Sparse matrix computations arise in many scientific and engineering applications, but their performance is limited by the growing gap between processor and memory speed. In this paper, we present a case study of an important sparse matrix triple product problem that commonly arises in primal-dual optimization method. Instead of a generic two-phase algorithm, we devise and implement a single pass algorithm that exploits the block diagonal structure of the matrix. Our algorithm uses fewer floating point operations and roughly half the memory of the two-phase algorithm. The speed-up of the one-phase scheme over the two-phase scheme is 2.04 on a 900 MHz Intel Itanium-2, 1.63 on an 1 GHz Power-4, and 1.99 on a 900 MHz Sun Ultra-3.
1 Introduction This paper contains a case study of an important matrix triple product problem that commonly arises in primal-dual optimization method [1]. The method is used in many practical applications such as circuit design, optimal control, portfolio optimization, etc., and the matrix triple product is often a bottleneck in solving the large-scale symmetric definite sparse matrix problems associated with primal-dual methods. The sparse matrix patterns are particular to the particular primal-dual problem class (e.g. linear, quadratic, etc.) and to the application domain. More formally, we address the following matrix problem: P = A × H × AT where A is a sparse matrix and H is a block diagonal matrix with each diagonal block of H, Hi , having a rank-1 structure, i.e. Hi = Di + ri ∗ rit for a diagonal matrix Di and a column vector ri . The matrix triple product can be treated as two generic sparse matrix products by storing both A and its transpose and performing a full sweep through both instances. However, this generic approach places unnecessary demand on the memory systems and ignores the specialized structure of H. Instead, we introduce a single pass algorithm that exploits the block diagonal structure of the matrix H. While the one-phase scheme has J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 740–746, 2006. c Springer-Verlag Berlin Heidelberg 2006
Performance Tuning of Matrix Triple Products Based on Matrix Structure
741
an advantage over the two-phase scheme through exploiting the structure of matrix, the summation of sparse matrices becomes a bottleneck. Therefore, we propose a row-based one-phase scheme, where the summation of sparse matrices is replaced by the summation of sparse vectors, which can be computed efficiently using a sparse accumulator. We further improve the performance of the row-based one-phase scheme through the use of additional helper data structures.
2 Two Implementation Schemes: One-Phase vs. Two-Phase 2.1 Two-Phase Scheme In this generic implementation scheme, the matrix triple product is treated as two generic sparse matrix products. That is, Q = A×H, followed by P = Q×AT . In computing the product of two sparse matrices, the non-zero elements are stored in a column-compressed format. Then, the multiplication is conducted in an outer and an inner loop. The outer loop marches through the columns of the right matrix. The inner loop sweeps the columns of the left matrix and accumulates the products of their non-zero elements with the non-zero elements in the corresponding column of the right matrix. The summation of sparse vectors that forms a column of the product matrix can be efficiently performed using a sparse accumulator [2]. This generic approach places unnecessary demand on the memory systems since the intermediate product matrix H × AT is explicitly computed and stored. Furthermore, it ignores the specialized structure inherent in H. 2.2 One-Phase Scheme In this scheme, the product matrix is computed in one pass. First, the triple-product is decomposed according to blocks of matrix H as follows : P =
# of blocks in H
(Pi = Ai Hi ATi )
(2.1)
i
where Hi is an i-th block of H and Ai is a corresponding column block of A. In equation-2.1, the summation of the sparse matrices with different non-zero distribution is not trivial and can be time consuming. To enhance the efficiency, we propose a row-based one-phase approach where the summation of sparse matrices is replaced by a sparse vector accumulator that composes each row (or column) of the product matrix, P , in order. By exploiting the structure of H, equation-2.1 may be computed as follows: P =
# of blocks in H
(Ai Di ATi + (Ai ri )(Ai ri )T )
(2.2)
i
From equation-2.2, the k-th row (or equivalently, column) of P , Pk∗ , is computed by the following equation:
742
Eun-Jin Im et al.
Pk∗ = =
# of columns in A
j
akj dj AT∗j
+
j
=
(A∗j dj AT∗j )k∗ +
akj dj AT∗j
j:akj =0
T (B∗i B∗i )k∗
i T bki B∗i
i
# of non−unit blocks in H
+
T bki B∗i
(2.3)
i:bki =0
where Xi∗ is i-th row of matrix X, X∗i is i-th column of matrix X, and B∗i = Ai ri . Furthermore, the computation of equation-2.3 is accelerated using the additional compressed sparse row index structures of matrices A and B which are stored in a compressed sparse column format. It is sufficient to construct a transpose of matrix A without its values since only the index structure is needed for this computation. Finally, since the product matrix P is symmetric, we only compute the upper trianT gular part of the matrix by computing akj dj ATk:m,j and bki Bk:m,j , instead of akj dj AT∗j T and bki B∗j , where Xk:m,j denotes k-th through m-th elements of the j-th column of matrix X. Accesses to elements Akj are expedited by keeping an array of indices pointing to the next non-zero element in each column of A.
3 Performance and Modeling We measure and compare performances of these two implementations using the five matrix sets summarized in the figure-1. This data set is obtained from real circuit design problems with progressively increasing sizes. The matrices grow sparser as their size increases. In the two rightmost columns, the table compares the number of floatingpoint operations and the memory requirements (dynamically allocated data size) of both schemes. These results show that both measures are almost halved in the one-phase scheme. Set # of rows # of columns # of NZs # of NZs 1 2 3 4 5
in A
in A
in A
8648
42750
361K
14872 21096 39768 41392
77406 112150 217030 244501
667K 977K 1913K 1633K
# of fop.s
in H
memory requirement
195K 1-phase
11M
11M
2-phase
24M
24M
361K 1-phase
21M
20M
2-phase
45M
41M
528K 1-phase
31M
29M
2-phase
66M
60M
1028K 1-phase
60M
57M
2-phase
129M
118M
963K 1-phase
31M
50M
2-phase
66M
113M
Fig. 1. Matrix sets used in the performance measurement
Performance Tuning of Matrix Triple Products Based on Matrix Structure Speedup
Performance Achieved
2.5
160 set1:1p set1:2p set2:1p set2:2p set3:1p set3:2p set4:1p set4:2p set5:1p set5:2p
140
120
Set1 Set2 Set3 Set4 Set5
2
1.5 speedup
MFLOPS
100
743
80
1
60
40
0.5 20
0
900 MHz Itanium 2
1 GHz Power 4 Data Set(1−5) : Processor
900 MHz Ultra 3
0
900 MHz Itanium 2
1 GHz Power 4 Data Set(1−5) : Processor
900 MHz Ultra 3
Fig. 2. Achieved Mflop rate and speedup The left graph shows the Mflop rate of the one-phase (1p) and two-phase (2p) schemes for the five matrix sets on an Itanium-2, a Power-4 and an Ultrasparc-3. The right graph shows the speedup of the one-phase scheme over the two-phase scheme for the same matrix sets on the 3 different platforms
The graphs in figure-2 show the Mflop rates (left) and speedups (right) of the onephase scheme over the two-phase scheme for this data set on a 900 MHz Itanium-2, an 1 GHz Power-4, and a 900 MHz Ultrasparc-3. The speedup on the Itanium-2 is almost 2, and overall the speedup is over 1.5. The Mflop rate is lower in the one-phase scheme in all cases. The reduction in the execution time for the one-phase scheme is attributed to the reduced number of floating-point operations and memory accesses. Figure-3 shows the pre-processing time on these three platforms. The pre-processing time is shown relative to the multiplication time of the two-phase scheme for each matrix. In the one-phase scheme, the pre-processing time is spent on – counting the number of non-zeros in B and P to determine the amount of memory allocation, – computing the structure of matrix B, and – constructing the row-major structure of A and B. In the two-phase scheme, it is spent on – generating At , and – counting the number of non-zeros in Q and P . Figure-3 shows there is an added overhead of 1.25-1.75 in the one-phase scheme over relative to the two-phase scheme. Typically the triple matrix product is repeatedly computed 100-120 times, while the pre-processing time is spent only once. Hence, this added overhead is considered to be negligible. We also modeled the execution of the one-phase and two-phase schemes. Figure-4 shows the measured and modeled execution time on the Itanium-2. The dominant factor in the one-phase scheme is due to accessing A’s elements. The number of accesses is computed as follows:
744
Eun-Jin Im et al. Normalized Preprocessing Overhead
Preprocessing time / Execution time of 2−phase scheme
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
900 MHz Itanium 2
1 GHz Power 4 Data Set(1−5) : Processor
900 MHz Ultra 3
Fig. 3. Overhead of pre-processing The overheads of the one-time pre-processing in the onephase and two-phase schemes are shown relative to the multiplication time of the two-phase scheme for each matrix on three platforms
# of columns in A nnz(A ∗i ) i
k+
k=1
# of columns in B nnz(B ∗i ) i
k
(3.4)
k=1
The dominant factor in the two-phase scheme results from the access of elements of X in computing X × Y : # of columns in X
nnz(X∗i )nnz(Yi∗ )
(3.5)
i
This is computed for both products: Q = H × AT and P = A × Q. The measured execution times are between the lower and upper bounds of computed execution times with the modeled lower and upper bounds of the memory latency.
4 Conclusion In our prior work [3,4,5] we have demonstrated the effectiveness of using data structure and algorithm transformations to significantly improve the performance of sparse matrixvector multiplication.
Performance Tuning of Matrix Triple Products Based on Matrix Structure Execution Time
1
10
745
Measured Model:upper bound Model:lower bound
0
seconds
10
−1
10
−2
10
1:1p
1:2p
2:1p
2:2p 3:1p 3:2p 4:1p 4:2p Data Set (1−5) : 1− or 2−phase Scheme
5:1p
5:2p
Fig. 4. Modeled and Measured Execution Time The execution time of the triple-product in the one-phase scheme is measured on a 900 MHz Itanium, and is shown with upper and lower bounds of computed execution time based on a memory latency model. The time is measured and modeled for the one-phase (1p) and two-phase (2p) schemes for the matrix data set in Figure-1
In this work we have examined an important sparse matrix kernel that commonly arises in solving primal-dual optimization problems. For the given data set, the speed-up of the one-phase scheme over the two-phase scheme is 2.04 on a 900 MHz Intel Itanium2, 1.63 on a 1 GHz Power-4, and 1.99 on a 900 MHz Sun Ultra-3 platform. The sparse matrix triple product is a higher level kernel than those hitherto considered for automatic tuning in sparse matrix operations. The use of algebraic transformations to exploit the problem structure represents a higher level of tuning for domain-specific optimizations that can significantly improve performance for a wide class of problems.
Acknowledgment We thank Richard Vuduc for his insightful discussion and for collecting measurement data on Power 4 and Ultra 3.
References 1. M.X. Goemans and D.P. Williamson. The primal-dual method for approximation algorithms and its application to network design problems. Approximation Algorithms for NP-hard Problems, pages 144–191, PWS Publishing Co., Boston, MA, 1996.
746
Eun-Jin Im et al.
2. J.R. Gilbert, C. Moler and R. Schreiber. Sparse matrices in Matlab: Design and implementation. SIAM J. Matrix Analysis and Applications,13:333–356, 1992. 3. E. Im and K.A. Yelick. Optimizing Sparse Matrix Computations for Register Reuse in SPARSITY. In Proceedings of the International Conference on Computational Science, volume 2073 of LNCS, pages 127–136, San Francisco, CA, May 2001, Springer. 4. E. Im, K.A. Yelick and R Vuduc. SPARSITY: Framework for Optimizing Sparse MatrixVector Multiply. International Journal of High Performance Computing Applications, 18(1):135–158, February, 2004. 5. R. Vuduc, A. Gyulassy, J.W. Demmel and K.A. Yelick. Memory Hierarchy Optimizations and Bounds for Sparse AT Ax. In Proceedings of the ICCS Workshop on Parallel Linear Algebra, volume 2660 of LNCS, pages 705–714, Melbourne, Australia, June 2003, Springer.
Adapting Distributed Scientific Applications to Run-Time Network Conditions Masha Sosonkina Ames Laboratory and Iowa State University, Ames IA 50010 [email protected]
Abstract. High-performance applications place great demands on computation and communication resources of distributed computing platforms. If the availability of resources changes dynamically, the application performance may suffer, which is especially true for clusters. Thus, it is desirable to make an application aware of system run-time changes and to adapt it dynamically to the new conditions. We show how this may be done using a helper tool (middleware NICAN). In our experiments, NICAN implements a packet probing technique to detect contention on cluster nodes while a distributed iterative linear system solver from the pARMS package is executing. Adapting the solver to the discovered network conditions may result in faster iterative convergence.
1 Introduction A typical high-performance scientific application places high demands on the computational power and communication subsystem of a distributed computing platform. It has been recorded [12] that cluster computing environments can successfully meet these demands and attain the performance comparable to supercomputers. However, because of the possibly heterogeneous nature of clusters, such performance enhancing techniques as load balancing or non-trivial processor mapping become of vital importance. In addition, the resources available to the application may vary dynamically at a run-time, creating imbalanced computation and communication. This imbalance introduces idle time on the “fastest” processors at the communication points after each computational phase. Taking into consideration an iterative pattern of computation/communication interchange, it could be beneficial for a distributed application to be aware of the dynamic system conditions present on the processors it is mapped too. Dynamic system conditions could include the load on the CPU, the amount of memory available, or the overhead associated with network communication. There have been many tools (e.g., [5,1]) created to help learn about the conditions present. One of the requirements for such tools is to provide an easy-to-use interface to scientific applications, which also translates and filters the low-level detailed system information into categories meaningful to applications. For a scientific application, in which the goal is to model and solve a physical problem rather
This work was supported in part by NSF under grants NSF/ACI-0000443, NSF/ACI-0305120, and NSF/INT-0003274, in part by the Minnesota Supercomputing Institute, and in part by the U.S. Department of Energy under Contract W-7405-ENG-82.
J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 747–755, 2006. c Springer-Verlag Berlin Heidelberg 2006
748
Masha Sosonkina
than to investigate the computer system performance, such an interface is very important. It also constitutes a major reason why inserting “performance recording hooks” directly into application’s code may not be viable for a wide range of applications and for a multitude of computing system parameters. Thus the usage of a helper middleware is justified. When used at application’s run-time, the middleware must be light-weight contrary to typical throughput benchmarks or operating system calls, which may heavily compete with the application for network or system resources. In this paper (Section 3) we outline a network probing technique that, by means of sending carefully chosen trains of small packets, attempts to discover network contention without exhibiting the overhead inherent to network throughput benchmarks [9]. A lightweight technique may be used simultaneouslywith the computational stage of a scientific application to examine the communication subsystem without hindering the application performance. Section 2 presents a brief description of a proposed earlier framework that enables adaptive capabilities of scientific applications during the run-time. In Section 4, we consider a case study of determining dynamic network conditions while a parallel linear system solution solver pARMS is executing. A short summary is provided in Section 5.
2 Enabling Runtime Adaptivity of Applications Network Information Conveyer and Application Notification (NICAN) is a framework which enables adaptation functionality of distributed applications [10]. The main idea is to decouple the process of analyzing network information from the execution of the parallel application, while providing the application with critical network knowledge in a timely manner. This enables non-intrusive interaction with the application and low overhead of the communication middleware. NICAN and the application interact according to a register and notify paradigm: the application issues a request to NICAN specifying the parameters it is interested in, and NICAN informs the application of the critical changes in these parameters. The application adaptation may be triggered by NICAN when certain resource conditions are present in the system. When the distributed application starts executing, each process starts a unique copy of NICAN as a child thread. This implementation is particularly useful in a heterogeneous environment because there is no requirement that the NICAN processes be homogeneous or even be running on the same type of machine. The process of monitoring a requested network parameter is separated from the other functions, such as notification, and is encapsulated into a module that can be chosen depending on the network type, network software configuration, and the type of network information requested by the application. Figure 1 depicts NICAN’s functionality as the interaction of four major entities. NICAN reads the resource monitoring requests using an XML interface, which enables diverse specifications and types of the application requests. Another functionality of NICAN is to call the adaptation functions provided when appropriate. If a particular resource condition is never met, then the adaptation is never triggered and the application will proceed as if NICAN had never been started. To make NICAN versatile and to provide a wide variety of monitoring capabilities, it is driven by dynamically loaded modules.
Adapting Distributed Scientific Applications to Run-Time Network Conditions Uses
NICAN Interface
Application Process
Provides
Creates
749
Modifies
Invokes Module Manager
Adaptation Handler
Controls
Module 1
...
Module n
Fig. 1. Interaction of NICAN’s components /* Include the NICAN header file */ #include /* The handlers are declared in the global scope */ void HandlerOne(const char* data) {/* empty */}; void HandlerTwo(const char* data) {/* empty */}; /* The application’s main function */ void main() { const char xmlFile[] = "/path/to/xmlFile"; /* Start NICAN’s monitoring */ Nican_Initialize(xmlFile, 2, "HandlerOne", &HandlerOne, "HandlerTwo", &HandlerTwo ); /* Application code runs while NICAN operates */ /* ... */ /* Terminate NICAN’s monitoring */ Nican_Finalize(); } Fig. 2. A trivial example of how to use NICAN
Figure 2 demonstrates how easy it is for the application to use NICAN. First the application must include a header file with the declarations for the NICAN interface functions. There are two adaptation handlers specified, HandlerOne and HandlerTwo, which for this trivial example are empty functions. The path to the XML file is specified and passed as the first parameter to Nican Initialize. The application is informing NICAN of two adaptation handlers. Nican Initialize will start the threads required for NICAN to operate and return immediately, allowing it to run simultaneously with the application. When the application is finished, or does not require the use of NICAN any longer, it calls the Nican Finalize function, which returns after all the threads related to NICAN have been safely terminated.
3 Packet Probing to Examine Communication Overhead In a cluster, we can consider two types of network transmissions following the terminology given in [2]. One type is latency bound transmission, and the other is a bandwidth bound transmission. A latency bound transmission is one where the transmission time required is dependent only on a one-time cost for processing any message. By using very
750
Masha Sosonkina
small messages, on the order of a few hundred bytes, the cost of processing the message is reduced to a minimum. A bandwidth bound transmission is one where the time required is dependent on not just a one-time cost, but also the bandwidth available (see e.g., [9]). Typically bandwidth bound transmissions are comprised of large messages, which cause additional overhead while those messages are queued on network interfaces and processed. Latency bound transmissions have the attractive property that they do not cause a significant software overhead on the protocol stack of the machine processing the message. Thus, latency bound transmissions may be used as a means of communication subsystem performance analysis at application’s runtime. This can be done while a distributed application performs its computational phase so as to not interfere with the application’s communications. To better capture what conditions are present on a neighboring node, we use a train of packets rather than only two. This allows more time for the conditions to influence our train and is a trade-off between only two packets and flooding the network. We can use the notion of initial gap similar to the LogP model [3] to describe what happens to the probing packets. By introducing different types of load on the source and sink nodes we can affect the gaps recorded at the sink in a way that can be used to determine what type of load is present. The two metrics we will be using are the average gaps recorded at the sink and the standard deviation of those gaps. 3.1 Example of Probing: Fast Ethernet Networks The cluster for our experiments is a collection of nine dual-processor nodes connected via 100Mbps Ethernet links by means of a shared hub. We send a train of packets directly from the source to the sink and perform timing on the sink. We vary the size of the packets across trials, but keep the number of packets constant. For each packet size we sent 100 trains of 64 packets and computed the average arrival time and the standard deviation for the distribution of arrival times. Note that 100 trains would be too many to be used in actual probing, as it is too close to “flooding” the network, but was used in our derivation of the technique described in this section. The actual amount of probing will depend on the time warranted by the computational phase of the application. We have conducted experiments that indicate using a single train of 32 packets may be sufficient. To add an additional load on the source and sink, we generated 30Mbps UDP network transmissions and external 98% CPU and memory loads on a node. Although we have performed a series of experiments to detect possible (external load, competing flow) combinations, we present here only a few illustrative cases. For more details see [11]. In particular, Figure 3 shows the case when no adverse conditions are present on the nodes. As packets grow larger they introduce a larger load on the network stacks. This is clearly the case with Figure 3 depicting a distinct bend formed at the point where the probes are 400B in size. We can approximate this point using the rate of increase for the arrival gaps as the probes get larger. A different situation is shown in Figure 4 where there is no pronounced bend. The “strength” of this bend can be used to help classify what conditions may be present. The location of the bend can be determined during the first set of probes, or by executing the proposed algorithm on a given computing system a priori before the actual application is run. Once this point is known we can narrow the scope of the probe sizes to reduce the impact and time required for the dynamic analysis. In each experiment the gap introduced on the source between the packets, at the user level, was constant with very little deviation. Therefore any deviation measured at the
Adapting Distributed Scientific Applications to Run-Time Network Conditions 100
100
Avg arrival gap Std dev of gaps
Avg arrival gap Std dev of gaps
80 Gap Measured (us)
Gap Measured (us)
80
751
60
40
20
60
40
20
0
0 0
100
200
300
400
500
600
700
800
Probe Size (B)
Fig. 3. No competing CPU load or network flows on either node
0
100
200
300
400
500
600
700
800
Probe Size (B)
Fig. 4. No CPU load and 30 Mbps flow leaving the source node
sink was due to an external influence on our probes. Depending on what type of influence is placed on the source and sink we observe different amounts of deviation, which we use as the second factor to classify the network conditions present. In particular, for a given network type, we have constructed a decision tree (see [11]), tracing which one may detect seven possible cases of network or CPU related load on the source and sink nodes based on recorded deviations and average gap measurements.
4 Case Study: Runtime Changes in Communication Overhead of pARMS pARMS is a parallel version of the Algebraic Recursive Multilevel Solver (ARMS) [8] to solve a general large-scale sparse linear system Ax = b, where A is a constant coefficient matrix, x is a vector of unknowns, and b is the solution vector. To solve such a system iteratively, one preconditions the system of equations into a form that is easier to solve. A commonly used (parallel) preconditioning technique, due to its simplicity, is Additive Schwarz procedure (see, e.g, [7]). In the iteration i, i = 1, . . . , m, given the current solution xi , Additive Schwarz computes the residual error ri = b − Axi . Once ri is known, δi is found by solving Aδi = ri . To obtain the next iterate xi+1 , we simply compute xi+1 = xi + δi and repeat the process until |xi+1 − xi | < , where is a user defined quantity. The Additive Schwarz procedure was used for all the experiments discussed here. To solve linear systems on a cluster of computers it is common to partition the problem using a graph partitioner and assign a subdomain to each processor. Each processor then assembles only the local equations associated with the elements assigned to it. Our motivation for using packet probing is to find congested links in the underlying network of a cluster and to alert pARMS whenever its own adaptive mechanisms need to be invoked. To achieve this goal we have developed a module for NICAN that performs packet probing using the techniques described in Section 3. The design of MPI [4] allows pARMS to start a unique instance of NICAN in each task, each of which sends probes independently. Discovering network overhead on neighboring nodes has proven
752
Masha Sosonkina
useful [6] for the overall performance of pARMS. Thus, ideally, we wish to have each node learn the conditions of the entire system. We will demonstrate this NICAN module using a four-node pARMS computation, with each node probing a fifth node free of any network overhead. By using the various options provided by the module this situation is easily created using an XML file. The 4pack cluster with 32 dual Macintosh G4 nodes, located at the Scalable Computing Laboratory in Iowa State University, has been used for the experiments. Sixteen 4pack nodes have a single 400MHz CPU and the remaining have dual 700MHz CPUs. We used the faster nodes with Fast Ethernet interconnection for both our probes and MPI traffic because an implementation of NICAN packet probing module on Ethernet networks is already available. Figure 5 illustrates how the experiment is laid out. The squares represent the nodes used, with the label indicating the hostname of the machine. pARMS is mapped to node0,. . .,node3; the competing flows (called Iperf traffic) are entering node iperf dest; and the probes sent from the pARMS nodes, are sinking into probe sink.
Fig. 5. Cluster node interaction used for the experiments
Consider the elliptic partial differential equation (PDE) −∆u + 100
∂ xy ∂ −xy (e u) + 100 e u − 1, 000u = f ∂x ∂y
(4.1)
solved on a two-dimensional rectangular (regular) grid with Dirichlet boundary conditions. It is discretized with a five-point centered finite-difference scheme on a nx × ny grid, excluding boundary points. The mesh is mapped to a virtual px × py grid of processors, such that a subrectangle of rx = nx /px points in the x direction and ry = ny /py points in the y direction are mapped to a processor. In the following experiments, the mesh size in each processor is kept constant at rx = ry = 40. In our experiments, four processors (px = py = 2) have been used, thus resulting in a problem of the total size 6,400. This problem is solved by FGMRES(20) using Additive Schwarz pARMS preconditioning with one level of overlap and four inner iterations needed to solve the local subproblem with GMRES preconditioned with ILUT (see [7]).
Adapting Distributed Scientific Applications to Run-Time Network Conditions
753
Each MPI node will use NICAN’s packet probing module to send probing packets to a fifth node known to be free of any network load. Then, at pARMS runtime, any network flow detected is indicative of a load on a node involved in the computation. The experimental procedure is outlined in Table 1. The “Phase” column represents the different phases of the experiment an the “Conditions present” column details what conditions were present in addition to the pARMS-related traffic. Table 1 also lists the average times required to complete the distributed sparse matrix-vector multiplication (called SpMxV) during each phase of this experiment (see columns node0,. . .,node3). The impact of competing network traffic is evident in how the unaffected nodes spend more time completing SpMxV. The extra time is actually accrued while they wait for the node affected by the network flow to transmit the required interface variables. The affected node does not perceive as long a waiting time because when it finally requests the interface unknowns from a neighbor, that neighbor can immediately send them. When we introduce the two network flows at the phase p2 , both node0 and node2 experience less waiting time, but we can also see how much impact competing network traffic can exert on pARMS. The average waiting time for unaffected nodes in this case has nearly tripled compared with p0 , increasing the overall solution time as a result. We only show the times for SpMxV because that is the specific source of increased execution time when additional network traffic is exerted upon a node in the computation. Once we are able to detect and monitor how much waiting time is incurred by each node a suitable adaptation can be developed to help balance the computation, and improve the performance in spite of competing network traffic [6]. The gist of the pARMS adaptation procedure is to increase the number of inner Additive Schwarz iterations performed locally in the fast processors, i.e., in those processors that incur idle time the most. For example, node1 and node3 are such processors in phase p2 . The amount of increase is determined experimentally and may be adjusted on subsequent outer iterations, such that the pARMS execution is kept balanced. With more inner iterations, the accuracy of the local solution becomes higher and will eventually propagate to the computation of the overall solution in an outer (global) iteration, thus reducing the total number of outer iterations. Figure 6, taken from [6] for the measurements on an IBM SP, shows the validity of suggested pARMS adaptation. In particular, with the increase of the number of inner iterations, the waiting time on all the processors becomes more balanced, while the total solution time and the number of outer iterations decrease. Figures 7 and 8 demonstrate how the gaps recorded on the nodes are used to determine the conditions present. In Figure 7 the bend discussed in Section 3 is clearly visible when the probes transition from 350B to 400B. Also, because we are using only 32 probing packets the deviation is larger than that observed in Figure 3. In Figure 8 we see the effect that the competing flow has on the probing packets. The distinct bend is no longer present, and by tracing through the decision tree corresponding to the given network type, we can determine that there is a competing flow leaving the source. We only illustrate the results for these two cases but similar plots can be constructed to show how the other conditions are detected.
754
Masha Sosonkina
Table 1. Phases of network conditions and average times (s) for SpMxV at each phase during pARMS execution Phase Conditions present
node0 node1 node2 node3
p0
No adverse conditions present
.0313 .0330 .0331 .0398
p1
40 Mbps flow from node0
.0293 .0698 .0899 .0778
p2
40 Mbps flows from node0 and node2 .0823 .1089 .0780 .1157
p3
40 Mbps flow from node2
.0668 .0847 .0291 .0794
p4
No adverse conditions present
.0293 .0319 .0474 .0379 PDE problem: Waiting time per processor
0.5
0.45
0.4
Adapt. Outer Solution, s Time, sec
yes/no
0.35
Iter.
no
5,000
4,887.78
yes
432
398.67
No adaptation Adaptation
0.3
0.25
0.2
0.15
0.1
0.05
0
1
2
3
4
Processor ranks
Fig. 6. Adaptation of pARMS on a regular-grid problem 100
100
Avg arrival gap Std dev of gaps
80 Gap Measured (us)
Gap Measured (us)
80
60
40
20
0 200
Avg arrival gap Std dev of gaps
60
40
20
250
300
350
400
450
500
550
600
Probe Size (B)
Fig. 7. The probe gaps observed by node0 at phase p0 in Table 1
0 200
250
300
350
400
450
500
550
600
Probe Size (B)
Fig. 8. The probe gaps observed by node0 at phase p1 in Table 1
5 Conclusions We have described a way to make distributed scientific applications network- and system-aware by interacting with an easy-to-use external tool rather than by obtaining and processing the low-level system information directly in the scientific code. This approach is rather general, suiting a variety of applications and computing platforms,
Adapting Distributed Scientific Applications to Run-Time Network Conditions
755
and causes no excessive overhead. The case study has been presented in which the NICAN middleware serves as an interface between parallel Algebraic Recursive Multilevel Solver (pARMS) and the underlying network. We show how a light-weight packet probing technique is used by NICAN to detect dynamically network contention. In particular, NICAN is able to detect and classify the presence of competing flows in the nodes to which pARMS is mapped. Upon this discovery, pARMS is prompted to engage its own adaptive mechanisms leading to a better parallel performance.
References 1. D. Andersen, D. Bansal, D. Curtis, S. Seshan, and H. Balakrishnan. System support for bandwidth management and content adaptation in Internet applications. In Proceedings of 4th Symposium on Operating Systems Design and Implementation, pages 213–226, 2000. 2. C. Bell, D. Bonachea, Y. Cote, J. Duell, P. Hargrove, P. Husbands, C. Iancu, M. Welcome, and K. Yelick. An evaluation of current high-performance networks. In Proceedings of International Parallel and Distributed Processing Symposium (IPDPS’03), 2003. 3. D. Culler, R. Karp, D. Patterson, A. Sahay, K. Schauser, E. Santos, R. Subramonian, and T. von Eicken. LogP: Towards a realistic model of parallel computation. In Principles Practice of Parallel Programming, pages 1–12, 1993. 4. Message Passing Interface Forum. MPI: A message-passing interface standard. Technical Report Computer Science Department Technical Report CS-94-230, University of Tennessee, Knoxville, TN, May 5 1994. 5. J. Hollingsworth and P. Keleher. Prediction and adaptation in Active Harmony. Cluster Computing, 2(3):195–205, 1999. 6. D. Kulkarni and M. Sosonkina. A framework for integrating network information into distributed iterative solution of sparse linear systems. In Jos´e M. L. M. Palma, et al. editors, High Performance Computing for Computational Science - VECPAR 2002, 5th International Conference, Porto, Portugal, June 26-28, 2002, Selected Papers and Invited Talks, volume 2565 of Lecture Notes in Computer Science, pages 436–450. Springer, 2003. 7. Y. Saad. Iterative Methods for Sparse Linear Systems, 2nd edition. SIAM, Philadelpha, PA, 2003. 8. Y. Saad and B. Suchomel. ARMS: An algebraic recursive multilevel solver for general sparse linear systems. Technical Report Minnesota Supercomputing Institute Technical Report umsi99-107, University of Minnesota, 1999. 9. Q. Snell, A. Mikler, and J. Gustafson. NetPIPE: A network protocol independent performance evaluator. In IASTED International Conference on Intelligent Information Management and Systems, June 1996. 10. M. Sosonkina and G. Chen. Design of a tool for providing network information to distributed applications. In Parallel Computing Technologies PACT2001, volume 2127 of Lecture Notes in Computer Science, pages 350–358. Springer-Verlag, 2001. 11. S. Storie and M. Sosonkina. Packet probing as network load detection for scientific applications at run-time. In IPDPS 2004 proceedings, 2004. 10 pages. 12. Top 500 supercomputer sites. http://www.top500.org/.
Sparse Direct Linear Solvers: An Introduction Organizer: Sivan Toledo Tel-Aviv University, Israel [email protected]
Introduction The minisymposium on sparse direct solvers included 11 talks on the state of the art in this area. The talks covered a wide spectrum of research activities in this area. The papers in this part of the proceedings are expanded, revised, and corrected versions of some the papers that appeared in the CD - ROM proceedings that were distributed at the conference. Not all the talks in the minisymposium have corresponding papers in these proceedings. This introduction explains the significance of the area itself. The introduction also briefly presents, from the personal and subjective viewpoint of the organizer, the significance and contribution of the talks. Sparse direct linear solvers solve linear systems of equations by factoring the sparse coefficient matrix into a product of permutation, triangular, and diagonal (or block diagonal) matrices. It is also possible to factor sparse matrices into products that include orthogonal matrices, such as sparse QR factorizations, but such factorizations were not discussed in the minisymposium. Sparse direct solvers lie at the heart of many software applications, such as finite-elements analysis software, optimization software, and interactive computational engines line Matlab and Mathematica. For most classes of matrices, sparse direct linear solvers scale super-linearly. That is, the cost of solving a linear system with n unknowns grows faster than n. This has led many to search for alternatives solvers with better scaling, mostly in the form of iterative solvers. For many classes of problems, there are now iterative solvers that scale better than direct solvers, and iterative solvers are now also widely deployed in software applications. But iterative solvers have not completely displaced iterative solvers, at least not yet. For some classes of problems, fast and reliable iterative solvers are difficult to construct. In other cases, the size and structure of linear system that application currently solve are such that direct solvers are simply faster. When applications must solve many linear systems with the same coefficient matrix, the amortized cost of the factorization is low. As a result of these factors, sparse direct solvers remain widely used, and research on them remains active. The talk that I think was the most important in the minisymposium was delivered by Jennifer Scott, and was based on joint work with Nick Gould and Yifan Hu. Jennifer’s talk, on the evaluation of sparse direct solvers for symmetric linear systems, described a large-scale study in which she and her colleagues carefully compared several solvers. A comprehensive, objective, and detailed comparison of existing techniques (concrete software packages in this case) is an essential tool for both researchers working within the field and users of the technology. But such studies are difficult to carry out, and are J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 756–757, 2006. c Springer-Verlag Berlin Heidelberg 2006
Sparse Direct Linear Solvers: An Introduction
757
not as exciting to conduct as research on new techniques. I am, therefore, particularly grateful to Scott, Gould, and Hu for carrying out and publishing this important research. The significance of this research reaches beyond direct solvers, because iterative solvers can be included in similar studies in the future. Two talks discussed pre-orderings for sparse direct solvers. Pre-ordering the rows and/or columns of a sparse matrix prior to its factorization can reduce the fill in the factors and the work required to factor them. Pre-ordering also affect the parallelism and locality of reference in the factorization. Yngve Villanger presented algorithms that can improve a given symmetric ordering, developed jointly with Pinar Heggernes. Yifan Hu presented work that he has done with Jennifer Scott on column pre-ordering for partial-pivoting unsymmetric solvers, with emphasis on generating orderings that lead to effective parallel factorizations. Xiaoye Li talked about modeling the performance of the parallel symbolic factorization phase of SuperLU DIST, a distributed-memory partial-pivoting unsymmetric solver. This work was conducted together with Laura Grigori. Two talks focused on solvers for symmetric but indefinite linear systems. Stefan Ro¨ollin described his work with Olaf Schenk on using maximum-weight matching to improve the performance of such solvers. Sivan Toledo gave a talk on an out-of-core implementation of a more conventional symmetric indefinite solver; this work was done with Omer Meshar. Jose Herrero presented surprising results on a sparse hypermatrix Cholesky factorization, work that he has done together with Juan Navarro. The hypermatrix representation is a data structure for storing and operating on sparse matrices. Herrero and Navarro showed that at least on some classes of matrices, a solver that uses a hypermatrix representation outperforms a more conventional sparse direct solver that is based on a supernodal partitioning of the matrix. The last three talks in the minisymposium form some sort of a group, in that all three gave a fairly high-level overview of one specific sparse direct solver. Anshul Gupta talked about WSMP, a solver that can now solve both symmetric and unsymmetric linear systems and that can exploit both shared-memory and distributed-memory parallelism (in the same run). Klaus Gaertner talked about PARDISO, which has been developed together with Olaf Schenk. In his talk, Klaus emphasized the need to exploit application-specific properties of matrices and gave examples from matrices arising in circuit simulations. Florin Dobrian talked about O BLIO, an object-oriented sparse direct solver that he developed with Alex Pothen. Florin’s talk focused mainly on the software-engineering aspects of sparse direct solvers.
Oblio: Design and Performance Florin Dobrian and Alex Pothen Department of Computer Science and Center for Computational Sciences Old Dominion University Norfolk, VA 23529, USA {dobrian,pothen}@cs.odu.edu
Abstract. We discuss Oblio, our library for solving sparse symmetric linear systems of equations by direct methods. The code was implemented with two goals in mind: efficiency and flexibility. These were achieved through careful design, combining good algorithmic techniques with modern software engineering. Here we describe the major design issues and we illustrate the performance of the library.
1 Introduction We describe the design of Oblio, a library for solving sparse symmetric linear systems of equations by direct methods, and we illustrate its performance through results obtained on a set of test problems. Oblio was designed as a framework for the quick prototyping and testing of algorithmic ideas [3] (see also [1,5,10,11] for recent developments in sparse direct solvers). We needed a code that is easy to understand, maintain and modify, features that are not traditionally characteristic of scientific computing software. We also wanted to achieve good performance. As a consequence, we set two goals in our design: efficiency and flexibility. Oblio offers several options that allow the user to run various experiments. This way one can choose the combination of algorithms, data structures and strategies that yield the best performance for a given problem on a given computational platform. Such experimentation is facilitated by decoupled software components with clean interfaces. The flexibility determined by the software design is Oblio’s major strength compared to other similar packages. Out of the three major computational phases of a sparse direct solver, preprocessing, factorization and triangular solve, only the last two are currently implemented in Oblio. For preprocessing we rely on external packages.
2 Design Factorization Types. The first major choice in Oblio is the factorization type. Generally, the direct solution of a linear system Ax = b requires factoring a permutation of the
This research was partially supported by U.S. NSF grant ACI-0203722, U.S. DOE grant DEFC02-01ER25476 and LLNL subcontract B542604.
J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 758–767, 2006. c Springer-Verlag Berlin Heidelberg 2006
Oblio: Design and Performance
759
coefficient matrix A into a product of lower and upper triangular matrices (factors) L and U (L unit triangular). If A is symmetric then U can be further factored as DLT , where D is generally block diagonal, and if A is positive definite as well then we can ˜L ˜ T (D is purely diagonal in this case), where L ˜ is lower write A = LU = LDLT = L ˜ is the triangular. The latter decomposition of A is the Cholesky factorization and L Cholesky factor. In Oblio we offer three types of factorizations. The first one is the Cholesky fac˜L ˜ T , which does not require pivoting and therefore has a static nature. The torization L other two are more general LDLT factorizations. One of them is static, the other one is dynamic. The static LDLT factorization does not usually allow row/column swaps (dynamic pivoting). Instead, it relies on small perturbations of the numerical entries (static pivoting) [7]. Although this changes the problem, it is possible to recover solution accuracy through iterative refinement. The static LDLT factorization can be, however, enhanced, by allowing dynamic pivoting as long as this does not require modifications of the data structures. In Oblio we provide the framework for both approaches. Numerical preprocessing is generally required before a static LDLT factorization in order to reduce the number of perturbations as well as the number of steps of iterative refinement [4]. Since the numerical preprocessing of sparse symmetric problems is currently an area of active research, this type of preprocessing is not yet present in Oblio. The dynamic LDLT factorization performs dynamic pivoting using 1×1 and 2×2 pivots and can modify the data structures. Two pivot search strategies are currently available, one biased toward 1×1 pivots and one biased toward 2×2 pivots [2]. Note that the dynamic LDLT factorization can also benefit from numerical preprocessing: a better pivot order at the beginning of the factorization can reduce the number of row/column swaps. ˜L ˜ T factorization. For matrices For positive definite matrices the best choice is the L that are numerically more difficult one must choose between the two LDLT factorizations. The dynamic LDLT factorization is more expensive but also more accurate. It should be used for those matrices that are numerically the most difficult. For less difficult matrices the static LDLT factorization might be a better choice.
Factorization Algorithms. The second major choice in Oblio is the factorization algorithm. We offer three major column based supernodal algorithms: left-looking, rightlooking and multifrontal. These three factorization algorithms perform the same basic operations. The differences come from the way these operations are scheduled. In addition to coefficient matrix and factor data these algorithms also manage temporary data. For left-looking and right-looking factorization this is required only for better data reuse, but for multifrontal factorization temporary data are also imposed by the way the operations are scheduled. The multifrontal algorithm generally requires more storage than the other two algorithms and the difference comes from the amount of temporary data. If this is relatively small compared to the amount of factor data then the multifrontal algorithm is a good choice. The right-looking and multifrontal algorithms perform the basic operations at
760
Florin Dobrian and Alex Pothen
a coarser granularity than the left-looking algorithm, which increases the potential for data reuse. The right-looking algorithm is not expected to perform well in an out-of-core context [9]. Factor Data Structures. The third major choice in Oblio is the factor data structure (for ˜L ˜ T factorization this stores L; ˜ for LDLT factorization this stores L and D together). L A static factorization can be implemented with a static data structure while a dynamic factorization requires a dynamic data structure. We offer both alternatives, although a static factorization can be implemented with a dynamic data structure as well. The difference between the two data structures comes from the way storage is allocated for supernodes. In a static context a supernode does not change its dimensions and therefore a large chunk of storage can be allocated for all supernodes. In a dynamic context a supernode can change its dimensions (due to delayed columns during pivoting) and in this case storage needs to be allocated separately for each supernode. In order to access the data we use the same pointer variables in both cases. This makes the data access uniform between implementations. The two data structures manage in-core data and therefore we refer to them as in-core static and in-core dynamic. In addition, we offer a third, out-of-core, data structure that extends the in-core dynamic data structure. The out-of-core data structure can store the factor entries both internally and externally. Of course, most of the factor entries are stored externally. Data transfers between the two storage layers are performed through I/O operations at the granularity of a supernode. A supernode is stored in-core only when the factorization needs to access it. Other Features. A few other features are of interest in Oblio. By using a dynamic data structure for the elimination forest [8] it is easy to reorder for storage optimization and to amalgamate supernodes for faster computations. Iterative refinement is available in order to recover solution accuracy after perturbing diagonal entries during a static LDLT factorization or after a dynamic LDLT factorization that uses a relaxed pivoting threshold. Oblio can also factor singular matrices and solve systems with multiple right hand sides. Implementation. Oblio is written in C++ and uses techniques such as dynamic memory allocation, encapsulation, polymorphism, and templates. Dynamic memory allocation is especially useful for temporary data. In order to minimize the coding effort we combine it with encapsulation most of the time, performing memory management within constructors and destructors. We use polymorphism in order to simplify the interaction between different factorization algorithms and different factor data structures, and we rely on templates in order to instantiate real and complex valued versions of the code. Oblio is organized as a collection of data and algorithm classes. Data classes describe passive objects such as coefficient matrices, vectors, permutations and factors. Algorithm classes describe active objects such as factorizations and triangular solves. The major classes and the interactions between them are depicted in Fig. 1 (real valued only). We discuss now the interaction between the three factorization algorithms and the three factor data structures in more detail. Generally we would require three different implementations for each algorithm and thus a total of nine implementations. That would determine a significant programming effort, especially for maintaining the code. We rely
Oblio: Design and Performance OblioEngineReal setOrderingAlgorithm() setFactorizationType() setFactorizationAlgorithm() setFactorsDataType() setPivotingThreshold() run()
MatrixReal allocate() setColumn() finalize()
OrderEngine setOrderingAlgorithm() run()
Permutation
FactorEngineReal setFactorizationType() setFactorizationAlgorithm() setPivotingThreshold() run()
FactorsReal allocateEntry() discardEntry()
FactorsStaticReal allocateEntry() discardEntry()
FactorsDynamicReal allocateEntry() discardEntry()
SolveEngineReal setFactorizationType() run()
761
FactorsOutOfCoreReal allocateEntry() discardEntry() createEntry() destroyEntry() openEntry() closeEntry() readEntry() writeEntry()
VectorReal
Fig. 1. The major classes from Oblio and the interactions between them (only the real valued numerical classes are shown but their complex valued counterparts are present in Oblio as well). The factorization algorithms interact with the abstract class FactorsReal. The actual class that describes the factors can be FactorsStaticReal, FactorsDynamicReal or FactorsOutOfCoreReal (all three derived from FactorsReal), the selection being made at run time
on polymorphism instead and use only three implementations, one for each algorithm. Remember that the difference between the two in-core data structures comes from the supernode storage allocation. But as long as we can use a common set of pointers in order to access data, a factorization algorithm does not need to be aware of the actual storage allocation. Therefore the factorization algorithms interact with an abstract factor class that allows them to access factor data without being aware of the particular factor implementation. The whole interaction takes place through abstract methods.
762
Florin Dobrian and Alex Pothen 5
4.5
4
3.5
storage ratio
3
2.5
2
1.5
1
0.5
0
0
5
10
15
20
25 problem
30
35
40
45
50
Fig. 2. The ratio between the storage required by the multifrontal algorithm and the storage required ˜L ˜ T factorization by the left-looking and right-looking algorithms, L 4.5 left−looking / multifrontal right−looking / multifrontal 4
3.5
time ratio
3
2.5
2
1.5
1
0.5
0
0
5
10
15
20
25 problem
30
35
40
45
50
Fig. 3. The ratios between the execution times of the left-looking and multifrontal algorithms, and ˜L ˜ T factorization between the execution times of the right-looking and multifrontal algorithms, L (IBM Power 3 platform: 375 MHz, 16 GB)
For out-of-core factorization we make a trade-off. In addition to the basic operations that are required by an in-core factorization, an out-of-core factorization requires I/O operations. These can be made abstract as well but that would not necessarily be an elegant software design solution. Another solution, requiring additional code but more
Oblio: Design and Performance
763
3
10
LDLT 1e−10 / LLT LDLT 1e−1 / LLT
2
storage ratio
10
1
10
0
10
0
5
10
15
20
25 problem
30
35
40
45
50
Fig. 4. The ratio between the storage required by the dynamic LDLT factorization and the storage ˜L ˜ T factorization, using the multifrontal algorithm, for pivoting thresholds of 1e-1 required by the L and 1e-10, respectively 2
10
LDLT 1e−10 / LLT LDLT 1e−1 / LLT
time ratio
1
10
0
10
0
5
10
15
20
25 problem
30
35
40
45
50
Fig. 5. The ratio between the execution time of the dynamic LDLT factorization and the execution ˜L ˜ T factorization, using the multifrontal algorithm, for pivoting thresholds of 1e-1 time of the L and 1e-10, respectively (IBM Power 3 platform: 375 MHz, 16 GB)
764
Florin Dobrian and Alex Pothen
0
10
1e−10 1e−1
−5
10
−10
error
10
−15
10
−20
10
−25
10
0
5
10
15
20
25 problem
30
35
40
45
50
Fig. 6. The relative residual for the dynamic LDLT factorization, using the multifrontal algorithm, for pivoting thresholds of 1e-1 and 1e-10, respectively 3000 in−core left−looking in−core right−looking in−core multifrontal out−of−core multifrontal 2500
time (s)
2000
1500
1000
500
0
2
2.5
3
3.5 problem size
4
4.5
5 6
x 10
˜L ˜ T factorization, in-core (left-looking, right-looking and mulFig. 7. The execution time of the L tifrontal) and out-of-core (multifrontal) (Sun UltraSparc III platform: 900 MHz, 2GB). There is only a slight increase of the out-of-core execution time because a slight increase of the problem size determines a slight increase of the amount of explicit I/O. The non-monotonicity of the plots is caused by the lack of smoothness in the ordering algorithm
Oblio: Design and Performance
765
elegant, is to use run time type identification. This is the solution that we adopted in Oblio. Oblio identifies the data structure at run time and performs I/O only in the out-ofcore context. The identification of the actual data structure is done at the highest level and thus virtually comes at no cost. In Oblio, most of the basic arithmetic operations are efficiently implemented through level 3 BLAS/LAPACK calls.
3 Performance We present a reduced set of results here. More details will be available in a future paper. We also refer the reader to [12] for a recent comparative study of sparse symmetric direct solvers (including Oblio). In all the experiments below we use the node nested dissection algorithm from METIS [6] as a fill reducing ordering. As reported in [12] Oblio performs as well as other similar solvers for positive definite problems and correctly solves indefinite problems. For the latter Oblio is not as fast as other indefinite solvers since we have not yet optimized our indefinite kernels. The best factorization rate observed so far is 42% of the peak rate on an IBM Power 4 platform (1.3 GHz, 5200 Mflop/s peak rate). In-Core Results. Figures 2 through 6 illustrate in-core experiments performed with Oblio. All problems are indefinite and represent a subset of the collection used in [12]. ˜L ˜ factorization as We used dynamic LDLT factorization and, for comparison, we used L well, after replacing the original numerical entries and transforming indefinite matrices into positive definite matrices. We used the in-core dynamic factor data structure for the former and the in-core static factor data structure for the latter, and we employed all three factorization algorithms. The results are obtained on an IBM Power 3 platform (375 MHz, 16 GB), which we chose because we were particularly interested in a large amount of internal memory. Figure 2 shows the difference in storage requirement between the three algorithms, in the positive definite case. The plot represents the ratio between the storage required by the multifrontal algorithm and the storage required by the left-looking and right-looking algorithms. This indicates how much additional storage is required by the multifrontal algorithm. For a significant number of problems the amount of additional storage is not large and the multifrontal algorithm is expected to perform well. Execution time results for positive definite problems are shown in Fig. 3. The plots represent the ratios between the times required to factor using the left-looking and multifrontal algorithms, and between the times required to factor using the right-looking and multifrontal algorithms, respectively. Note that the multifrontal algorithm outperforms the other two for many problems. Figures 4 and 5 illustrate results for the indefinite case, using only the multifrontal algorithm and two different pivoting thresholds, 1e-1 and 1e-10. The plots show the increase in storage and execution time (same IBM Power 3 platform). The increase is smaller when the pivoting threshold is smaller but a larger pivoting threshold may be required for accurate results. Similar results can be obtained with the indefinite leftlooking and right-looking algorithms. For indefinite problems we also report accuracy results. These are provided in Fig. 6. The plots represent the relative residual for the two sets of experiments. For most of the
766
Florin Dobrian and Alex Pothen
problems from this set the computation is accurate enough with the relaxed pivoting threshold (1e-10) but some problems require a larger pivoting threshold, at the cost of a more expensive computation. Out-of-Core Results. For the out-of-core experiments we chose a platform with a smaller amount of internal memory: an UltraSparc III Sun Fire 280R (900 MHz, 2GB). Since this platform has a smaller core we can easily determine differences in performance between the in-core and the out-of-core computations. Unfortunately, matrix collections usually provide problems of a given size, even if some of these problems may be large. In order to run out-of-core experiments one should be able to tune the problem size given a particular core size. Here, in order to match the 2 GB core size we tune the order of the problem from roughly 2,000,000 to roughly 5,000,000 (this way we cross the 2GB threshold). We use model 2d finite difference discretization grids with 5-point stencils, the grid size ranging from 1,500 to 2,200. ˜L ˜ T factorizations (static data Figure 7 plots the execution time for the three in-core L ˜L ˜ T factorization. structure) as well as the execution time for the out-of-core multifrontal L As expected, the in-core factorizations start to perform poorly close to the 2GB threshold. At that point the operating system begins to rely on virtual memory in order to provide the required storage. This translates into implicit I/O (swapping), which slows down the computation. On the other hand, the out-of-core factorization performs explicit I/O and therefore the computation continues to be performed efficiently.
4 Conclusion Oblio is an active project and the package continues to be developed. Our current effort is targeted toward software design improvements and code optimizations. We also plan several extensions of the package, mostly in order to provide more options for numerically difficult problems. A future, longer paper will address the design of Oblio in more detail and will provide more results.
References 1. P. R. Amestoy, I. S. Duff, J. Y. L’Excellent, and J. Koster. A fully asynchronous multifrontal solver using distributed dynamic scheduling. SIAM Journal on Matrix Analysis and Applications, 23(1):15–41, 2001. 2. C. Ashcraft, R. G. Grimes, and J. G. Lewis. Accurate symmetric indefinite linear equation solvers. SIAM Journal on Matrix Analysis and Applications, 20(2):513–561, 1998. 3. F. Dobrian, G. K. Kumfert, and A. Pothen. The design of sparse direct solvers using objectoriented techniques. In H. P. Langtangen, A. M. Bruaset, and E. Quak, editors, Advances in Software Tools in Scientific Computing, volume 50 of Lecture Notes in Computational Science and Engineering, pages 89–131. Springer-Verlag, 2000. 4. I. S. Duff and J. Koster. On algorithms for permuting large entries to the diagonal of a sparse matrix. SIAM Journal on Matrix Analysis and Applications, 22(4):973–996, 2001. 5. A. Gupta, M. Joshi, and V. Kumar. WSMP: A high-performance shared- and distributedmemory parallel sparse linear equation solver. Technical Report RC 22038, IBM T.J. Watson Research Center, 2001.
Oblio: Design and Performance
767
6. G. Karypis and V. Kumar. METIS: Family of multilevel partitioning algorithms. http://www-users.cs.umn.edu/˜karypis/metis. 7. X. S. Li and J. W. Demmel. SuperLU DIST: A scalable distributed-memory sparse direct solver for unsymmetric linear systems. ACM Transactions on Mathematical Software, 29(2):110–140, 2003. 8. A. Pothen and S. Toledo. Elimination structures in scientific computing. In D. Mehta and S. Sahni, editors, Handbook on Data Structures and Applications, pages 59.1–59.29. CRC Press, 2004. 9. E. Rothberg and R. Schreiber. Efficient methods for out-of-core sparse Cholesky factorization. SIAM Journal on Scientific Computing, 21(1):129–144, 1999. 10. V. Rotkin and S. Toledo. The design and implementation of a new out-of-core sparse Cholesky factorization method. ACM Transactions on Mathematical Software, 30(1):19–46, 2004. 11. O. Schenk and K. G¨artner. Solving unsymmetric sparse systems of linear equations with PARDISO. Future Generation Computer Systems, 20(3):475–487, 2004. 12. J. A. Scott, Y. Hu, and N. I. M Gould. An evaluation of sparse direct symmetric solvers: an introduction and preliminary findings. Numerical Analysis Internal Report 2004-1, Rutherford Appleton Laboratory, 2004.
Performance Analysis of Parallel Right-Looking Sparse LU Factorization on Two Dimensional Grids of Processors Laura Grigori1 and Xiaoye S. Li2 1 INRIA Rennes Campus Universitaire de Beaulieu, 35042 Rennes, France [email protected] 2 Lawrence Berkeley National Laboratory, MS 50F-1650 One Cyclotron Road, Berkeley, CA 94720, USA [email protected]
Abstract. We investigate performance characteristics for the LU factorization of large matrices with various sparsity patterns. We consider supernodal rightlooking parallel factorization on a two dimensional grid of processors, making use of static pivoting. We develop a performance model and we validate it using the implementation in SuperLU DIST, the real matrices and the IBM Power3 machine at NERSC. We use this model to obtain performance bounds on parallel computers, to perform scalability analysis and to identify performance bottlenecks. We also discuss the role of load balance and data distribution in this approach.
1 Introduction A valuable tool in designing a parallel algorithm is to analyze its performance characteristics for various classes of applications and machine configurations. Very often, good performance models reveal communication inefficiency and memory access contention that limit the overall performance. Modeling these aspects in detail can give insights into the performance bottlenecks and help improve the algorithm. The goal of this paper is to analyze performance characteristics and scalability for the LU factorization of large matrices with various sparsity patterns. For dense matrices, the factorization algorithms have been shown to exhibit good scalability, where the efficiency can be approximately maintained as the number of processors increases when the memory requirements per processor are held constant [2]. For sparse matrices, however, the efficiency is much harder to predict since the sparsity patterns vary with different applications. Several results exist in the literature [1,4,6], which were obtained for particular classes of matrices arising from the discretization of a physical domain. They show that factorization is not always scalable with respect to memory use. For sparse matrices resulting from two-dimensional domains, the best parallel algorithm lead to an increase of the memory at a rate of O(P log P ) with increasing P [4]. It is worth mentioning that for matrices resulting from three-dimensional domains, the best algorithm is scalable with respect to memory size. In this work, we develop a performance model for a sparse factorization algorithm that is suitable for analyzing performance with arbitrary input matrix. We use a classical J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 768–777, 2006. c Springer-Verlag Berlin Heidelberg 2006
Performance Analysis of Parallel Right-Looking Sparse LU Factorization
769
model to describe an ideal machine architecture in terms of processor speed, network latency and bandwidth. Using this machine model and several input characteristics (order of the matrix, number of nonzeros, etc), we analyze a supernodal right-looking parallel factorization on two dimensional grids of processors, making use of static pivoting. This analysis allows us to obtain performance upper bounds on parallel computers, to perform scalability analysis, to identify performance bottlenecks and to discuss the role of load balance and data distribution. More importantly, our performance model reveals the relationship between parallel runtime and matrix sparsity, where the sparsity is measured with respect to the underlying hardware’s characteristics. Given any combination of application and architecture, we can obtain this sparsity measure. Then our model can quantitatively predict not only the performance on this machine, but also what hardware parameters to improve are most critical to improve the performance for this type of applications. We validate our analytical model using the actual factorization algorithm implemented in the SuperLU DIST [5] solver, the real-world matrices and the IBM Power3 machine at NERSC. We also show that the runtime predicted by our model is more accurate than that predicted by simply examining the workload on the critical path, because our model takes into account both task dependency and communication overhead. The rest of the paper is organized as follows: Section 2 introduces a performance analysis model for the right-looking factorization, with its scalability analysis. The experimental results validating the performance model are presented in Section 3 and Section 4 draws the conclusions.
2 Parallel Right-Looking Sparse LU Factorization on Two Dimensional Grids of Processors Consider factorizing a sparse unsymmetric n×n matrix A into the product of a unit lower triangular matrix L and an upper triangular matrix U . We discuss a parallel execution of this factorization on a two dimensional grid of processors. The matrix is partitioned into N × N blocks of submatrices using unsymmetric supernodes (columns of L with the same nonzero structure). These blocks of submatrices are further distributed among a two dimensional grid Pr ×Pc of P processors (Pr ×Pc ≤ P ) using a block cyclic distribution. With this distribution, a block at position (I, J) of the matrix (0 ≤ I, J < N ) will be mapped on the process at position (I mod Pr , J mod Pc ) of the grid. U (K, J) (L(K, J)) denotes a submatrix of U (L) at row block index K and column block index J. The algorithm below describes a right-looking factorization and Figure 1 illustrates the respective execution on a rectangular grid of processors. This algorithm loops over the N supernodes. In the K-th iteration, the first K−1 block columns of L and block rows of U are already computed. At this iteration, first the column of processors owning block column K of L factors this block column L(K : N, K); second, the row of processors owning block row K of U performs the triangular solve to compute U (K, K + 1 : N ); and third, all the processors update the trailing matrix using L(K + 1 : N, K) and U (K, K + 1 : N ). This third step requires most of the work and also exhibits most of the parallelism in the right-looking approach.
770
Laura Grigori and Xiaoye S. Li Distributed matrix
Grid of processors
0
1
2
0
1
2
0
0
1
2
3
4
5
3
4
5
3
3
4
5
0
1
2
0
1
2
0
3
4
5
3
4
5
3
0
1
2
0
1
2
0
3
4
5
3
4
5
3
0
1
2
0
1
2
0
Fig. 1. Illustration of parallel right-looking factorization
for K := 1 to N do Factorize block column L(K : N, K) Perform triangular solves: U (K, K + 1 : N ) := L(K, K)−1 × A(K, K + 1 : N ) for J := K + 1 to N with U (K, J) = 0 do for I := K + 1 to N with L(I, K) = 0 do Update trailing submatrix: A(I, J) := A(I, J) − L(I, K) × U (K, J) end for end for end for
The performance model we develop for the sparse LU factorization is close to the performance model developed for the dense factorization algorithms in ScaLAPACK [2]. Processors have local memory and are connected by a network that provides each processor direct links with any of its 4 direct neighbors (mesh-like). To simplify analysis and to make the model easier to understand, we make the following assumptions: – We use one parameter to describe the processor flop rate, denoted γ, and we ignore communication collisions. We estimate the time for sending a message of m items between two processors as α + mβ, where α denotes the latency and β the inverse of the bandwidth. – We approximate the cost of a broadcast to p processors by log p [2]. Furthermore, the LU factorization uses a pipelined execution to overlap some of the communication with computation, and in this case the cost of a broadcast is estimated by 2 [2]. – We assume that the computation of each supernode lies on the critical path of execution, that is the length of the critical path is N . We also assume that the load and the data is evenly distributed among processors. Later in Section 3, we will provide the experimental data verifying these assumptions. Runtime Estimation. We use the following notations to estimate the runtime to factorize an n×n matrix. We use ck to denote the number of off-diagonal elements in each column of block column K of L, rk to denote the number of off-diagonal elements in each row of block row K of U , nnz(L) to denote the number of nonzeros in the off-diagonal blocks of L, nnz(U ) to denote the number of nonzeros in the off-diagonal blocks of U .
Performance Analysis of Parallel Right-Looking Sparse LU Factorization
771
M = 2 nk=1 ck rk is the total number of flops in the trailing matrix update, counting both multiplications and additions. F = nnz(L) + M is the total number of flops in the factorization. With the above notations, the sequential runtime can be estimated as Ts = nnz(L)γ + M γ = F γ. We assume each processor in the column processors owning block column K gets s · ck /Pr elements, where s · ck is the number of nonzeros in the block column K and Pr is the number of processors in the column. Block row K of U is distributed in a similar manner.The parallel runtime using a square grid of processors can be expressed as: T (N,
√ √ (2nnz(L) + 12 nnz(U ) log P ) F 1 √ P × P ) ≈ γ + (2N + N log P )α + β P 2 P
The first term represents the parallelization of the computation. The second term represents the number of broadcasting messages. The third term represents the volume of communication overhead. Scalability Analysis. We now examine the scalability using a square grid of processors of size P , where the efficiency of the right-looking algorithm is given by the following formula: (N,
√
P×
√ P) =
Ts (N ) √ √ (2.1) P T (N, P × P ) −1 √ N P log P α (2nnz(L) + nnz(U ) log P ) P β ≈ 1+ + (2.2) F γ F γ
One interesting question is which of the three terms dominates efficiency (depending on the size and the sparsity of the matrix). The preliminary remark is that, for very dense problems (F large), the first term significantly affects the parallel efficiency. For the other cases, we can compare the last two terms to determine which one is dominant. is, if we ignore the factors 2 and log P in the third term, we need to √ That α compare P β with nnz(L+U) . Assuming that the network’s latency-bandwidth product N is given ( α ), we can determine if the ratio of the latency to the flop rate (α/γ term) or β the ratio of the inverse of the bandwidth to the flop rate (β/γ term) dominates efficiency. Overall, the following observations hold: √ Case 1 For sparser problems ( P α > nnz(L+U) ), the α/γ term dominates efficiency. N √ αβ nnz(L+U) Case 2 For denser problems ( P β < ), the β/γ term dominates efficiency. N √ nnz(L+U) Case 3 For problems for which is close to P α N β , the β/γ term can be dominant on smaller number of processors, and with increasing number of processors the α/γ term can become dominant. Note that even for Case 2, the algorithm behaviour varies during the factorization: at the beginning of the factorization, where the matrix is generally sparser and the messages
772
Laura Grigori and Xiaoye S. Li
are shorter, α/γ term dominates the efficiency, while at the end of the factorization where the matrix becomes denser, β/γ term dominates the efficiency. Let us now consider matrices in Case 2. For these matrices, √ in order to maintain a F constant efficiency, nnz(L+U) must grow proportionally with P . On the other hand, for a scalable algorithm in terms of memory requirements, the memory requirement nnz(L + U ) should not grow faster than P . Thus, when we allow nnz(L + U ) ∝ P , the condition F ∝ nnz(L+U )3/2 must be satisfied, and the efficiency can be approximately maintained constant. (In reality, even for these matrices, α/γ term as well as log P factor will still contribute to efficiency degradation.) We note that the matrices with N unknowns arising from discretization of Laplacian operator on three-dimensional finite element grids fall into this category. Using nested dissection, the number of fill-ins in such matrix is on the order of O(N 4/3 ) while the amount of work is on the order of O(N 2 ). Maintaining a fixed efficiency requires that the number of processors P grows proportionally with N 4/3 , the size of the factored matrix. In essence, the efficiency for these problems can be approximately maintained if the memory requirement per processor is constant. Note that α/γ term grows with N 1/3 , which also contributes to efficiency loss.
3 Experimental Results In this section, we compare the analytical results against experimental results obtained on a IBM Power3 machine at NERSC, with real-world matrices. The test matrices and their characteristics are presented in Table 1. They are ordered according to the last column, nnz(L + U )/N , which we use as a measure of the sparsity of the matrix. Table 1. Benchmark matrices Matrix
Order
N nnz(A) nnz(L + U ) F lops(F ) nnz(L + U )/N ×106
×109
×103
5.41
1.13
af23560
23560 10440 484256
11.8
rma10
46835 6427 2374001
9.36
1.61
1.45
ecl32
51993 25827 380415
41.4
60.45
1.60
bbmat
38744 11212 1771722
35.0
25.24
3.12
inv-extr1
30412 6987 1793881
28.1
27.26
4.02
ex11
16614 2033 1096948
11.5
5.60
5.65
fidapm11
22294 3873 623554
26.5
26.80
6.84
mixingtank 29957 4609 1995041
44.7
79.57
9.69
The first goal of our experiments is to analyze the different assumptions we have made during the development of the performance model. Consider again the efficiency Equation (2.2) in Section 2. For the first term, we assume that the load is evenly distributed among processors, while for the third term we assume that the data is evenly distributed
Performance Analysis of Parallel Right-Looking Sparse LU Factorization
773
among processors. Note that we also assume that the computation of each supernode lies on the critical path of execution. Our experiments show that a two dimensional distribution of the data on a two dimensional grid of processors leads to a balanced distribution of the data. They also show that for almost all the matrices, the critical path assumption is realistic [3]. However, the load balance assumption is not always valid. To assess load balance, we consider the load F to be the number of floating point operations to factorize the matrix. We then compute the load lying on the critical path FCP by adding at each iteration the load of the most loaded processor in this iteration. More precisely, consider fpi being the load of processor p at i (number of flops performed by this processor iteration N at iteration i). Then FCP = i=1 maxP p=1 fpi . The load balance factor is computed P as LB = FCP . In other words, LB is the load of heaviest processors lying on the F critical path divided by the average load per processor. The closer this factor approaches 1, the better is the load balance. Contrary to the “usual” way, when the load balance factor is computed as the average load divided by the maximum load among all the processors, our computation of load balance is more precise. Note that we can also use F FCP to compute a crude upper bound on the parallel speedup, which takes into account the workload on the critical path but ignores communication cost and task dependency. The results are presented in Table 2, in the rows denoted by LB. We observe that the workload distribution is good for large matrices on a small number of processors. But it can degrade quickly for some matrices such as rma10, for which load balance can degrade by a factor of 2 when increasing the numbers of processors by a factor of 2. Consequently, efficiency will suffer a significant degradation. The second goal of the experiments is to show how the experimental results support our analytical performance model developed in Section 2. For this, we use the analytical performance model to predict the speedup that each matrix should attain with an increasing number of processors. Then we compare the predicted speedup against the speedup obtained by SuperLU DIST. The plots in Figure 2 display these results, where the predicted speedup for each matrix M is denoted by M p, and the actually obtained (measured) speedup is denoted by M m. We also display in these plots the upper bound obtained from the workload on the critical path, given by FFCP , and we denote it as M LB. As the plots show, the analytical performance model predicts well the performance on a small number of processors (up to 30-40 processors), while the predicted speedup starts to deviate above the measured speedup with an increase in the number of processors. This is because on a smaller number of processors our model assumptions are rather realistic, but on a larger number of processors the assumptions are deviating from reality. That is why we see the degraded scalability. More detailed data were reported in [3]. The upper bound based only on the workload on the critical path can be loose, since it ignores communication and task dependency. However, it often corroborates the general trend of the speedup predicted by our analytical model. For several matrices, such as bbmat, inv-extr1, fidapm11 and mixingtank, this upper bound is very close to the prediction given by our analytical model (see Figure 2), implying that for those matrices, load imbalance is a more severe problem than communication, and improving load balance alone can greatly improve the overall performance.
774
Laura Grigori and Xiaoye S. Li
Table 2. Runtimes (in seconds) and load distribution (LB) for right-looking factorization on two dimensional grids of processors P=1 P = 4 P = 16 P = 32 P = 64 P = 128 af23560
time
9.95 3.95
2.36
2.30
3.17
3.38
LB
1.0 1.28
2.05
2.92
4.21
6.67
time
3.41 2.12
1.90
1.99
3.03
3.17
LB
1.0 1.66
2.75
5.62
9.13
16.07
time 104.27 29.34
9.49
7.25
7.31
7.22
1.0 1.09
1.28
1.52
1.80
2.37
time 67.37 19.50
7.61
5.64
6.05
6.50
1.0 1.21
1.75
2.35
3.17
4.88
time 73.13 19.08
6.46
4.72
4.95
5.29
LB
1.0 1.14
1.42
1.87
2.45
3.51
time
9.49 3.27
1.54
1.33
1.65
2.07
LB
1.0 1.27
1.90
2.58
3.53
5.32
time 51.99 14.21
4.84
3.57
3.47
3.81
1.0 1.15
1.46
1.83
2.31
3.13
mixingtank time 119.45 33.87
9.52
6.47
5.53
5.27
1.0 1.08
1.25
1.43
1.63
2.04
rma10 ecl32
LB bbmat
LB inv-extr1 ex11 fidapm11
LB LB
The third goal of the experiments is to study the actual efficiency case by case for all the test matrices. One approach to do this is to observe how the factorization runtime degrades as the number of processors increases for different matrices. For this we report in Table 2 the runtime in seconds of SuperLU DIST factorization. These results illustrate that good speedups can be obtained on a small number of processors and show how efficiency degrades on a larger number of processors. As one would expect, the efficiency degrades faster for problems of smaller size (number of flops smaller), and slower for larger problems. We now examine how the actual model parameters (sparsity, α, β and γ) affect the performance. On the IBM Power3, the measured latency is 8.0 microseconds and the bandwidth (1/β) for our medium size of messages is 494 MB/s [7]. The latencybandwidth product is α/β = 4 × 103 . Table 1 shows that the algorithm’s efficiency for some matrices is clearly dominated by the α/γ term, such as af23560, rma10, ecl32 (Case 1 matrices). For the other matrices, mixingtank, fidapm11, ex11, inv-extr1, the efficiency is significantly affected by the β/γ term (Case 2 matrices). Matrices af23560 and ex11 have an approximately equal number of flops, and almost similar runtimes on one processor. But efficiency degrades faster for af23560 than for ex11. This is because the efficiency of af23560 is mostly affected by the α/γ term (Case 1), while the efficiency of ex11 is mainly affected by the β/γ term (Case 2), and thus its performance degrades slower than for af23560. For denser matrices which fall into Case 2, such as mixingtank, the algorithm achieves much better efficiency even on
Performance Analysis of Parallel Right-Looking Sparse LU Factorization 20
775
8 af23560p af23560m af23560LB
18
rma10p rma10m rma10LB
7
16 6 14 5
12
10
4
8 3 6 2 4
2
0
20
40
60
80
100
120
140
60
1
20
40
60
80
100
120
140
40
60
80
100
120
140
40
60
80
100
120
140
40
60
80
100
120
140
30 ecl32p ecl32m ecl32LB
50
20
30
15
20
10
10
5
0
20
bbmatp bbmatm bbmatLB
25
40
0
0
40
60
80
100
120
140
40
0
0
20
25 ex11p ex11m ex11LB
inv−extr1p inv−extr1m inv−extr1LB
35
20 30
25
15
20 10
15
10 5 5
0
0
20
40
60
80
100
120
140
45
0
0
20
70 fidapm11p fidapm11m fidapm11LB
40
mix−tankp mix−tankm mix−tankLB
60
35 50 30 40
25
20
30
15 20 10 10 5
0
0
20
40
60
80
100
120
140
0
0
20
Fig. 2. Speedups predicted by our performance model (labeled “p”) and by the load balance constraint (labeled “LB”), versus the measured speedups (labeled “m”)
776
Laura Grigori and Xiaoye S. Li
large number of processors. Therefore, the algorithm is more sensitive to latency than bandwidth.
4 Conclusions We developed a performance model for a sparse right-looking LU factorization algorithm and validated this model using the SuperLU DIST solver, real-world matrices and the IBM Power3 machine at NERSC. Using this model, we first analyzed the efficiency of this algorithm with increasing number of processors and problem size. We concluded that for matrices satisfying a certain relation (namely F ∝ nnz(L + U )3/2 ) between their problem size and their memory requirements, the algorithm is scalable with respect to memory use. This relation is satisfied by matrices arising from the 3D model problems. For these matrices the efficiency can be roughly maintained constant when the number of processors increases and the memory requirement per processor is constant. But for matrices arising from the 2D model problems, the algorithm is not scalable with respect to memory use, the same as sparse Cholesky factorization [4]. Secondly, we analyzed the efficiency of this algorithm for fixed problem size and increasing number of processors. We observed that good speedups can be obtained on smaller number of processors. On larger number of processors, the efficiency degrades faster for sparser problems which are more sensitive to the latency of the network. A two dimensional distribution of the data on a two dimensional grid of processors leads to a balanced distribution of the data. It also leads to a balanced distribution of the load on smaller number of processors. But the load balance is usually poor on larger number of processors. We believe that load imbalance and insufficient amount of work (F ) relative to communication overhead are the main sources of worse efficiency on large number of processors. One practical use of our theoretical efficiency bound is as follows. For certain application domain, the matrices usually exhibit a similar sparsity pattern. We can measure the sparsity with respect to the underlying machine parameters, i.e., floating-point speed, the network latency and bandwidth. Depending on whether they belong to Case 1 or Case 2, we can determine the most critical hardware parameters which need to be improved in order to enhance the performance for this class of application. In addition, given several choices of machines, we can predict which hardware combination is best for this application.
References 1. Cleve Ashcraft. The fan-both family of column-based distributed Cholesky factorization algorithms. In Alan George, John R. Gilbert, and Joseph W. H. Liu, editors, Graph Theory and Sparse Matrix Computation, pages 159–191. Springer Verlag, 1994. 2. Jack K. Dongarra, Robert A. van de Geijn, and David W. Walker. Scalability Issues Affecting the Design of a Dense Linear Algebra Library. Journal of Parallel and Distributed Computing, 22(3):523–537, 1994. 3. Laura Grigori and Xiaoye S. Li. Performance analysis of parallel supernodal sparse lu factorization. Technical Report LBNL-54497, Lawrence Berkeley National Laboratory, Feburary 2004.
Performance Analysis of Parallel Right-Looking Sparse LU Factorization
777
4. Anshul Gupta, George Karypis, and Vipin Kumar. Highly Scalable Parallel Algorithms for Sparse Matrix Factorization. IEEE Transactions on Parallel and Distributed Systems, 8(5):502– 520, 1997. 5. Xiaoye S. Li and James W. Demmel. SuperLU DIST: A scalable distributed-memory sparse direct solver for unsymmetric linear systems. ACM Trans. Mathematical Software, 29(2):110– 140, June 2003. 6. Robert Schreiber. Scalability of sparse direct solvers. In Alan George, John R. Gilbert, and Joseph W. H. Liu, editors, Graph Theory and Sparse Matrix Computation, pages 191–211. Springer Verlag, 1994. 7. Adrian Wong. Private communication. Lawrence Berkeley National Laboratory, 2002.
A Shared- and Distributed-Memory Parallel Sparse Direct Solver Anshul Gupta IBM T.J. Watson Research Center 1101 Kitchawan Road Yorktown Heights, NY 10598, USA
1 Introduction In this paper, we describe a parallel direct solver for general sparse systems of linear equations that has recently been included in the Watson Sparse Matrix Package (WSMP) [7]. This solver utilizes both shared- and distributed- memory parallelism in the same program and is designed for a hierarchical parallel computer with networkinterconnected SMP nodes. We compare the WSMP solver with two similar well known solvers: MUMPS [2] and Super LUDist [10]. We show that the WSMP solver achieves significantly better performance than both these solvers based on traditional algorithms and is more numerically robust than Super LUDist . We had earlier shown [8] that MUMPS and Super LUDist are amongst the fastest distributed-memory general sparse solvers available. The parallel sparse LU factorization algorithm in WSMP is based on the unsymmetric pattern multifrontal method [3]. The task- and data-dependency graph for symmetric multifrontal factorization is a tree. A task-dependency graph (henceforth, task-DAG) is a directed acyclic graph in which the vertices correspond to the tasks of factoring rows or columns or groups of rows and columns of the sparse matrix and the edges correspond to the dependencies between the tasks. A task is ready for execution if and only if all tasks with incoming edges to it have completed. The vertex set of a data-dependency graph (henceforth, data-DAG) is the same as that of the task-DAG and and edge from vertex i to a vertex j denotes that at least some of the output data of task i is required as input by task j. In unsymmetric pattern multifrontal factorization, the task- and data-DAGs are general directed acyclic graphs in which nodes can have multiple parents. Hadfield [9] introduced a parallel sparse LU factorization code based on the unsymmetric pattern multifrontal method. A significant drawback of this implementation was that partial pivoting during factorization would change the row/column order of the sparse matrix. Therefore, the data-DAG needed to be generated during numerical factorization, thus introducing considerable symbolic processing overhead. We recently introduced improved symbolic and numerical factorization algorithms for general sparse matrices [6]. The symbolic algorithms are capable of inexpensively computing a minimal task-dependency graph and near-minimal data-dependency graph for factoring a general sparse matrix. These graphs, computed solely from the nonzero pattern of the sparse matrix, are valid for any amount of pivoting induced by the numerical values during LU factorization. With the introduction of these algorithms, the symbolic processing phase J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 778–787, 2006. c Springer-Verlag Berlin Heidelberg 2006
A Shared- and Distributed-Memory Parallel Sparse Direct Solver
779
can be completely separated from numerical factorization and needs to be performed only once for matrices with the same initial structure but different numerical values, and hence, potentially different pivoting sequences during numerical factorization. A description of the serial unsymmetric pattern multifrontal algorithm used in WSMP can be found in [6].
2 Overview of the Parallel Sparse LU Algorithm WSMP is designed to make the best use of the computational resources of most modern parallel computers. These machines, typically, are either shared-memory multiprocessors or are clusters of nodes consisting of shared-memory multiprocessors. WSMP can run on multiple nodes using MPI processes and each process uses threads to utilize all the CPUs on the node. The symbolic preprocessing step, among other things, generates a static data-DAG that defines the communication and computation pattern of the multifrontal LU factorization. This static information is used to generate a static mapping of the data-DAG onto the processes. The work required to processes certain nodes could change during execution due to pivoting. However, such changes are usually relatively small and are randomly distributed among the processes. Therefore, they rarely pose a serious load-imbalance problem. Dynamic load-balancing would have introduced extra communication overhead and other complexities in the code. With these factors in mind, we chose static load-balancing at the process level. However, multiple threads running on each process keep their loads balanced dynamically. Each process maintains a dynamic heap of tasks that are ready for execution. The threads dynamically pick tasks from the heap and add more tasks as they become available. Further discussion of the SMP-parallel component of the factorization can be found in [5]. With the exception of the root node, a typical task in WSMP’s multifrontal factorization involves partial factorization and update of a dense rectangular frontal matrix and distributing the updated part of the matrix among the parents of the node corresponding to the factored frontal matrix. The root node involves full factorization of a dense square matrix. Depending on its size and location relative to the root, a task may be mapped onto one or multiple nodes. When mapped on a single node, a task may be performed by a single or by multiple threads. If required, appropriate shared-memory parallel dense linear algebra routines are employed for partial factorization and updates. When a task is mapped onto multiple processes, the group of processes on which the task is mapped is viewed as a virtual grid. The virtual grid can be one-dimensional or two-dimensional, depending on the number of processes. In the current implementation, the grids are chosen such that they are one-dimensional with multiple columns for less than 6 processes and are two-dimensional for 6 or more processes. Moreover, the number of process rows in all grids is always a power of 2. All grid sizes are not allowed. For example, a 5-process grid is 1 × 5, a 6-process grid is 2 × 3, a 7-process grid is not allowed, and an 8-process grid is 2 × 4. The root node is always mapped onto the largest permissible grid with number of processes less than or equal to the total number of MPI processes that the program is running on. As we move away from the root node in the data-DAG and as more tasks can be executed in parallel, the process grids onto which
780
Anshul Gupta P_0
P_1
P_2
P_0
P_1 P_0
U
Factor
P_0
P_2
P_4
P_0
P_2 P_2
U
Factor
P_1
L
P_0
Update
P_3
L
P_2
Update
P_1
P_3 P_1
(a) A partial factor and update task mapped
P_1
P_2
P_0
P_0 P_0
P_3
P_2
P_4
P_0
P_2
U
P_3
P_0
P_2
L
P_0
(c) A factorization task corresponding to
P_2
P_1
P_1
L
P_1
onto a 6−process 2−dimensional grid.
P_1
U
P_5
(b) A partial factor and update task mapped
onto a 3−process 1−dimensional grid.
P_0
P_3
P_3 P_2
(d) A factorization task corresponding to the root
the root node mapped onto a 3−process
node mapped onto a 6−process 2−dimensional
1−dimensional grid.
grid.
Fig. 1. The four scenarios for mapping frontal matrices onto process grids
these tasks are mapped become progressively smaller. Eventually, the tasks are mapped onto single processes. In addition to a serial/multithreaded partial factorization kernel, four types of message-passing parallel factorization kernels are employed for the four scenarios for mapping frontal matrices onto process grid, as shown in Figure 1. Efficient implementation of pivoting for numerical stability is a key requirement of these kernels. With a non-root frontal matrix mapped onto a one-dimensional grid (Figure 1(a)), if a process can find a pivot amongs its columns, then no communication is required for pivoting. Otherwise, a column interchange involving communication with another
A Shared- and Distributed-Memory Parallel Sparse Direct Solver
781
process is required. When a non-root frontal matrix is mapped onto a two-dimensional grid (Figure 1(b)), then finding a pivot may require all processes in a column of the grid to communicate to find the pivot. That is the reason why the process grids have fewer rows than columns and the number of rows is a power of two so that fast logarithmic time communication patterns can be used for pivot searches along columns. Furthermore, the WSMP algorithm avoids this communication at each pivot step by exchanging data corresponding to more than one column in each communication step. The frontal matrix corresponding to the root supernode never requires column interchanges because this matrix is factored completely. Therefore, pivoting is always free of communication for this matrix on a one-dimensional grid (Figure 1(c)). On a two-dimensional grid (Figure 1(d)), pivoting on the root supernode involves communication among the processes along process columns. Once a pivot block has been factored, it is communicated along the pivot row and the pivot column, which are updated and are then communicated along the columns and the rows of the grid, respectively, to update the remainder of the matrix. The computation then moves to the next pivot block in a pipelined manner and continues until the supernode has been factored completely or no more pivots can be found. Finally, the update matrix is distributed among the parent supernodes for inclusion into their frontal matrices.
3 Performance Comparison In this section, we compare the performance of WSMP with that of MUMPS [2] (version 4.3.2) and Super LUDist [10] (version 2.0) on a suite of 24 test matrices on one to eight 1.45 GHz Power4+ CPUs of an IBM p650 SMP computer. All codes were compiled in 32bit mode and each process had access to 2 GB of memory. The MP SHARED MEMORY environment variable was set to yes to enable MPI to take advantage of shared memory for passing messages. All codes were used with their default options, including their default fill-reducing ordering. WSMP uses its own nested-dissection based ordering, MUMPS uses a nested-dissection based ordering software called PORD [11], and Super LUDist uses the approximate minimum degree algorithm [1]. All three softwares use row prepermutation to permute large entries of the matrix to the diagonal [4]. A detailed description of the various algorithms and features of these packages can be found in [8]. Figures 2 and 3 show a comparison of the times taken by WSMP and MUMPS to solve a single system of equation on 1, 2, 4, and 8 CPUs with medium to large general sparse coefficient matrices. In each case, the total solution time of WSMP is considered to be one unit and all other times are normalized with respect to this. Each bar is further divided into a solid and an empty portion depicting the portions of the total time spent in factorization and triangular solves. The solution phase depicted by the hollow portion of the bars includes the iterative refinement time to reduce the residual of the backward error to 10−15 . Some bars are missing for some of the largest matrices because the program could not complete due to insufficient memory. Figure 4 shows a similar comparison between WSMP and Super LUDist , but contains bars corresponding to the later only. Since MUMPS and Super LUDist are designed to work only in distributed-memory
782
Anshul Gupta
parallel mode, all three codes are run with a single thread on each process for this comparison. Figures 2 and 3 reveal that on a single CPU, MUMPS typically takes between 1 and 2 times the time that WSMP takes to solve a system of equations. However, WSMP appears to be more scalable. With very few exceptions, such as lhr71c and twotone, the bars corresponding to MUMPs tend to grow taller as the number of CPUs increases. Both codes use similar pre-ordering algorithms, and although individually different, the fill-in and operation count are comparable on an average on a single CPU. As the number of CPU’s increases, the operation count in WSMP tends to increase gradually due to added load-balancing constraints on the ordering. MUMPS’ ordering, however, is identical on 1 to 4 CPUs, but shows a marked degradation on 8 (or more) CPUs. The comparison of WSMP with Super LUDist in Figure 4 reveals, Super LUDist is slower by a factor ranging from 2 to 30 on a single CPU. However, it tends to achieve higher speedups and the bars in Figure 4 tend to grow shorter as the number of CPUs increases. This is not surprising because Super LUDist does not incur the communication overhead of dynamic pivoting and it is always easier to obtain good speedups by a slow serial program than by a fast one. Despite better speedups, with the exception of onetone1, WSMP was still significantly faster than Super LUDist on up to 8 CPUs. Figure 5 shows the speedup of WSMP on two categories of matrices—those that are more than 90% symmetric and those that are less than 33% symmetric. The speedup patterns for the nearly symmetric matrices appear quite similar, with all achieving a speedup between 4 and 5 on 8 CPUs. However, it is more erratic for the highly unsymmetric matrices. In particular, very unstructured matrices arising in optimization problems achieve poor speedups because their task- and data-DAGs tend to have a large number of edges, with many of them connecting distant portions of the DAGs. This increases the overheads due to communication and load-imbalance. There does not appear to be any identifiably best ratio of threads to processes in an 8 CPU configuration. The small variations in run times appear to be the result of differences in fill-in and DAG structures due to different orderings produced by a parallel nested-dissection code whose output is sensitive to the processor configuration.
4 Concluding Remarks and Future Work In this paper, we introduce WSMP’s shared- and distributed-memory parallel sparse direct solver for general matrices and compare its performance with that of MUMPS and Super LUDist for 24 test matrices on 1, 2, 4, and 8 CPUs. Out of the 96 comparisons with MUMPS shown in Figures 2 and 3, there are only six cases where MUMPS is faster. Similarly, Super LUDist is faster than WSMP in only two of the 96 cases. On an average, WSMP appears to be significantly faster that both other distributed-memory parallel solvers, which themselves are the among best such solvers available. However, as discussed in Section 3, the parallel unsymmetric solver still needs improvement and tuning, particularly for very unsymmetric and unstructured matrices, such as those arising in linear programming problems.
A Shared- and Distributed-Memory Parallel Sparse Direct Solver
M
783
Factor Solve
bbmat
1 CPU 2 CPUs 4 CPUs 8 CPUs
W M ecl32 W M
M : MUMPS W : WSMP
eth−3dm W M invextr1 W M mil094 W M mixtank W M nasasrb W M opti_andi W M pre2 W M torso3 W M twotone W M xenon2 W 0
1
2
3
4
5
6
Fig. 2. A comparison of the factor and solve time of WSMP and MUMPS for the 12 largest systems of the test suite on 1, 2, 4, and 8 CPUs. All times are normalized with respect to the time taken by WSMP. Furthermore, the times spent by both packages in the factorization and solve (including iterative refinement) phases are denoted by filled and empty bars, respectively
784
Anshul Gupta
M af23560 W M av41092 W M comp2c W M eth−2dp W M fidap011 W
Factor Solve 1 CPU 2 CPUs 4 CPUs 8 CPUs
M fidapm11 W M lhr71c W
M : MUMPS W : WSMP
M onetone1 W M raefsky4 W M scircuit W M venkat50 W M wang4 W 0
1
2
3
4
9
10
Fig. 3. A comparison of the factor and solve times of WSMP and MUMPS for the 12 smaller systems of the test suite on 1, 2, 4, and 8 CPUs. All times are normalized with respect to the time taken by WSMP. Furthermore, the times spent by both packages in the factorization and solve (including iterative refinement) phases are denoted by filled and empty bars, respectively
A Shared- and Distributed-Memory Parallel Sparse Direct Solver
785
af23560
NF
av41092 bbmat comp2c ecl32
Factor Solve
eth−2dp 1 CPU 2 CPUs 4 CPUs 8 CPUs
eth−3dm
NF
fidap011
NF
fidapm11 invextr1
NF
lhr71c mil094 mixtank nasasrb onetone1 opti−andi
NF
pre2 raefsky4
NF
scircuit torso3 twotone venkat50 wang4 xenon2 0
6
12
18
24
30
Fig. 4. A comparison of the time taken by Super LUDist to factor and solve 24 systems of the test suite on 1, 2, 4, and 8 CPUs with respect to the time taken by WSMP, which is assumed to be one unit in each case. The cases marked “NF” are those in which Super LUDist generated an incorrect solution due to absence of partial pivoting
786
Anshul Gupta Serial Time
eth−3dm
8.53
nearly symmetric structure
mil094
20.8 Order of bars
mixtank
6.51
para−8
153.
poiss3Db
1 X 1 CPUs 1 X 2 CPUs 1 X 4 CPUs 1 X 8 CPUs 2 X 4 CPUs 4 X 2 CPUs 8 X 1 CPUs
22.0
torso3
128.
xenon2
42.7
very unsymmetric structure
av41092
1.75 bbmat
8.75
opti−andi
29.1
pre2
57.4
twotone
3.69 0.2
0.4
0.6
0.8
1.0
Fig. 5. WSMP factor and solve times from 1 to 8 CPUs (normalized w.r.t. single CPU time). The 8-CPU results are presented for four different configurations t × p, where t is the number of threads and p is the number of MPI processes
References 1. Patrick R. Amestoy, Timothy A. Davis, and Iain S. Duff. An approximate minimum degree ordering algorithm. SIAM Journal on Matrix Analysis and Applications, 17(4):886–905, 1996.
A Shared- and Distributed-Memory Parallel Sparse Direct Solver
787
2. Patrick R. Amestoy, Iain S. Duff, Jacko Koster, and J. Y. L’Excellent. A fully asynchronous multifrontal solver using distributed dynamic scheduling. SIAM Journal on Matrix Analysis and Applications, 23(1):15–41, 2001. 3. Timothy A. Davis and Iain S. Duff. An unsymmetric-pattern multifrontal method for sparse LU factorization. SIAM Journal on Matrix Analysis and Applications, 18(1):140–158, 1997. 4. Iain S. Duff and Jacko Koster. On algorithms for permuting large entries to the diagonal of a sparse matrix. SIAM Journal on Matrix Analysis and Applications, 22(4):973–996, 2001. 5. Anshul Gupta. A high-performance GEPP-based sparse solver. In Proceedings of PARCO, 2001. http://www.cs.umn.edu/˜agupta/doc/parco-01.ps. 6. Anshul Gupta. Improved symbolic and numerical factorization algorithms for unsymmetric sparse matrices. SIAM Journal on Matrix Analysis and Applications, 24(2):529–552, 2002. 7. Anshul Gupta. WSMP: Watson sparse matrix package (Part-II: direct solution of general sparse systems). Technical Report RC 21888 (98472), IBM T. J. Watson Research Center, Yorktown Heights, NY, November 20, 2000. http://www.cs.umn.edu/˜agupta/wsmp. 8. Anshul Gupta. Recent advances in direct methods for solving unsymmetric sparse systems of linear equations. ACM Transactions on Mathematical Software, 28(3):301–324, September 2002. 9. Steven M. Hadfield. On the LU Factorization of Sequences of Identically Structured Sparse Matrices within a Distributed Memory Environment. PhD thesis, University of Florida, Gainsville, FL, 1994. 10. Xiaoye S. Li and James W. Demmel. SuperLU DIST: A scalable distributed-memory sparse direct solver for unsymmetric linear systems. ACM Transactions on Mathematical Software, 29(2):110–140, 2003. 11. Jurgen Schulze. Towards a tighter coupling of bottom-up and top-down sparse matrix ordering methods. Bit Numerical Mathematics, 41(4):800–841, 2001.
Simple and Efficient Modifications of Elimination Orderings Pinar Heggernes and Yngve Villanger Department of Informatics, University of Bergen, N-5020 Bergen, Norway {pinar,yngvev}@ii.uib.no
Abstract. We study the problem of modifying a given elimination ordering through local reorderings. We present new theoretical results on equivalent orderings, including a new characterization of such orderings. Based on these results, we define the notion of k-optimality for an elimination ordering, and we describe how to use this in a practical context to modify a given elimination ordering to obtain less fill. We experiment with different values of k, and report on percentage of fill that is actually reduced from an already good initial ordering, like Minimum Degree.
1 Introduction One of the most important and well studied problems related to sparse Cholesky factorization is to compute elimination orderings that give as few nonzero entries as possible in the resulting factors. In graph terminology, the problem can be equivalently modeled through the well known Elimination Game [10,13], where the input graph G corresponds to the nonzero structure of a given symmetric positive definite matrix A, and the output graph G+ corresponds to the Cholesky factor of A. The set of edges that is added to G to obtain G+ is called fill, and different output graphs with varying fill are produced dependent on the elimination ordering in which the vertices are processed. Computing an ordering that minimizes the number of fill edges is an NP-hard problem [15], and hence various heuristics are proposed and widely used, like Minimum Degree, Nested Dissection, and combinations of these. Sometimes, after a good elimination ordering with respect to fill is computed by some heuristic, it is desirable to do some modifications on this ordering in order to achieve better properties for other performance measures, like storage or parallelism, without increasing the size of fill. Elimination orderings that result in the same filled graph are called equivalent orderings, and they can be used for this purpose [8]. In this paper, we prove some properties of equivalent orderings, and we give a new characterization of such orderings based on local permutations. These results are useful in practice when one seeks to modify a given ordering in order to group subsets of vertices close to or far from each other without changing fill. Based on the mentioned results, we then move on to orderings that are not equivalent to the given ordering, and we study how local permutations of consecutive vertices in a given elimination ordering can affect the resulting fill. We define the notion of k-optimality for an elimination ordering, based on vertex subsets of size k. We give J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 788–797, 2006. c Springer-Verlag Berlin Heidelberg 2006
Simple and Efficient Modifications of Elimination Orderings
789
theoretical results on k-optimality, and we conclude with an algorithm that starts with any given ordering and makes it k-optimal for a chosen k. Finally, we implement a relaxed variant of this algorithm to exhibit its practical potential. For an efficient implementation, we use a data structure suitable for chordal graphs, called clique trees. Our tests show that even when we start with a good ordering, like Minimum Degree, substantial amount of fill can be reduced by trying to make the given ordering 3-optimal.
2 Preliminaries A graph is denoted by G = (V, E), where V is the set of vertices with |V | = n, and E ⊆ V2 is the set of edges with |E| = m. When a graph G is given, we will use V (G) and E(G) to denote the vertices and edges of G respectively. The neighbors of a vertex v in G are denoted by the set NG (v) = {u | uv ∈ E}, and the closed neighborhood of v is NG [v] = NG (v) ∪ {v}. For a given vertex subset A ⊆ V , G(A) denotes the graph induced by the vertices of A in G. A is a clique if G(A) is a complete graph. A vertex v is called simplicial if NG (v) is a clique. A chord of a cycle is an edge connecting two non-consecutive vertices of the cycle. A graph is chordal if it contains no induced chordless cycle of length ≥ 4. A graph H = (V, E ∪ F ) is called a triangulation of G = (V, E) if H is chordal. An elimination ordering of G is a function α : V ↔ {1, 2, ..., n}. We will say that α(v) is the position or number of v in α, and also use the equivalent notation α = {v1 , v2 , ..., vn } meaning that α(vi ) = i. For the opposite direction we use α−1 (i) = vi to find a vertex, if the number in the ordering is given and α = {v1 , v2 , ..., vn }. The following algorithm simulates the production of fill in Gaussian elimination [10,13]: Algorithm Elimination Game Input: A graph G = (V, E) and an ordering α = {v1 , ..., vn } of V . Output: The filled graph G+ α. G0α = G; for i = 1 to n do (vi ) a clique; Let F i be the set of edges necessary to make NGi−1 α Obtain Giα by adding the edges in F i to Gi−1 and removing vi ; α n i G+ α = (V, E + ∪i=1 F );
We will call Giα the ith elimination graph. During Elimination Game, two vertices u and v that are not yet processed become indistinguishable after step i if NGiα [u] = NGiα [v], and they remain indistinguishable until one of them is eliminated [7]. Given G and α, we will say that u and v are indistinguishable if u and v are indistinguishable at step min{α(u), α(v)}, and otherwise we will say that they are distinguishable. The edges that are added to G during this algorithm are called fill edges, and uv is an edge in G+ α if and only if uv ∈ E or there exists a path u, x1 , x2 , ..., xk , v in G where α(xi ) < min{α(u), α(v)}, for 1 ≤ i ≤ k [12]. If no fill edges are added during Elimination Game, G+ α = G and α is called a perfect elimination ordering (peo). This corresponds to repeatedly removing a simplicial vertex until the graph becomes empty. A graph is chordal if and only if it has a peo [5]; thus filled graphs are exactly the class of chordal graphs, and Elimination Game is a way of
790
Pinar Heggernes and Yngve Villanger
producing triangulations of a given graph. Every chordal graph has at least two nonadjacent simplicial vertices [4], and hence any clique of size k in a chordal graph can be ordered consecutively with numbers n − k + 1, n − k + 2, ..., n in a peo. An elimination ordering is minimum if no other ordering can produce fewer fill edges. Ordering α is a minimal elimination ordering if no proper subgraph of G+ α is a triangulation of G, and G+ α is then called a minimal triangulation of G. While it is NP-hard to compute minimum fill [15], minimal fill can be computed in polynomial time [12]. Minimal fill can in general be far from minimum, however algorithms exist to make a given elimination ordering minimal by removing edges from the filled graph [1,3,11]. For our purposes, when modifying a given elimination ordering to reduce fill, we will always start by making the given ordering minimal by removing the redundant edges in the filled graph using an appropriate existing algorithm. Two elimination orderings α and β + are called equivalent if G+ α = Gβ [8]. If α is a minimal elimination ordering then every peo β of G+ α is equivalent to α [1]. For an efficient implementation of the ideas presented in this paper, we need the notion of clique trees defined for chordal graphs [6]. A clique tree of a chordal graph G is a tree T , whose vertex set is the set of maximal cliques of G, that satisfies the following property: for every vertex v in G, the set of maximal cliques containing v induces a connected subtree of T . We will be working on a clique tree of G+ α to identify groups of vertices that can be locally repermuted to give less fill. Clique trees can be computed in linear time [2]. A chordal graph has at most n maximal cliques [4], and thus a clique tree has O(n) vertices and edges. We refer to the vertices of T as tree nodes to distinguish them from the vertices of G. Each tree node of T is thus a maximal clique of G.
3 Equivalent Orderings Equivalent orderings of a given ordering were studied by Liu to achieve better properties for storage without increasing the amount of fill [8]. He used elimination tree rotations to compute appropriate equivalent orderings. Here, we will use repeated swapping of consecutive vertices. Lemma 1. Given G and α, let u and v be two vertices in G satisfying α(u) = α(v) − 1 = i. Let β be the ordering that is obtained from α by swapping u and v so that + β(u) = α(v), β(v) = α(u), and β(x) = α(x) for all vertices x = u, v. Then G+ α = Gβ + if and only if u and v are non-adjacent in Gα or indistinguishable in α. Proof. (If) If u and v are non-adjacent then it follows from Lemma 3.1 of [14] that + + G+ α = Gβ . Let us consider the case of indistinguishable vertices. The graph Gα is obtained by picking the next vertex vj in α and then making NGj−1 (vj ) into a clique α j−1 j and removing vj from Gα to obtain Gα , as described in Elimination Game. This can be considered as making n different sets of vertices into cliques. For every vertex z ∈ V \ {u, v}, its neighborhood which is made into a clique in G+ α is the same as its neighborhood which is made into a clique in G+ . This follows from Lemma 4 of β [12], since the same set of vertices are eliminated prior to z in both orderings. Now we have to show that eliminating u and then v results in the exact same set of fill
Simple and Efficient Modifications of Elimination Orderings
791
edges as eliminating v and then u in the graph Gi−1 = Gi−1 α β . Vertices u and v are i−1 [v] which is made into a clique in both cases. indistinguishable, so NGi−1 [u] = N G α β
Observe that NGiα [v] ⊆ NGi−1 [u] and that NGiβ [u] ⊆ NGi−1 [v], thus making these α β
+ sets into cliques will not introduce any new edges. Thus G+ α = Gβ if u and v are indistinguishable. (Only if) We show that the same filled graph is not obtained by swapping u and v when u and v are adjacent and distinguishable. Since u and v are adjacent and distinguishable [u] = NGi−1 [v]. Thus there exists a vertex x ∈ NGi−1 [u]\ in α, then we know that NGi−1 α α α i−1 [v]\N i−1 [u]. Observe also that N i−1 [u] = NGi−1 [v] or there exists a vertex y ∈ N Gα Gα Gα α NGi−1 [u] and that NGi−1 [v] = NGi−1 [v], since the exact same set of vertices has been α β
β
eliminated before u and v in both α and β, and thus Gi−1 = Gi−1 α β . If vertex x exists, + then it follows from Lemma 4 of [12] that edge xv ∈ Gα and xv ∈ G+ β . If vertex y exists, then it also follows from Lemma 4 of [12] that edge yu ∈ G+ and yu ∈ G+ α. β + + Thus Gα = Gβ if u and v are adjacent and distinguishable. Observe that if u and v are indistinguishable in α they are also indistinguishable in β. Note that similar results have been shown for the degrees of such consecutive vertices u and v [7]. Based on the above result, we give an interesting characterization of equivalent orderings in Theorem 1, which can be used as an algorithm to modify a given ordering and generate any desired equivalent ordering. For the proof of Theorem 1 we first need the following lemma. Lemma 2. Let k < n be a fixed integer and let α and β be two equivalent orderings on G such that α−1 (i) = β −1 (i) for 1 ≤ i ≤ k. Let v be the vertex that satisfies β(v) = k + 1 and α(v) = k + d, d > 1, and let u be the predecessor of v in α so that α(u) = α(v) − 1. Obtain a new ordering σ from α by only swapping u and v. Then + G+ σ = Gβ . Proof. Note first that since α and β coincide in the first k vertices, and v is the vertex numbered k + 1 in β, the position of v must be larger than k + 1 in α, as stated in the premise of the lemma. Note also for the rest of the proof that α(u) = k + d − 1 > k. We + need to show that u and v are either non-adjacent in G+ α = Gβ or indistinguishable in α, as the result then follows immediately from Lemma 1. Assume on the contrary that + uv ∈ E(G+ α ) = E(Gβ ) and that u and v are distinguishable in α. Thus there exists at least a vertex x ∈ NGk+d−1 (u)\NGk+d−1 [v], or a vertex y ∈ NGk+d−1 (v)\NGk+d−1 [u]. α α α α (v), and thus x ∈ N k (v) = NGk (v). It follows that If vertex x exists, then x ∈ NGk+d−1 G α α β
vx ∈ E(G+ β ), since v is eliminated at step k +1 and before u and x in β. This contradicts + + + our assumption that G+ α = Gβ , because vx ∈ Gα since ux, uv ∈ E(Gα ), and u is + ∈ NGk+d−1 (u), eliminated before v and x. If vertex y exists, then uy ∈ E(Gα ), since y α and u is eliminated before v and y in α. Vertex v is eliminated before y and u in β, and this gives the same contradiction as above, since uv and vy are contained in E(G+ β ), + and thus uy ∈ E(Gβ ). Regarding orderings β and σ in Lemma 2, note in particular that σ −1 (i) = β −1 (i) for 1 ≤ i ≤ k and σ(v) = k + d − 1. Thus β and σ coincide in the first k positions and
792
Pinar Heggernes and Yngve Villanger
v has moved one position closer to its position in β. We are now ready to present our new characterization of equivalent orderings. + Theorem 1. Two orderings α and β on G are equivalent, i.e., G+ α = Gβ , if and only if β can be obtained by starting from α and repeatedly swapping pairs of consecutive non-adjacent or indistinguishable vertices.
Proof. The if part of the theorem follows directly from Lemma 1. We prove the only if part by showing that it is always possible to make two equivalent orderings identical by repeatedly swapping pairs of consecutive non-adjacent or indistinguishable vertices in one of them. For this we use the following inductive approach. Induction hypothesis: Let α and β be two equivalent orderings on G. Modify α so that the k first vertices of α coincide with the k first vertices of β, for a k ∈ [0, n], and the remaining n − k vertices have the same local relative order as before the change. + Then G+ α = Gβ . + Base case: When k = 0 α is unchanged and thus G+ α = Gβ . Induction step: Given two equivalent orderings α and β on G such that the k first vertices in α coincide with the k first vertices in β. We have to show that, in α, one of the vertices in positions k + 1, k + 2, ..., n can be moved to position k + 1 so that the vertex in position k + 1 is the same in α and in β. We will show that this can be achieved by repeatedly swapping pairs of consecutive non-adjacent or indistinguishable vertices in positions k + 1, k + 2, ..., n of α. We use induction also for this purpose. Induction hypothesis: Given two equivalent orderings α and β on G such that the k first vertices in α coincide with the k first vertices in β. Let v be vertex number k + 1 in β, and let α(v) = k + d, for a d ∈ [1, n − k]. Move v so that α(v) = k + d − i, for an + i ∈ [0, d − 1]. Then G+ α = Gβ . + Base case: Let i = 0, then α is unchanged and thus G+ α = Gβ . Induction step: Given two equivalent orderings α and β on G such that the k first vertices in α coincide with the k first vertices in β, and vertex v which is numbered k + 1 in β has number k + d − i in α. It follows from Lemma 2 that v can be swapped with its predecessor in α and thus moved to position k + d − (i + 1) in α and G+ α remains unchanged. We have thus proved that, the first vertex v of β which does not coincide with the vertex in the same position of α, can be repeatedly swapped with its predecessor in α until it reaches the same position in α as in β without changing the filled graph, and that this can again be repeated for all vertices. By Lemma 1 all swapped vertices are non-adjacent and indistinguishable, which completes the proof.
4 k-Optimal Elimination Orderings In this section, we put the results of the previous section in a practical context with the purpose of modifying a given ordering α to reduce the resulting fill. As we have seen, swapping two consecutive non-adjacent or indistinguishable vertices of α does not change the fill. Consequently, if we want to swap consecutive vertices to reduce fill, these should be distinguishable and neighbors in G+ α . The reason why we only consider vertices that appear consecutively in an elimination ordering is that it is then easy to do
Simple and Efficient Modifications of Elimination Orderings
793
a local computation to find the change in the number of fill edges, without having to recompute the whole filled graph. This saves time for practical implementations. The following lemma gives the change in the number of fill edges when two adjacent vertices are swapped. Lemma 3. Given G and α, let uv be an edge of G+ α such that α(u) = α(v)−1 = i. Let β + be the ordering that is obtained from α by swapping u and v. Then |E(G+ α )|−|E(Gβ )| = (u)| − |NGi−1 (v)|. |NGi−1 α α + Proof. Consider the elimination graph Gi−1 α . Since uv is an edge of Gα , it is also an edge i−1 of Gα . If we at step i eliminate u (which corresponds to ordering α), the set of fill edges added incident to v by u is NGi−1 [u] \ NGi−1 [v]. If on the other hand, v is eliminated at α α step i (which corresponds to ordering β), the set of fill edges added adjacent to u by v is [v] \ NGi−1 [u]. It follows from Lemma 4 of [12] that no other fill is changed since NGi−1 α α the set of vertices eliminated previous to every vertex z ∈ V \ {u, v} are the same for [u]| − |NGi−1 [v]| = |NGi−1 (u)| − |NGi−1 (v)|. α and β. Thus the difference is |NGi−1 α α α α
Although uv is an edge of G+ α , it might be that u and v do not appear consecutively in α but they appear consecutively in an equivalent ordering β. Since both α and β result in the same filled graph G+ α , we want to be able to check all edges uv that appear consecutively in α or in an equivalent ordering β. Indeed, we will generalize this idea from 2 to k. Instead of only considering pairs of adjacent vertices, we will consider cliques K of size k whose vertices appear consecutively in the given ordering α or in an equivalent ordering β, and examine whether a local permutation of the vertices of K can reduce the resulting fill. Note that the vertices of any k-clique in a chordal graph can be ordered with numbers n − k + 1, n − k + 2, ..., n in a peo [13], and thus we can always find an equivalent ordering β in which the vertices of a given k-clique are ordered consecutively. As a consequence of the results of the previous section, β can be computed from α by swapping pairs of consecutive non-adjacent or distinguishable vertices. However, since K is a clique in G+ α , we will show that we are able to compute β more efficiently using a clique tree of G+ α , rather than this more general characterization. Before that, we summarize the above ideas with a definition and an algorithm. We will always start with making the given ordering α minimal by removing unnecessary edges from G+ α by using an appropriate algorithm for this purpose. When α is minimal, we know that no ordering can result in a filled graph that is a proper subgraph of G+ α . Thus reducing the number of fill edges further can only be achieved if some fill edges are removed and some new fill edges are introduced. Definition 1. An elimination ordering α is k-optimal on G if obtaining an equivalent ordering β where the vertices of K appear consecutively, and then locally permuting the vertices of K, cannot result in less fill for any k-clique K in G+ α. Algorithm Minimal k-optimal ordering Input: A graph G and an ordering α. (β = α;) + Output: A minimal elimination ordering β of G where |E(G+ β )| ≤ |E(Gα )|. repeat + Compute a minimal ordering σ such that E(G+ σ ) ⊆ E(Gβ ); β = Compute-k-optimal(G, σ); until β is minimal and k-optimal
794
Pinar Heggernes and Yngve Villanger
The call Compute-k-optimal(G, σ) returns a k-optimal ordering based on σ. Note that computing a minimal ordering removes the unnecessary fill edges from the filled graph, and never adds new fill edges. Computing a k-optimal ordering, however, will both remove and add fill edges, and thus the resulting ordering after this step is not necessarily minimal. By the definition of k-optimality, this algorithm always produces an ordering β with less fill than that of α unless α is already minimal and k-optimal. Simple examples exist which show that a k-optimal ordering is not necessarily (k − 1)-optimal or (k + 1)-optimal. However, we suspect that when checking k-optimality, if we also check all maximal cliques in G+ α of size less than k, then we can also guarantee (k − 1)-optimality. Note that the number of maximal cliques in G+ α is O(n), and thus this extra work is substantially less than the work of checking all (both maximal and non-maximal) k-cliques that the definition requires. We leave this as an open question. We now describe a practical implementation of a variant of the above algorithm to exhibit the potential of the proposed ideas. Given G and σ, in order to implement the function Compute-k-optimal(G, σ) correctly, we need to be able to do the following subtasks for each k-clique K of G+ σ : 1. Compute every equivalent ordering where the vertices of K are ordered consecutively. 2. Try all possible local permutations of the vertices of K in each of these equivalent orderings. Recall however that we need not consider k-cliques in which all vertices are pairwise indistinguishable, and we need not consider local permutations that result in equivalent orderings. Still, the remaining work is too much for a practical algorithm, and in the practical version we generate several but not all equivalent orderings in which the vertices of K are ordered consecutively. Furthermore, we only test a single local permutation of K; one that we suspect might give the highest fill reduction. For both of these subtasks, we use clique trees. The vertices of every k-clique K must appear together in at least one tree node of any clique tree of G+ σ, and subsets of K that appear in other tree nodes will give us various local permutations that we will try. Then using the clique tree, it is easy to compute orderings where the vertices of K appear together and follow the chosen permutations. Observe that every leaf in any clique tree of G+ α contains a simplicial vertex, which can be eliminated as the first vertex in an equivalent ordering. Let us consider how we construct equivalent orderings for a clique K. Let T be a clique tree of G+ α . A subtree or a forest of subtrees remains when we remove every tree node in T containing K. One equivalent ordering is now created for each remaining subtree. Let T be one such subtree, and let T be the rest of T when this subtree is removed. Let V be the set of vertices of G contained in the tree nodes of T and not contained in the tree nodes of T . An equivalent ordering is now obtained by eliminating the vertices of V \ (V ∪ K), then the vertices of K, and finally the vertices of V . Within each of these three sets we eliminate the vertices in a simplicial order. There exists at least one vertex in K which is not adjacent to any vertex in V . If this was not the case, then K would be contained in the tree node of T which is adjacent in T to a tree node containing K, which is a contradiction since none of the tree nodes of T contains K. Eliminating such a vertex will make the remaining vertices of K into a clique. We want to avoid this for two reasons. The first is that we want to obtain a new graph, and the second is that we do not want to destroy the possibility of removing edges in K. A simple solution to this problem is to eliminate these vertices last when K is
Simple and Efficient Modifications of Elimination Orderings e
c
g a
5
f
d
j
h
k
i
6
b
A)
e
c 1
3
f
d 2
g
a
l
4
795
j
h
k
i
o
m
p
n
l o
q p
m n
7
b
B)
q
Reduction percentage compared to MD
Fig. 1. The solid edges are the original edges of the given graph, while the dashed edges are fill edges resulting from the elimination process. In figure A we use the elimination ordering given by the alphabetic ordering of the vertices, which is minimal and also a Minimum Degree ordering. Consider the equivalent ordering {c, d, e, f, g, a, b, h, i, ..., q} and the clique {a, g}. Swapping a and g gives us a new ordering {c, d, e, f, a, g, b, h, i, ..., q} which contains one less fill edge, as shown in figure B. Thus the Minimum Degree ordering of A is minimal but not 2-optimal
20 2−opt 3−opt 4−opt
15 10 5 0 1 2 3 4
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 The graphs ordered by number
Fig. 2. This figure only shows the reduction in percentage
eliminated. So the new order is created as follows: Eliminate the vertices of V \ (V ∪K) in a simplicial order, then eliminate the vertices of K that are adjacent to a vertex in V in a random order, then eliminate the rest of the vertices of K in any order. Finally eliminate the vertices of V in a simplicial order. This will give us an ordering that introduces new fill edges, and possibly remove some. In order to achieve good orderings, it is important to start the process with an ordering that already produces low fill. We have tested our implementation with Minimum Degree as the starting ordering. Minimum Degree orderings are in general neither minimal nor k-optimal for any k > 1, although they tend to be often minimal in practice [1,11]. However, even Minimum Degree orderings that are minimal need not be 2-optimal, as shown in Figure 1. Thus Minimum Degree is a good starting ordering for our purposes. Table 1 and Figure 2 show the reductions in fill achieved by our implementation when trying to make an initial Minimum Degree ordering k-optimal for k ∈ {2, 3, 4} on a collection of sparse matrices downloaded from Matrix Market [9]. Columns 6, 8, and 10 of the table show the achieved reduction in the number of fill edges, whereas columns 7, 9, and 11 show the reduction in percentage. Figure 2 shows the reductions in percentage only. As can be seen, the reduction is substantial in several cases. For 4 other graphs
796
Pinar Heggernes and Yngve Villanger
Table 1. We would like to remark that in order to avoid “lucky” reductions, we ran Minimum Degree on 5 random permutations of each graph, and chose the ordering that gave the lowest fill as the starting ordering No
Name
# vertices # edges MD fill 2-opt 2-opt % 3-opt 3-opt % 4-opt 4-opt %
1 494 bus
494
586
314
3
0.96
3
0.96
3
0.96
2 ash292
292
958
1392
1
0.07
51
3.66 103
7.40
3 bcspwr03
118
179
85
1
1.18
1
1.18
1
1.18
4 bcspwr04
274
669
403
34
8.44
48
11.91
48
11.91
5 bcspwr05
443
590
375
3
0.80
4
1.07
4
1.07
6 bcsstk03
112
265
10
2
20.00
2
20.00
2
20.00
7 bcsstk05
153
1135
1197
0
0
57
4.76
57
4.76
8 bcsstk20
485
1328
412
2
0.49
2
0.49
2
0.49
9 bcsstk22
138
282
263
2
0.76
6
2.28
6
2.28
10 can 144
144
576
225
0
0
10
4.44
10
4.44
11 can 161
161
608
1551
0
0
48
3.09
48
3.09
12 can 187
187
652
1277
0
0
15
1.17
49
3.84
13 can 229
229
774
2008
0
0
8
0.40
20
1.00
14 can 256
256
1330
1913
0
0
50
2.61
51
2.67
15 can 268
268
1407
2096
0
0
42
2.00
48
2.29
16 can 292
292
1124
1143
8
0.70
24
2.10
24
2.10
17 dwt 162
162
510
638
0
0
46
7.21
98
15.36
18 dwt 193
193
1650
2367
0
0
15
0.63
15
0.63
19 dwt 198
198
602
513
62
12.09
79
15.40
79
15.40
20 dwt 209
209
767
1102
44
4.00
72
6.53
79
7.17
21 dwt 221
221
704
849
0
0
42
4.95
51
6.01
22 dwt 234
234
306
204
19
9.31
19
9.31
19
9.31
23 dwt 245
245
608
610
4
0.66
22
3.61
24
3.93
24 dwt 307
307
1108
3421
1
0.03
22
0.64
35
1.02
25 dwt 310
310
1069
1920
22
1.15 110
5.73 119
6.20
26 dwt 346
346
1443
3516
15
0.43 317
9.02 338
9.61
27 dwt 361
361
1296
3366
0
0
55
1.63 108
3.21
28 dwt 419
419
1572
3342
0
0 176
5.27 267
8.00
29 lshp 265
265
744
2185
0
0
21
0.96
21
0.96
30 lshp 406
406
1155
4074
0
0
63
1.55
70
1.72
that we tested, bcsstk04, lund a, lund b, and nos1, no reduction was achieved for these k-values, and to save space, we omit these graphs in the table and the following figure. We would like to stress that the goal of our numerical experiments has been to exhibit the potential of the proposed algorithm to find good quality orderings with respect to fill.
Simple and Efficient Modifications of Elimination Orderings
797
We have not optimized the the running time of our code, thus we find it useless to give the running time of computing a 3-optimal or a 4-optimal ordering compared to the Minimum Degree algorithm whose code has been thoroughly optimized through the last decades. Although computing 3-optimal and 4-optimal orderings are computationally heavy tasks that require more time than Minimum Degree, we believe that practical implementations using approximations are possible. This is an interesting topic for further research.
References 1. J. R. S. Blair, P. Heggernes, and J. A. Telle. A practical algorithm for making filled graphs minimal. Theor. Comput. Sci., 250:125–141, 2001. 2. J. R. S. Blair and B. W. Peyton. An introduction to chordal graphs and clique trees. In Sparse Matrix Computations: Graph Theory Issues and Algorithms, pages 1–30. Springer Verlag, 1993. 3. E. Dahlhaus. Minimal elimination ordering inside a given chordal graph. In Proceedings of WG 1997, pages 132–143. Springer Verlag, 1997. LNCS 1335. 4. G.A. Dirac. On rigid circuit graphs. Anh. Math. Sem. Univ. Hamburg, 25:71–76, 1961. 5. D.R. Fulkerson and O.A. Gross. Incidence matrices and interval graphs. Pacific Journal of Math., 15:835–855, 1965. 6. F. Gavril. The intersection graphs of subtrees in trees are exactly the chordal graphs. J. Combin. Theory Ser. B, 16:47–56, 1974. 7. J. A. George and J. W. H. Liu. Computer Solution of Large Sparse Positive Definite Systems. Prentice-Hall Inc., Englewood Cliffs, New Jersey, 1981. 8. J. W. H. Liu. Equivalent sparse matrix reorderings by elimination tree rotations. SIAM J. Sci. Stat. Comput., 9(3):424–444, 1988. 9. R. Boisvert, R. Pozo, K. Remington, B. Miller, and R. Lipman. NIST Matrix Market. http://math.nist.gov/MatrixMarket/. 10. S. Parter. The use of linear graphs in Gauss elimination. SIAM Review, 3:119–130, 1961. 11. B. W. Peyton. Minimal orderings revisited. SIAM J. Matrix Anal. Appl., 23(1):271–294, 2001. 12. D. Rose, R.E. Tarjan, and G. Lueker. Algorithmic aspects of vertex elimination on graphs. SIAM J. Comput., 5:146–160, 1976. 13. D. J. Rose. Triangulated graphs and the elimination process. J. Math. Anal. Appl., 32:597–609, 1970. 14. Y. Villanger. Lex M versus MCS-M. Reports in Informatics 261, University of Bergen, Norway, 2004 15. M. Yannakakis. Computing the minimum fill-in is NP-complete. SIAM J. Alg. Disc. Meth., 2:77–79, 1981.
Optimization of a Statically Partitioned Hypermatrix Sparse Cholesky Factorization Jos´e R. Herrero and Juan J. Navarro Computer Architecture Department, Universitat Polit`ecnica de Catalunya Jordi Girona 1-3, M`odul D6, E-08034 Barcelona, Spain {josepr,juanjo}@ac.upc.edu
Abstract. The sparse Cholesky factorization of some large matrices can require a two dimensional partitioning of the matrix. The sparse hypermatrix storage scheme produces a recursive 2D partitioning of a sparse matrix. The subblocks are stored as dense matrices so BLAS3 routines can be used. However, since we are dealing with sparse matrices some zeros may be stored in those dense blocks. The overhead introduced by the operations on zeros can become large and considerably degrade performance. In this paper we present an improvement to our sequential in-core implementation of a sparse Cholesky factorization based on a hypermatrix storage structure. We compare its performance with several codes and analyze the results.
1 Introduction Sparse Cholesky factorization is heavily used in several application domains, including finite-element and linear programming algorithms. It forms a substantial proportion of the overall computation time incurred by those methods. Consequently, there has been great interest in improving its performance [8,20,22]. Methods have moved from column-oriented approaches into panel or block-oriented approaches. The former use level 1 BLAS while the latter have level 3 BLAS as computational kernels [22]. Operations are thus performed on blocks (submatrices). In this paper we address the optimization of the sparse Cholesky factorization of large matrices. For this purpose, we use a Hypermatrix [10] block data structure with static block sizes. 1.1 Background Block sizes can be chosen either statically (fixed) or dynamically. In the former case, the matrix partition does not take into account the structure of the sparse matrix. In the latter case, information from the elimination tree [17] is used. Columns having similar structure are taken as a group. These column groups are called supernodes [18]. Some supernodes may be too large to fit in cache and it is advisable to split them into panels [20,22]. In other cases, supernodes can be too small to yield good performance. This is the case of supernodes with just a few columns. Level 1 BLAS routines are used in this case and the performance obtained is therefore poor. This problem can be reduced by amalgamating
This work was supported by the Ministerio de Ciencia y Tecnolog´ıa of Spain and the EU FEDER funds (TIC2001-0995-C02-01).
J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 798–807, 2006. c Springer-Verlag Berlin Heidelberg 2006
Optimization of a Statically Partitioned Hypermatrix Sparse Cholesky Factorization
799
several supernodes into a single larger one [1]. Although, some null elements are then both stored and used for computation, the resulting use of level 3 BLAS routines often leads to some performance improvement. 1.2 Hypermatrix Representation of a Sparse Matrix Sparse matrices are mostly composed of zeros but often have small dense blocks which have traditionally been exploited in order to improve performance [8]. Our approach uses a data structure based on a hypermatrix (HM) scheme [10,21]. The matrix is partitioned recursively into blocks of different sizes. The HM structure consists of N levels of submatrices. The top N-1 levels hold pointer matrices which point to the next lower level submatrices. Only the last (bottom) level holds data matrices. Data matrices are stored as dense matrices and operated on as such. Null pointers in pointer matrices indicate that the corresponding submatrix does not have any non-zero elements and is therefore unnecessary. Figure 1 shows a sparse matrix and a simple example of corresponding hypermatrix with 2 levels of pointers.
Fig. 1. A sparse matrix and a corresponding hypermatrix
The main potential advantages of a HM structure over 1D data structures, such as the Compact Row Wise structure, are: the ease of use of multilevel blocks to adapt the computation to the underlying memory hierarchy; the operation on dense matrices; and greater opportunities for exploiting parallelism. A commercial package known as PERMAS uses the hypermatrix structure [3]. It can solve very large systems out-ofcore and can work in parallel. However, the disadvantages of the hypermatrix structure, namely the storage of and computation on zeros, introduce a large overhead. This problem can arise either when a fixed partitioning is used or when supernodes are amalgamated. In [2] the authors reported that a variable size blocking was introduced to save storage and to speed the parallel execution. In this way the HM was adapted to the sparse matrix being factored. The results presented in this paper, however, correspond to a static partitioning of the matrix into blocks of fixed sizes. 1.3 Machine and Matrix Characteristics Execution took place on a 250 MHz MIPS R10000 Processor. The first level instruction and data caches have size 32 Kbytes. There is a secondary unified instruction/data cache with size 4 Mbytes. This processor’s theoretical peak performance is 500 Mflops.
800
Jos´e R. Herrero and Juan J. Navarro Table 1. Matrix characteristics Matrix
Dimension
GRIDGEN1
NZs
NZs in L Density Flops to factor
330430 3162757 130586943
0.002 278891960665
QAP8 QAP12 QAP15
912 14864 3192 77784 6330 192405
193228 2091706 8755465
0.463 63764878 0.410 2228094940 0.436 20454297601
RMFGEN1
28077 151557
6469394
0.016
TRIPART1 TRIPART2 TRIPART3 TRIPART4
4238 80846 1147857 19781 400229 5917820 38881 973881 17806642 56869 2407504 76805463
pds1 pds10 pds20 pds30 pds40 pds50 pds60 pds70 pds80 pds90
1561 18612 38726 57193 76771 95936 115312 133326 149558 164944
12165 148038 319041 463732 629851 791087 956906 1100254 1216223 1320298
37339 3384640 10739539 18216426 27672127 36321636 46377926 54795729 64148298 70140993
6323333044
0.127 511884159 0.030 2926231696 0.023 14058214618 0.047 187168204525 0.030 0.019 0.014 0.011 0.009 0.007 0.006 0.006 0.005 0.005
1850179 2519913926 13128744819 26262856180 43807548523 61180807800 81447389930 100023696013 125002360050 138765323993
We have used several test matrices. All of them are sparse matrices corresponding to linear programming problems. QAP matrices come from Netlib [19] while others come from a variety of linear multicommodity network flow generators: A Patient Distribution System (PDS) [5], with instances taken from [9]; RMFGEN [4]; GRIDGEN [16]; TRIPARTITE [11]. Table 1 shows the characteristics of several matrices obtained from such linear programming problems. Matrices were ordered with METIS [14] and renumbered by an elimination tree postorder.
2 Performance Issues In this section we present two aspects of the work we have done to improve the performance of our sparse Hypermatrix Cholesky factorization. Both of them are based on the fact that a matrix is divided into submatrices and operations are thus performed on blocks (submatrices). A matrix M is divided into submatrices of arbitrary size. We call Mbri ,bcj the data submatrix in block-row bri and block-column bcj . Figure 2 shows 3 submatrices within a matrix. The highest cost within the Cholesky factorization process comes from the multiplication of data submatrices. In order to ease the explanation we will refer to the three matrices involved in a product as A, B and C. For block-rows br1 and br2 (with br1 < br2 ), and block-column bcj each of these blocks is A ≡ Mbr2 ,bcj , B ≡ Mbr1 ,bcj and C ≡ Mbr2 ,br1 . Thus, the operation performed is C = C − A × B t , which means that submatrices A and B are used to produce an update on submatrix C.
Optimization of a Statically Partitioned Hypermatrix Sparse Cholesky Factorization
801
B x
br 1
x
x
x
A x
br 2
x
C
x x
x
x x
bc 1
bc 2
Fig. 2. Blocks within a matrix
2.1 Efficient Operation on Small Matrices Choosing a block size for data submatrices is rather difficult. When operating on dense matrices, it is better to use large block sizes. On the other hand, the larger the block is, the more likely it is to contain zeros. Since computation with zeros is non productive, performance can be degraded. Thus, a trade-off between performance on dense matrices and operations on non-zeros must be reached. In a previous paper [12], we explained how we could reduce the block size while we improved performance. This was achieved by the use of a specialized set of routines which operate on small matrices of fixed size. By small matrices we mean matrices which fit in the first level cache. The basic idea used in producing this set of routines, which we call the Small Matrix Library (SML), is that of having dimensions and loop limits fixed at compilation time. For example, our matrix multiplication routines mxmts fix outperform the vendor’s BLAS (version 7.3) dgemm routine (dgemm nts) for small matrices (fig. 3a) on an R10000 processor. We achieved similar results to outperform the ATLAS [23] dgemm routine.
a)
b)
Fig. 3. a) Performance of different M xM t routines for several matrix sizes. b) Factorization of matrix pds40: Mflops obtained by different M xM t codes within HM Cholesky. Effective Mflops are reckoned excluding any operations on zeros
The matrix multiplication routine used affects the performance of hypermatrix Cholesky factorization. This operation takes up most of the factorization time. We found that when using mxmts fix, a block size of 4 × 32 usually produced the best performance.
802
Jos´e R. Herrero and Juan J. Navarro
In order to illustrate this, figure 3b shows results of the HM Cholesky factorization on an R10000 for matrix pds40 [5]. The use of a fixed dimension matrix multiplication routine speeded up our Cholesky factorization an average of 12% for our test matrix set (table 1). 2.2 Reducing Overhead: Windows Within Data Submatrices
rig
le
ht
ft
Data Submatrix
co
co
lu
lu
m
m
n
n
Let A and B be two off-diagonal submatrices in the same block-column of a matrix. At first glance, these matrices should always be multiplied since they belong to the same block-column. However, there are cases where it is not necessary. We are storing data submatrices as dense while the actual contents of the submatrix are not necessarily dense. Thus, the result of the product A × B t can be zero. Such an operation will produce an update into some matrix C whenever there exists at least one column for which both matrices A and B have any non-zero element. Otherwise, if there are no such columns, the result will be zero. Consequently, that multiplication can be bypassed.
B j1 j2
top row
bottom row
j3
11111111 00000000 00000000 11111111 x 00000000 11111111 00000000 11111111 x 00000000 11111111 00 11 00 11 x
k1 k2 k3
A 1111111 0000000 00 11 00 x 11 i1
i2
window
k4
11111111 00000000 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 11 00 00 x11 00 11 00 11 00 11 00 11 x 00 11 00 11 00 11 x 00 11
1111111 0000000 00 11 00 11 0000000 1111111 00 11 00 11 0000000 1111111 00 11 x 00 11 0000000 1111111
x x
000000000000 C 111111111111 x111111111111 00000 11111 000000000000 00000 11111 00000 11111 111111111111 000000000000 00000 11111 00000 11111 000000000000 111111111111 x 00000 11111 00000 11111 000000000000 111111111111 x 00000 11111 00000 11111 000000000000 111111111111 x 000000000000 111111111111
i2
j3
a)
b)
Fig. 4. a) A rectangular data submatrix and a window within it. b) Windows can reduce the number of operations
In order to reduce the storage and computation of zero values, we define windows of non-zeros within data submatrices. Figure 4a shows a window of non-zero elements within a larger block. The window of non-zero elements is defined by its top-left and bottom right corners. All zeros outside those limits are not used in the computations. Null elements within the window are still stored and computed. Storage of columns to the left of the window’s leftmost column is avoided since all their elements are null. Similarly, we do not store columns to the right of the window’s rightmost column. However, we do store zeros over the window’s upper row and/or underneath its bottom row whenever these window’s boundaries are different from the data submatrix boundaries, i.e. whole data submatrix columns are stored from the leftmost to the rightmost columns in a window. We do this to have the same leading dimension for all data submatrices used in the hypermatrix. Thus, we can use our specialized SML routines which work on matrices with fixed leading dimensions. Actually, we extended our SML library with routines which have the leading dimensions of matrices fixed, while the loop limits may be given as parameters. Some of the routines have all loop limits fixed, while others have
Optimization of a Statically Partitioned Hypermatrix Sparse Cholesky Factorization
803
only one, or two of them fixed. Other routines have all the loop limits given as parameters. The appropriate routine is chosen at execution time depending on the windows involved in the operation. Thus, although zeros can be stored above or underneath a window, they are not used for computation. Zeros can still exist within the window but, in general, the overhead is greatly reduced. The use of windows of non-zero elements within blocks allows for a larger default hypermatrix block size. When blocks are quite full operations performed on them can be rather efficient. However, in those cases where only a few non-zero elements are present in a block, or the intersection of windows results in a small area, it is necessary to compute only a subset of the total block (dark areas within figure 4b). When the column-wise intersection of windows in matrices A and B is null, we can avoid the multiplication of these two matrices.
3 Results Figure 5 shows the results obtained with several variants of our HM Cholesky code. In all cases the code follows a right looking scheduling with a fixed partitioning of the hypermatrix (the elimination tree is consequently not used at all). We use a 4 × 32 data matrix size as stated in section 2.1. The first and second bars allow for the evaluation of the usage of windows within data submatrices (in a 2D layout of such submatrices). The usage of windows clearly improves the performance of our sparse hypermatrix Cholesky algorithm. The second and third bars show the results of 2D and 1D data layouts and scheduling of the data submatrices (windows are used in both cases). The former uses upper levels in the HM with sizes 32 × 32 and 512 × 512. For the latter, we define only an upper level with sizes 32 × 32. A 2D data layout and scheduling of the computations is beneficial to the efficient computation of the Cholesky factorization of large sparse matrices.
Fig. 5. HM performance for several input matrices
804
Jos´e R. Herrero and Juan J. Navarro
Next, we present results obtained by four different sparse Cholesky factorization codes. Figure 6 shows the results obtained with each of them for the set of matrices introduced above. Matrix families are separated by dashed lines.
Fig. 6. Performance of several sparse Cholesky factorization codes
The first bar corresponds to a supernodal left-looking block Cholesky factorization (SN-LL (Ng-Peyton)) [20]. It takes as input parameters the cache size and unroll factor desired. This algorithm performs a 1D partitioning of the matrix. A supernode can be split into panels so that each panel fits in cache. This code has been widely used in several packages such as LIPSOL [24], PCx [7], IPM [6] or SparseM [15]. Although the test matrices we have used are in most cases very sparse, the number of elements per column is in some cases large enough so that a few columns fill the first level cache. Thus, a one-dimensional partition of the input matrix produces poor results. As the problem size gets larger, performance degrades heavily. We noticed that, in some cases, we could improve results by specifying cache sizes multiple of the actual first level cache. However, performance degrades in all cases for large matrices. The second and third bars correspond to sequential versions of the supernodal leftlooking (SN-LL) and supernodal multifrontal (SN-MF) codes in the TAUCS package (version 2.2) [13]. In these codes the matrix is represented as a set of supernodes. The dense blocks within the supernodes are stored in a recursive data layout matching the dense block operations. The performance obtained by these codes is quite uniform. Finally, the fourth bar shows the performance obtained by our right looking sparse hypermatrix Cholesky code (HM). We have used windows within data submatrices and SML [12] routines to improve our sparse matrix application based on hypermatrices. A fixed partitioning of the matrix has been used. We present results obtained for data submatrix sizes 4 × 32 and upper hypermatrix levels with sizes 32 × 32 and 512 × 512. We have included matrix pds1 to show that for small matrices the hypermatrix approach is usually very inefficient. This is due to the large overhead introduced by blocks
Optimization of a Statically Partitioned Hypermatrix Sparse Cholesky Factorization
805
Fig. 7. Increase in number of operations in sparse HM Cholesky (4x32 + windows)
Fig. 8. HM flops per MxMt subroutine type
which have many zeros. Figure 7 shows the percentage increase in number of flops in sparse HM Cholesky w.r.t. the minimum (using a Compact Sparse Row storage). For large matrices however, blocks are quite dense and the overhead is much lower. Performance of HM Cholesky is then better than that obtained by the other algorithms tested. Note that the hypermatrix Cholesky factorization usually improves its performance as the problem size gets larger. There are two reasons for this. One is the better usage of the memory hierarchy: locality is properly exploited with the two dimensional partitioning of the matrix which is done in a recursive way using the HM structure. The other is that since blocks tend to be larger, more operations are performed by efficient M × M t routines. Figure 8 shows the percentage of multiplication operations performed by each of the four M × M t routines we use. We focus on matrix multiplication because it is, by far, the most expensive operation within the Cholesky factorization. FULL refers to the routine where all matrix dimensions are fixed. This is the most efficient of the four routines. WIN 1DC names the routine where windows are used for columns while WIN 1DR indicates the equivalent applied to rows. Finally, WIN 2D
806
Jos´e R. Herrero and Juan J. Navarro
denotes the case where windows are used for both columns and rows. Thus, for the latter, no dimensions are fixed at compilation time and it becomes the least efficient of all four routines. WIN 1DC computes matrix multiplications more efficiently than WIN 1DR because of the constant leading dimension used for all three matrices. This is the reason why the performance obtained for matrices in the TRIPARTITE set is better than that obtained for matrices with similar size belonging to the PDS family. The performance obtained for the largest matrix in our test suite, namely matrix GRIDGEN1, is less than half of the theoretical peak for the machine used. This is basically due to the dispersion of data in this matrix which leads to the usage of the FULL M × M t routine less than 50% of the time. In addition, the least efficient routine WIN 2D is used to compute about 10% of the operations.
4 Future Work The results presented here have been obtained by splitting the matrix into blocks of fixed size for each hypermatrix level. However, we can also create the hypermatrix structure using the information from the supernodal elimination tree. The results for low amalgamation values are very poor and have not been presented. However, we expect an improvement in the performance of our sparse hypermatrix Cholesky code with different (larger) amalgamation values. In the near future we will experiment with different amalgamation values and algorithms. We would also like to improve our code by avoiding the operation on very small matrices. We are currently studying the way to extend our code so that it can work with supernodes in addition to dense matrices in the last level of the hypermatrix (data submatrix level). Since our approach seems promising for large matrices we want to extend it to work out-of-core. We think that the hypermatrix scheme also provides a good platform for exploiting the higher levels of the memory hierarchy. Finally, we plan to use our code as the basis for a parallel implementation of the sparse Cholesky factorization.
5 Conclusions A two dimensional partitioning of the matrix is necessary for large sparse matrices. The overhead introduced by storing zeros within dense data blocks in the hypermatrix scheme can be reduced by keeping information about a dense subset (window) within each data submatrix. Although some overhead still remains, the performance of our sparse hypermatrix Cholesky can be up to an order of magnitude better than that of a 1D supernodal block Cholesky (which tries to use the cache memory properly by splitting supernodes into panels). It also outperforms sequential supernodal left-looking and supernodal multifrontal routines based on a recursive data layout of blocks within the supernodes. Using windows and SML routines our HM Cholesky often gets over half of the processor’s peak performance for medium and large sparse matrices factored sequentially in-core.
Optimization of a Statically Partitioned Hypermatrix Sparse Cholesky Factorization
807
References 1. C. Ashcraft and R. G. Grimes. The influence of relaxed supernode partitions on the multifrontal method. ACM Trans. Math. Software, 15:291–309, 1989. 2. M. Ast, C. Barrado, J.M. Cela, R. Fischer, O. Laborda, H. Manz, and U. Schulz. Sparse matrix structure for dynamic parallelisation efficiency. In Euro-Par 2000,LNCS1900, pages 519–526, September 2000. 3. M. Ast, R. Fischer, H. Manz, and U. Schulz. PERMAS: User’s reference manual, INTES publication no. 450, rev.d, 1997. 4. T. Badics. RMFGEN generator., 1991. Code available from ftp://dimacs.rutgers.edu/pub/netflow/generators/network/genrmf. 5. W.J. Carolan, J.E. Hill, J.L. Kennington, S. Niemi, and S.J. Wichmann. An empirical evaluation of the KORBX algorithms for military airlift applications. Oper. Res., 38:240–248, 1990. 6. J. Castro. A specialized interior-point algorithm for multicommodity network flows. SIAM Journal on Optimization, 10(3):852–877, September 2000. 7. J. Czyzyk, S. Mehrotra, M. Wagner, and S. J. Wright. PCx User’s Guide (Version 1.1). Technical Report OTC 96/01, Evanston, IL 60208–3119, USA, 1997. 8. I.S. Duff. Full matrix techniques in sparse Gaussian elimination. In Numerical analysis (Dundee,1981). Lect. Notes in Math., 912:71–84. Springer, 1982. 9. A. Frangioni. Multicommodity Min Cost Flow problems. Data available from http://www.di.unipi.it/di/groups/optimize/Data/. 10. G.Von Fuchs, J.R. Roy, and E. Schrem. Hypermatrix solution of large sets of symmetric positive-definite linear equations. Comp. Meth. Appl. Mech. Eng., 1:197–216, 1972. 11. A. Goldberg, J. Oldham, S. Plotkin, and C. Stein. An implementation of a combinatorial approximation algorithm for minimum-cost multicommodity flow. In IPCO, 1998. 12. J.R. Herrero and J.J. Navarro. Improving Performance of Hypermatrix Cholesky Factorization. In Euro-Par 2003,LNCS2790, pages 461–469. Springer, August 2003. 13. D. Irony, G. Shklarski, and S. Toledo. Parallel and fully recursive multifrontal sparse Cholesky. In ICCS 2002,LNCS2330, pages 335–344. Springer, April 2002. 14. G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. TR95-035, Dep. Comp. Sci., U. of Minnesota, Oct. 1995. 15. R. Koenker and P. Ng. SparseM: A Sparse Matrix Package for R, 2003. http://cran.rproject.org/src/contrib/PACKAGES.html#SparseM. 16. Y. Lee and J. Orlin. GRIDGEN generator., 1991. Code available from ftp://dimacs.rutgers.edu/pub/netflow/generators/network/gridgen. 17. J. H. W. Liu. The role of elimination trees in sparse factorization. SIAM Journal on Matrix Analysis and Applications, 11(1):134–172, 1990. 18. J. W. Liu, E. G. Ng, and B. W. Peyton. On finding supernodes for sparse matrix computations. SIAM J. Matrix Anal. Appl., 14(1):242–252, January 1993. 19. NetLib. Linear programming problems. http://www.netlib.org/lp/. 20. E.G. Ng and B.W. Peyton. Block sparse Cholesky algorithms on advanced uniprocessor computers. SIAM J. Sci. Comput., 14(5):1034–1056, 1993. 21. A. Noor and S. Voigt. Hypermatrix scheme for the STAR–100 computer. Computers and Structures, 5:287–296, 1975. 22. E. Rothberg. Performance of panel and block approaches to sparse Cholesky factorization on the IPSC/860 and Paragon multicomputers. SIAM J. Sci. Comput., 17(3):699–713, 1996. 23. R.C. Whaley and J.J. Dongarra. Automatically tuned linear algebra software. In Supercomputing ’98, pages 211–217. IEEE Computer Society, Nov 1998. 24. Y. Zhang. Solving large–scale linear programs by interior–point methods under the MATLAB environment. TR 96–01, Baltimore, MD 21228–5398, USA, 1996.
Maximum-Weighted Matching Strategies and the Application to Symmetric Indefinite Systems Stefan R¨ollin1 and Olaf Schenk2 1
Integrated Systems Laboratory, Swiss Federal Institute of Technology Zurich Gloriastrasse 35, CH-8092 Zurich, Switzerland [email protected] http://www.iis.ee.ethz.ch 2 Department of Computer Science, University Basel Klingelbergstrasse 50, CH-4056 Basel, Switzerland [email protected] http://www.informatik.unibas.ch/personen/schenk o.html
Abstract. The problem of finding good numerical preprocessing methods for the solution of symmetric indefinite systems is considered. Special emphasis is put on symmetric maximum-weighted matching strategies. The aim is to permute large elements of the matrix to diagonal blocks. Several variants for the block sizes are examined and the accuracies of the solutions are compared. It is shown that maximum-weighted matchings improve the accuracy of sparse direct linear solvers. The use of a strategy called FULL CYCLES results in an accurate and reliable factorization. Numerical experiments validate these conclusions.
1 Introduction It is now commonly known that maximum-weighted bipartite matching algorithms [5] – developed to place large entries on the diagonal using nonsymmetric permutations and scalings – greatly enhance the reliability of linear solvers [1,5] for nonsymmetric linear systems. The goal is to transform the coefficient matrix A with diagonal scaling matrices Dr and Dc and a permutation matrix Pr so as to obtain an equivalent system with a matrix Pr Dr ADc that has nonzeros on the diagonal and is better scaled and more diagonally dominant. This preprocessing has a beneficial impact on the accuracy of the solver and it also reduces the need for partial pivoting, thereby speeding up the factorization process. These maximum-weighted matchings are also applicable to symmetric indefinite systems, but the symmetry of the indefinite systems is lost and therefore, the factorization would be more expensive concerning both time and required memory. Nevertheless, modified weighted matchings can be used for these systems. The goal is to preserve symmetry and to define a permutation, such that the large elements of the permuted matrix lie on diagonal blocks (not necessarily on the diagonal as for nonsymmetric matrices). This technique based on numerical criterion was first presented by Gilbert and Duff [4] and further extended using a structural metric by Duff and Pralet [6].
This work was supported by the Swiss Commission of Technology and Innovation under contract number 7036.1 ENS-ES, and the Strategic Excellence Positions on Computational Science and Engineering of the Swiss Federal Institute of Technology, Zurich.
J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 808–817, 2006. c Springer-Verlag Berlin Heidelberg 2006
Maximum-Weighted Matching Strategies
809
We will use these matchings as an additional preprocessing method for the direct solver PARDISO [10]. This solver employs a combination of left- and right-looking Level 3 BLAS supernode techniques including a Bunch and Kaufman pivoting strategy restricted to a static structure of diagonal blocks corresponding to single supernodes.
2 Matching Strategies for Symmetric Indefinite Systems We are looking for a symmetric permutation matrix P , such that the large elements of A lie in p × p diagonal blocks of the matrix P AP T . The large elements do not necessarily lie on the diagonal as in the nonsymmetric case. As a first step, similar algorithms [5] as in the nonsymmetric case are used to compute the permutation matrix Pr as well as the diagonal scaling matrices Dr and Dc , which are used later to define a symmetric scaling matrix. The permutation corresponding to Pr can be broken up into disjoint cycles c1 , . . . , ck . The length of the cycles can be arbitrary but they sum up to the dimension of A. These cycles are used to define a symmetric permutation: σ = (c1 , . . . , ck ), which already results to the permutation matrix P , we are looking for. The matrix P AP T will have k blocks of size |c1 |, . . . , |ck | on the diagonal, if the adjacency structure of the nodes corresponding to a cycle are extended to a clique. In the factorization, the rows and columns corresponding to a diagonal block are treated together. Large diagonal blocks will result in a large fill-in. A first possibility is to keep the k blocks and to put up with the fill-in. Another strategy is to break up long cycles of the permutation Pr to avoid large diagonal blocks. One has to distinguish between cycles of even and odd length. It is possible to break up even cycles into cycles of length two, without changing the weight of the matching due to the symmetry of the matrix. For each cycle, there are two possibilities to break it up. We use a structural metric [6] to decide which one to take. The same metric is also used for cycles of odd length, but the situation is slightly different. Cycles of length 2k + 1 can be broken up into k cycles of length two and one cycle of length one. There are 2k + 1 different possibilities to do this. The resulting 2 × 2 blocks will contain the matched elements. However, there is no guarantee that the diagonal element corresponding to the cycle of length one will be nonzero. Another possibility is therefore, to have k − 1 cycles of length two and one cycle of length three since there will be more freedom to choose an appropriate pivoting sequence in this case. Again, there are 2k + 1 different ways to do the division. As in the nonsymmetric case, it is possible to scale the matrix such that the absolute value of the matched elements is one and the other elements are equal or less than one in modulus. However, the original scalings cannot be used, since they are not symmetric, but they are used to define a symmetric scaling. The matrix P Ds ADs P T will have the desired √ properties if we define a symmetric diagonal scaling matrix Ds as follows: Ds = Dr Dc , where Dr and Dc are the diagonal matrices mentioned above. In order to keep the fill-in small during the factorization, the matrix is reordered further with MeTiS [9] before carrying out the factorization. The block structure we have described before is kept in mind during this reordering step, such that the large elements still lie in diagonal blocks. We refer to the first strategy in [10, section 4] for more information on this topic.
810
Stefan R¨ollin and Olaf Schenk
Rows covered by blocks [%]
100
5
10
15
20
25
30
80 Cycles of length 1 Cycles of length 2 Cycles of length 3 Cycles of length ≥ 4
60 40 20
100 80 60 40 20 35
40
45 Matrices
50
55
60
Fig. 1. Distribution of the cycle lengths for the tested matrices. Upper chart: matrices 1-30, lower chart: matrices 31-61
In the numerical experiments in section 3, we have used 61 sparse symmetric indefinite matrices. These matrices are described by Gould and Scott [7] and we have chosen the same ordering to list our results as they have. Figure 1 shows how long the cycles for the matrices are if the cycles are not broken up. In theory, they could be arbitrarily long. However, in practice, most of the cycles ci are of length one or two. Only a small part of all cycles are longer than two. 40% of all matrices have only cycles of length one and two. In [4,6], the size of the diagonal blocks has been limited to one or two for the factorization process with MA57 [3]. Here, we extend these strategies and choose the following block sizes: 1. Do not break up cycles of length larger than two. This will probably result in more fill-in, but allows more choices to perform the pivoting within the diagonal blocks. Therefore, we expect to have better accuracy. This strategy will be called FULL CYCLES . 2. Divide blocks larger than two into blocks of size one and two, as described above. It is possible that the block of size one contains a zero element, which can lead to a zero pivot. However, contrary to [6], we do not permute this block to the end, but treat it in the same way as all other blocks. We use the name 2 X 2/1 X 1 for this choice. 3. For this possibility, an uneven cycle will be divided into blocks of size two and one block of size three. We name it 2 X 2/3 X 3. 4. Do not apply any kind of additional preprocessing based on symmetric matching. This strategy will be called DEFAULT.
Maximum-Weighted Matching Strategies
811
In addition,√all methods can be combined with a symmetric diagonal scaling based on Ds = Dr Dc . Therefore, we can basically test eight strategies3 . The difference between the eight methods is how the diagonal blocks are chosen and whether the matrix is scaled or not. In total, we have tested PARDISO with these eight different preprocessing strategies in combination with the 61 symmetric indefinite matrices. All eight strategies are applied before performing the numerical factorization, for which we use the algorithm LDLT − SBK described in [10]. This means that we use Bunch and Kaufman pivoting restricted to static data structures of diagonal supernode blocks in PARDISO. The coefficient matrix of the LDLT factorization is perturbed whenever numerically acceptable pivots can not be found within a diagonal supernode block. As stated above, it would be possible to apply the nonsymmetric permutations and scalings directly to the symmetric matrices, but one loses the symmetry for the factorization. Therefore, we have not tested a nonsymmetric permutation strategy. The strategy DEFAULT, SCALED deserves a special explanation. For this strategy, we do the preprocessing step only to compute the symmetric scaling matrix Ds and use it to scale the symmetric matrix. As a result, the largest absolute value in the matrix is one. This has an influence on the factorization, as we will see in the next section.
3 Numerical Experiments For the computation of the symmetric permutation P and scaling Ds , we need to know a nonsymmetric permutation Pr and two diagonal scaling matrices Dr and Dc . We have used our own algorithm that is based on [8] for the numerical experiments to compute these matrices. We have also tested the algorithm MC64 from HSL (formerly known as Harwell Subroutine Library). The results are qualitatively the same and therefore we will only comment on the results of our algorithm. The numerical results were conducted in sequential mode on a COMPAQ AlphaServer ES40 with eight GByte main memory and with four CPUs running at 667 MHz. A successful factorization does not mean that the computed solution is satisfactory, e.g., the triangular solves could be unstable due to numerical perturbation and thus give an inaccurate solution. With respect to this, we use two different accuracies to decide whether the factorization and solution process is considered to be successful: 1. Standard Accuracy: a factorization is considered successful if the factorization did not fail and the second scaled residual Ax − b/(Ax + b) that has been obtained after one step of iterative refinement is smaller than 10−8 . 2. High Accuracy: We say that the factorization has succeeded if the scaled residuals Ax − b/(Ax + b) are smaller than 10−14 . Furthermore, a factorization is considered to be successful if the residuals does not grow during the iterative refinement step. The aim of the high accuracy is to see which strategy results in an accurate solution without the need of iterative refinements. Furthermore, we check whether the solution 3
Please note that we will combine all strategies with a fill-in reduction method based on MeTiS that honors the special cycle structure.
812
Stefan R¨ollin and Olaf Schenk
Table 1. Number of solved matrices out of 61 for the eight different preprocessing strategies. The columns two to four show the number of systems, which are successfully solved, if standard accuracy or high accuracy is used Standard Accuracy of 10−8
High Accuracy of 10−14
Iterative refinement
no
no
yes
DEFAULT
59
41
53
DEFAULT, SCALED
59
40
49
FULL CYCLES
59
52
54
FULL CYCLES , SCALED
61
49
51
2 X 2/1 X 1
57
47
52
2 X 2/1 X 1, SCALED
61
48
53
2 X 2/3 X 3
57
48
53
2 X 2/3 X 3, SCALED
61
47
52
process is stable by testing the convergence of the iterative refinement step. We compare the different strategies using the performance profiling system presented in [2] and which was also used in [7] to compare sparse direct solvers. Performance profiles are a convenient tool to compare a set of different algorithms A on a set of matrices M. Each algorithm i ∈ A is run with each matrix j ∈ M, which gives a statistic sij . The statistic sij might be the time for the factorization, the required memory or the norm of the scaled residuals. We assume that the smaller sij is, the better an algorithm performs. In case of a failure of the algorithm, we set sij = ∞. The performance profile pi (α) of algorithm i ∈ A is defined as pi (α) :=
|{j ∈ M : sij ≤ αˆ sj }| , |M|
where sˆj := mini∈A {sij } denotes the best statistics for matrix j. In other words, in a performance profile, the values on the y-axis indicate the fraction p(α) of all examples, which can be solved within α times the best algorithm. 3.1 Standard Accuracy of the Scaled Residuals From Table 1 we see that the symmetric strategies (FULL CYCLES, 2 X 2/1 X 1, 2 X 2/3 X 3) with scaling successfully solve all matrices with an accuracy smaller than 10−8 . Without a scaling, slightly fewer systems can be solved. The memory requirements, the time to compute the factorization, and the accuracy varies between the eight methods. We describe the differences in the following. Figure 2 shows the CPU time profile for the tested strategies, i.e. the time for the numerical factorization as well as the time for the overall solution. For the sake of clarity, we do not show the results for both 2 X 2/3 X 3 strategies, since there are only minor differences between them and the results of the strategies 2 X 2/1 X 1. The reason is
Maximum-Weighted Matching Strategies
813
Time for matching, analyze, factorization and solve
Time for numerical factorization
0.8
0.8
0.6
0.6 p(α)
1
p(α)
1
default (2 failed) default, scaled (2 failed) full cycles (2 failed) full cycles, scaled (0 failed) 2x2/1x1 (4 failed) 2x2/1x1, scaled (0 failed)
0.4
0.2
0
1
2
3 α
4
default (2 failed) default, scaled (2 failed) full cycles (2 failed) full cycles, scaled (0 failed) 2x2/1x1 (4 failed) 2x2/1x1, scaled (0 failed)
0.4
0.2
0
5
1
2
3 α
4
5
Fig. 2. Standard accuracy performance profile for the CPU time of the numerical factorization, and the complete solution including matching, analysis, factorization and solve Accuracy of first scaled residual
Accuracy of second scaled residual
0.8
0.8
0.6
0.6 p(α)
1
p(α)
1
default (2 failed) default, scaled (2 failed) full cycles (2 failed) full cycles, scaled (0 failed) 2x2/1x1 (4 failed) 2x2/1x1, scaled (0 failed)
0.4
0.2
0
1
10
100 α
1000
default (2 failed) default, scaled (2 failed) full cycles (2 failed) full cycles, scaled (0 failed) 2x2/1x1 (4 failed) 2x2/1x1, scaled (0 failed)
0.4
0.2
10000
0
1
10
100 α
1000
10000
Fig. 3. Standard accuracy performance profile of the solution before and after one step of iterative refinement. The profile shows the results for the first and second scaled residual up to α = 105
that these strategies only treat matrices differently that have odd cycles and only seven matrices have such cycles. The results for these matrices in combination with 2 X 2/1 X 1 and 2 X 2/3 X 3 differ only very slightly except for one matrix (Matrix 11: BLOCKQP1). The difference for this matrix is obvious, since it has almost only cycles of size three. The default symmetric version of PARDISO appears to be the most effective solver. It can solve all but two systems, it has the smallest factorization time and the smallest total solution time for the complete solution. Comparing the profiles it can also be seen that the factorization time for the FULL CYCLES methods are the slowest of all strategies. This configuration requires the most memory since columns with different row structure are coupled together resulting in larger fill-in during factorization. Although 2 X 2/1 X 1 uses less memory and is faster than FULL CYCLES, it is still slower than both default strategies. Figure 3 compares the first and second scaled residual that has been obtained after one step of iterative refinement. Here, the benefit of the preprocessing strategies becomes obvious. The most accurate solution before iterative refinement is provided by the strategy FULL CYCLES without scaling, followed closely by the other strategies.
814
Stefan R¨ollin and Olaf Schenk
Number of perturbed pivots [%]
100
5
10
15
20
25
30
10 1
default full cycles 2x2/1x1
0.1 0.01 0.001 100 10 1 0.1 0.01 0.001
35
40
45
50
55
60
Matrices Fig. 4. Number of perturbed pivots during factorization for the tested matrices. Note the logarithmic scaling of the y-axis. If there is no symbol for a matrix, this means no perturbed pivots occurred during the factorization
Interestingly, the accuracy of the strategies with a scaling is slightly worse than without a scaling, even though the former can solve more problems. The DEFAULT strategy gives the least accurate results. However, after one step of iterative refinement, the accuracy of all methods is of the same order. It can be concluded that if the factorization time is of primary concern and if one step of iterative refinement can be applied, then the strategy DEFAULT is superior due to the fact that the accuracy of the scaled residuals is in the same order. 3.2 High Accuracy of the Scaled Residuals Up to now, the DEFAULT strategy of PARDISO seems to be the best choice for most of the matrices if the accuracy of the solution is not the first criterion. However, the situation changes if the requirements for the solution are highly tightened and high accuracy without iterative refinement is used, as can be seen from column 3 in Table 1. No strategy is able to solve all matrices. The strategy FULL CYCLES is the most successful strategy. More precisely, for 52 matrices, the first scaled residual is smaller than 10−14 . The other preprocessing strategies are only slightly inferior. The results for both default strategies are much worse: PARDISO with the DEFAULT does not meet the accuracy requirements for 20 matrices and DEFAULT, SCALED fails for 21 matrices. That the use of blocks is beneficial is not surprising if the number of perturbed pivots during the factorization is considered, as depicted in Figure 4. In general, this number decreases by using a strategy with blocks on the diagonal. Especially for those matrices which are not successfully solved by the DEFAULT strategy, the number of perturbed pivots will decrease significantly resulting in a high accuracy of the solution without iterative refinement by using a FULL CYCLES or 2 X 2/1 X 1 strategy.
Maximum-Weighted Matching Strategies
5
10
15
20
815
25
30
-6
10
-10
10
-14
Scaled residuals
10
-18
10
default default, refinement full cycles full cycles, refinement
-6
10
-10
10
-14
10
-18
10
35
40
45 Matrices
50
55
60
Fig. 5. Accuracy of the scaled residuals using DEFAULT and the FULL CYCLES strategy
The situation changes, if we look at the second scaled residuals, i.e. the residuals after one step of iterative refinement. The corresponding results are shown in the last column in Table 1. Although the FULL CYCLES strategy is still the best choice, there are only minor differences regarding the number of successful solves between the preprocessing strategies and the DEFAULT strategy. As a result, in can be concluded that the PARDISO default strategy with one step iterative refinement provides generally high accurate solutions and it can solve 53 out of 61 matrices with an accuracy of 10−14 . In Figure 5, a comparison of the accuracy of the first and second scaled residual for all 61 matrices is given. This figure shows the accuracy for the DEFAULT and the FULL CYCLES strategy before and after one step of iterative refinement. It is clearly visible that the accuracy of the DEFAULT strategy with one step iterative refinement is a method that produces high accurate solutions for almost all matrices. In Figures 6 and 7, the performance profiles are given for the total CPU time with and without one step of iterative refinement. Due to the fact that a high accurate solution is sought, there are now more failures for all strategies. As stated before, we set the statistic for the performance profile (e.g. the factorization time), to infinity in the case of a failure. Therefore, the performance profiles in Figures 2, 6 and 7 are different. The best choice is FULL CYCLES without scaling if the accuracy is important and no iterative refinement can be made. The strategy 2 X 2/1 X 1 is only slightly worse, but on the average requires less memory for the factorization and solves the matrices faster. The fastest method that produces residuals with a similar accuracy is again the DEFAULT strategy using one step of iterative refinement.
816
Stefan R¨ollin and Olaf Schenk Time for matching, analyze, factorization and solve
Time for numerical factorization
0.8
0.8
0.6
0.6 p(α)
1
p(α)
1
default (20 failed) default, scaled (21 failed) full cycles (9 failed) full cycles, scaled (12 failed) 2x2/1x1 (14 failed) 2x2/1x1, scaled (13 failed)
0.4
0.2
0
1
2
3 α
4
default (20 failed) default, scaled (21 failed) full cycles (9 failed) full cycles, scaled (12 failed) 2x2/1x1 (14 failed) 2x2/1x1, scaled (13 failed)
0.4
0.2
0
5
1
2
3 α
4
5
Fig. 6. High accuracy performance profile for the CPU time of the numerical factorization, and the time for the complete solution including matching, analysis, factorization and solve without iterative refinement Time for matching, analyze, factorization and solve
Time for numerical factorization
0.8
0.8
0.6
0.6 p(α)
1
p(α)
1
default (8 failed) default, scaled (12 failed) full cycles (7 failed) full cycles, scaled (10 failed) 2x2/1x1 (9 failed) 2x2/1x1, scaled (8 failed)
0.4
0.2
0
1
2
3 α
4
default (8 failed) default, scaled (12 failed) full cycles (7 failed) full cycles, scaled (10 failed) 2x2/1x1 (9 failed) 2x2/1x1, scaled (8 failed)
0.4
0.2
5
0
1
2
3 α
4
5
Fig. 7. High accuracy performance profile for the CPU time of the numerical factorization, and the time for the complete solution including matching, analysis, factorization and solve with one step of iterative refinement
4 Conclusion We have investigated maximum-weighted matching strategies to improve the accuracy for the solution of symmetric indefinite systems. We believe that the current default option in PARDISO is a good general-purpose solver [10]. This is based on restricting pivoting interchanges within diagonal blocks corresponding to single supernodes. The factor LDLT is perturbed whenever numerically acceptable pivots cannot be found within a diagonal block. The solution step is therefore performed with one step of iterative refinement. However, iterative refinement can be a serious problem in cases that involve a high number of solution steps. In these cases, matchings are advantageous as they improve the accuracy of the solution without performing iterative refinement. We have demonstrated that the matching strategies are very effective at the cost of a higher factorization time. Among the proposed strategies, FULL CYCLES is the most accurate one. Furthermore, our results also show that matchings strategies are more important than scalings and that these matching strategies improve the accuracy of the solution.
Maximum-Weighted Matching Strategies
817
References 1. M. Benzi, J.C. Haws, and M. Tuma. Preconditioning highly indefinite and nonsymmetric matrices. SIAM J. Scientific Computing, 22(4):1333–1353, 2000. 2. E. D. Dolan and J.J. Mor´e. Benchmarking optimization software with performance profiles. Mathematical Programming, 91(2):201–213, 2002. 3. I. S. Duff. MA57 – a code for the solution of sparse symmetric definite and indefinite systems. ACM Transactions on Mathematical Software, 30(2):118–144, June 2004. 4. I. S. Duff and J. R. Gilbert. Maximum-weighted matching and block pivoting for symmetric indefinite systems. In Abstract book of Householder Symposium XV, June 17-21, 2002. 5. I. S. Duff and J. Koster. On algorithms for permuting large entries to the diagonal of a sparse matrix. SIAM J. Matrix Analysis and Applications, 22(4):973–996, 2001. 6. I. S. Duff and S. Pralet. Strategies for scaling and pivoting for sparse symmetric indefinite problems. Technical Report TR/PA/04/59, CERFACS, Toulouse, France, 2004. 7. Nicholas I. M. Gould and Jennifer A. Scott. A numerical evaluation of HSL packages for the direct solution of large sparse, symmetric linear systems of equations. ACM Transactions on Mathematical Software, 30(3):300 – 325, September 2004. 8. A. Gupta and L. Ying. On algorithms for finding maximum matchings in bipartite graphs. Technical Report RC 21576 (97320), IBM T. J. Watson Research Center, Yorktown Heights, NY, October 25, 1999. 9. G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Scientific Computing, 20(1):359–392, 1998. 10. O. Schenk and K. G¨artner. On fast factorization pivoting methods for sparse symmetric indefinite systems. Technical Report CS-2004-004, Department of Computer Science, University of Basel, 2004. Submitted.
An Evaluation of Sparse Direct Symmetric Solvers: An Introduction and Preliminary Findings Jennifer A. Scott1 , Yifan Hu2 , and Nicholas I.M. Gould1 1
Computational Science and Engineering Department Rutherford Appleton Laboratory Chilton, Oxfordshire, OX11 0QX, England, UK {n.i.m.gould,j.a.scott}@rl.ac.uk 2 Wolfram Research, Inc., 100 Trade Center Drive Champaign, IL61820, USA [email protected]
Abstract. In recent years a number of solvers for the direct solution of large sparse, symmetric linear systems of equations have been developed. These include solvers that are designed for the solution of positive-definite systems as well as those that are principally intended for solving indefinite problems. The available choice can make it difficult for users to know which solver is the most appropriate for their applications. In this study, we use performance profiles as a tool for evaluating and comparing the performance of serial sparse direct solvers on an extensive set of symmetric test problems taken from a range of practical applications.
1 Introduction Solving linear systems of equations lies at the heart of numerous problems in computational science and engineering. In many cases, particularly when discretizing continuous problems, the system is large and the associated matrix A is sparse. Furthermore, for many applications, the matrix is symmetric; sometimes, such as in finite-element applications, A is positive definite, while in other cases, including constrained optimization and problems involving conservation laws, it is indefinite. A direct method for solving a sparse linear system Ax = b involves the explicit factorization of the system matrix A (or, more usually, a permutation of A) into the product of lower and upper triangular matrices L and U . In the symmetric case, for positive definite problems U = LT (Cholesky factorization) or, more generally, U = DLT , where D is a block diagonal matrix with 1 × 1 and 2 × 2 blocks. Forward elimination followed by backward substitution completes the solution process for each given right-hand side b. Direct methods are important because of their generality and robustness. Indeed, for the ‘tough’ linear systems arising from some applications, they are currently the only feasible solution methods. In many other cases, direct methods are often the method of choice because the difficulties involved in finding and computing a good preconditioner for an iterative method can outweigh the cost of using a direct method. Furthermore, direct methods provide an effective means of solving multiple J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 818–827, 2006. c Springer-Verlag Berlin Heidelberg 2006
An Evaluation of Sparse Direct Symmetric Solvers
819
systems with the same A but different right-hand sides b because the factorization needs only to be performed once. Since the early 1990s, many new algorithms and a number of new software packages for solving sparse symmetric systems have been developed. Because a potential user may be bewildered by such choice, our intention to compare the serial solvers (including serial versions of parallel solvers) on a significant set of large test examples from many different application areas. This study is an extension of a recent comparison by Gould and Scott [3] of sparse symmetric direct solvers in the mathematical software library HSL [7]. This earlier study concluded that the best general-purpose HSL package for solving sparse symmetric systems is currently MA57. Thus the only HSL direct solver included here is MA57, but for some classes of problems, other HSL codes may be more appropriate. For full details and results for the HSL symmetric solvers are given in [4]. The sparse solvers used in this study are listed in Table 1. The codes are discussed in detail in the forthcoming report [5]. Some of the solvers are freely available to academics while to use others it is necessary to purchase a licence. This information is provided in Table 2. For each code a webpage address is also given (or, if no webpage is currently available, an email contact is provided). Note that for non academic users, the conditions for obtaining and using a solver varies between the different packages. Table 1. Solvers used in our numerical experiments. A ‘&’ indicates both languages are used in the source code; ‘F77/F90’ indicates there is a F77 version and a F90 version Code
Date/version
Language
Authors
BCSLIB-EXT
11.2001, v4.1
F77
MA57
01.2002, v1.0.0 F77/F90 I.S. Duff, HSL
MUMPS
11.2003, v4.3.2
F90
P.R. Amestoy, I.S. Duff,
Oblio
12.2003, v0.7
C++
F. Dobrian and A. Pothen
PARDISO
02.2004
SPOOLES
1999, v2.2
C
C. Ashcraft and R. Grimes
SPRSBLKLLT
1997, v0.5
F77
E.G. Ng and B.W. Peyton
TAUCS
08.2003, v2.2
UMFPACK
04.2003, v4.1
WSMP
2003, v1.9.8
The Boeing Company
J.-Y. L’Excellent, and J. Koster F77 & C O. Schenk and K. G¨artner
C
S. Toledo
C
T. Davis
F90 & C A. Gupta and M. Joshi, IBM
2 Test Environment Our aim is to test the solvers on a wide range of problems from as many different application areas as possible. In collecting test data we imposed only two conditions: – The matrix must be square and of order greater than 10, 000. – The data must be available to other users.
820
Jennifer A. Scott, Yifan Hu, and Nicholas I.M. Gould
Table 2. Academic availability of and contact details for the solvers used in our numerical experiments. ∗ denotes source code is provided in the distribution Code
Free to
Webpage / email contact
academics BCSLIB-EXT MA57
∗
MUMPS∗ Oblio
∗
SPRSBLKLLT∗ TAUCS∗ WSMP
× √
www.cse.clrc.ac.uk/nag/hsl
√
PARDISO
UMFPACK
www.boeing.com/phantom/BCSLIB-EXT/index.html
√
∗
SPOOLES
×
∗
√ √ √ √ √
www.enseeiht.fr/lima/apo/MUMPS/ [email protected] or [email protected] www.computational.unibas.ch/cs/scicomp/software/pardiso www.netlib.org/linalg/spooles/spooles.2.2.html [email protected] www.cs.tau.ac.il/∼stoledo/taucs/ www.cise.ufl.edu/research/sparse/umfpack/ www-users.cs.umn.edu/∼agupta/wsmp.html
The first condition was imposed because our interest is in large problems. The second condition was to ensure that our tests could be repeated by other users and, furthermore, it enables other software developers to test their codes on the same set of examples and thus to make comparisons with other solvers. Our test set comprises 88 positivedefinite problems and 61 numerically indefinite problems. Numerical experimentation found 5 of the indefinite problems to be structurally singular and a number of others are highly-ill-conditioned. Any matrix for which we only have the sparsity pattern available is included in the positive-definite set and appropriate numerical values generated. Application areas represented by our test set include linear programming, structural engineering, computational fluid dynamics, acoustics, and financial modelling. A full list of the test problems together with a brief description of each is given in [6]. The problems are all available from ftp://ftp.numerical.rl.ac.uk/pub/matrices/symmetric In this study, performance profiles are used as a means to evaluate and compare the performance of the solvers on our set T of test problems. Let S represent the set of solvers that we wish to compare. Suppose that a given solver i ∈ S reports a statistic sij ≥ 0 when run on example j from the test set T , and that the smaller this statistic the better the solver is considered to be. For example, sij might be the CPU time required to solve problem j using solver i. For all problems j ∈ T , we want to compare the performance of solver i with the performance of the best solver in the set S. For j ∈ T , let sˆj = min{sij ; i ∈ S}. Then for α ≥ 1 and each i ∈ S we define k(sij , sˆj , α) =
1 if sij ≤ αˆ sj 0 otherwise.
An Evaluation of Sparse Direct Symmetric Solvers
821
The performance profile (see [1]) of solver i is then given by the function ˆj , α) j∈T k(sij , s pi (α) = , α ≥ 1. |T | Thus pi (1) gives the fraction of the examples for which solver i is the most effective (according to the statistic sij ), pi (2) gives the fraction for which it is within a factor of 2 of the best, and limα−→∞ pi (α) gives the fraction for which the algorithm succeeded. In this study, the statistics used are the CPU times required to perform the different phases of the solver, the number of nonzero entries in the matrix factor, and the total memory used by the solver (but in this paper, limitations on space allow us only to present CPU timings). Since some of the solvers we are examining are specifically designed for positive-definite problems (and may be unreliable, or even fail, on indefinite ones), we present our findings for the positive-definite and indefinite cases separately. In our tests, default values are used for all control parameters.
3 Preliminary Results The numerical results were obtained on a Compaq DS20 Alpha server with a pair of EV6 CPUs; in our experiments only a single processor with 3.6 GBytes of RAM was used. The codes were compiled with full optimisation; the vendor-supplied BLAS were used. All CPU reported times are in seconds. A CPU limit of 2 hours was imposed for each code on each problem; any code that had not completed after this time was recorded as having failed. The scaled residual b − Ax/(Ax + b) of each computed solution was checked before and after one step of iterative refinement; a residual after iterative refinement greater than 0.0001 causes an error to be flagged. In the reported tests, the input matrix A was not scaled. 3.1 Positive Definite Problems The reliability of all the solvers in the positive-definite case was generally high. Only problem audikw was not solved by any code, this example being one of the two largest – it is of order roughly 950 thousand, and involves some 39 million nonzeros; the solvers with no out-of-core facilities were not able to allocate sufficient memory while the CPU time limit was exceeded for the remaining solvers. We present the performance profile for the CPU time for the complete solution (that is, the CPU time for analysing, factorising and solving for a single right-hand side) for the solvers in Figure 1, with the profiles for the separate analyse, factorise, and solve times in Figures 2 to 4. We see that, with the exception of SPOOLES and UMFPACK (which is primarily designed for unsymmetric problems), there is little to choose between the solvers when comparing the complete solution time. Many of the solvers use the nested dissection ordering from the METIS package [8] to obtain the pivot sequence and so they record similar analyse times. The multiple minimum degree algorithm used by SPRSBLKLLT is notably faster while WSMP
822
Jennifer A. Scott, Yifan Hu, and Nicholas I.M. Gould Performance Profile: 0.AFS.CPU − 88 positive−definite problems, u=default 1
fraction of problems for which solver within α of best
0.9
0.8
0.7
0.6
0.5
0.4 BCSLIB−EXT (1 failed) MA57 (1 failed) MUMPS (1 failed) Oblio (2 failed) PARDISO (1 failed) SPOOLES (2 failed) SPRSBLKLLT (1 failed) TAUCS (2 failed) UMFPACK (4 failed) WSMP (1 failed)
0.3
0.2
0.1
0
1
1.5
2
2.5
3 α
3.5
4
4.5
5
Fig. 1. Performance profile, p(α): CPU time (seconds) for the complete solution (positive-definite problems) Performance Profile: 0.Analyse.CPU − 88 positive−definite problems, u=default 1
fraction of problems for which solver within α of best
0.9
0.8
0.7
0.6
0.5
0.4 BCSLIB−EXT (1 failed) MA57 (1 failed) MUMPS (1 failed) Oblio (2 failed) PARDISO (1 failed) SPOOLES (2 failed) SPRSBLKLLT (1 failed) TAUCS (2 failed) UMFPACK (4 failed) WSMP (1 failed)
0.3
0.2
0.1
0
1
1.5
2
2.5
3 α
3.5
4
4.5
5
Fig. 2. Performance profile, p(α): CPU time (seconds) for analyse phase (positive-definite problems)
An Evaluation of Sparse Direct Symmetric Solvers
823
Performance Profile: 0.Factorise.CPU − 88 positive−definite problems, u=default 1
fraction of problems for which solver within α of best
0.9
0.8
0.7
0.6
0.5
0.4 BCSLIB−EXT (1 failed) MA57 (1 failed) MUMPS (1 failed) Oblio (2 failed) PARDISO (1 failed) SPOOLES (2 failed) SPRSBLKLLT (1 failed) TAUCS (2 failed) UMFPACK (4 failed) WSMP (1 failed)
0.3
0.2
0.1
0
1
1.5
2
2.5
3 α
3.5
4
4.5
5
Fig. 3. Performance profile, p(α): CPU time (seconds) for the factorization (positive-definite problems) Performance Profile: 0.Solve.CPU − 88 positive−definite problems, u=default 1
fraction of problems for which solver within α of best
0.9
0.8
0.7
0.6
0.5
0.4 BCSLIB−EXT (1 failed) MA57 (1 failed) MUMPS (1 failed) Oblio (2 failed) PARDISO (1 failed) SPOOLES (2 failed) SPRSBLKLLT (1 failed) TAUCS (2 failed) UMFPACK (4 failed) WSMP (1 failed)
0.3
0.2
0.1
0
1
1.5
2
2.5
3 α
3.5
4
4.5
5
Fig. 4. Performance profile, p(α): CPU time (seconds) for the solve phase (positive-definite problems)
824
Jennifer A. Scott, Yifan Hu, and Nicholas I.M. Gould
computes both a minimum local fill ordering and an ordering based on recursive bisection and selects the one that will result in the least fill-in. This extra investment pays dividends with WSMP having the fastest factorise times. In some applications, many solves may be required following the factorisation. The codes BCSLIB-EXT, MA57, and PARDISO have the fastest solve times. 3.2 Indefinite Problems We now turn to the indefinite test suite. For these problems, pivoting is needed to maintain stability. The pivoting strategies offered by the codes are summarised in Table 3; more details are given in [5] and the references therein. Since SPRSBLKLLT and the tested version of TAUCS are designed only for solving definite problems, they are omitted. We have experienced problems when using SPOOLES for some indefinite systems, and thus results for SPOOLES are not currently included. MUMPS uses 1 × 1 pivots chosen from the diagonal and the factorization terminates if all the remaining (uneliminated) diagonal entries are zero. Because this may mean some of our test problems are not solved, at the authors’ suggestion, we run both the symmetric and unsymmetric versions of MUMPS when testing indefinite examples. Table 3. Default pivoting strategies BCSLIB-EXT
Numerical pivoting with 1 × 1 and 2 × 2 pivots.
MA57
Numerical pivoting with 1 × 1 and 2 × 2 pivots.
MUMPS
Numerical pivoting with 1 × 1 pivots.
Oblio
Numerical pivoting with 1 × 1 and 2 × 2 pivots.
PARDISO
Supernode Bunch-Kaufmann within diagonal blocks.
SPOOLES
Fast Bunch-Parlett.
UMFPACK
Partial pivoting with preference for diagonal pivots.
WSMP
No pivoting.
The profiles for the indefinite results are given in Figures 5to 8. The overall reliability of the solvers in the indefinite case was not as high as for the positive-definite one, with all the codes failing on some of the test problems. Because it does not include pivoting, WSMP had the highest number of failures. The majority of failures for the other codes were due to insufficient memory or the CPU time limit being exceeded or, in the case of symmetric MUMPS, no suitable diagonal pivots (singular matrices). The main exception was PARDISO, which had the fastest factorization times, but for two problems the computed solutions were found to be inaccurate. We note that v1.0.0 of MA57 uses an approximate minimum degree ordering and computing this is significantly faster than computing the METIS ordering; this accounts for the fast MA57 analyse times. MA57 also has the fastest solve times; the solve times for PARDISO are slower even though it produces the sparsest factors since by default its solve phase includes one step of iterative refinement. Overall, PARDISO was the fastest code on our set of indefinite problems.
An Evaluation of Sparse Direct Symmetric Solvers
825
Performance Profile: 1.AFS.CPU − 61 indefinite problems, scaled, u=default 1
fraction of problems for which solver within α of best
0.9
0.8
0.7
0.6
0.5
0.4
0.3 BCSEXT−LIB (15 failed) MA57 (10 failed) MUMPS (14 failed) MUMPS−unsym (7 failed) Oblio (5 failed) PARDISO (2 failed) UMFPACK (2 failed) WSMP (30 failed)
0.2
0.1
0
1
1.5
2
2.5
3 α
3.5
4
4.5
5
Fig. 5. Performance profile, p(α): CPU time (seconds) for the complete solution (indefinite problems)
Performance Profile: 1.Analyse.CPU − 61 indefinite problems, scaled, u=default 1
fraction of problems for which solver within α of best
0.9
0.8
0.7
0.6
0.5
0.4
0.3 BCSEXT−LIB (15 failed) MA57 (10 failed) MUMPS (14 failed) MUMPS−unsym (7 failed) Oblio (5 failed) PARDISO (2 failed) UMFPACK (2 failed) WSMP (30 failed)
0.2
0.1
0
1
1.5
2
2.5
3 α
3.5
4
4.5
5
Fig. 6. Performance profile, p(α): CPU time (seconds) for analyse phase (indefinite problems)
826
Jennifer A. Scott, Yifan Hu, and Nicholas I.M. Gould Performance Profile: 1.Factorise.CPU − 61 indefinite problems, scaled, u=default 1
fraction of problems for which solver within α of best
0.9
0.8
0.7
0.6
0.5
0.4
0.3 BCSEXT−LIB (15 failed) MA57 (10 failed) MUMPS (14 failed) MUMPS−unsym (7 failed) Oblio (5 failed) PARDISO (2 failed) UMFPACK (2 failed) WSMP (30 failed)
0.2
0.1
0
1
1.5
2
2.5
3 α
3.5
4
4.5
5
Fig. 7. Performance profile, p(α): CPU time (seconds) for the factorization (indefinite problems) Performance Profile: 1.Solve.CPU − 61 indefinite problems, scaled, u=default 1
fraction of problems for which solver within α of best
0.9
0.8
0.7
0.6
0.5
0.4
0.3 BCSEXT−LIB (15 failed) MA57 (10 failed) MUMPS (14 failed) MUMPS−unsym (7 failed) Oblio (5 failed) PARDISO (2 failed) UMFPACK (2 failed) WSMP (30 failed)
0.2
0.1
0
1
1.5
2
2.5
3 α
3.5
4
4.5
5
Fig. 8. Performance profile, p(α): CPU time (seconds) for the solve phase (indefinite problems)
4 General Remarks In this paper, we have introduced our study of sparse direct solvers for symmetric linear systems and presented preliminary results. Full details of the study together with further
An Evaluation of Sparse Direct Symmetric Solvers
827
information on all the solvers used will be given in the forthcoming report [5]. This report will also contain more detailed numerical results and analysis of the results. Furthermore, since performance profiles can only provide a global view of the solvers, the full results for all the solvers on each of the test problems will be made available in a further report [6]. We note that our preliminary findings have already lead to modifications to a number of the solvers (notably MA57, PARDISO, and Oblio), and to further investigations and research into ordering and pivoting strategies.
Acknowledgements We are grateful to all the authors of the solvers who supplied us with copies of their codes and documentation, helped us to use their software and answered our many queries. Our thanks also to all those who supplied test problems.
References 1. E.D. Dolan and J.J. Mor´e. Benchmarking optimization software with performance profiles. Mathematical Programming, 91(2), 201–213, 2002. 2. I.S. Duff, R.G. Grimes, and J.G. Lewis. Sparse matrix test problems. ACM Trans. Mathematical Software, 15, 1–14, 1989. 3. N.I.M. Gould and J.A. Scott. A numerical evaluation of HSL packages for the direct solution of large sparse, symmetric linear systems of equations. ACM Trans. Mathematical Software, 30, 300-325, 2004. 4. N.I.M. Gould and J.A. Scott. Complete results for a numerical evaluation of HSL packages for the direct solution of large sparse, symmetric linear systems of equations. Numerical Analysis Internal Report 2003-2, Rutherford Appleton Laboratory, 2003. Available from www.numerical.rl.ac.uk/reports/reports.shtml. 5. N.I.M. Gould, Y. Hu and J.A. Scott. A numerical evaluation of sparse direct solvers for the solution of large, sparse, symmetric linear systems of equations. Technical Report, Rutherford Appleton Laboratory, 2005. To appear. 6. N.I.M. Gould, Y. Hu, and J.A. Scott. Complete results for a numerical evaluation of sparse direct solvers for the solution of large, sparse, symmetric linear systems of equations. Numerical Snalysis Internal Report, Rutherford Appleton Laboratory, 2005. To appear. 7. HSL. A collection of Fortran codes for large-scale scientific computation, 2002. See http://hsl.rl.ac.uk/. 8. G. Karypis and V. Kumar. METIS: A software package for partitioning unstructured graphs, partitioning meshes and computing fill-reducing orderings of sparse matrices - version 4.0, 1998. See http://www-users.cs.umn.edu/ karypis/metis/
Treatment of Large Scientific Problems: An Introduction Organizers: Zahari Zlatev1 and Krassimir Georgiev2 1
National Environmental Research Institute Frederiksborgvej 399, P. O. Box 358, DK-4000 Roskilde, Denmark 2 Institute for Parallel Processing, Bulgarian Academy of Sciences Acad. G. Bonchev Str., Bl. 25-A, 1113 Sofia, Bulgaria
The computers are becoming faster and faster. Their capabilities to deal with very large data sets are steadily increasing. Problems that require a lot of computing time and rely on the use of huge files of input data can now be treated on powerful workstations and PCs (such problems had to be treated on powerful mainframes only 5-6 years ago). Therefore, it is necessary to answer the following two important questions: – Are the computers that are available at present large enough? – Do we need bigger (here and in the remaining part of this introduction, “a bigger computer" means a computer with larger memory discs, not a physically bigger computer) and faster computers for the treatment of the large-scale scientific problems, which appear in different fields of science and engineering and have to be resolved either now or in the near future? These two questions can best be answered by using a quotation taken from a paper “Ordering the universe: The role of mathematics" written by Arthur Jaffe ([1]): “Although the fastest computers can execute millions of operations in one second they are always too slow. This may seem a paradox, but the heart of the matter is: the bigger and better computers become, the larger are the problems scientists and engineers want to solve". The paper of Jaffe was published in 1984. The relevance of this statement can be illustrated by the following example. The difference between the requirements to the air pollution problems studied in 1984 and those studied at present can be expressed by the following figures about the sizes of the problems from this field that were solved 20 years ago and the problems treated now. At that time, i.e. in 1984, it was difficult to handle models containing, after the discretization, more than 2048 equations. One of the biggest problems solved at present, which can successfully be treated on large parallel computers, contains, again after the discretization, more than 160 000 000 ordinary differential equations. This system of ordinary differential equations had to be treated during about 250 000 time-steps. Moreover, many scenarios with this problem had to be handled in a comprehensive study related to future climate changes and high pollution levels. One of the major reasons for being able to handle such huge problems, about 80 000 times larger than the problems that were solved in 1984, is the availability of much bigger and much faster computers. There is no reason to assume that the development of J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 828–830, 2006. c Springer-Verlag Berlin Heidelberg 2006
Treatment of Large Scientific Problems: An Introduction
829
bigger and faster computers will not continue. Therefore, the scientists could continue to plan the formulation and the treatment of bigger and bigger tasks, tasks which impose strong requirements for handling bigger and bigger data sets, because the solutions obtained in this way – will be closer to the reality, – will contain more details and more useful details and – will give reliable answers to questions which are at present open. The main conclusion from the above short discussion can be formulated as follows. Although there are more and more problems which can successfully be handled (without any attempt to optimize the computational process) on workstations and PCs, it is still necessary, when the problems are very large, to run highly optimized codes on big modern supercomputers. The set of problems, which could be treated on workstations and PCs will become bigger and bigger in the future, because the computer power of the workstations and PCs will continue to be increasing. However, the requirements – for more accuracy (i.e. the need to calculate more accurate and, thus, more reliable results) and – for obtaining more detailed information are also steadily increasing. This means that the models are continuously becoming bigger and bigger. Of course, this is a well known, by the specialists, fact. In his paper [1] A. Jaffe stated (about 20 years ago) that the computers will always be too slow for the scientists that are dealing with very large applications. A. Jaffe – was right, – is still right and – will certainly continue to be right in the future. In other words, there will always exist very advanced and very important for the modern society scientific problems which cannot be solved on workstations or PCs without imposing some non-physical assumptions during the development of the model. Thus, the fastest available computers will always be needed for the successful solution of the most important for the modern society tasks. The exploitation of the new fast computers in the efforts to avoid non-physical assumptions and, thus, to develop and run more reliable and more robust large scientific models was the major topic of the special session on the “Treatment of Large Scientific Models". In the papers, which were presented at this special session, the authors considered
830
Zahari Zlatev and Krassimir Georgiev
– numerical methods which are both faster and more accurate, – the use of parallel algorithms and – the organization of the computational process for efficient exploitation of the cache memory of the modern computer architectures. The special session on the “Treatment of Large Scientific Models" was organized by Krassimir Georgiev ([email protected]) and Zahari Zlatev ([email protected]).
Reference 1. Jaffe, A. (1984). Ordering the universe: The role of mathematics. SIAM Rev., 26: 475-488.
Towards a Parallel Multilevel Preconditioned Maxwell Eigensolver Peter Arbenz1, Martin Beˇcka1, , Roman Geus2 , and Ulrich Hetmaniuk3, 1
3
Institute of Computational Science, ETH Z¨urich, CH-8092 Z¨urich 2 Paul Scherrer Institut, CH-5232 Villigen PSI Sandia National Laboratories, Albuquerque, NM 87185-1110, USA
Abstract. We report on a parallel implementation of the Jacobi–Davidson (JD) to compute a few eigenpairs of a large real symmetric generalized matrix eigenvalue problem Ax = λM x, C T x = 0. The eigenvalue problem stems from the design of cavities of particle accelerators. It is obtained by the finite element discretization of the time-harmonic Maxwell equation in weak form by a combination of N´ed´elec (edge) and Lagrange (node) elements. We found the Jacobi–Davidson (JD) method to be a very effective solver provided that a good preconditioner is available for the correction equations that have to be solved in each step of the JD iteration. The preconditioner of our choice is a combination of a hierarchical basis preconditioner and a smoothed aggregation AMG preconditioner. It is close to optimal regarding iteration count and scales with regard to memory consumption. The parallel code makes extensive use of the Trilinos software framework.
1 Introduction Many applications in electromagnetics require the computation of some of the eigenpairs of the curl-curl operator, 2 curl µ−1 r curl e(x) − k0 εr e(x) = 0,
div e(x) = 0,
x ∈ Ω,
(1.1)
in a bounded simply-connected, three-dimensional domain Ω with homogeneous boundary conditions e × n = 0 posed on the connected boundary ∂Ω. εr and µr are the relative permittivity and permeability. Equations (1.1) are obtained from the Maxwell equations after separation of the time and space variables and after elimination of the magnetic field intensity. While εr and µr are complex numbers in problems from waveguide or laser design, in simulations of accelerator cavities the materials can be
The work of these authors has been supported by the ETH research grant TH-1/02-4 on “Large Scale Eigenvalue Problems in Opto-Electronic Semiconductor Lasers and Accelerator Cavities”. Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 831–838, 2006. c Springer-Verlag Berlin Heidelberg 2006
832
Peter Arbenz et al.
assumed to be loss-free, thus admitting real εr and µr , whence all eigenvalues are real. Here, we will assume εr = µr = 1. Thus, the discretization of (1.1) by finite elements leads to a real symmetric generalized matrix eigenvalue problem Ax = λM x,
C T x = 0,
(1.2)
where A is positive semidefinite and M is positive definite. In order to avoid spurious modes we approximate the electric field e by N´ed´elec (or edge) elements [12]. The Lagrange multiplier (a function) introduced to treat properly the divergence free condition is approximated by Lagrange (or nodal) finite elements [2]. In this paper we consider a parallel eigensolver for computing a few, i.e., five to ten of the smallest eigenvalues and corresponding eigenvectors of (1.2) as efficiently as possible with regard to execution time and consumption of memory space. In earlier studies [2] we found the Jacobi–Davidson algorithm [13,8] a very effective solver for this task. We have parallelized this solver in the framework of the Trilinos parallel solver environment [9]. In section 2 we review the symmetric Jacobi-Davidson eigensolver and the preconditioner that is needed for its efficient application. In section 3 we discuss data distribution and issues involving the use of Trilinos. In section 4 we report on experiments that we conducted by means of problems originating in the design of the RF cavity of the 590 MeV ring cyclotron installed at the Paul Scherrer Institute (PSI) in Villigen, Switzerland. These experiments indicate that the implemented solution procedure is almost optimal in that the number of iteration steps until convergence only slightly depends on the problem size.
2 The Eigensolver In this paper we focus on the symmetric Jacobi–Davidson algorithm (JDSYM) for solving (1.2). This algorithm is well-suited since it does not require the factorization of the matrices A or M . In [1,3,2] we found JDSYM to be the method of choice for this problem. 2.1 The Symmetric Jacobi–Davidson Algorithm The Jacobi–Davidson algorithm has been introduced by Sleijpen and van der Vorst [13]. There are variants for all types of eigenvalue problems [4]. Here, we use a variant adapted to the generalized symmetric eigenvalue problem (1.2) as described in detail in [1,8]. Here we just sketch the algorithm. Let us assume that we have already computed q eigenvalues of (1.2) and have the corresponding eigenvectors available in the n × q matrix Q. Of course, C T Q = 0. Let us further assume that we have available a search space R(Vk ) with Vk = [v1 , . . . , vk ] with C T Vk = 0 and QT M Vk = 0. The JDSYM proceeds in three steps to expand the search space by one dimension. 1. Extraction. In the extraction phase, a Ritz pair of (1.2) restricted to R(Vk ) is computed. This amounts to computing the spectral decomposition of VkT AVk and selecting a particular eigenpair (˜ ρ, q ˜) where ρ˜ = ρ(˜ q) is the Ritz value in R(Vk )
Towards a Parallel Multilevel Preconditioned Maxwell Eigensolver
833
that best approximates the searched eigenvalue. Here, ρ(˜ q) denotes the Rayleigh quotient of q ˜. 2. Correction. In order to improve the actual best approximation (˜ ρ, q ˜) a correction t is determined that satisfies the correction equation Q T )(A − ρ˜M )(I − Q Q T M )t = −r, (I − M Q T M t = 0, = [Q, q Q C T t = 0, Q ˜]
(2.3)
where r = A˜ q − ρ˜ M q ˜ is called the residual at q ˜. t can be interpreted as a Newton correction at q ˜ for solving Ax− ρ(x)M x = 0. For efficiency reasons the correction equation is solved only approximately by a Krylov subspace method. 3. Extension. The solution t of (2.3) is made M -orthogonal to Vk and orthogonal to C, ˆ t = (I − Vk VkT M )(I − Y H −1 C T )t. (2.4) ˆ After M -normalization, t is appended to Vk to yield Vk+1 . Notice that Y = M −1 C is a (very sparse) basis of the null space of A and that H = Y T C is the discretization of the Laplace operator in the nodal element space [2]. In order to limit the memory space the algorithm consumes the dimension of Vk is limited. If dim(Vk ) = jmax then the iteration is restarted meaning that the vectors v1 , . . . , vjmax are replaced by the jmin best Ritz vectors in Vk . 2.2 Solving the Correction Equation For the Krylov subspace method to be efficient a preconditioner is a prerequisite. Following Fokkema et al. [6] for solving (2.3) we use preconditioners of the form Q T )K(I − Q Q T M ), (I − M Q
(2.5)
where K is a symmetric preconditioner of A − ρ˜M . For efficiency reasons we compute K only once for a fixed shift σ such that K ≈ A− σM . We experienced best results when σ is in the middle of the set of desired eigenvalues. However, in the experiments of this paper we choose σ close to but below the smallest eigenvalue we desire. This makes it possible to use the same K for all equations we are solving. In each preconditioning step an equation of the form Q T )Kc = b and Q T M c = 0 (I − M Q has to be solved. The solution c is [8, p. 92] T M )K −1 b. Q T M K −1 M Q) −1 Q c = (I − K −1 M Q( Briefly, the Krylov subspace method is invoked with the following arguments: Q T )(A − ρ˜M ) (I − M Q T M )K −1 Q T M K −T M Q) −1 Q preconditioner: (I − K −1 M Q( Q T )r right hand side: −(I − M Q system matrix:
initial vector:
0
(2.6)
834
Peter Arbenz et al.
Both the system matrix and the preconditioner are symmetric. However, because of the dynamic shift ρ˜ they can become indefinite. For this reason, the QMRS iterative solver [7] is suited particularly well. 2.3 The Preconditioner Our preconditioner K, cf. (2.5), is a combination of a hierarchical basis preconditioner and an algebraic multigrid (AMG) preconditioner. Since our finite element spaces consist of N´ed´elec and Lagrange finite elements of degree 2 and since we are using hierarchical bases we employ the hierarchical basis preconditioner that we used successfully in [2]. Numbering the linear before the quadratic degrees of freedom the matrices A and M in (1.2) get a 2-by-2 block structure, M11 M12 A11 A12 , M= . A= A21 A22 M21 M22 Here, the (1, 1)-blocks correspond to the bilinear forms involving linear basis functions. The hierarchical basis preconditioners as discussed by Bank [5] are stationary iteration methods for solving K11 K12 x1 b1 = , Kij = Aij − σMij . K21 K22 x2 b2 that respect the 2-by-2 block structure of A and M . If the underlying stationary method is the symmetric block Gauss–Seidel iteration then −1 K11 K11 K12 K11 . K= 22 22 22 K K K21 K 22 of K22 again represents a stationary iteration method. In a The approximation K parallel environment the Jacobi iteration ˜ 22 = diag(K22 ) K
(2.7)
is quite efficient and easy to implement. For very large problems the direct solve with K11 becomes inefficient and in particular consumes far too much memory due to fill-in. In order to reduce the memory requirements of the two-level hierarchical basis preconditioner but at the same time not lose its optimality with respect to iteration count we replaced the direct solves by a single V-cycle of an AMG preconditioner. This makes our preconditioner a true multilevel preconditioner. We found ML [11] the AMG solver of choice as it can handle unstructured systems that originate from the Maxwell equation discretized by linear N´ed´elec finite elements. ML implements a smoothed aggregation AMG method [15] that extends the non-smoothed aggregation approach of Reitzinger and Sch¨oberl [10]. ML is part of Trilinos which is discussed in the next section.
Towards a Parallel Multilevel Preconditioned Maxwell Eigensolver
835
3 Parallelization Issues If the problems become very large then only the distribution of the data over a number of processors (and their memory) will make their solution feasible. Therefore, an efficient parallel implementation is prerequisite for being able to solve really large problems. The parallelization of the algorithm requires proper data structures and data layout, parallel direct and iterative solvers and various preconditioners. We found the Trilinos Project [14] to be an efficient environment to develop such a complex parallel application. 3.1 Trilinos The Trilinos Project is a huge effort to develop parallel solver algorithms and libraries within an object-oriented software framework for the solution of large-scale, complex multi-physics engineering and scientific applications [14,9], still leveraging the value of established numerical libraries such as Aztec, the BLAS and LAPACK. Trilinos is a collection of compatible software packages that support parallel linear algebra computations, solution of linear and non-linear systems of equations, eigenvalue problems, and related capabilities. Trilinos is primarily written in C++, but part of the underlying code is implemented in C and Fortran. Each Trilinos package is a self-contained, independent piece of software. The Trilinos fundamental layer, Epetra, provides a common look-and-feel and infrastructure. It implements basic objects like distributed vectors, matrices, graphs, linear operators and problems. It also implements abstract interfaces for Trilinos packages to interact with each other. In our project we employ the package AztecOO, an object-oriented interface to the Aztec library of iterative solvers and preconditioners, and the package ML, that implements a smoothed aggregation AMG method capable of handling Maxwell equations. Direct solvers were managed by Amesos which is the Trilinos wrapper for various linear direct solvers. To improve the original distribution of data the external Zoltan/ParMetis libraries were accessed through the EpetraExt interface. Some utilities from Teuchos and TriUtils packages were also used. Trilinos offers effective facilities for parallel mathematical software developers. As it is still under development, the level of integration of the various packages into Trilinos and the quality of their documentation varies. E.g., at this time, ML for Maxwell equations is available only in the developer version of Trilinos. 3.2 Data Distribution It is clear that a suitable data distribution can reduce communication expenses and balance computational load. The gain from such a redistribution can overcome the cost of this preprocessing step. In the Trilinos context, distribution of matrices is by rows. A map defines which row goes on which processor. For the redistribution with Zoltan/ParMetis, an artificial graph G is constructed from portions of the matrices M , H, and C that contains a vertex for each node, edge, and face of the finite element mesh that participates at the computation.
836
Peter Arbenz et al.
Notice that neither M nor H works with all degrees of freedom. The Poisson matrix H has degrees of freedom located at vertices and on edges. The matrix M has degrees of freedom located on edges and element faces. The matrix C relates both. ParMetis tries to distribute the matrix such that (1) the number of nonzero elements per processor and thus the work load is balanced and (2) the number of edge cuts is minimal. The latter should minimize the communication overhead. In our experiments we observed significant decreases (up to 40%) of the execution time with the redistributed data.
4 Numerical Experiments In this section we present experiments that we conducted by means of problems originating in the design of the RF cavity of the 590 MeV ring cyclotron installed at the Paul Scherrer Institute (PSI) in Villigen, Switzerland. The characteristics of this problem that we indicate by cop40k is given in Table 1 where orders n and number of nonzeros nnz of both the shifted operator A − σM and the discrete Laplacian H are listed. The second, artificial test example is a rectangular box denoted box170k. The experiments have been conducted on up to 16 processors of a 32 dual-node PC cluster in dedicated mode. Each node has 2 AMD Athlon 1.4 GHz processors, 2 GB main memory, and 160 GB local disk. The nodes are connected by a Myrinet providing a communication bandwidth of 2000 Mbit/s. The system operates with Linux 2.4.20. For our experiments we used the developer version of Trilinos on top of MPICH 1.2.5. Table 1. Matrix characteristics grid nA−σM nnzA−σM nH nnzH cop40k 231668 4811786 46288 1163834 box170k 1030518 20767052 209741 5447883
In Tables 2 and 3 timings are given for computing the 5 smallest positive eigenvalues and corresponding eigenvectors using JDSYM with the multilevel preconditioner introduced earlier. We set jmin = 6 and jmax = 15. An approximate eigenpair (ρ, q) is considered converged if the norm of the residual r = Aq − ρ M q satisfies r 2 ≤ ε c(M ) q M , 1/2
1/2
where c(M ) approximates λmin(M) [3]. Notice that r M −1 ≤ r 2 /λmin(M) . In Tables 2 and 3 the execution times t are given for various processor numbers p. Both problems were too large to be solved on a single processor. Efficiencies E(p) are given with respect to the performance of the run with two and six processors. Notice that the speedup from four to six processors in Table 3 is superlinear due to memory effects. tprec and tproj give the percentage of the time the solver spent applying the preconditioner and the projector, respectively. The preconditioner in (2.6) is described in detail in section 2.3. The projector (2.4) is applied only once per outer iteration. Solving with H amounts to solving a Poisson equation [2]. This is done iteratively with the preconditioned conjugate gradient method. A multilevel preconditioner was provided by ML
Towards a Parallel Multilevel Preconditioned Maxwell Eigensolver
837
Table 2. Results for matrix cop40k p t [sec] E(p) tprec tproj nouter 2 1241 1.00 38% 16% 55 4 637 0.97 37% 17% 54 6 458 0.90 39% 18% 54 8 330 0.94 39% 17% 53 10 266 0.93 39% 19% 52 12 240 0.86 41% 20% 54 14 211 0.84 42% 20% 55 16 186 0.83 44% 20% 54
navg inner 19.38 19.24 19.69 19.53 19.17 19.61 19.36 19.17
Table 3. Results for matrix box170k p t [sec] E(p) tprec tproj nouter 4 7720 — 28% 22% 54 6 2237 1.00 39% 23% 55 8 1744 0.96 38% 23% 55 10 1505 0.89 38% 25% 56 12 1224 0.91 38% 25% 54 14 1118 0.86 39% 24% 55 16 932 0.90 38% 25% 54
navg inner 22.39 22.47 23.51 22.54 22.02 23.76 22.30
applied to H. The projector consumes a high fraction of the execution time. This can be explained by the high accuracy (residual norm reduction by a factor 1010 ) we required of the solution vector in order to guarantee that it satisfies the constraints C T x = 0. The fraction of the overall solver time of both preconditioner and projector are practically constant or increasing very slowly. navg inner is the average number of inner iterations per outer iteration. The total number of matrix-vector multiplications and applications of the preconditioner is approximately avg avg nouter × ninner . The number of average inner iterations ninner is almost constant indicating that the preconditioner does not deteriorate as p increases. In fact the only difference among the preconditioners of different p is the number of aggregates that are formed by ML. To reduce the number of messages aggregates are not permit to cross processor boundaries.
5 Conclusions In conclusion, the parallel algorithm shows a very satisfactory behavior. The efficiency of the parallelization does not get below 83 percent for 16 processors. It slowly decreases as the number of processors increases, but this is natural due to the growing communicationto-computation ratio. The accuracy of the results was satisfactory. The computed eigenvectors were M orthogonal and orthogonal to C to machine precision. The 2-norm of the residuals of the computed eigenpairs were below 10−9 .
838
Peter Arbenz et al.
In the next future we will incorporate the hierarchical basis preconditioner for solves with the Poisson matrix H in the projector (2.4). Presently, we apply ML to the whole 22 of K22 than of H. We will also experiment with more powerful approximations K just its diagonal, cf. (2.7). In [2] we used symmetric Gauss-Seidel instead of Jacobi preconditioning. The Gauss-Seidel preconditioner does not parallelize well, though.
References 1. P. Arbenz and R. Geus. A comparison of solvers for large eigenvalue problems originating from Maxwell’s equations. Numer. Linear Algebra Appl., 6(1):3–16, 1999. 2. P. Arbenz and R. Geus. Multilevel preconditioners for solving eigenvalue problems occuring in the design of resonant cavities. To appear in “Applied Numerical Mathematics”. Paper available from doi:10.1016/j.apnum.2004.09.026. 3. P. Arbenz, R. Geus, and S. Adam. Solving Maxwell eigenvalue problems for accelerating cavities. Phys. Rev. ST Accel. Beams, 4:022001, 2001. Paper available from doi:10.1103/PhysRevSTAB.4.022001. 4. Z. Bai, J. Demmel, J. Dongarra, A. Ruhe, and H. van der Vorst. Templates for the Solution of Algebraic Eigenvalue Problems: A Practical Guide. SIAM, Philadelphia, PA, 2000. 5. R. E. Bank. Hierarchical bases and the finite element method. Acta Numerica, 5:1–43, 1996. 6. D. R. Fokkema, G. L. G. Sleijpen, and H. A. van der Vorst. Jacobi–Davidson style QR and QZ algorithms for the partial reduction of matrix pencils. SIAM J. Sci. Comput., 20(1):94–125, 1998. 7. R. Freund and N. M. Nachtigal. Software for simplified Lanczos and QMR algorithms. Appl. Numer. Math., 19:319–341, 1995. 8. R. Geus. The Jacobi–Davidson algorithm for solving large sparse symmetric eigenvalue problems. PhD Thesis No. 14734, ETH Z¨urich, 2002. (Available at URL http://e-collection.ethbib.ethz.ch/show?type=diss&nr=14734). 9. M. Heroux, R. Bartlett, V. Howle, R. Hoekstra, J. Hu, T. Kolda, R. Lehoucq, K. Long, R. Pawlowski, E. Phipps, A. Salinger, H. Thornquist, R. Tuminaro, J. Willenbring, and A. Williams. An overview of Trilinos. Technical Report SAND2003-2927, Sandia National Laboratories, August 2003. To appear in the ACM Trans. Math. Softw. 10. S. Reitzinger and J. Sch¨oberl. An algebraic multigrid method for finite element discretizations with edge elements. Numer. Linear Algebra Appl., 9(3):223–238, 2002. 11. M. Sala, J. Hu, and R. S. Tuminaro. ML 3.1 Smoothed Aggregation User’s Guide. Tech. Report SAND2004-4819, Sandia National Laboratories, September 2004. 12. P. P. Silvester and R. L. Ferrari. Finite Elements for Electrical Engineers. Cambridge University Press, Cambridge, 3rd edition, 1996. 13. G. L. G. Sleijpen and H. A. van der Vorst. A Jacobi–Davidson iteration method for linear eigenvalue problems. SIAM J. Matrix Anal. Appl., 17(2):401–425, 1996. 14. The Trilinos Project Home Page. http://software.sandia.gov/trilinos/. 15. P. Vanˇek, J. Mandel, and M. Brezina. Algebraic multigrid based on smoothed aggregation for second and fourth order problems. Computing, 56(3):179–196, 1996.
On Improvement of the Volcano Search and Optimization Strategy Venansius Baryamureeba and John Ngubiri Institute of Computer Science, Makerere University P.O.Box 7062, Kampala, Uganda {barya,ngubiri}@ics.mak.ac.ug
Abstract. The ever-increasing load on databases dictates that queries do not need to be processed one by one. Multi-query optimization seeks to optimize queries grouped in batches instead of one by one. Multi-query optimizers aim at identifying inter and intra query similarities to bring up sharing of common sub-expressions and hence saving computer resources like time, processor cycles and memory. Of course, the searching takes some resources but so long as the saved resources are more than those used, there is a global benefit. Since queries are random and from different sources, similarities are not guaranteed but since they are addressed to the same schema, it is likely. The search strategy need to be intelligent such that it continues only when there is a high probability of a sharing (hence resource saving) opportunity. We present a search strategy that assembles the queries in an order such that the benefits are high, that detects null sharing cases and therefore terminates the similar sub-expressions’ search as well as removing sub-expressions which already exist else where so as to reduce subsequent searching procedures for a global advantage. AMS Subject Classification: 68M20, 68P20, 68Q85 Keywords: Query processing, Multi-query optimization, Volcano optimizer.
1 Introduction Databases are becoming central in the running of organizations. The load put on them is therefore increasing. This is mostly from simultaneous, complex and close to real time requests. Single-query optimizers such as the System R optimizer [7] express query plans as trees and process one query at ago. They can work quite well for databases receiving a low traffic of simple queries. As queries become complex and many, they put a bigger load on computer resources. The need to optimize them in batches while exploiting similarities among them becomes irresistible. In fact, it is the only feasible way out. Multi-query optimizers [2],[3],[8],[10] represent queries as Directed Acyclic Graphs (DAGs). Others like in [9] use an AND-OR DAGs so as to ease extensibility. An AND-OR DAG is a DAG with two types of nodes:- the AND nodes and the OR nodes. AND nodes have only OR nodes as children and OR nodes have only AND nodes as children. An AND node represents a relational algebraic operation such as join (), select (σ) and project(π). AND nodes are as well referred to as operational nodes. OR J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 839–846, 2006. c Springer-Verlag Berlin Heidelberg 2006
840
Venansius Baryamureeba and John Ngubiri
Fig. 1. Directed Acyclic Graph
nodes represents an equivalent logical expression of the child AND node and its children. They are as well referred to as equivalence nodes. Multi-query optimizers aim at traversing the different query plans(AND-OR DAGs) and identify sub-expressions (equivalent nodes) in which sharing is possible. Sharing of node outcomes saves memory, disc access and computation costs hence lowering the global resource requirements for the execution of the queries. Though searching for sub-expressions to be shared takes some resources (time and processor cycles). If a good search strategy is used, the savings exceed the searching cost hence a global advantage. Generally, there are three cases with a possibility of materialization and sharing. If W is an equivalent node of a plan, then sharing is possible when:(i) Nodes produce exactly similar results. e.g s1 = σx=a (W ) and s2 = σx=a (W ). In this case, only one is executed and the other one just uses the output of the executed node. (ii) The output of one node is an exact subset of the other(s). For example, if we have nodes like s3 = σx≤6 (W ) and s4 = σx≤10 (W ), s4 is executed and s3 is derived from it i.e. s3 = σx≤6 (s4 ). (iii) Nodes retrieve data from the same (intermediate) relation or view but on different instances of the outermost constraint. For example, if we have s5 = σx=10 (W ) and s6 = σx=15 (W ), then a different node s7 = σx=10∨x=15 (W ) is created so that the two are derived from it i.e. s5 = σx=10 (s7 ), and s6 = σx=15 (s7 ). The efficiency of the multi-query optimizer does not depend on how aggressively common sub-expressions are looked for but rather on the search strategy [9]. Given that multi-query optimization takes place on many complex queries with many relations, comparing sub-expressions exhaustively leads to too many comparisons hence high comparison cost and time. Work of Kyuseok et al. [6], Kyuseok [5], Sellis and Gosh
On Improvement of the Volcano Search and Optimization Strategy
841
[10], Cosar et al.[1] as well as Park and Segar [3] are exhaustive in nature hence costly. The searching cost may exceed the pre and post optimization resource trade off hence leading to no global benefit. Graefe and McKenna [2] proposed the Volcano approach to multi-query optimization. Roy et al. [8] improved the search strategy proposed by Graefe and McKenna [2] while Roy et al. [9] proposed improvements in the algorithm in [2]. Kroger et al. [4] observes that the main aim of a multi-query optimizer is to come up with a plan as cost effective as possible in a time as short as possible. This research relies on the search strategy’s ability to quickly identify shareable nodes and detect future null sharing cases so as to optimize the search effort and time for a global advantage. We discuss related research in Section 2, the common sub-expressions search strategy in Section 3, the improved algorithm in Section 4 and give concluding remarks in Section 5.
2 Related Work 2.1 The Basic Volcano Algorithm The basic Volcano algorithm was proposed by Graefe and McKenna [2]. It uses transformation rules to generate a DAG representation of the search space. For efficiency reasons, the basic Volcano algorithm starts from the root node and expands towards the leaves. It uses the upper bound of the node costs so as to come up with decisions whether to continue with the plan. This is so because the exact costs can only be got from the leaves. For any optimal sub plan got, it is materialized so that if found again, it is reused without re-optimizing and re-computing it. This saves costs. 2.2 The Volcano SH The Volcano SH algorithm was proposed by Roy et al [9] as an improvement to the basic Volcano algorithm. It observes that not all sub-expressions are worth materialization. Sub-expressions are only materialized if materializing will cause global savings. It traverses the DAG bottom up and computes the cost of each equivalence node. If a node is found to exist more than once, a decision to materialize it is made. Materialization is only made if it will create a saving on the resources required to execute a query. If for example an equivalence node e has execution cost = cost(e), Materialization cost = matcost(e), reuse cost = reusecost(e), and the node is used numuses(e) times at run time. It is materialized if cost(e)+matcost(e)+(reusecost(e)×(numuses(e)−1)) < cost(e)×numuses(e). This simplifies to reusecost(e) +
matcost(e) < cost(e) numuses(e) − 1
[9] which is the materialization condition. Unless an equivalence node satisfies it, it is not cost effective to materialize it and therefore, it is not chosen for materialization. The Volcano SH starts from the leaves and advance towards the root. This makes it unable to
842
Venansius Baryamureeba and John Ngubiri
get the actual value of numuses(e) for an equivalence node e since the number of times an equivalence node appears depends on the materialization status of the parents who, at the moment of reaching the equivalence node, are not yet met. The algorithm therefore uses an under estimate of numuses(e) which is numuses− (e). numuses− (e) is got by simply counting the number of ancestors for the node in question [9]. The inputs are the basic Volcano optimal plans. It has a strengths that it avoids blind materialization. Materialization is only done when the algorithm is sure it will cause cost benefits. Therefore it is more efficient than the basic Volcano. It however has weaknesses, which include:(i) Poor estimation of numuses−(e): Estimating numuses−(e) for the node by counting the ancestors is inaccurate and can lead to wrong decisions. If for example node a is used twice to produce node b which is used thrice to produce node c ( c has two parents and c together with the parents are all used once). The number of times a is used is 6. Using the under estimate in [9],we come up with 4. Since the under estimate is less than the actual value, it is okay. However, assuming all nodes were used once, then the number of times a is used is 1 yet the under estimate remains 4. This is dangerous since the node under estimate may permit materialization yet the actual value refuses it.In this case for example, since a appears only once, the decision to materialize it should not come up at all. (ii) Inter- query sharing and search optimization order: The order of optimization in Volcano SH is immaterial [9]. This is because nodes in a plan are examined indipendently. Even the estimate of numuses(e) is confined in the same plan [9]. This leads to excessive work being done. For example, in Figure 2, the roots of the two sub DAGs are exactly similar. Volcano SH reaches each of them via the leaves which is logically repetitive. Searching for the nodes top down, and then move horizontally across the plans would eliminate duplicate operations leading to similar outputs. Likewise, starting by a plan which share a lot with the rest would eliminate more operations hence smaller search space. 2.3 The Volcano RU The Volcano RU was also proposed by Roy et al. in [9] and seek to exploit sharing beyond the optimal plans. If for example we have two optimal plans Q1 =(AB)C and Q2 =(AC)D, the Volcano SH algorithm would not permit sharing between Q1 and Q2 since no sub-expression is common between them. The Volcano RU however can adjust Q1 to a locally sub optimal plan Q =(AC)B such that a common sub-expression exists therefore sharing is possible. In Volcano RU, queries are optimized one by one. For any sub-expression, the algorithm establishes whether it would cause savings if reused one more time. If it would, it is chosen for materialization for reuse in subsequent operations. Operations that follow are done putting into consideration that previous selected nodes are materialized. Volcano RU has a strength that it materializes beyond the optimal plans therefore more benefits are expected. It however has the following shortfalls that: (i) Over materialization: Not all sub-expressions that would cause benefit when reused once actually exist in other queries in a batch. The criteria leads to over materialization hence resource wasting.
On Improvement of the Volcano Search and Optimization Strategy
843
Fig. 2. Typical Sub-plans
(ii) Order of optimization: Though the order of optimization matters [9], the algorithm does no attempt to establish it.
3 Popularity and Inter-query Sharing Establishment First, we establish the extent of sharing and summerize it in a query sharing matrix. The total number of equivalence nodes in a plan that have sharing partners in the batch is the popularity of the query while the number of relations with at least a relational operator, that make up a node is the node order. We chose an optimal processing order, eliminate duplicate operations as well as detecting no sharing opportunities. First, we create the sharing matrix and then base on it to optimize the batch. 3.1 The Greedy Search Algorithm We get nodes from the Volcano best plans of the queries in the batch produced as in [2]. The greedy algorithm compares nodes pairwise to establish inter and intra sharing. M is accordingly updated. Using this approach, for a batch of n queries where the ith query has ki nodes, we need Σalli ki × (Σalli ki − 1) comparisons which is too much. We propose improvements: 1. Eliminate duplicate searches: The output of search Qi and Qj is equivalent to that between Qj and Qi . We therefore need to compare Qi and Qj only if i ≤ j. 2. Search by node order: Sharing can only take place between nodes of the same order. We therefore need to group nodes by order and for a node, we search within the order.
844
Venansius Baryamureeba and John Ngubiri
3. Null sharing prediction: Since moving up the order make the nodes more specific, if there is no sharing for nodes of order m, then there is no sharing for nodes of order n where n > m. We therefore terminate higher nodes’ search for the query pair. 4. Zero order tip: Relations (zero order node) are too broad to be considered for sharing. We only use them to find if any two queries are disjoint or not. If we get a sharing opportunity at this order, we leave M un updated and go to the next order. If however there are no nodes sharable, then the queries are disjoint. 3.2 The Improved Greedy Search Algorithm We now present the greedy algorithm but enhanced such that the observations above are put into consideration in order to have a more optimal search. It outputs the sharing matrix. for i = 1; i ≤ n; i++ for j = i; j ≤ n; j++ ORDER = 0 repeat nextOrderPermision = false Si = set of nodes of order ORDER in query i node1 = first node in Si Sj = set of nodes of order ORDER in query j node2 = first node in Sj while(Si still has nodes) while(Sj still has nodes) if(node1 and node2 are sharable) nextOrderPermision = true if(ORDER = 0) break out of the two while loops else increment M [i, j] and M [j, i] mark node1 and node2 endif endif node2 = next(node2) endwhile node1 = next(node1) endwhile ORDER = next(ORDER) until(nextOrderPermision = false or orders are exhausted) endfor endfor
4 The Optimizing Algorithm The new algorithm inputs the DAG made up of the basic Volcano plans for each query. The shareability information and popularities are got from M. In this algorithm plans are
On Improvement of the Volcano Search and Optimization Strategy
845
assigned focal roles in the order of their decreasing popularity, sharing for any focal plan is done in the stable marriage preference order of less popular plans, searching starts from a higher order and plans of Zero popularity are not assigned focal roles. 4.1 The Algorithm S = set of plans that make up the virtual DAG in order of their popularity focalPlan = first plan in S repeat establish node cost and numuses for each node for the focal plan S* = subset of S who share at least a node with focalPlan query candidateOrder = highest order of focalPlan repeat e = node in candidateOrder if( e is marked) repeat traverse S* searching for marked equivalent nodes of the same order with and sharable with e increment numuses(e) whenever a sharable node is met until(S* is exhausted) if(sharable condition(e)) chose which node to materialized and add it to the materializable set remove the rest update the DAG for the materialized node to cater for the removed nodes’ parents un mark the chosen node endif else if(sharablecondition(e)) add e to materialization node endif endif until(nodes in candidateOrder are exhausted) focalPlan = next(focalPlan) until(plans on non zero popularity are exhausted)
4.2 Benefits of the New Algorithm 1. Better estimation of numuses(e): The algorithm uses the exact number of times the node exists in the focal plan and increments by one whenever a sharable node found outside the focal plan. The nodes may actually be greater if the none focal plan nodes are used multiple times therefore numuses−(e) ≤ numuses(e) ∀e 2. Elimination of null searches: The sharing matrix has details of the extent of sharing for any pair of plans. if the entry for a pair is Zero, then we need not to search for shareability between them. If the popularity of a certain query is zero, then its plan is not involved in the search. 3. Elimination of catered for optimizations: If we have say three nodes of order five and they are sharable, and it is decided that one has to be materialized and the rest of the plans use it, then it is not worthwhile to process the other nodes since the ultimate goal is to get the root of the tree. This algorithm removes the sub DAGs
846
Venansius Baryamureeba and John Ngubiri
whose output can be got from materialized nodes so that such children do not enter the optimization process yet their output is catered for. 4. Optimal order of optimization: Since the strategy eliminates catered for subDAGs, its better if it does it as early as possible so that the subsequent search space is reduced without affecting the outcome. starting with the most popular query does this. This saves time, Memory and processor cycles.
5 Conclusion We have shown that by using bottom up approach at search and up - down approach at optimization we are able to only continue searching for sharable nodes when they exist. Still identification of a sharable node leads to the removal of all children whose output are already catered for by the sharing. This reduces the sample space therefore the subsequent search cost and time.
References 1. A. Cosar, E. Lim and J. Srivasta. Multiple query Optimization with depth-first branch and bond and dynamic query ordering. International Conference on Information and Knowledge Management, 2001. 2. G. Graefe and W.J. McKenna. Extensibility and search efficiency in the Volcano Optimizer generator. Technical report CU-CS-91-563. University of Colorado, 1991. 3. J. Park and A. Seger. Using common sub-expressions to optimize multiple queries. Proceedings of the IEEE International Conference on Data Engineering,1988. 4. J. Kroger, P. Stefan, and A.Heuer. Query optimization: On the ordering of Rules. Research paper, Cost - and Rule based Optimization of object - oriented queries (CROQUE) Project University of Restock and University of Hamburg - Germany, 2001. 5. Kyuseok Shim. Advanced query optimization techniques for relational database systems. PhD dissertation, University of Maryland, 1993. 6. Kyuseok Shim, Timos.K. Sellis and Dana Nau. Improvements on a heuristic algorithm for multiple-query Optimization. Technical report, University of Maryland, Department of Computer science,1994. 7. P.G.Selinger, M.M Astrahan, D.D Chamberlin, R. A. Lorie and T.G Price: Access path selection in relational database management systems. In preceeding of the ACM-SIGMOD Conference for the Management of Data, 1979. pp 23-34. 8. Prasan Roy, S. Seshadri, S. Sudarshan, and Siddhesh. Bhobe. Practical Algorithms for multiquery Optimization. Technical Report, Indian Institute of Technology, Bombay,1998. 9. Prasan Roy, S. Seshadri, S. Sudarshan, and Siddhesh Bhobe. Efficient and Extensible algorithms for Multi query optimization. Research Paper, SIGMOD International Conference on management of data,2001. 10. Timos. K. Sellis and S. Gosh. On Multi-query Optimization Problem. IEEE Transactions on Knowledge and Data Engineering 1990 pp. 262-266.
Aggregation-Based Multilevel Preconditioning of Non-conforming FEM Elasticity Problems Radim Blaheta1 , Svetozar Margenov2, and Maya Neytcheva3 1
2
Institute of Geonics, Czech Academy of Sciences Studentska 1768, 70800 Ostrava-Poruba, The Czech Republic [email protected] Institute for Parallel Processing, Bulgarian Academy of Sciences Acad. G. Bonchev Str., Bl. 25A, 1113 Sofia, Bulgaria [email protected] 3 Department of Information Technology, Uppsala University Box 337, SE-75105 Uppsala, Sweden [email protected]
Abstract. Preconditioning techniques based on various multilevel extensions of two-level splittings of finite element (FE) spaces lead to iterative methods which have an optimal rate of convergence and computational complexity with respect to the number of degrees of freedom. This article deals with the construction of algebraic two-level and multilevel preconditioning algorithms for the Lam´e equations of elasticity, which are discretized by Crouzeix-Raviart non-conforming linear finite elements on triangles. An important point to note is that in the nonconforming case the FE spaces corresponding to two successive levels of mesh refinements are not nested. To handle this, a proper aggregation-based two-level basis is considered, which enables us to fit the general framework of the two-level preconditioners and to generalize the method to the multilevel case. The derived estimate of the constant in the strengthened Cauchy-Bunyakowski-Schwarz (CBS) inequality is uniform with respect to both, mesh anisotropy and Poisson ratio, including the almost incompressible case.
1 Introduction The target problem in this paper is the Lam´e system of elasticity: 2 ∂σ ij + fi = 0, ∂x j j=1 u = 0,
x ∈ Ω,
i = 1, 2
x ∈ ∂Ω
where Ω is a polygonal domain in IR2 and ∂Ω is the boundary of Ω. The stresses σij and the strains εij are defined by the classical Hooke’s law, i.e. σij (u) = λ
2 k=1
εkk (u) δij + 2µεij (u),
1 εij (u) = 2
∂ui ∂uj + ∂xj ∂xi
J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 847–856, 2006. c Springer-Verlag Berlin Heidelberg 2006
.
848
Radim Blaheta, Svetozar Margenov, and Maya Neytcheva
We assume that the Lam´e coefficients are piecewise constant in the domain Ω. The unknowns of the problem are the displacements ut = (u1 , u2 ). A generalization to nonhomogeneous displacement boundary condition is straightforward. νE E The Lam´e coefficients are given by λ = ,µ= , where (1 + ν)(1 − 2ν) 2(1 + ν) E stands for the elasticity modulus, and ν ∈ (0, 12 ) is the Poisson ratio. We use the notion almost incompressible for the case ν = 12 − δ (δ > 0 is a small parameter). Note that the boundary value problem becomes ill-posed when ν = 12 (the material is incompressible). We assume that the domain Ω is discretized using triangular elements. The partitioning is denoted by T and is assumed to be obtained by regular refinement steps of a given coarse triangulation T0 . For f ∈ (L2 (Ω))2 , the weak formulation of the boundary value problem reads: find u ∈ V = {v ∈ (H 1 (Ω))2 , v |∂Ω = 0} such that a(u, v) = − f T vdx + gT v ds, ∀v ∈ V. (1) Ω
ΓN
Here ΓN is that part of ∂Ω where there are active surface forces. The bilinear form a(u, v) is of the form a(u, v) =
[λdiv(u)div(v) + 2µ Ω
εij (u)εij (v)]dx
(2)
i,j=1
=
2
Cd(u), d(v)dx,
(3)
C s d(u), d(v)dx = as (u, v)
(4)
Ω = Ω
where ⎡
λ + 2µ ⎢ 0 C=⎢ ⎣ 0 λ
0 µ µ 0
⎤ 0 λ µ 0 ⎥ ⎥, µ 0 ⎦ 0 λ + 2µ
d(u) =
⎡
λ + 2µ ⎢ 0 Cs = ⎢ ⎣ 0 λ+µ
∂u1 ∂u1 ∂u2 ∂u2 , , , ∂x1 ∂x2 ∂x1 ∂x2
0 µ 0 0
⎤ 0 λ+µ 0 0 ⎥ ⎥, µ 0 ⎦ 0 λ + 2µ
(5)
t .
(6)
Note that (4) holds due to homogeneous pure displacement boundary conditions as for ∂ui ∂vj s i ∂vj u, v ∈ V we have Ω ∂x dx = Ω ∂u ∂xi ∂xj dx. The matrix C is positive definite. j ∂xi More details can be found e.g. in [3,10,11]. The variational formulation with the modified bilinear form as is next discretized using Crouzeix-Raviart non-conforming linear finite elements, i.e., the continuous space V is replaced by a finite dimensional subspace V () . For more details see [10,11]. Here, and in what follows, {V (k) }k=0 and {A(k) }k=0 stand for the FE spaces and
Aggregation-Based Multilevel Preconditioning
θ3
7
4
I
3
II 6
849
1
2
5
θ1
θ2 8
(a)
III
9
(b)
Fig. 1. Regular refinement (a), and macroelement local node numbering (b)
for the stiffness matrices corresponding to the triangulations {Tk }k=0 . Let us recall that non-conforming FE approximations provide some attractive stability properties for parameter-dependent problems.
2 Hierarchical Decomposition of Crouzeix-Raviart Systems Let us consider two consecutive mesh refinements Tk and Tk+1 . As already mentioned, for Crouzeix-Raviart non-conforming linear elements, the FE spaces associated with two consecutive mesh refinements are not nested. To enable the use of the general multilevel scheme, we consider the so called differences and aggregates (DA) approach to construct hierarchical two-level decomposition of the Crouzeix-Raviart systems. Such a technique was originally introduced and analysed in [8] for scalar elliptic problems. The algorithm is easily described on macroelement level, see Figure 1(b). Let φ1 , . . . , φ9 be the standard nodal non-conforming linear finite element basis functions on the macroelement E. Then for the 2D elasticity we use the basis functions (1)
φi
(2)
= (φi , 0)t and φi
= (0, φi )t ,
i = 1, . . . , 9.
The vector of the macroelement basis functions (1)
(2)
(1)
(2)
(1)
(2)
ϕE = {Φi }18 i=1 = {φ1 , φ1 , φ2 , φ2 , . . . , φ9 , φ9 } can be transformed into a vector of new hierarchical basis functions i }18 ϕ E = {Φ i=1 , where Φi =
j
by using the transformation matrix JDA = (Jij ),
Jij Φj .
850
Radim Blaheta, Svetozar Margenov, and Maya Neytcheva
⎡
JDA
⎤
1
⎢ 1 ⎥ ⎢ ⎥ ⎢ ⎥ 1 ⎢ ⎥ ⎢ ⎥ 1 ⎢ ⎥ ⎢ ⎥ 1 ⎢ ⎥ ⎢ ⎥ 1 ⎢ ⎥ ⎢ ⎥ 1 0 −1 ⎢ ⎥ ⎢ ⎥ 1 0 −1 ⎢ ⎥ ⎢ ⎥ 1 0 −1 ⎢ ⎥. = ⎢ ⎥ 1 0 −1 ⎢ ⎥ ⎢ ⎥ 1 0 −1 ⎢ ⎥ ⎢ ⎥ 1 0 −1 ⎢ ⎥ ⎢1 ⎥ 1 0 1 ⎢ ⎥ ⎢ 1 ⎥ 1 0 1 ⎢ ⎥ ⎢ ⎥ 1 1 0 1 ⎢ ⎥ ⎢ ⎥ 1 1 0 1 ⎢ ⎥ ⎣ ⎦ 1 10 1 1 1 0 1
Now we are in a position to define the splitting V (E) = {Φi }18 i=1 = V1 (E) ⊕ V2 (E) , 1 (E) = span { Φ i }12 V i=1 (k+1)
= span { φ1
(k+1)
φ4
(k+1)
, φ2
(k+1)
− φ5
(k+1)
, φ3
(7)
,
(k+1)
, φ6
(k+1)
− φ7
(k+1)
, φ8
(k+1) 1 }k=0 ,
− φ9
2 (E) = span { Φ i }18 V i=13 (k+1)
+ φ4
(k+1)
+ φ6
= span { φ1 φ2
(k+1)
+ φ5
(k+1)
,
(k+1)
+ φ7
(k+1)
, φ3
(8) (k+1)
(k+1)
+ φ8
(k+1) 1 }k=0 .
+ φ9
Accordingly, JDA transforms the macroelement stiffness matrix AE into a hierarchical E = J AE J T , form A E,11 A E,12 φi ∈ V 1 (E) A AE = 2 (E) . AE,21 AE,22 φi ∈ V The corresponding global stiffness matrix (k+1) = A
E A
E∈Tk+1
can be again decomposed into 2 × 2 blocks (k+1) A (k+1) 11 A = (k+1) A 21
(k+1) A 12 , (k+1) A 22
(9)
Aggregation-Based Multilevel Preconditioning
851
(k+1) which are induced by the decomposition on macroelement level. The block A 11 corresponds to interior nodal unknowns with respect to the macro-elements E ∈ Tk+1 (k+1) plus differences of the nodal unknowns along the sides of E ∈ Tk+1 . The block A 22 corresponds to certain aggregates of nodal unknowns. The related splitting of the current (k+1) ⊕V (k+1) . The introduced decomposition can be used FE space reads as V (k+1) = V 1 2 to construct two-level preconditioners. The efficiency of these preconditioners depends (see, e.g. [12]) on the corresponding strengthened CBS constant γDA : ⎧ ⎫ ⎪ ⎪ (k+1) ⎪ ⎪ ⎨ ⎬ A12 y2 , y1 γDA = sup , y = 0, y = 0 . 1 2 ⎪ ⎪ (k+1) (k+1) ⎪ ⎪ ⎩ ⎭ A11 y1 , y1 A22 y2 , y2 The following result holds. Theorem 1. Consider the problem where the Lam´e coefficients are constant on the coarse triangles E ∈ Tk , discretization by the Crouzeix-Raviart FE and the DA decomposition of the stiffness matrix. Then for any element size and shape and for any Poisson ratio ν ∈ (0, 12 ), there holds 3 γDA ≤ . 4 Proof. The constant γDA can be estimated by the maximum of the constants over the macroelements. Let us first consider the case of a right angled reference macroelement, see Figure 2.
6
2
4
T3
1
7
3
3
1
T1
T4
8
2
5
T2
9
Fig. 2. The reference coarse grid triangle and the macroelement E
1 (E), V 2 (E), be the two-level splitting for the reference macroelement E. Let V (r) (r) (r) (r) 1 (E) and v ∈ V 2 (E) denote d For u ∈ V = d (u) |Tr and δ = d (v) |Tr , r = 1 · · · , 4, see the definition (6). Then it is easy to show (cf. [8]) that d(1) + d(2) + d(3) + d(4) = 0,
(10)
δ (1) = δ (2) = δ (3) = −δ (4) = δ.
(11)
852
Radim Blaheta, Svetozar Margenov, and Maya Neytcheva
Hence, aE (u, v) =
4
C s d(u), d(v) dx =
r=1 T r
4
C s d(r) , δ (r)
r=1
= C s δ, d(1) + d(2) + d(3) − d(4)
(12)
= −2C s δ, d(4) ≤ 2 | δ |C s | d(4) |C s where = area(Tk ), x, y = xT y denotes the inner product in R4 , and | d |C s = s C d, d is the norm induced by the coefficient matrix C s . Further, | d(4) |2C s = | d(1) + d(2) + d(3) |2C s ≤ 3
3
| d(k) |2C s
k=1
leads to aE (u, u) =
4
| d(k) |2C s ≥
k=1
1 1+ | d(4) |2C s 3
(13)
and aE (v, v) = 4 | δ |2C s .
(14)
Thus, 3 aE (u, v) ≤ aE (u, u) aE (v, v) . 4
(15)
In the case of an arbitrary shaped macroelement E we can use the affine mapping → E for transformation of the problem to the reference macroelement, for F : E more details see e.g. [3]. This transformation changes the coefficient matrix C s , but the 3 estimate γE ≤ 4 still holds since the result (15) for the reference macroelement does not depend on the coefficient matrix C s .
Remark 1. The obtained new uniform estimate of the CBS constant γ is a generalization of the earlier estimate from [14], namely √ 8+ 8 γ≤ ≈ 0.822, 4 which is derived in the case of a regular triangulation T0 obtained by a diagonal subdivision of a square mesh.
3 Multilevel Preconditioning The standard multiplicative two-level preconditioner, based on (9), can be written in the form (k+1) (k+1) )−1 A (k+1) A 0 I1 (A (k+1) 11 11 12 C = . (16) (k+1) (k+1) 0 I2 A21 A22
Aggregation-Based Multilevel Preconditioning
853
The convergence rate of the related two-level iterative method is based on the spectral condition number estimate (k+1) )−1 A (k+1) ) < κ((C
1 < 4. 2 1 − γDA
The following theorem is useful for extending the two-level to multilevel preconditioners. Theorem 2. Consider again the elasticity problem with constant Lam´e coefficients on the triangles E ∈ Tk , discretization by the Crouzeix-Raviart FE and the DA decom(k+1) be the stiffness matrix corresponding to the position of the stiffness matrix. Let A 22 (k+1) space V from the introduced DA splitting, and let A(k) be the stiffness matrix, 2 corresponding to the coarser triangulation Tk , equipped with the standard nodal finite element basis. Then (k+1) = 4 A(k) . A (17) 22 The proof follows almost directly from the definitions of the hierarchical basis functions i with value equal to one in two nodes of one of the macroelement sides and one Φ opposite inner node and the corresponding coarse grid basis function with value equal to one in one node on the same side. This result enables the recursive multilevel extension of the considered two-level multiplicative preconditioner preserving the same estimate of the CBS constant. In particular, the general scheme of the algebraic multilevel iteration (AMLI) algorithm becomes straightforwardly applicable (see [6,7]). We consider now the construction of optimal preconditioners for the coarse grid (k+1) for the blocks A (k+1) , see decomposition (9). We search complement systems C 11 11 for optimal preconditioners in the sense that they are spectrally equivalent to the top-left matrix block independently on mesh size, element shape and Poisson ratio. Moreover the cost of applying the preconditioner is aimed to be proportional to the number of degrees of freedom. Similarly to [4,5,9], we construct preconditioners on macroelement (k+1) . level and assemble the local contributions to obtain C 11 Remark: One possible approach is first to impose a displacement decomposition or(k+1) , and then precondition the dering, then use a block-diagonal approximation of A 11 diagonal blocks which are elliptic. Let us assume that the multiplicative preconditioner (k+1) . Then, for homogeneous isotropic from [9] is applied to the diagonal blocks of A 11 materials, the following simplified estimate holds ! " −1 (k+1) A (k+1) ≤ 1 − ν 15 . κ B 11 11 1 − 2ν 8 The presented construction is optimal with respect to mesh size and mesh anisotropy but is applicable for moderate values of ν ∈ (0, 12 ) only. When the material is almost (k+1) incompressible, it is better to apply a macroelement level static condensation of A 11 first, which is equivalent to the elimination of all unknowns corresponding to the interior nodes of the macroelements, see Figure 1(b). Let us assume that the triangulations T0 is constructed by diagonal subdivision of a square mesh. Let the corresponding Schur
854
Radim Blaheta, Svetozar Margenov, and Maya Neytcheva
compliment be approximated by its diagonal. Then the resulted preconditioner satisfies the following estimate ! " −1 (k+1) A (k+1) ≤ const ≈ 8.301 · · · , κ B 11 11 which is uniform with respect to the Poisson ratio ν (see for more details in [14]). The robustness of the later approach is demonstrated by the numerical tests presented in the next section.
4 Numerical Tests It is well known, that if low order conforming finite elements are used in the construction of the approximation space, when the Poisson ratio ν tends to 0.5, the so called locking phenomenon appears. Following [10,11], we use the Crouzeix-Raviart linear finite elements to get a locking-free FEM solution of the problem. Note that the straightforward FEM discretization works well for the pure displacement problem only. The presented numerical tests illustrate the behavior of the FEM error as well as the optimal convergence rate of the AMLI algorithm when the size of the discrete problem is varied and ν ∈ (0, 1/2) tends to the incompressible limit. We consider the simplest test problem in the unit square Ω = (0, 1)2 with E = 1. The right hand side corresponds to the following exact solution u(x, y) = [sin(πx) sin(πy), y(y − 1)x(x − 1)] . The relative stopping criterion for the PCG iterations is −1 −1 (CAMLI rNit , rNit )/(CAMLI r 0 , r 0 ) < ε2 ,
where ri stands for the residual at the i-th iteration step. The relative FEM errors, given in Table 1, illustrates very well the locking-free approximation. Here the number of refinement steps is = 4, N = 1472, and ε = 10−9 . In Table 2, the number of iterations are presented as a measure of the robustness of the multilevel preconditioner. The optimal order locking-free convergence rate of the AMLI algorithm is well expressed. Here β stands for the degree of the stabilization polynomial. The value of β = 2 corresponds to the derived uniform estimate of the CBS constant, providing the total computational cost optimality of the related PCG algorithm.
Table 1. Relative error stability for ν → 1/2 ν
u − uh [L2 ]2 /f [L2 ]2
ν
u − uh [L2 ]2 /f [L2 ]2
0.4
.3108249106503572
0.4999
.3771889077038727
0.49
.3695943747405575
0.49999
.3772591195613628
0.499
.3764879643773666
0.499999
.3772661419401481
Aggregation-Based Multilevel Preconditioning
855
Table 2. Iterations: ε = 10−3 , β = 2 N ν 0.3 0.4 0.49 0.499 0.4999 0.49999 0.499999 4 5 6 7 8
1472 6016 24320 97792 196096
13 12 12 11 11
13 12 12 11 11
12 12 12 11 11
13 14 12 12 12
13 13 13 13 12
13 13 13 13 13
13 13 13 13 13
5 Concluding Remarks This study is strongly motivated by the expanding interest in non-conforming finite elements, which are very useful for solving problems, where the standard conforming elements may suffer from the so-called locking effects. The success of the CrouzeixRaviart and other non-conforming finite elements can be explained also by the fact that they produce algebraic systems that are equivalent to the Schur complement system for the Lagrange multipliers arising from the mixed finite element method for RaviartThomas elements. There are also other advantages of the non-conforming CrouzeixRaviart finite elements, such as less density of the stiffness matrix etc. The developed robust PCG algorithms are further applicable as efficient preconditioning operators in the context of nonlinear elliptic problems, see [13]. The presented multilevel algorithm has some well expressed inherently parallel features. The key point here is that the considered approximations of the coarse grid com11 are of either diagonal or generalized tridiagonal form. The observed plement blocks A potential for parallel implementation is of a further importance for efficient solution of large-scale real-life problems including coupled long-time models which are as a rule nonlinear.
Acknowledgments This work is supported in part by the Centre of Excellence BIS-21 grant ICA1-200070016 and the Bulgarian NSF grant IO-01/2003.
References 1. B. Achchab, O. Axelsson, L. Laayouni, A. Souissi. Strengthened Cauchy-BunyakowskiSchwarz inequality for a three dimensional elasticity system, Numerical Linear Algebra with Applications, Vol. 8(3): 191-205, 2001. 2. B. Achchab, J.F. Maitre. Estimate of the constant in two strengthened CBS inequalities for FEM systems of 2D elasticity: Application to Multilevel methods and a posteriori error estimators, Numerical Linear Algebra with Applications, Vol. 3 (2): 147-160, 1996. 3. O. Axelsson, R. Blaheta. Two simple derivations of universal bounds for the C.B.S. inequality constant, Applications of Mathematics, Vol.49 (1): 57-72, 2004.
856
Radim Blaheta, Svetozar Margenov, and Maya Neytcheva
4. O. Axelsson, S. Margenov. On multilevel preconditioners which are optimal with respect to both problem and discretization parameters, Computational Methods in Applied Mathematics, Vol. 3, No. 1: 6-22, 2003. 5. O. Axelsson, A. Padiy. On the additive version of the algebraic multilevel iteration method for anisotropic elliptic problems, SIAM Journal on Scientific Computing 20, No.5: 1807-1830, 1999. 6. O. Axelsson and P.S. Vassilevski. Algebraic Multilevel Preconditioning Methods I, Numerische Mathematik, 56: 157-177, 1989. 7. O. Axelsson and P.S. Vassilevski. Algebraic Multilevel Preconditioning Methods II, SIAM Journal on Numerical Analysis, 27: 1569-1590, 1990. 8. R. Blaheta, S. Margenov, M. Neytcheva. Uniform estimate of the constant in the strengthened CBS inequality for anisotropic non-conforming FEM systems, Numerical Linear Algebra with Applications, 11: 309-326,2004. 9. R. Blaheta, S. Margenov, M. Neytcheva. Robust optimal multilevel preconditioners for nonconforming finite element systems, to appear in Numerical Linear Algebra with Applications. 10. S. Brenner and L. Scott. The mathematical theory of finite element methods, Texts in applied mathematics, vol. 15, Springer-Verlag, 1994. 11. S. Brenner and L. Sung. Linear finite element methods for planar linear elasticity, Math. Comp., 59: 321-338, 1992. 12. V. Eijkhout and P.S. Vassilevski. The role of the strengthened Cauchy-Bunyakowski-Schwarz inequality in multilevel methods, SIAM Review, 33: 405-419, 1991. 13. I. Farago, J. Karatson. Numerical solution of nonlinear elliptic problems via preconditionionig operators. Theory and applications, NOVA Science, 2002. 14. Tz. Kolev, S. Margenov. Two-level preconditioning of pure displacement non-conforming FEM systems, Numerical Linear Algebra with Applications, 6: 533-555, 1999. 15. Tz. Kolev, S. Margenov. AMLI preconditioning of pure displacement non-conforming elasticity FEM systems, Springer LNCS, 1988: 482-489, 2001. 16. S. Margenov. Upper bound of the constant in the strengthened C.B.S. inequality for FEM 2D elasticity equations, Numer. Linear Algebra Appl., 1: 65-74, 1994. 17. S. Margenov, P.S. Vassilevski. Algebraic multilevel preconditioning of anisotropic elliptic problems, SIAM Journal on Scientific Computing, V.15(5): 1026-1037, 1994.
Efficient Solvers for 3-D Homogenized Elasticity Model Ronald H.W. Hoppe and Svetozara I. Petrova Institute of Mathematics, University of Augsburg D-86159 Augsburg, Germany {hoppe,petrova}@math.uni-augsburg.de
Abstract. The optimization of the macroscopic behavior of microstructured materials using microscopic quantities as design variables is a well established discipline in materials science. The paper deals with recently produced microcellular biomorphic ceramics. The mechanical macromodel corresponding to these composite materials is obtained by homogenization. The homogenized elasticity tensor and its dependence on the design variables are computed numerically involving adaptive finite element approximations of elasticity problems in the 3D periodicity cell. Efficient iterative solvers based on incomplete Cholesky (IC) decomposition and algebraic multigrid method (AMG) as preconditioners of the stiffness matrix are proposed in the application of PCG method.
1 Introduction The production of microcellular biomorphic ceramics by biotemplating processes is a particular area within biomimetics which has emerged as a perspective new technology in materials science during the past decade (cf., e.g., [11]). The biological object under consideration in this paper is naturally grown wood which is known to be highly porous and to possess excellent mechanical properties. The wood morphologies are characterized by an open porous system of tracheidal cells which provide the transportation path for water and minerals in the living plants. The biotemplating process uses wooden specimen to produce graphite-like carbon preforms by high temperature pyrolysis followed by an infiltration by liquid-phase or gaseous-phase materials such as silicon (Si) or titanium (Ti) to come up with SiC- or TiC-ceramics (see, e.g., [6] for details). An important feature of the biotemplating process is that it preserves the high porosity of the wooden specimen and results in a final ceramics with excellent structural-mechanical and thermomechanical properties which can be used as heat insulators, particle filters, catalyst carriers, automotive tools, and medical implants. The macroscopic mechanical behavior of the microcellular biomorphic ceramics depends on microscopic geometrical quantities such as the size of the voids and the lengths and widths of the different layers forming the cell walls. While the size of the voids is determined by the growth of the wood itself (early/late wood), the other quantities can be influenced by tuning the parameters of the biotemplating process. Therefore, an optimal structural design of the ceramics can be performed where the state equation is given by linear elasticity and the design variables are chosen as the microstructural geometrical quantities (cf., e.g., [8]). The objective functional depends on the mode J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 857–863, 2006. c Springer-Verlag Berlin Heidelberg 2006
858
Ronald H.W. Hoppe and Svetozara I. Petrova
of loading. Since the resolution of the microstructures is cost prohibitive with respect to computational work, the idea is to derive a homogenized macromodel featuring the dependence on the microstructural design variables and to apply the optimization process to the homogenized model.
2 Computation of Homogenized Elasticity Tensor For the structural optimization of the microcellular biomorphic SiC ceramics modern optimization techniques (see, [7]) are applied to the mechanical macromodel obtained by the homogenization approach (cf., e.g., [3,9]). We assume the workpiece of macroscopic length L to consist of periodically distributed constituents with a cubic periodicity cell Y of microscopic characteristic length consisting of an interior void channel (V) surrounded by layers of silicon carbide (SiC) and carbon (C) (cf. Fig.1).
V
SiC C
Fig. 1. a) Periodicity cell Y = [0, ]3 , b) Cross section of Y = V ∪ SiC ∪ C
Assuming linear elasticity and denoting by u the displacements vector, the stress tensor σ is related to the linearized strain tensor e = 12 (∇u + (∇u)T ) by Hooke’s law σ = Ee, where E = E(X) = (Eijk (X)) stands for the elasticity tensor whose components attain different values in the regions V, SiC, and C: ⎛ ⎞ E1111 E1122 E1133 E1112 E1123 E1113 ⎜ ⎟ ⎜ E2222 E2233 E2212 E2223 E2213 ⎟ ⎜ ⎟ ⎜ E3333 E3312 E3323 E3313 ⎟ ⎜ ⎟ E=⎜ (2.1) ⎟ ⎜ ⎟ E E E 1212 1223 1213 ⎜ ⎟ ⎜ E2323 E2313 ⎟ ⎝ ⎠ SYM E1313 Introducing x := X/L and y := X/ as the macroscopic and microscopic variables and ε := /L as the scale parameter, homogenization based on the standard double scale H asymptotic expansion results in the homogenized elasticity tensor EH = (Eijk ) whose components are given by
Efficient Solvers for 3-D Homogenized Elasticity Model H Eijk =
1 |Y |
∂ξpk Eijkl (y) − Eijpq (y) dy . ∂yq
859
(2.2)
Y
1 The tensor ξ=(ξpk ), k, l, p = 1, 2, 3, with periodic components ξpk ∈ Hper (Y ) has to be computed via the solution of the elasticity problems
∂ξpkl ∂φi ∂φi Eijpq (y) dy = Eijkl (y) dy (2.3) ∂y ∂y ∂yj q j Y Y
for an arbitrary Y −periodic variational function φ ∈ H1 (Y ). We note that explicit formulas for the homogenized elasticity tensor are only available in case of laminated or checkerboard structures (cf., e.g., [2,9]). Therefore, (2.3) has to be solved numerically which has been done by using continuous, piecewise linear finite elements with respect to adaptively generated locally quasi-uniform and shape regular simplicial tetrahedral partitionings of the periodicity cell Y .
3 Mesh Adaptivity by a Posteriori Error Estimation The computation of the homogenized elasticity coefficients requires the solution of linear elastic boundary value problems with the periodicity cell Y as the computational domain. Due to the composite character of the cell there are material interfaces where the solution changes significantly. Hence, local refinement of the underlying finite element mesh is strongly advised. In contrast to previous work in structural optimization (cf., e.g., [1,2]) where local refinement is done by manual remeshing, we have used an automatic grid refinement based on a posteriori error estimator of Zienkiewicz-Zhu type [13] obtained by local averaging of the computed stress tensor. Using an approximation of the components of the displacements vector by continuous, piecewise linear finite elements with respect to a simplicial tetrahedrization Th of the periodicity cell Y , we denote by σ ˆ the discontinuous finite element stress. A continuous recovered stress σ ∗ is obtained at each nodal point p by local averaging: Denoting by Yp ⊂ Y the union of all elements K ∈ Th sharing p as a vertex, we compute
σ ∗ (p) =
ωK σ ˆ |K
K∈Yp
Based on (3.4), we have chosen
1/2 2 η := ηK
,
,
ωK :=
|K| , K ∈ Yp . |Yp |
ηK := σ ∗ − σ ˆ 0,K , K ∈ Th
(3.4)
(3.5)
K∈Th
as a global estimator whose local contributions ηK are cheaply computable. Note that such an estimator has been studied and analyzed in [10] for linear second order elliptic boundary value problems where it was shown that η is asymptotically exact. Moreover, general averaging techniques for low order finite element approximations of linear elasticity problems have been considered in [5] and efficiency and reliability of Zienkiewicz-Zhu type estimators have been established.
860
Ronald H.W. Hoppe and Svetozara I. Petrova
4 Iterative Solution Techniques After finite element discretization of the domain Y the elasticity equation (2.3) used to compute the effective coefficients results in the following system of linear equations (4.6)
A u = f,
where u is the vector of unknown displacements and f is the discrete right–hand side. The stiffness matrix A is symmetric and positive definite but not an M -matrix. Two typical orderings of the unknowns are often used in practice, namely
(x) (y) (z) (x) (y) (z) (y) (z) u1 , u1 , u1 , u2 , u2 , u2 , . . . , u(x) , (4.7) n , un , un referred to as a pointwise displacements ordering and
(x) (x) (y) (y) (z) (z) (y) (z) u1 , u2 , . . . , u(x) , n , u 1 , u 2 , . . . , u n , u 1 , u 2 , . . . , un (x)
(y)
(4.8)
(z)
called the separate displacements ordering. Here, uk , uk , and uk are the corresponding x, y-, and z- displacement components. Using (4.8), for instance, the matrix A admits the following 3×3 block decomposition ⎡ ⎤ A11 A12 A13 ⎢ ⎥ A = ⎣ A21 A22 A23 ⎦ . (4.9) A31 A32 A33 In case of isotropic materials, the diagonal blocks Ajj , j = 1, 2, 3, in (4.9) are discrete analogs of the following anisotropic Laplacian operators 2 2 2 2 2 2 2 2 2 ˜1 =a ∂ +b ∂ +b ∂ , D ˜2 =b ∂ + a ∂ + b ∂ , D ˜3 =b ∂ +b ∂ +a ∂ D ∂x2 ∂y 2 ∂z 2 ∂x2 ∂y 2 ∂z 2 ∂x2 ∂y 2 ∂z 2
with coefficients a = E(1 − ν)/((1 + ν)(1 − 2ν)) and b = 0.5E/(1 + ν) where E is the Young modulus and ν is the Poisson ratio of the corresponding material. This anisotropy requires a special care to construct an efficient preconditioner for the iterative solution method. Based on Korn’s inequality, it can be shown that A and its block diagonal part are spectrally equivalent. The condition number of the preconditioned system depends on the Poisson ratio ν of the materials and the constant in the Korn inequality. For the background of the spectral equivalence approach using block diagonal displacement decomposition preconditioners in linear elasticity problems we refer to [4]. Note that the spectral equivalence estimate will deteriorate for ν close to 0.5 which is not the case in our particular applications. The PCG method is applied to solve the linear system (4.6). We propose two approaches to construct a preconditioner for A: (i) construct a preconditioner for the global matrix A (ii) construct a preconditioner for A of the type ⎡ ⎤ M11 0 0 ⎢ ⎥ M = ⎣ 0 M22 0 ⎦ , (4.10) 0
0 M33
Efficient Solvers for 3-D Homogenized Elasticity Model
861
where Mjj ∼ Ajj , j = 1, 2, 3, are “good” approximations to the diagonal blocks of A. In case (i) we have chosen the incomplete Cholesky (IC) factorization of A with an appropriate stopping criterion. An efficient preconditioner for Ajj in case (ii) turns out to be a matrix Mjj corresponding to a Laplacian operator (−div (c grad u)) with a fixed scale factor c. In our case we use, for instance, c = b/2 for all three diagonal blocks. Algebraic multigrid (AMG) method is applied as a “plug-in” solver for A (see [12] for details). This method is a purely matrix–based version of the algebraic multilevel approach and has shown in the last decade numerous efficient implementations in solving large sparse unstructured linear systems of equations without any geometric background.
5 Numerical Experiments In this section, we present some computational results concerning the microscopic problem to find the homogenized elasticity coefficients. The elasticity equation (2.3) is solved numerically using initial decomposition of the periodic microcell Y into hexahedra and then continuous, piecewise linear finite elements on tetrahedral shape regular meshes. Due to the equal solutions ξ 12 =ξ 21 , ξ 23 =ξ 32 , and ξ 13 =ξ 31 one has to solve six problems in the period Y to find ξ 11 (Problem 1), ξ 22 (Problem 2), ξ 33 (Problem 3), ξ 12 (Problem 4), ξ 23 (Problem 5), and ξ 13 (Problem 6). The discretized problems have been solved by the iterative solvers discussed in Section 4 and the mesh adaptivity has been realized by means of a Zienkiewicz-Zhu type a posteriori error estimator [13].
Fig. 2. Cross section of Y , density = 96%, Problem 3, nt = 12395, nn = 2692
The Young modulus E (in GPa) and the Poisson ratio ν of our two materials are, respectively, E = 10, ν = 0.22 for carbon and E = 410, ν = 0.14 for SiC. We denote by nt the number of tetrahedra and by nn the number of nodes on the corresponding refinement level. In Fig.2 the adaptive mesh refinement is visualized on the cross section of the period Y . Tables 1 and 2 contain information for the computed homogenized coefficients according to the refinement level. Table 3 presents some convergence results for the proposed preconditioners within PCG method. For various values of the density µ of the periodicity microstructure we report the number of degrees of freedom d.o.f., the number of iterations iter, and the CPU– time in seconds for the first 11 adaptive refinement levels. One can see from the numerical results a better convergence of AMG–preconditioner compared to IC–factorization. We observe an essential efficiency of AMG for a larger number of unknowns.
862
Ronald H.W. Hoppe and Svetozara I. Petrova Table 1. Homogenized coefficients w.r.t. adaptive refinement level, µ = 19% H level E1111
1 2 3 4 5 6 7 8 9 10
H E2222
H E3333 nt/nn (Prob.1) nt/nn (Prob.2) nt/nn (Prob.3)
160.82 174.34 204.39 175.60 207.65 214.79 159.18 170.78 206.73 174.07 166.79 213.77 168.97 163.26 214.10 146.50 147.22 208.64 160.25 147.90 211.11 146.80 138.09 211.82 137.10 134.55 210.36 133.22 131.84 210.91
288/126 334/137 443/166 637/222 982 /297 1641/433 2516/624 3761/896 5134/1171 10839/2259
288/126 332/136 457/169 593/208 971/292 1609/431 2422/604 3915/927 7743/1722 13698/2869
288/126 332/136 441/165 595/211 948/287 1684/443 2427/601 3881/920 5092/1160 11078/2289
Table 2. Homogenized coefficients for late wood, density µ = 91% H level E1111
1 2 3 4 5 6 7 8 9 10
H E2222
H E3333
148.35 152.57 153.96 154.34 162.64 162.77 142.66 148.42 162.79 145.84 137.61 161.70 127.99 134.32 161.43 98.29 111.65 160.71 91.79 90.23 158.29 82.42 83.00 160.57 75.05 75.11 160.22 69.66 70.30 159.82
H H H E1212 E2323 E1313
60.22 62.46 59.50 69.71 71.31 65.79 60.51 65.26 63.23 53.91 59.04 62.92 49.41 56.19 56.49 40.44 46.14 48.45 35.70 43.69 46.03 30.59 41.03 43.70 26.93 39.75 40.97 25.47 37.16 39.30
Table 3. Convergence results with IC and AMG preconditioners, density µ, Problem 1 prec. level µ = 51% d.o.f. IC iter CPU AMG iter CPU µ = 84% d.o.f. IC iter CPU AMG iter CPU
1 78 9 e-16 11 e-16 78 10 e-16 12 e-16
2 3 90 126 8 14 e-16 e-16 13 13 e-16 e-16 93 150 11 16 e-16 0.1 14 14 e-16 e-16
4 5 225 336 23 40 0.1 0.2 15 18 0.2 0.3 261 510 21 44 0.1 0.1 14 18 0.2 0.4
6 579 66 0.2 23 0.5 1047 78 0.6 31 1.1
7 8 1185 1908 105 150 0.9 2.4 38 57 1.5 3 2103 3843 117 171 2.4 8.4 43 73 3 7.5
9 10 3360 5598 235 269 8.2 20.9 89 94 7.6 14.8 6537 10485 226 273 24.3 63.7 69 74 15.5 25.6
11 9987 299 59.1 99 23.5 18459 301 187.1 75 33.8
Efficient Solvers for 3-D Homogenized Elasticity Model
863
Acknowledgments This work has been partially supported by the German National Science Foundation (DFG) under Grant No.HO877/5-3. The second author has also been supported in part by the Bulgarian NSF under Grant I1402/2004.
References 1. M.P. Bendsøe and N. Kikuchi. Generating optimal topologies in structural design using a homogenization method. Comput. Methods Appl. Mech. Eng., 71:197–224, 1988. 2. M.P. Bendsøe and O. Sigmund. Topology Optimization: Theory, Methods and Applications. Springer, Berlin-Heidelberg-New York, 2003. 3. A. Bensoussan, J. L. Lions, G. Papanicolaou. Asymptotic Analysis for Periodic Structures. North-Holland, Elsevier Science Publishers, Amsterdam, 1978. 4. R. Blaheta. Displacement decomposition – incomplete factorization preconditioning techniques for linear elasticity problems. Numer. Linear Algebra Appl., 1(2):107–128, 1994. 5. C. Carstensen and St. Funken. Averaging technique for FE-a posteriori error control in elasticity. Part II: λ-independent estimates. Comput. Methods Appl. Mech. Engrg., 190:4663–4675, 2001. 6. P. Greil, T. Lifka, A. Kaindl. Biomorphic cellular silicon carbide ceramics from wood: I. Processing and microstructure, and II. Mechanical properties. J. Europ. Cer. Soc.. 18:1961– 1973 and 1975–1983, 1998. 7. R.H.W. Hoppe and S.I. Petrova. Applications of primal-dual interior methods in structural optimization. Comput. Methods Appl. Math., 3(1):159–176, 2003. 8. R.H.W. Hoppe and S.I. Petrova. Optimal shape design in biomimetics based on homogenization and adaptivity. Math. Comput. Simul., 65(3):257–272, 2004. 9. V.V. Jikov, S.M. Kozlov, and O.A. Oleinik. Homogenization of Differential Operators and Integral Functionals. Springer, 1994. 10. R. Rodriguez. Some remarks on the Zienkiewicz-Zhu estimator. Numer. Meth. PDEs, 10:625– 635, 1994. 11. M. Sarikaya and I.A. Aksay. Biomimetics: Design and Processing of Materials. AIP Series in Polymer and Complex Materials, Woodbury (New York), 1995 12. K. St¨uben. A review of algebraic multigrid. J. Comput. Appl. Math., 128:281–309, 2001. 13. O.C. Zienkiewicz and J.Z. Zhu. A simple error estimator and adaptive procedure for practical engineering analysis. Intern. J. Numer. Methods Eng., 24:337–357, 1987.
Performance Evaluation of a Parallel Algorithm for a Radiative Transfer Problem Paulo B. Vasconcelos and Filomena d’Almeida 1
Faculdade de Economia da Universidade do Porto rua Dr. Roberto Frias s/n, 4200-464 Porto, Portugal [email protected] 2 Faculdade de Engenharia da Universidade do Porto rua Dr. Roberto Frias s/n, 4200-465 Porto, Portugal [email protected]
Abstract. The numerical approximation and parallelization of an algorithm for the solution of a radiative transfer equation modeling the emission of photons in stellar atmospheres will be described. This is formulated in the integral form yielding a weakly singular Fredholm integral equation defined on a Banach space. The main objective of this work is to report on the performance of the parallel code. Keywords: High performance computing, Fredholm integral equation, weakly singular kernel, projection approximation, iterative refinement, numerical methods. AMS Subject Classification: 32A55, 45B05, 65D20, 65R20, 68W10.
1 The Integral Problem We will consider the integral problem T ϕ = zϕ + f defined on a Banach space X = L1 ([0, τ ∗ ]) , where T is the integral operator (T x)(τ ) =
τ∗
g(|τ − τ |)x(τ )dτ ,
0 ≤ τ ≤ τ ∗,
(1.1)
0
the kernel g being a weakly singular decreasing function on ]0, τ ∗ ] and z belonging to the resolvent set of T . This integral equation represents a radiative transfer problem. The emission of photons in stellar atmospheres is modeled by a strongly coupled system of nonlinear equations and equation (1.1) is a restriction of it by considering the temperature and pressure given (see [2] and [11] for details). For this astrophysics problem the free term f , is taken to be f (τ ) = −1 if 0 ≤ τ ≤ τ ∗ /2, f (τ ) = 0 if τ ∗ /2 < τ ≤ τ ∗ and the kernel g is defined by g(τ ) = E1 (τ ), (1.2) 2
This work was supported by CMUP. CMUP is financed by FCT under programs POCTI and POSI from QCA III with FEDER and National funds.
J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 864–871, 2006. c Springer-Verlag Berlin Heidelberg 2006
Performance Evaluation of a Parallel Algorithm for a Radiative Transfer Problem
where
∞
E1 (τ ) = 1
exp(−τ µ) dµ, τ > 0. µ
(1.3)
This is the first of the sequence of functions ∞ exp(−τ µ) Eν (τ ) = dµ, τ ≥ 0, ν > 1, µν 1
which have the following property: Eν+1 = −Eν and Eν (0) =
865
(1.4) 1 , ν > 1 [1]. E1 ν −1
has a logarithmic singularity at τ = 0. The numerical solution is based on the projection of the integral operator onto a finite dimensional subspace. To obtain a good accuracy it is necessary to use a large dimension and in consequence a large algebraic linear system. The approach we use is to refine iteratively an approximate initial solution obtained by the projection onto a subspace of moderate (small) size. This iterative refinement is a Newton-type method where the resolvent operator is replaced with three different approximations. In opposition to the usual integral problems, this one leads to a band sparse discretization matrix. This fact is taken into account by using sparse data handling and by a particular distribution of the matrix blocks among the processors.
2 Projection and Iterative Refinement Integral equations of this type are usually solved by discretization, for instance by projection methods, onto a finite dimensional subspace. The operator T is thus approximated by Tn , its projection onto the finite dimensional subspace Xn spanned by n linearly independent functions in X. In this case we will consider in Xn the basis of piecewise constant functions on each subinterval of [0, τ ∗ ] determined by a grid of n + 1 points 0 = τn,0 < τn,1 < . . . < τn,n = τ ∗ . We get Tn (x) =
n
T x, e∗n,j en,j ,
(2.5)
j=1
for all x ∈ X, where e∗n,j ∈ X ∗ (the adjoint space of X) and x,
e∗n,j
=
1
τn,j
x(τ )dτ,
hn,j
(2.6)
τn,j−1
hn,j = τn,j − τn,j−1 . The approximate problem (Tn − zI)ϕn = f
(2.7)
is then solved by means of an algebraic linear system of equations (An − zI)xn = bn ,
(2.8)
866
Paulo B. Vasconcelos and Filomena d’Almeida
where An is a non⎛singular matrix of dimension n. The relation between xn and ϕn is ⎞ n 1 given by ϕn = ⎝ xn (j)en,j − f ⎠. z j=1 In order to obtain an approximate solution ϕn with very good accuracy by this method it may be necessary to use a very large dimensional linear system. Alternatively we may compute an initial approximation using a linear system of small dimension corresponding to a coarse grid on [0, τ ∗ ] and refine it by a refinement formula based on a Newton-type method applied to the problem formulated under the form T x−zx−f = 0. The Fr´echet derivative of the left hand side operator is approximated by an operator Fn built on the coarse grid set on the domain: x(k+1) = x(k) − Fn−1 (T x(k) − zx(k) − f ). The first choice for Fn is (Tn − zI), Fn−1 being Rn (z) the resolvent operator of Tn which approximates R(z) if n is large enough. Fn−1 can also be replaced with 1 1 (Rn (z)T − I) or (T Rn (z) − I). z z After some algebraic manipulations we can set the corresponding three refinement schemes under the form [5,7,9]: x(0) = Rn (z)f Scheme 1: (k+1) x = x(0) + Rn (z)(Tn − T )x(k) , ⎧ ⎪ ⎨ Scheme 2:
⎪ ⎩ x(k+1) ⎧ ⎪ ⎨
Scheme 3:
1 (Rn (z)T − I)f z 1 = x(0) + Rn (z)(Tn − T )T x(k) , z
x(0) =
1 T (Rn (z) − I)f z 1 = x(0) + T Rn (z)(Tn − T )x(k) . z
x(0) =
⎪ ⎩ x(k+1)
A finer grid of m + 1 points 0 = τm,0 < τm,1 < . . . < τm,m = τ ∗ is set to obtain a projection operator Tm which is only used to replace the operator T in the schemes above and not to solve the corresponding approximate equation (2.7) with dimension m. The accuracy of the refined solution by the three refinement formulae is the same that would be obtained by applying the projection method directly to this fine grid operator (see [3]). The convergence theorems and error bounds can be found in [4].
3 Matrix Computations The evaluation of T cannot be done in closed form so T will be replaced with operator Tm corresponding to the projection of T onto Xm with m n. The functional basis in Xn is {en,j }j=1,...,n and in Xm it is {em,j }j=1,...,m , and the matrices representing the operators Tn and Tm restricted to Xn and Xm are respectively An and Am . By observing the schemes 1, 2 and 3 we see that the restrictions of Tn to Xm and of Tm to Xn are also needed, they will be denoted by C and D respectively.
Performance Evaluation of a Parallel Algorithm for a Radiative Transfer Problem
867
To simplify the description of the elements of the matrices we will assume that the coarse grid is a subset of the fine grid and we denote by q the ratio m/n. For matrix D we have τm,i τ ∗ D(i, j) = E1 (|τ − τ |) en,j (τ ) dτ dτ (3.9) 2hm,i τm,i−1 0 τm,i τn,j = E1 (|τ − τ |) dτ dτ 2hm,i τm,i−1 τn,j−1 for i = 1, ..., m and j = 1, ..., n. D(i, j) = =
2hm,i
τm,i
(−E2 (τn,j − τ ) + E2 (τn,j−1 − τ ))dτ
τm,i−1
if τm,i−1 ≤ τm,i ≤ τn,j−1 ≤ τn,j =
2hm,i
τm,i
(E2 (τ − τn,j ) − E2 (τ − τn,j−1 ))dτ
τm,i−1
if τn,j−1 ≤ τn,j ≤ τm,i−1 ≤ τm,i =
2hm,i
τm,i
(2 − E2 (τ − τn,j−1 ) − E2 (τn,j − τ ))dτ
τm,i−1
if τn,j−1 ≤ τm,i−1 ≤ τm,i ≤ τn,j ; D(i, j) = (3.10) =
(−E3 (τn,j − τm,i ) + E3 (τn,j − τm,i−1 ) + E3 (τn,j−1 − τm,i ) 2hm,i −E3 (τn,j−1 − τm,i−1 )) if τm,i−1 ≤ τm,i ≤ τn,j−1 ≤ τn,j
=
(−E3 (τm,i − τn,j ) + E3 (τm,i−1 − τn,j ) + E3 (τm,i − τn,j−1 ) 2hm,i −E3 (τm,i−1 − τn,j−1 )) if τn,j−1 ≤ τn,j ≤ τm,i−1 ≤ τm,i 1 (−E3 (τn,j − τm,i ) + E3 (τn,j − τm,i−1 ) 2hm,i +E3 (τm,i − τn,j−1 ) − E3 (τm,i−1 − τn,j−1 )))
= (1 +
if τn,j−1 ≤ τm,i−1 ≤ τm,i ≤ τn,j .
868
Paulo B. Vasconcelos and Filomena d’Almeida
Matrix C is obtained in a similar way. To obtain the elements of matrices Am or An we use formulae (3.9) by replacing n with m or vice-versa and use the fact that E3 (0) = 1/2: A(i, j) = (3.11) =
(−E3 (τm,i − τm,j ) + E3 (τm,i−1 − τm,j ) + E3 (τm,i − τm,j−1 ) 2hm,i +E3 (τm,i−1 − τm,j−1 )) if i =j
= (1 +
1 (−E3 (hm,i ) − 1)) hm,i if i = j.
Remark (see also [2]) that (3.12)
D = Am P
where P (k, j) =
1, 0,
q × (j − 1) + 1 ≤ k ≤ q × j otherwise
and, similarly, (3.13)
C = RAm
where R(j, k) =
hm,k /hn,j , 0
,
q × (j − 1) + 1 ≤ k ≤ q × j otherwise
.
4 Numerical Results The parallelization of the code corresponding to the construction of matrices An , Am , C, D and the codes corresponding to schemes 1, 2 and 3 was done on a Beowulf machine consisting of 22 processing nodes connected by a fast Ethernet switch (100Mbps). Each one is a Pentium III - 550 MHz with 128 MB of RAM. There is also a front-end dual Pentium III - 550 MHz with 512 MB of RAM. The operating system is Linux Slackware 7.0 and the software used for the communications between the nodes is MPI (Message Passing Interface) [13]. A cyclic distribution by columns of Am allows scalability since the matrix is band. Matrix C, n × m, is distributed by block columns and D, m × n, by block rows because n m. The coarse grid will have enough points to ensure convergence of the iterative refinement but it will be as small as possible so that the solution of the small linear system (2.8) can be done in all processors at a small cost in computing time. In [6] the building of matrices C and D was done by formulae C = RAm and D = Am P to avoid some approximate integral computations of function E3. For parallel processing, however, matrix D must be computed by (3.10) because that can be done in
Performance Evaluation of a Parallel Algorithm for a Radiative Transfer Problem
869
Table 1. Number of iterations and elapsed times for the three iterative schemes with different number of processors (m = 10000, n = 1000) nb. processors Scheme 1 p
Scheme 2
Scheme 3
43 iterations 21 iterations 21 iterations
1
3472.7
3472.4
3472.2
2
1749.4
1749.0
1748.9
4
885.0
884.4
884. 4
5
714.1
713.6
713.6
10
375.3
375.1
375.1
Table 2. Speedup and efficiency for the three iterative schemes up to 10 processors (m = 10000, n = 1000) nb. processors Scheme 1-3 p 1
Sp
Ep
1
1.00
2
1.99 0.99
4
3.93 0.98
5
4.86 0.97
10
9.26 0.93
Table 3. Number of iterations and elapsed times for the three iterative schemes with 10 and 20 processors (m = 100000, n = 5000) nb. processors Scheme 1 p
Scheme 2
Scheme 3
48 iterations 24 iterations 24 iterations
10
32590.6
32585.8
32585.1
20
14859.7
14855.0
14854.2
parallel and on the other hand the communications involved in the product D = Am P are avoided. The numerical tests performed were designed to answer the question of whether it is better to compute D and C by (3.12) and (3.13) or by (3.10). The question arises because the computation of D by (3.12) requires to many communications since Am is distributed. This becomes more and more important as the number of processors grows. The matrix blocks needed for C = RAm are all in the same processor so it is better to use (3.13) instead of the integral formulation. The iterative refinement is not well suited for parallelization because it requires too many communications but that is not important since the most time consuming phase of the solution of the problem is the computation of the matrices elements and that is "embarrassingly parallel". The linear system (2.8) can be solved either by block band LU factorization [10] or by a Krylov iterative method (usually preconditioned GMRES) [12]. Usually we are interested in very large dimensions and matrix Am does not fit in one processor so it has to be computed in a distributed way. This fact forces the matrix-vector
870
Paulo B. Vasconcelos and Filomena d’Almeida 5
10
p=10 p=5 p=20
4
10
time (log)
p=1
3
10
2
10
4
5
10
10 problem size (log)
Fig. 1. Performance of the parallel code up to 20 processors
products in the refinement formulae to be done locally by blocks and the partial results have to be communicated to the other processors to be added. The numerical tests presented correspond to m = 10000, n = 1000, z = 1 and the albedo = 0.75. A nonuniform grid was considered in [0, τ ∗ ] by taking 4 different zones and a higher number of grid points at the beginning and at the middle of the interval. The required accuracy was residual less than 10−12 . We remark that for the chosen m and n values, the approximate solution for the 10000 × 10000 linear system is obtained without solving the corresponding system but, instead, by solving a 1000 × 1000 linear system and iteratively refining its solution. Table 1 show the times of the total solution with the computation of D done by (3.10) for the three different refinement schemes. We can see that schemes 2 and 3 take approximately one half of the number of iterations of scheme 1 but the elapsed times for the three schemes are similar. Most of the time spent by the three schemes is the computation of the matrices which is similar for all of them. That explains why, as we can see in Table 2, the speedups and efficiencies are the same. As for the iterative part, schemes 2 and 3 take a little less time than scheme 1 since they require less iterations,
Performance Evaluation of a Parallel Algorithm for a Radiative Transfer Problem
871
although they perform two products by Am instead of one. For all methods as the number of processors grows the computational elapsed time decreases. Figure 1 shows the elapsed time per node of the solution of the problem. When the number of processors is scaled by a constant factor, the same efficiency is achieved for equidistant problem sizes on a logarithimic scale. This is denoted as isoefficiency or isogranularity, [8]. The times for the computation of D by formulae (3.10), for this example, is approximately 31 seconds for 10 processors while with formulae (3.12) it would be approximately 70 seconds. These results show that in a cluster of processors as this one, it is useful to compute D by the integral formulae because although it requires much more arithmetic it does not involve communications. This will be more important when the number of processors and their processing speed grow significantly. With this approach we were able to solve on this machine a 100000×100000 problem based on the solution of a 5000 × 5000 linear system of equations, Table 3.
References 1. M. Abramowitz and I.A. Stegun. Handbook of Mathematical Functions. Dover, New York, 1960. 2. M. Ahues, F.D. d’Almeida, A. Largillier, O. Titaud and P. Vasconcelos. An L1 Refined Projection Approximate Solution of the Radiation Transfer Equation in Stellar Atmospheres. J. Comput. Appl. Math., 140:13–26, 2002. 3. M. Ahues, A. Largillier and B.V. Limaye. Spectral Computations with Bounded Operators. Chapman and Hall, Boca Raton, 2001. 4. M. Ahues, A. Largillier and O. Titaud. The Roles of a Weak Singularity and the Grid Uniformity in the Relative Error Bounds. Numer. Funct. Anal. and Optimiz., 22:7&8 789–814, 2001. 5. M. Ahues, F.D. d’Almeida, A. Largillier, O. Titaud and P. Vasconcelos. Iterative Refinement Schemes for an ill-Conditioned Transfer Equation in Astrophysics. Algorithms for Approximation IV, Univ. of Huddersfield, 70–77, 2002. 6. F.D. d’Almeida and P.B. Vasconcelos. A Parallel Implementation of the Atkinson Algorithm for Solving a Fredholm Equation. Lecture Notes in Computer Science, 2565:368–376, 2003. 7. K.E. Atkinson. A Survey of Numerical Methods for the Solution of Fredholm Integral Equations of the Second Kind. Society for Industrial and Applied Mathematics, Philadelphia, PA, 1976. 8. L.S. Blackford, J. Choi, A. Cleary, E. D’Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R.C. Whaley. ScaLAPACK Users’ Guide. Society for Industrial and Applied Mathematics, Philadelphia, PA, 1997. 9. H. Brakhage. Uber die Numeriche Bechandlung von Integralgleichungen nach der Quadraturformelmethod. Numer. Math., 2:183–196, 1960. 10. J.J. Dongarra, I.S. Duff, D.C. Sorensen and H.A. van der Vorst. Numerical Linear Algebra for High-Performance Computers. Society for Industrial and Applied Mathematics, Philadelphia, 1998. 11. B. Rutily. Multiple Scattering Theoretical and Integral Equations. Integral Methods in Science and Engineering: Analytic and Numerical Techniques, Birkhauser, 211-231, 2004. 12. Y. Saad. Iterative Methods for Sparse Linear Systems. PWS Publishing Company, 1996. 13. M. Snir, S. Otto, S. Huss-Lederman, D. Walker, J.J. Dongarra. MPI: The Complete Reference. The MIT Press, 1996.
Performance Evaluation and Design of Hardware-Aware PDE Solvers: An Introduction Organizers: Frank H¨ulsemann and Markus Kowarschik System Simulation Group, Computer Science Department Friedrich-Alexander-University Erlangen-Nuremberg, Germany {Frank.Huelsemann,Markus.Kowarschik}@cs.fau.de
1 Scope In an ideal situation, all performance optimization of computationally intensive software would take place automatically, allowing the researchers to concentrate on the development of more efficient algorithms (in terms of computational complexity) rather than having to worry about performance. However, for the time being, optimizing compilers are unable to synthesize long chains of complicated code transformations to optimize program execution. As a consequence, the need to identify and to remove the performance bottlenecks of computationally intensive codes remains. As an example of a class of computationally intensive problems, this minisymposium concentrated on the numerical solution of partial differential equations (PDEs). As with every computer program, the run times of PDE solvers depend both on the algorithms and on the data structures used in the implementations. In the context of numerical PDEs, algorithms with optimal asymptotic complexity are known for certain types of problems; e.g., multigrid methods for elliptic problems. In those cases where the optimal algorithms are applicable, only the data structures and the implementation details offer scope for improvement. The objective of this minisymposium was to bring together researchers who are developing efficient PDE algorithms and implementations as well as researchers focusing on performance evaluation tools to predict, guide, and quantify the development of efficient numerical software. Hardware-efficient codes in high performance simulation must follow two fundamental design goals; parallelism and locality. The idea must be to use all parallel processing units as efficiently as possible and, in addition, to optimize the performance on each individual node in the parallel environment. In particular, this requires extensive tuning for memory hierarchies, i.e., caches.
2 Outline As was intended by the organizers, the contributions to the minisymposium cover a wide spectrum of topics. G¨unther et al. discuss the application of space-filling curves in order to design cache-efficient multilevel algorithms. H¨ulsemann et al. present an approach that combines the geometric flexibility of unstructured meshes with the relatively high processing speed of regular grid structures. The paper by Johansson et al. focuses on the J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 872–873, 2006. c Springer-Verlag Berlin Heidelberg 2006
Performance Evaluation and Design of Hardware-Aware PDE Solvers
873
analysis of the memory behavior of various PDE codes through simulation. Kowarschik et al. introduce a novel patch-adaptive and cache-aware grid processing strategy to be employed in the context of multigrid algorithms. Quinlan’s contribution concentrates on a software framework for the generation of optimizing preprocessors for C++ codes. Teresco et al. discuss load balancing issues for parallel PDE solvers. Weidendorfer et al. introduce techniques for the optimization of the cache performance of iterative solvers that are based on hardware prefetching capabilities of the underlying architecture.
A Cache-Aware Algorithm for PDEs on Hierarchical Data Structures Frank G¨unther, Miriam Mehl, Markus P¨ogl, and Christoph Zenger Institut f¨ur Informatik, TU M¨unchen Boltzmannstraße 3, 85748 Garching, Germany {guenthef,mehl,poegl,zenger}@in.tum.de
Abstract. A big challenge in implementing up to date simulation software for various applications is to bring together highly efficient mathematical methods on the one hand side and an efficient usage of modern computer archtitectures on the other hand. We concentrate on the solution of PDEs and demonstrate how to overcome the hereby occuring quandary between cache-efficiency and modern multilevel methods on adaptive grids. Our algorithm is based on stacks, the simplest possible and thus very cache-efficient data structures.
1 Introduction In most implementations, competitive numerical algorithms for solving partial differential equations cause a non-negligible overhead in data access and, thus, can not exploit the high performance of processors in a satisfying way. This is mainly caused by tree structures used to store hierarchical data needed for methods like multi-grid and adaptive grid refinement. We use space-filling curves as an ordering mechanism for our grid cells and – based on this order – to replace the tree structure by data structures which are processed linearly. For this, we restrict to grids associated with space-trees (allowing local refinemt) and (in a certain sense) local difference stencils. In fact, the only kind of data structures used in our implementation is a fixed number of stacks. As stacks can be considered as the most simple data structures used in Computer Science allowing only the two basic operations push and pop1 , data access becomes very fast – even faster than the common access of non-hierarchical data stored in matrices – and, in particular, cache misses are reduced considerably. Even the implementation of multi-grid cycles and/or higher order discretizations as well as the parallelization of the whole algorithm becomes very easy and straightforward on these data structures and doesn’t worsen the cache efficiency. In literature, space-filling curves are a well-known device to construct efficient grid partitionings for data parallel implementations of the numerical solution of partial differential equations [23,24,13,14,15,16,19]. It is also known that – due to locality properties of the curves – reordering grid cells according to the numbering induced by a spacefilling curve improves cache-efficiency (see e.g. [1]). Similar benefits of reordering data along space-filling curves can also be observed for other applications like e.g. matrix transposition [5] or matrix multiplication [6]. We enhance this effect by constructing 1
push puts data on top of a pile and pop takes data from the top of a pile.
J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 874–882, 2006. c Springer-Verlag Berlin Heidelberg 2006
A Cache-Aware Algorithm for PDEs on Hierarchical Data Structures
template
1. iterate
2. iterate
875
discrete curve for adaptive grid
Hilbert−curve
Peano−curve Fig. 1. Generating templates, first iterates and discrete curve on an adaptively refined grid for two-dimensional Hilbert- and Peano-curves
stacks for which we can even completely avoid ‘jumps’ in the adresses instead of only reducing their probability or frequency.
2 Space-Filling Curves and Stacks As we look for an odering mechanism for the cells of a multilevel adaptive rectangular grid based on a space tree, we restrict our attention to a certain class of space-filling curves2 , namely recursively defined, self-similar space-filling curves with rectangular recursive decomposition of the domain. These curves are given by a simple generating template and a recursive refinement procedure which describes the (rotated or mirrored) application of the generating template in sub-cells of the domain. See figure 1 for some iterates of the two-dimensional Hilbert- and Peano-curves, which are prominent representatives of this class of curves. As can be seen, the iterate we use in a particular part of our domain depends on the local resolution. Now, these iterates of the space-filling curves associated to the given grid – also called discrete space-filling curves – and not the space-filling curve itself3 define the processing order of grid cells. To apply an operator-matrix to the vector of unknowns, we process the cells strictly in this order and perform all computations in a cell-oriented way, which means that in each cell, we evaluate only those parts of the corresponding operator that can be computed solely with the help of the cell’s own information and, in particular, without adressing any information of neighbouring cells. This method is standard for finite element methods (see e.g. [4]), but can be generalized for ‘local’ 2 3
For general information on space-filling curves see [20]. The space-filling curve itself can be interpreted as the limit of the recursive refinement procedure and is – in contrast to the iterates – not injective.
876
Frank G¨unther et al.
discrete operators, which means that for the evaluation in one grid point, only direct neighbours are needed4. To get more concrete, we look at a space-tree with function values located at the vertices of cells in the following. To apply the common five-point-stencil for the Laplacian operator, for example, we have to decompose the operator into four parts associated to the four neighbouring cells of a gridpoint5: ∆h uh |i,j = ui−1,j +ui,j−1 −2ui,j 2h2
+
ui+1,j +ui,j−1 −2ui,j 2h2
+
ui+1,j +ui,j+1 −2ui,j 2h2
· − · −·
·− · − ·
·−
1 2
− ·
|
|
|
|
|
1 2
|
|
− −1 − ·
|
|
|
· −
1 2
−·
|
· − −1 −
1 2
|
|
|
·−
1 2
− ·
+
ui−1,j +ui,j+1 −2ui,j 2h2
· −
1 2
−·
|
|
|
|
· − −1 −
1 2
1 2
|
|
|
|
·− · − ·
.
− −1 − · |
|
· − · −·
Of course, in each cell, we evaluate the cell-parts of the operator values for all four vertices all at once. Thus, we have to construct stacks such that all data associated to the respective cell-vertices lie on top of the stacks when we enter the cell during our run along the discrete space-filling curve. We will start with a simple regular two-dimensional grid illustrated in figure 2 and nodal data. If we follow the Peano-curve, we see that during our run through the lower part of the domain, data points on the middle line marked by 1 to 9 are processed linearly from the left to the right, during the subsequent run through the upper part vice versa from the right to the left. Analogously, all other grid points can be organized on lines which are processed linearly forward and, afterwards, backward. For the Peano-curve and a regular grid, this can be shown to hold for arbitrarily fine grids, as well. As these linearly processed lines work with only two possibilities of data access, namely push (write data at the next position in the line) and pop (read data from the actual position of the line), they correspond directly to our data stacks. We can even integrate several lines in one stack, such that it is sufficient to have two stacks: 1Dblack for the points on the right-hand-side of the curve and 1Dred for the points at the lefthand-side of the curve. The behavior of a point concerning read and write operations on stacks can be predicted in a deterministic6 way. We only have to classify the points as ‘inner points’ or ‘boundary points’ and respect the (local) progression of the curve. 4
5
6
In addition, in the case of adaptively refined grid, the cell-oriented view makes the storage of specialized operators at the boundary of local refinements redundant as a single cell computes its contribution to the overall operators without taking into consideration the refinement depth of surrounding cells. If we want to minimze the number of operations, this equal decomposition is of course not the optimal choice, at least not for equidistant grids and constant coefficients. In this context, deterministic means depending only on locally available informations like the local direction of the Peano-curve.
A Cache-Aware Algorithm for PDEs on Hierarchical Data Structures
1
2
3
4
5
6
7
8
877
9
Fig. 2. Example in 2D using the Peano-curve
1
17
18
19
16
6
5
2
3
4
151413 101112 9 8 7
Fig. 3. Example for the order of hierarchical cells defined by the discrete Peano-curve
We use the Peano-curve instead of the Hilbert-curve since in 3D the Peano-curve guarantees the inversion of the processing order of grid points along a plane in the domain if we switch from the domain on one side of the plane to the domain on the other side. This property is essential for our stack concept and we could not find any Hilbert-curve fulfilling it. There are good reasons to assume that this is not possible at all. Up to this point, we have restricted our considerations to regular grids to explain the general idea of our algorithm, but the whole potential of our approach shows only when we look at adaptively refined grids and – in the general case – hierarchical data in connection with generating systems [11]. This leads to more than one degree of freedom per grid point and function. Even in this case, the space-filling curve defines a linear order of cells respecting all grid levels: the cells are visited in a a top-down depth-first process reflecting the recursive definition of the curve itself (see figure 3 for an example). In our stack-context, we have to assure that even then predictable and linear data access to and from stacks is possible. In particular, we have to assure that grid points of different levels lie on the stacks in the correct order and do not “block” each other. We could show that it is sufficient to use four colors representing four different stacks of the same type and a second type of stack, called 0D stack. Thus, we end up with a still fixed – and in particular independent of the grid resolution – number of eight stacks (for details see [10]). To be able to process data on a domain several times – as needed for each iterative solver – without loss of efficiency, we write all points to a so called 2D- or plain-stack as soon as they are ‘ready’. It can easily be seen that, if we process the grid cells in opposite direction in the next iteration, the order of data within this 2D-stack enables us to pop all grid points from this 2D-stack as soon as they are ‘touched’ for the first time.
878
Frank G¨unther et al.
Such, we can use the 2D-stack as input stack very efficiently. We apply this repeatedly and, thus, change the processing direction of grid cells after each iteration. An efficient algorithm deducted from the concepts described above passes the grid cell-by-cell, pushes/pops data to/from stacks deterministically and automatically and will be cache-aware by concept (not by optimization!)7 because of the fact, that using linear stacks is a good idea in conjunction with modern processors prefetching techniques. An additional gain of using our concept is the minimal storage cost for administrational data. We can do completely without any pointer to neighbours and/or father- and soncells. Instead, we only have to store one bit for refinement informations and one bit for geometrical information (in- or outside the computational domain) per cell.
3 Results To point out the potential of our algorithm, we show some first results for simple examples with the focus on the efficiency concerning storage requirements, processing time and cache behavior. Note that up to now we work with an experimental code which is not optimized at all yet. Thus, absolute values like computing time will be improved further and our focus here is only on the cache behavior and the qualitative dependencies between the number of unknowns and the performance values like computing times. 3.1 Two-Dimensional Poisson Equation As a first test, we solve the two-dimensional Poisson equation on the unit-square and on a two-dimensional disc with homogeneous Dirichlet boundary conditions: u(x) = −2π 2 sin(πx) sin(πy), ∀ x = (x, y)T ∈ Ω u(x) = sin(πx) · sin(πy) ∀ x ∈ ∂Ω
(3.1) (3.2)
The exact solution of this problem is given by u(x) = sin(πx) · sin(πy). To discretize the Laplacian operator, we use the common Finite Element stencil arising from the use of bilinear basis functions. The resulting system of linear equations was solved by an additive multi-grid method with bilinear interpolation and full-weighting as restriction operator. As criterion for termination of the iteration loop we took a value of rmax = 10−5 , where rmax is the maximum (in magnitude) of all corrections over all levels. On the unit square, we used regular grids with growing resolution. On the disc, we used different adaptive grids gained by local coarsening strategies starting from an initial grid with 729×729 cells. To get a sufficient accuracy near the boundary, we did not allow any coarsening of boundary cells. Tables 1 and 2 show performance values obtained on a dual Intel XEON 2.4 GHz with 4 GB of RAM. Note that the number of iterations until convergence is in both cases independent from the resolution, which one would have expected for a multi-grid method. The analysis of cache misses and cache hits on the level 2 cache gives a good hint on the high efficiency of our method. But for a more realistic judgement of the efficieny, the very good relation 7
Algorithms wich are cache-aware by concept without detailed knowledge of the cache parameters are also called cache-oblivious [18,9,8].
A Cache-Aware Algorithm for PDEs on Hierarchical Data Structures
879
Table 1. Performance values for the two-dimensional Poisson equation on the unit square solved on a dual Intel XEON 2.4 GHz with 4 GB of RAM resolution
iterations L2 cache misses per it. rate
243 × 243
39
134,432
1.09
729 × 729
39
1,246,949
1.12
11,278,058
1.12
2187 × 2187 39
Table 2. Performance values for the two-dimensional Poisson equation on a disk solved on a dual Intel XEON 2.4 GHz with 4 GB of RAM variables % of full grid L2-hitrate iter cost per variable 66220
15.094%
99.28%
287 0.002088
101146 23.055%
99.26%
287 0.002031
156316 35.630%
99.24%
286 0.001967
351486 80.116%
99.11%
280 0.001822
438719 100.00%
99.16%
281 0.002030
between the number of actual cache-misses and the minimal number of necessary cachemisses caused by the unavoidable first loading of data to the cache is more significant: With the size s of a stack element, the number n of degrees of freedom and the size cl of a L2 cache line on the used architecture we can guess a minimum cmmin = n·s cl of cache misses per iteration, which has to occur if we read each grid point once per iteration producing cls cache misses per point. In fact, grid points are typically used four times in our algorithm as well as in most FEM-algorithms. Thus, this minimum guess is assumed real to be even too low. ‘Rate’ in table 1 is defined as cm cmmin where cmreal are the L2 cache misses per iteration simulated with calltree [25]. As this rate is nearly one in our algorithm, we can conclude that we produce hardly no needless cache misses at all. In addition, we get a measured L2-hit-rate of at least 99,13%, using hardware performance counters [26] on an Intel XEON, which is a very high value for ‘real’ algorithms beyond simple programs for testing performance of a given architecture. In table 2, the column ‘costs per variable’ shows the amount of time needed per variable and per iteration. These values are nearly constant. Thus, we see that we have a linear dependency between the number of variables and the computing time, no matter if we have a regular full grid or an adaptively refined grid. This is a remarkable result as we can conclude from this correlation that, in our method, adaptivity doesn’t any additional costs. 3.2 Three-Dimensional Poisson Equation In the previous sections, we only considered the two-dimensional case. But our concepts can be generalized in a very natural way to three or even more dimensions. Here, we give a few preliminary results on the 3D case. The basic concepts are the same as in the
880
Frank G¨unther et al.
Fig. 4. Representation of the star-shaped domain with the help of adaptive grids gained by a geometry oriented coarsening of a 243 × 243 × 243 grid
two-dimensional case, but to achieve an efficient implementation, we introduced some changes and/or enhancements in the concrete realization [17]. The first – and obvious – change is that we need 3D in- and output stacks and 2D, 1D and 0D stacks during our run through the space tree instead of 1D and 0D stacks only. In addition, we use 8 colors for the 0D, 12 colors for the 1D stacks, and 6 colors for the 2D stacks. Another interesting aspect in 3D is that we replace the direct refinement of a cell by the introduction of 27 sub-cells by a dimension recursive refinement: A cell is cut into three “plates” in a first step, each of these plates is cut into three “bars”, and, finally, each bar into three “cubes”. This reduces the number of different cases dramatically and can even be generalized to arbitrary dimensions, too. We solve the three-dimensional Poisson equation u((x)) = 1
(3.3)
on a star-shaped domain (see figure 4) with homogeneous boundary conditions. The Laplace operator is discretized by the common Finite Difference stencil and – analoguously to the 2D case – we use an additive multi-grid method with trilinear interpolation and full-weighting as restriction operator. The termination criterion was |rmax | ≤ 10−5 . Table 3 shows performance values obtained on a dual Intel XEON 2.4 GHz with 4 GB of RAM for adaptively refined grids. As can be seen, all important performance properties like multi-grid performance and cache efficiency carry over from the 2D case. The rate between the number of real cache misses and the minimal number of expected cache misses was measured for a different example (poisson equation in a sphere) and stays also in the range of 1.10. Thus, even in 3D, the essential part of the cache misses is really necessary and cannot be avoided by any algorithm.
A Cache-Aware Algorithm for PDEs on Hierarchical Data Structures
881
Table 3. Performance values for the three-dimensional Poisson equation on a star-shaped domain solved on a dual Intel XEON 2.4 GHz with 4 GB of RAM max. resolution
variables
storage req. L2-hit-rate iter cost per variable
81 × 81 × 81
18,752
0.7MB
99.99%
96 0.0998
9MB
99.99%
103 0.0459
99.98%
103 0.0448
243 × 243 × 243 508,528
729 × 729 × 729 13,775,328 230MB
4 Conclusion In this paper we presented a method combining important mathematical features for the numerical solution of partial differential equations – adaptivity, multi-grid, geometrical flexibility – with a very cache-efficient way of data organization tailored to modern computer architectures and suitable for general discretized systems of partial differential equations (yet to be implemented). The parallelization of the method follows step by step the approach of Zumbusch [23] who has already used successfully space-filling curves for the parallelization of PDE-solvers. The generalization to systems of PDEs is also straightforward. With an experimental version of our programs we already solved the non-stationaray Navier-Stokes equations in two dimensions, but the work is still in progress and shall be published in the near future. The same holds for detailed performance results for various computer architectures.
References 1. M.J. Aftosmis, M.J. Berger, and G. Adomavivius. A Parallel Multilevel Method for adaptively Refined Cartesian Grids with Embedded Boundaries. AIAA Paper, 2000. 2. R.A. Brualdi and B.L. Shader. On sign-nonsingular matrices and the conversion of the permanent into the determinant. In P. Gritzmann and B. Sturmfels, editors. Applied Geometry and Discrete Mathematics, The Victor Klee Festschrift, pages 117-134, Providence, RI, 1991. American Mathematical Society. 3. F.A. Bornemann. An adaptive multilevel approach to parabolic equations III: 2D error estimation and multilevel preconditioning. IMPACT Computational Science and Engeneering, 4:1-45, 1992. 4. Braess. Finite Elements. Theory, Fast Solvers and Applications in Solid Mechanics. Cambridge University Press, 2001. 5. S. Chatterjee and S. Sen. Chache-Efficient Matrix Transposition. In Proceedings of HPCA-6, pages 195–205, Toulouse, France, January 2000. 6. S. Chatterjee, A.R. Lebeck, P.K. Patnala, and M. Thottethodi. Recursive array layouts and fast parallel matrix multiplication. In Proceedings of Eleventh Annual ACM Symposium on Parallel Algorithms and Architectures, pages 222-231, Saint-Malo, France, 1999. 7. W. Clarke. Key-based parallel adaptive refinement for FEM. Bachelor thesis, Australian National Univ., Dept. of Engineering, 1996. 8. E.D. Demaine. Cache-Oblivious Algorithms and Data Structures. In Lecture Notes from the EEF Summer School on Massive Data Sets, Lecture Notes in Computer Science, BRICS, University of Aarhus, Denmark, June 27-July 1, 2002, to appear.
882
Frank G¨unther et al.
9. M. Frigo, C.E. Leierson, H. Prokop, and S. Ramchandran. Cache-oblivious algorithms. In Proceedings of the 40th Annual Sympoisium on Foundations of Computer Science, pages 285-297, New York, October 1999. 10. F. G¨unther. Eine cache-optimale Implementierung der Finite-Elemente-Methode. Doctoral thesis, Institut f¨ur Informatik, TU M¨unchen, 2004. 11. M. Griebel. Multilevelverfahren als Iterationsmethoden u¨ ber Erzeugendensystemen. Habilitationsschrift, TU M¨unchen, 1993. 12. M. Griebel, S. Knapek, G. Zumbusch, and A. Caglar. Numerische Simulation in der Molek¨uldynamik. Numerik, Algorithmen, Parallelisierung, Anwendungen, Springer, Berlin, Heidelberg, 2004. 13. M. Griebel and G.W. Zumbusch. Parallel multigrid in an adaptive PDE solver based on hashing and space-filling curves. Parallel Computing, 25:827-843, 1999. 14. M. Griebel and G. Zumbusch. Hash based adaptive parallel multilevel methods with spacefilling curves. In H. Rollnik and D. Wolf, editors, NIC Series, 9:479-492, Germany, 2002. Forschungszentrum J¨ulich. 15. J.T. Oden, A. Para, and Y. Feng. Domain decomposition for adaptive hp finite element methods. In D.E. Keyes and J. Xu, editors, Domain decomposition methods in scientific and engineering computing, proceedings of the 7th int. conf. on domain decomposition, vol. 180 of Contemp. Math., pages 203-214, 1994, Pennsylvania State Universitiy. 16. A.K. Patra, J. Long, A. Laszloff. Efficient Parallel Adaptive Finite Element Methods Using Self-Scheduling Data and Computations. HiPC, pages 359-363, 1999. 17. M. P¨ogl. Entwicklung eines cache-optimalen 3D Finite-Element-Verfahrens f¨ur große Probleme. Doctoral thesis, Institut f¨ur Informatik, TU M¨unchen, 2004. 18. H. Prokop. Cache-Oblivious Algorithms. Master Thesis, Massachusetts Institute of Technology, 1999. 19. S. Roberts, S. Klyanasundaram, M. Cardew-Hall, and W. Clarke. A key based parallel adaptive refinement technique for finite element methods. In B.J. Noye, M.D. Teubner, and A.W. Gill, editors, Proc. Computational Techniques and Applications: CTAC ’97, pages 577-584, World Scientific, Singapore, 1998. 20. H. Sagan. Space-Filling Curves. Springer-Verlag, New York, 1994. 21. R.J. Stevens, A.F. Lehar, and F.H. Preston. Manipulation and Presentation of Multidimensional Image Data Using the Peano Scan. IEEE Trans. Pattern An. and Machine Intelligence, Vol PAMI-5, pages 520-526, 1983. 22. L. Velho, J. de Miranda Gomes. Digital Halftoning with Space-Filling Curves. Computer Graphics, 25:81-90, 1991. 23. G.W. Zumbusch. Adaptive Parallel Multilevel Methods for Partial Differential Equations. Habilitationsschrift, Universit¨at Bonn, 2001. 24. G.W. Zumbusch. On the quality of space-filling curve induced partitions. Z. Angew. Math. Mech., 81:25-28, 2001. Suppl. 1, also as report SFB 256, University Bonn, no. 674, 2000. 25. J. Weidendorfer, M. Kowarschik, and C. Trinitis. A Tool Suite for Simulation Based Analysis of Memory Access Behavior, Proceedings of the 2004 International Conference on Computational Science, Krakow, Poland, June 2004, vol. 3038, Lecture Notes in Computer Science (LNCS), Springer. 26. http://user.it.uu.se/ mikpe/linux/perfctr/
Constructing Flexible, Yet Run Time Efficient PDE Solvers Frank H¨ulsemann and Benjamin Bergen System Simulation Group, Computer Science Department Friedrich-Alexander-University Erlangen-Nuremberg, Germany {frank.huelsemann,ben.bergen}@cs.fau.de
Abstract. Amongst other properties, PDE solvers for large scale problems should be flexible, as they are time consuming to write, and obviously run time efficient. We report on the experiences with a regularity centred approach for grid based PDE software that aims to combine geometric flexibility with run time efficiency. An unstructured coarse grid that describes the problem geometry is repeatedly subdivided in a regular fashion to yield a hierarchy of grids on which the approximation is sought. By construction, the grid hierarchy is well suited for multilevel methods. The gain in run time performance that results from the exploitation of the patch wise regularity of the refined grids over standard implementations will be illustrated.
1 Introduction Two desirable properties of PDE solvers, flexibility and run time efficiency, are often seen as difficult to combine. In this article we discuss the optimisation potential for the implementation that is inherent to sparse linear system solvers on regularly refined, unstructured grids which commonly arise in geometric multigrid methods. Although the developed techniques apply to any two or higher dimensional problem, we concentrate on two and three spatial dimensions. The run time behaviour of a PDE solver for a given problem size and a given hardware platform depends on the algorithmic complexity of the solution method employed and on its implementation. Therefore, in order to be fast, an efficient PDE solver will have to combine an optimal algorithm with an appropriate implementation. For our discussion, we concentrate on geometric multigrid methods as solvers, which are known to be of optimal linear algorithmic complexity for certain second order, elliptic boundary value problems [4], [7]. The type of flexibility that we are interested in concerns the use of unstructured grids. In this work, we do not consider other, undoubtedly desirable, aspects of solver flexibility such as the availability of different approximation schemes, for example higher order basis functions or staggered grids to name but two, nor do we cover adaptive methods. Our main concern is to improve the floating point performance of geometric multigrid methods on unstructured grids. For this purpose, it suffices to consider cases for which (standard) geometric multigrid methods are applicable. Hence we do not consider the
This work has been supported by a KONWIHR grant of the Bavarian High Performance Computing Initiative.
J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 883–892, 2006. c Springer-Verlag Berlin Heidelberg 2006
884
Frank H¨ulsemann and Benjamin Bergen
question of how to construct robust multigrid schemes on unstructured grids with high aspect ratios or for problems with strongly varying coefficients. The article is structured as follows. First, we give reasons why repeated refinement is a common approach in the implementation of geometric multigrid methods on unstructured grids. Then, in Sect. 3, we identify the different types of regularity in the discretisation that can be exploited. After that, we present the data structures that capture the regular properties in Sect. 4. Section 5 accesses the cost of the data synchronisation that are inherent to our approach. Following on, we present performance data from numerical experiments before we conclude the paper.
2 Geometric Multigrid on Unstructured Grids The common approach to generating the hierarchy of nested approximation spaces needed for standard geometric multigrid on unstructured grids consists in carrying out several steps of repeated regular refinement. The reason for the popularity of this approach lies in the complexity and difficulty of any alternative. Coarsening a given unstructured grid so that the resulting approximation spaces are nested is often simply impossible. Hence, one would have to work on non-nested spaces. Multigrid methods for such a setting are much more complicated than those for their nested counterparts. In summary, although it is possible to construct geometric multigrid methods on hierarchies of repeatedly coarsened unstructured grids, this option is rarely chosen in practice. Hence we turn our attention to the standard approach of repeated refinement. The basic assumption underlying this strategy is that the spatial resolution of the unstructured input grid is not high enough in any section of the grid (anywhere) to yield a sufficiently accurate solution. In such a case, some global refinement is indeed necessary. While this assumption excludes the cases in which the geometry is so complex that the resolution of the domain geometry is already sufficient for the accuracy requirements, it still includes various challenging applications that require large scale computations. Applications with high accuracy requirements on comparatively simple domains arise for example in the simulation of plasma in particle accelerators, density functional computations in molecular dynamics and the simulation of wave propagation in high energy ultra sound.
3 Identifying Regularity on Globally Unstructured Grids We illustrate the main concepts of our approach using a simplified problem domain that arose in an electromagnetic field problem. Consider the setting in Fig. 1. For ease of presentation, we assume a Poisson problem −u=f u=0
on Ω on ∂Ω.
on the problem domain Ω. The extension of the implementation techniques to other settings is straightforward.
Constructing Flexible, Yet Run Time Efficient PDE Solvers
885
b) a)
c)
I)
II)
III)
Fig. 1. Non-destructive testing example, from left to right: I) the problem domain with a) the coil that generates a magnetic field, b) air surrounding the coil and c) the material to be tested. The material parameters (magnetic permeability) of the three different components differ from each other but are assumed to be constant within each component (homogeneous materials). II) the coarse, unstructured grid that represents the problem geometry and III) the grid after two global regular subdivision steps
For the representation of the discrete operator on the grid in Fig. 1 II), it is appropriate to employ sparse matrix storage schemes. For a description of various sparse matrix formats and their implementation see [1], for example. Returning to the example, we note that the grid consists of different cell geometries (triangles and quadrilaterals), the number of neighbours is not constant for all vertices and the material parameters vary. However, the situation on the grid in Fig. 1 III) is different. Within each cell of the unstructured input grid the repeated refinement has resulted in regular patches. In the interior of each coarse grid cell the grid is regular and the material parameter is constant. This implies that one stencil suffices to represent the discrete operator inside such a regular region1. Put differently, if the stiffness matrix was set up for the refined grid, then, assuming an appropriate numbering of the unknowns, the row entries belonging to each patch would display a banded structure with constant matrix coefficients. From this observation it is clear that once the patches contain sufficiently many points it will not be desirable to ever set up a sparse matrix representation of the discrete operator. The unstructured and hybrid nature of the coarse input grid shows itself in the fine grid representation of the discrete operator at the vertices and at the points on the edges of the coarse grid. The vertices of the input grid have to be treated in a standard, unstructured manner, because there is no regularity to be exploited at these points. In two spatial dimensions, the fine grid points that resulted from the refinement of an edge of the input grid share a common stencil. The stencil shape and the stencil entries can vary from edge to edge, depending on the geometric shapes of the cells that share the edge, but along one edge, both shape and entries are constant. In three spatial dimensions, this 1
In the description of the grid structure, we use the terms (regular) patch and region interchangeably.
886
Frank H¨ulsemann and Benjamin Bergen
observation holds for the faces whereas the (points on the) edges have to be treated in an unstructured way. In summary, within each patch in Fig. 1 III) we have – a constant stencil shape (banded matrix structure) for the discrete operator, – constant material parameters and therefore – constant stencil entries. Next we will describe the data structures that we propose to exploit the regular structure of the operator within each patch to improve the floating point performance of the geometric multigrid implementation on unstructured input grids and give parallel performance results on different architectures. Further descriptions of the approach can be found in [2], [3], [5].
4 Data Structures for Patch Wise Regularity As pointed out before, a general sparse matrix storage format is not suitable for the adequate representation of the patch wise structure of the discrete operator on a highly refined grid. The main idea of our approach is to turn operations on the refined, globally unstructured grid into a collection of operations on block structured parts where possible and resort to unstructured operations only where necessary. As described above, in two spatial dimensions only the vertices of the input grid require inherently unstructured operations, while each edge and each computational cell of the input grid constitute a block structured region. For the remainder of the paper, we concentrate on the case where in each block structured part, the stencil shape and the stencil entries are constant, but the shape and the entries can of course differ between parts. The storage scheme is illustrated in Fig. 2. Each block structured region provides storage for the unknowns and the discrete operators in its interior. In addition to the memory for the degrees of freedom of one problem variable, the regular region also provides buffers for information from neighbouring objects, which can be other regular regions or unstructured regions and points. When applying a discrete operator to a variable, this rim of buffers eliminates the need to treat the unknowns close to the boundary, where the stencil extends into neighbouring objects, any different from those in the inside. In other words, the number of points to which a constant stencil can be applied in the same manner is maximised. We mentioned before that in the regular regions, the discrete operator can be expressed by stencils with constant shape. In some cases, the entries of the stencil are also constant over the object. Inside such a region, the representation of the operator can be reduced to one single stencil. Both properties, constant stencil shape and constant stencil entries, help to improve the performance of operations involving the discrete operator. The scale of the improvement depends on the computer architecture. First, we turn to the constant shape property. On general unstructured grids, the discrete operator is usually stored in one of the many sparse matrix formats. If one uses such general formats in operations such as the matrix-vector product or a Gauß-Seidel iteration, one usually has to resort to indirect indexing to access the required entries in the vector. Being able to represent the discrete operator in the regular regions by
Constructing Flexible, Yet Run Time Efficient PDE Solvers
887
Fig. 2. Data layout for stencil based operations: Assuming a vertex based discretisation with Dirichlet boundary conditions on the top grid, the unknowns are stored in three different memory segments to allow stencil operations. Full circles indicate unknowns, hollow circles indicate buffered values from neighbouring objects (ghost cells)
stencils with fixed shapes, we can express the memory access pattern for the operation explicity through index arithmetic and thus enable the compiler, or indeed hardware mechanisms like prefetch units, to analyse and optimise the memory accesses better. In the sparse matrix case, the memory access pattern is known at run time, in our setting it is known at compile time, at least for a significant subset of all points. The size of this subset will be discussed in the section on data synchronisation costs. On the Hitachi SR8000 supercomputer at the Leibniz computing centre in Munich, this change from indirect indexing to index arithmetic improves the MFLOP/s performance of a GaußSeidel iteration on a single processor from around 50 MFLOP/s for a CRS (compressed row storage) implementation to 300 MFLOP/s for the stencil based computation. These values were obtained with a 27 point finite element discretisation inside a refined hexahedron on a processor with a theoretical peak performance of 1500 MFLOP/s. On this high memory bandwidth architecture, the explicit knowledge about the structure of the operations results in a six-fold performance improvement. Many cache based architectures do not offer such a high bandwidth to main memory. Given that a new stencil has to be fetched for each unknown, the memory traffic to access the stencil values slows down the computations on these machines. As an example for such machines, we consider an Intel Pentium 4 processor with 2.4 GHz clock speed, 533 MHz (effective) front side bus and dual channel memory access. Our CRS based Gauß-Seidel iteration runs at 190 MFLOP/s on a machine with a theoretical peak performance of 4800 MFLOP/s2 . With the fixed stencil shape implementation, we observe a performance between 470 and 490 MFLOP/s, depending on the problem size, which is 2.5 times more than for the 2
The results on the PC architecture were obtained with the version 3.3.3 of the GNU compiler collection.
888
Frank H¨ulsemann and Benjamin Bergen
standard unstructured one. In all cases, the problem size did not fit into the caches on the machines. In the case of constant stencil entries for an element, the advantage of the structured implementation is even clearer. Instead of fetching 27 stencil values from memory for each unknown, the same stencil is applied to all unknowns in the element. This obviously reduces the amount of memory traffic significantly. On the Hitachi SR8000, such a constant coefficient Gauß-Seidel iteration achieves 884 (out of 1500) MFLOP/s on a hexahedron with 1993 unknowns. Compared to the standard unstructured implementation, this amounts to a speed up factor of 17.5. Again, on the commodity architecture Pentium 4, the improvement is less impressive. For a problem of the same size, the constant coefficient iteration runs at 1045 MFLOP/s, which is 5.5 times faster than its unstructured counterpart. These numbers make it clear that exploiting the substructure does improve the run time performance inside the regular regions. However, in general, the global grid consists of more than one regular block, which introduces the need for data synchronisation between these blocks. In the next section, we consider the computational cost of the communication between blocks and indicate how many levels of refinement are necessary for the proposed data structures to pay off.
5 Cost of Data Synchronisation In the previous section it has been shown that the run time performance inside the regular regions improves when the structure of the computations is taken into account. However, when comparing the performance of a structured implementation with its unstructured counterpart, one also has to take the computational cost for the data synchronisation into account. The computations inside a block can take place once the current boundary values have been stored in the rim buffers. These boundary values have to be collected from the neighbouring grid objects. For a hexahedron, this means that it has to receive the current values from its six neighbouring faces, its 12 edges and eight corners. Furthermore, after the values of the unknowns inside the hexahedron have been updated, the neighbouring objects have to be informed about the changes, which results again in data exchange operations. It is reasonable to assume that the time requirements for the data synchronisation within one process are linearly proportional to the number of values that have to be copied back and forth. Consequently, we consider the ratio between the number of points inside a hexahedra, for which the computations can be carried out in the optimised manner described above, and the number of points on the hexahedra boundary, which have to be exchanged. Only when the computations inside the regular region take sufficiently longer than the communication of the boundary values can the performance on the proposed data structures be higher than on the standard unstructured ones. In Table 1 we give the numbers for the different point classes for a repeatedly refined hexahedron and the ratio of the regular points (Nr ) over all points (N ). The refinement rule consists in subdividing a cube into eight sub-cubes of equal volume. The refinement level 0 corresponds to the initial cube, level l, l ≥ 1 is the result of subdividing all cubes in level l − 1 according to the rule above.
Constructing Flexible, Yet Run Time Efficient PDE Solvers
889
Table 1. Point numbers in a refined hexahedron (cube): l denotes the refinement level, Nr the number of points in the inside of a refined cube, N the total number of points in the cube including the boundary and r = Nr /N the ratio between regular points and all points l
Nr
N
r
0
0
8
0
1
1
27
0.037
2
27
125
0.216
3
343
729
0.471
4
3375
4913
0.687
5
29791
35937
0.829
6
250047
274625
0.911
7
2048383
2146689
0.954
8
16581375
16974593
0.977
We introduce some notation in order to estimate the overall performance on one cube with the communication taken into account. We assume that all N points can be updated with the same number of floating point operations, denoted by f . If the standard, unstructured implementation is employed to update the boundary points, we observe a floating point performance of Pcrs MFLOP/s on these N − Nr = (1 − r) ∗ N points. In the regular region, comprising Nr points, the computations are carried out with Preg MFLOP/s. The value s = Preg /Pcrs is the speed up factor in the observed floating point performance. It turns out to be useful to express the communication time in terms of floating point operations that could have been performed in the same time. By m we denote the number of floating point operations that can be carried out at an execution speed of Preg MFLOP/s in the same period of time that it takes to copy one point value to or away from the cube. In total, there are (1 − r)N points on the boundary and for simplicity we assume that the number of values to be copied from the hexahedra to its neighbours is also (1 − r)N , which is an upper bound on the actual number. Thus we arrive at 2(1 − r)N copy operations, each one taking m/Preg seconds, resulting in ×m a communication time tsyn of tsyn = 2(1−r)N seconds. The overall floating point Preg performance is now computed by dividing the number of floating point operations N ×f by the time it takes to perform the computations. The overall time ttotal is given by ttotal = treg + tcrs + tsyn Nr f (1 − r)N f 2(1 − r)N m = + + Preg Pcrs Preg 2(1 − r)N f m rN f s(1 − r)N f f = + + Preg Preg Preg r + s(1 − r) + 2 m (1 − r) Nf f = Preg
890
Frank H¨ulsemann and Benjamin Bergen
Hence we obtain for the overall performance Ptotal : Ptotal = =
Nf 1 = Preg ttotal r + s(1 − r) + 2 m f (1 − r)
1
Preg 1 + (s − 1) + 2 m (1 − r) f
Even when ignoring the communication (m=0), the expression for Ptotal shows that with r → 1, the denominator tends to one and the overall performance approaches the performance of the structured implementation. However, for a given level of refinement, it depends on the observed values for s and m whether the structured implementation is faster than the unstructured one. In the next section, we illustrate that we do achieve an acceleration.
6 Numerical Results The stencil based representation of the discrete operator together with the block wise storage scheme for the unknown variables on the grid can be employed in all iterative solution algorithms for the equivalent linear system, that involve the application of an operator to a variable, more commonly known as matrix-vector product. We are particularly interested in geometric multigrid methods due to their optimal asymptotic complexity. Therefore we have chosen the refinement rule for our implementation in such a way that the resulting grids are nested. Our main concern in this section is the floating point performance of the resulting program. As a consequence, we choose a test problem for which a standard multigrid algorithm is appropriate. Given that multigrid methods require computations on different levels, not only the finest one, they present a harder test case than single level Krylov methods for instance, that only involve operator applications on the finest and for our data structures most favourable level. Our test case is a three dimensional Poisson problem with Dirichlet boundary conditions on L-shaped domains. The input grid is refined seven times for the scale up experiment and six times for the speed up computations. The meshsize in the different grids for the scale up experiment is constant. The domain is extended in one direction by the necessary number of equally refined unit cubes in order to fill the available processors. The algorithmic components of the full multigrid method (FMG) are essentially standard: We employ a row-wise red-black Gauss-Seidel smoother, which updates the unknowns in a row in a red-black ordering thus breaking the data dependencies between two neighbouring points in a lexicographic numbering, and combine it with full weighting and trilinear interpolation. On each level in the multigrid hierarchy we perform two V(2,2) cycles before prolongating the obtained approximation to the next refinement level. Trilinear finite elements result in a 27 point stencil in the interior of the domain. The results in Fig. 3, see also [6], underline that the proposed regularity oriented data structures result in a program that shows both good scale up and good speed up behaviour. Please note that the solution time for the largest problem involving more
CPU
Dof
Time
×10
in (s)
6
64
1179,48 44
128
2359,74 44
256 512
4719,47 44 9438,94 45
550
10139,49 48 (a)
Speedup
Constructing Flexible, Yet Run Time Efficient PDE Solvers
45 40 35 30 25 20 15 10 5 0
891
linear observed
0
10 20 30 40 50 60 70 Number of Processes
(b)
Fig. 3. Parallel performance results in 3D: (a) Scalability experiments using a Poisson problem with Dirichlet boundary conditions. Each partition in the scalability experiments consists of nine cubes, each of which is regularly subdivided seven times. The timing results given refer to the wall clock time for the solution of the linear system using a full multigrid solver. (b) Speedup results for the same Poisson problem. The problem domain consisted of 128 cubes, each of which was subdivided six times. The problem domain was distributed to 2, 4, 8, 16, 32 and 64 processes
than 1010 unknowns is less than 50 seconds, on 550 processors. But this demonstrates that efficient hierarchical algorithms, in this case full multigrid, in combination with “hardware-aware" (or: architecture-friendly) data structures are capable of dealing with large scale problems in an acceptable amount of time. Due to the memory requirements of the matrix storage, it would not have been possible to treat a problem of that size even on the whole machine with a sparse matrix based scheme. At the end of this section, we come back to the question of whether this implementation is actually faster than an unstructured one. Profiling results indicate that the processes in the scale up experiment spend slightly more than 30% in the Gauß-Seidel routine for the regular regions inside the cubes. The speed up factor for this routine is roughly 14 (more than 700 MFLOP/s vs. less than 50 MFLOP/s). In other words, if only this one routine was implemented in a standard, unstructured manner, the run time of the single routine would be more than four times the run time for the whole, structured program. On this architecture, our proposed approach is successful in reducing the overall solution time.
7 Conclusion In this article, we have shown that repeatedly refined, unstructured grids, as they commonly arise in geometric multigrid methods, result in different regular features in the discretisation. We have presented data structures that capture these regularities and we have illustrated that operations on these data structures achieve a higher floating point performance than their commonly used unstructured counterparts. The impact of necessary data synchronisation on the overall performance was considered. By providing performance data from numerical experiments we have demonstrated that the proposed
892
Frank H¨ulsemann and Benjamin Bergen
data structures are well suited for large scale, parallel computations with efficient individual processes.
References 1. R. Barrett, M. Berry, T. F. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout, R. Pozo, C. Romine, and H. Van der Vorst. Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, 2nd ed., SIAM, Philadelphia, 1994, 2. B. Bergen and F. H¨ulsemann. Hierarchical hybrid grids: A framework for efficient multigrid on high performance architectures. Technical Report 03-5, Lehrstuhl f¨ur Informatik 10, Universit¨at Erlangen-N¨urnberg, 2003 3. B. Bergen and F. H¨ulsemann. Hierarchical hybrid grids: data structures and core algorithms for multigrid. Numerical Linear Algebra with Applications, 11 (2004), pp. 279–291 4. A. Brandt. Multi–level adaptive solutions to boundary–value problems. Math. Comp., 31 (1977), pp. 333–390. 5. F. H¨ulsemann, B. Bergen, and U. R¨ude. Hierarchical hybrid grids as basis for parallel numerical solution of PDE. In H. Kosch, L. B¨osz¨orm´enyi and H. Hellwagner (eds.) Euro-Par 2003 Parallel Processing, Lecture Notes in Computer Science Vol. 2790, Springer, Berlin, 2003, pp. 840–843 6. F. H¨ulsemann, S. Meinlschmidt, B. Bergen, G. Greiner, and U. R¨ude. gridlib – A parallel, objectoriented framework for hierarchical-hybrid grid structures in technical simulation and scientific visualization. In High Performance Computing in Science and Engineering, Munich 2004. Transactions of the Second Joint HLRB and KONWIHR Result and Reviewing Workshop, Springer, Berlin, 2004, pp. 37–50 7. U. Trottenberg, C.W. Oosterlee, and A. Sch¨uller. Multigrid. Academic Press, London, 2000
Analyzing Advanced PDE Solvers Through Simulation Henrik Johansson, Dan Wallin, and Sverker Holmgren Department of Information Technology, Uppsala University Box 337, S-751 05 Uppsala, Sweden {henrik.johansson,dan.wallin,sverker.holmgren}@it.uu.se
Abstract. By simulating a real computer it is possible to gain a detailed knowledge of the cache memory utilization of an application, e.g., a partial differential equation (PDE) solver. Using this knowledge, we can discover regions with intricate cache memory performance. Furthermore, this information makes it possible to identify performance bottlenecks. In this paper, we employ full system simulation of a shared memory computer to perform a case study of three different PDE solver kernels with respect to cache memory performance. The kernels implement state-of-the-art solution algorithms for complex application problems and the simulations are performed for data sets of realistic size. We discovered interesting properties in the solvers, which can help us to improve their performance in the future.
1 Introduction The performance of PDE solvers depends on many computer architecture properties. The ability to efficiently use cache memory is one such property. To analyze cache behavior, hardware counters can be used to gather rudimentary data like cache hit rate. However, such experiments can not identify the reason for a cache miss. By software simulation of a computer, it is possible to derive, e.g., the type of a cache miss and the amount of address and data traffic. A further advantage is that the simulated computer can be a fictitious system. In this paper, we present a case study of the cache memory behavior of three PDE solvers. The solvers are based on modern, efficient algorithms, and the settings and working sets are carefully chosen to represent realistic application problems. The study is based on full-system simulations of a shared memory multiprocessor, where the baseline computer model is set up to correspond to a commercially available system, the SunFire 6800 server. However, using a simulator, the model can easily be modified to resemble alternative design choices. We perform simulations where both the cache size and the cache line size is varied, as two simulation case studies. Many other parameters can be varied as well, but the chosen parameters show some typical properties of the studied PDE solvers. We begin this paper with a general discussion of simulation, coupled with a description of our simulation platform (section two). Next, we present the PDE solvers used for the simulations (section three). The fourth section holds the results of the simulations, divided into two case studies. Finally, our conclusions are presented in section five. J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 893–900, 2006. c Springer-Verlag Berlin Heidelberg 2006
894
Henrik Johansson, Dan Wallin, and Sverker Holmgren
2 Related Work Early work in full system simulation was performed at Stanford University and resulted in the SimOS simulator [2]. The SimOS was, among others, used to study the memory usage of commericial workloads [3], data locality on CC-NUMA systems [4] and to characterize and optimize an auto-parallelizing compiler [5]. However, the work on SimOS was discontinued in 1998. Most recent work in the field is based on the Simics full system simulator [1]. Simics has been used in a large number of publications. These include an examination of the memory system behavior for Java-based middleware [6] and a comparision of memory system behavior in java and non-java commercial workloads [7]. The work presented above is primarily concerned with hardware. In the papers real applications are simulated with the goal of improving the underlying hardware. Our paper uses the opposite approach, it uses simulation in an effort to improve the software.
3 Simulation Techniques There are several ways to analyze algorithms on complex high performance computer systems, e.g., a high-end server. One alternative is to perform measurements on the actual physical machine, either using hardware or software. Another alternative is to develop analytical or statistical models of the system. Finally, there is simulation. Simulation has several advantages over the other methods. First, the characteristics of the simulated computer can be set almost arbitrary. Second, the simulated machine is observable and controllable at all times. Third, a simulation is deterministic; it is always possible to replicate the results obtained from the simulation. Fourth, a high degree of automation is possible. The ability to change parameters in the area of interest gives a researcher great flexibility. Looking at cache memory, we can easily change cache line size, cache size, replacement policy, associativity and use sub-blocked or unblocked caches. We can also determine which information to collect. In this case it can, e.g., be the number of memory accesses, cache miss rates, cache miss type and the amount of data or address traffic. There are also a few drawbacks inherent in simulation. It is in practice impossible to build a simulator that mimics the exact behavior of the target computer in every possible situation. This might lead to some differences between the collected information and the performance of the real computer. However, comparisons between the individual simulations are valid since they all encounter the same discrepancies. Estimating execution time is difficult based on simulation and it is highly implementation dependent on the bandwidth assumptions for the memory hierarchy and the bandwidth of coherence broadcast and data switches. The complexity of an “exact” simulation is thus overwhelming, and performance and implementation issues make approximations necessary. The experiments are carried out using execution-driven simulation in the Simics full-system simulator [1]. Simics can simulate computer architectures at the level of individual instructions. The target computer runs unmodified programs and even a complete operating system. Usually an operating system is booted inside Simics and programs, benchmarks and tests are run on this operating system in a normal fashion. Simics
Analyzing Advanced PDE Solvers Through Simulation
895
collects detailed statistics about the target computer and the software it runs. The user determines what information to collect, and how to do it. For example, the user can initially collect a wide range of data about the cache memory system as a whole. Later, in order to isolate a certain behavior, it is possible to restrict the collection to a small region of the code. The modeled system implements a snoop-based invalidation MOSI cache coherence protocol [8]. We set up the baseline cache hierarchy to resemble a SunFire 6800 server with 16 processors. The server uses UltraSPARC III processors, each equipped with two levels of caches. The processors have two separate first level caches, a 4-way associative 32 KB instruction cache and a 4-way associative 64 KB data cache. The second level cache is a shared 2-way associative cache of 8 MB and the cache line size is 64 B.
4 The PDE Solvers The kernels studied represent three types of PDE solvers, used for compressible flow computations in computational fluid dynamics (CFD), radar cross-section computations in computational electro-magnetics (CEM) and chemical reaction rate computations in quantum dynamics (QD). The properties of the kernels differ with respect to the amount of computations performed per memory access, memory access patterns and the amount of communication and its patterns. The CFD kernel implements the advanced algorithm described by Jameson and Caughey for solving the compressible Euler equations on a 3D structured grid using a Gauss-Seidel-Newton technique combined with multi-grid acceleration [9]. The implementation is described in detail by Nord´en [10]. The data is highly structured. The total work is dominated by local computations needed to update each cell. These computations are quite involved, but their parallelization is trivial. Each of the threads computes the updates for a slice of each grid in the multi-grid scheme. The amount of communication is small and the threads only communicate pair-wise. The CEM kernel is part of an industrial solver for determining the radar cross section of an object [11]. The solver utilizes an unstructured grid in three dimensions in combination with a finite element discretization. The resulting large system of equations is solved with a version of the conjugate gradient method. In each conjugate gradient iteration, the dominating operation is a matrix-vector multiplication with the very sparse and unstructured coefficient matrix. Here, the parallelization is performed such that each thread computes a block of entries in the result vector. The effect is that the data vector is accessed in a seemingly random way. However, the memory access pattern does not change between the iterations. The QD kernel is derived from an application where the dynamics of chemical reactions is studied using a quantum mechanical model with three degrees of freedom [12]. The solver utilizes a pseudo-spectral discretization in the two first dimensions and a finite difference scheme in the third direction. In time, an explicit ODE-solver is used. For computing the derivatives in the pseudo-spectral discretization, a standard convolution technique involving 2D fast Fourier transforms (FFTs) is applied. The parallelization is performed such that the FFTs in the x-y dimensions are performed in parallel and locally [13]. The communication within the kernel is concentrated to a transpose opera-
896
Henrik Johansson, Dan Wallin, and Sverker Holmgren
tion, which involves heavy all-to-all communication between the threads. However, the communication pattern is constant between the iterations in the time loop.
5 Results We will now present two case studies performed on the PDE solvers. In the first study we simulate the solvers with different sizes of the cache memory, varied between 512 KB and 16 MB. The cache line size is held constant at 64 B. This gives a good overview of the different characteristics exhibited by the solvers. In the second study, we fix the cache memory size and vary the cache line size between 32 B and 1024 B. These simulations provide us with detailed knowledge of the cache memory utilization for each of the three solvers. The case studies should be seen as possible studies to perform in a simulated environment. Other interesting studies could for example be to vary the cache associativity or the replacement policy [14,15]. We only perform a small number of iterative steps for each PDE kernel since the access pattern is regular within each iteration. Before starting the measurements, each solver completes a full iteration to warm-up the caches. The CFD problem has a grid size of 32×32×32 elements using four multi-grid levels. The CEM problem represents a modeled generic aircraft with a problem coefficient matrix of about 175,000×175,000 elements; a little more than 300,000 of these are non-zero. The QD problem size is 256×256×20, i.e. a 256×256 2D FFT in the x-y plane followed by a FDM with 20 grid points in the z-direction. The wall clock time needed to perform the the simulations ranged from six up to twelve hours. The cache misses are categorized into five different cache miss types according to Eggers and Jeremiassen [16]. Cold cache misses occur when data is accessed for the first time. Acapacity cache miss is encountered when the cache memory is full. We also treat conflict misses, when data is mapped to an already occupied location in the cache, as capacity misses. When shared data is modified, it can result in a true sharing miss when another processor later needs the modified data. A false sharing cache miss results when a cache line is invalidated in order to maintain coherency, but no data was actually shared. Updates are not cache misses, but the messages generated in order to invalidate a shared cache line. An update can result in a sharing miss. The data traffic is a measurement of the number of bytes transferred on the interconnect, while the term address traffic represents the number of snoops the caches have to perform. All results presented are from the second level cache. 5.1 Case Study 1: Varying the Cache Size The CFD-problem has the lowest miss rate of all the kernels. There is a relatively large amount of capacity misses when the cache size is small. This is expected, as the problem sizes for all of the applications are chosen to represent realistic work loads. There is a drop in the miss rate at a cache size of 1 MB. The drop corresponds to that all of the data used to interpolate the solution to the finest grid fits in the cache. The CEM solver has a higher miss rate than the other applications. The miss rate decreases rapidly with larger cache sizes, especially between 2 MB and 4 MB. This
Analyzing Advanced PDE Solvers Through Simulation
897
indicates that the vector used in the matrix-vector multiplication now fits into the cache. The large drop in miss rate between 8 MB and 16 MB tells us that cache now is large enough to hold all the data. Some of the capacity misses become true sharing misses when the size is increased. The QD-solver has a miss rate in-between the two other solvers. The miss rate is stable and constant until the whole problem fits into the cache. Most of the capacity misses have then been transformed into true sharing misses because of the classification scheme; the classification of capacity misses take precedence over the true sharing misses. The distribution of the different miss types varies between the three solvers. All of the solvers have a lot of capacity misses when the cache size is small. The capacity misses tend to disappear when the cache size is large enough to hold the full problem. The CFD-kernel exhibits large percentage of false sharing misses. The CEM-solver is dominated by true sharing misses, while the QD-solver has equal quantities of upgrades and true sharing misses for large cache sizes. For all solvers, both data and address traffic follow the general trends seen in the miss rate. 5.2 Case Study 2: Varying the Cache Line Size The CFD kernel performs most of the computations locally within each cell in the grid, leading to a low overall miss ratio. The data causing the true sharing misses and the upgrades exhibit good spatial locality. However, the true and false sharing misses decrease more slowly since they cannot be reduced below a certain level. The reduction in address traffic is proportional to the decrease in miss ratio. The decrease in cache miss ratio is influenced by a remarkable property: the false sharing misses are reduced when the line size is increased. The behavior of false sharing is normally the opposite; false sharing misses increase with larger line size. Data shared over processor boundaries are always, but incorrectly, invalidated. If a shorter cache line is used, less invalidated data will be brought into the local cache after a cache miss and accordingly, a larger number of accesses is required to bring all the requested data to the cache [14]. It is probable that this property can be exploited to improve the cache memory utilization. The CEM kernel has a large problem size, causing a high miss ratio for small cache line sizes. Capacity misses are common and can be avoided using a large cache line size. The true sharing misses and the upgrades are also reduced with a larger cache line size, but at a slower rate since the data vector is being irregular accessed. False sharing is never a problem in this application. Both the data and address traffic exhibit nice properties for large cache lines because of the rapid decrease in the miss ratio with increased line size. The QD kernel is heavily dominated by two miss types, capacity misses and upgrades, which both decrease with enlarged cache line size. The large number of upgrades is caused by the all-to-all communication pattern during the transpose operation where every element is modified after being read. This should result in an equal number of true sharing misses and upgrades, but the kernel problem size is very large and replacements take place before the true sharing misses occur. Therefore, a roughly equal amount of capacity misses and upgrades are recorded. The data traffic increases rather slowly for cache lines below 512 B. The address traffic shows the opposite behavior and decreases rapidly until the 256 B cache line size is reached, where it levels out.
898
Henrik Johansson, Dan Wallin, and Sverker Holmgren
(a) CFDm
(b) CFDt
(c) CEMm
(d) CEMt
(e) QDm
(f) QDt
Fig. 1. Influence of cache size on cache misses, address and data traffic. The miss ratio in percent is indicated in the cache miss figures. The address and data traffic are normalized relative to the 8 MB configuration
6 Conclusion We have shown that full system simulation of PDE solvers with realistic problem sizes is possible. By using simulation, we can obtain important information about the run time behavior of the solvers. The information is difficult to collect using traditional methods such as hardware counters. Simulation also allows us to run the solvers on non existing computer systems and configurations. Our three PDE-solvers were found to have vastly different cache memory characteristics. The collected data in combination with a good knowledge of the algorithms made it possible to understand the behavior of each solver in great detail. We are certain that this Understanding will make it easier to find ways to better utilize the cache memory.
Analyzing Advanced PDE Solvers Through Simulation
(a) CFDm
(b) CFDt
(c) CEMm
(d) CEMt
(e) QDm
(f) QDt
899
Fig. 2. Influence of cache line size on cache misses, address and data traffic. The miss ratio in percent is indicated in the cache miss figures. The address and data traffic are normalized relative to the 64 B configuration
References 1. P. S. Magnusson et al, Simics: A Full System Simulation Platform, IEEE Computer, Vol. 35, No. 2, pp. 50-58, 2002. 2. M. Rosenblum, E. Bugnion, S. Devine, and S. Herrod,Using the SimOS Machine Simulator to Study Complex Computer Systems, ACM TOMACS Special Issue on Computer Simulation,1997. 3. L. A. Barroso, K. Gharachorloo and E. Bugnion, Memory System Characterization of Commercial Workloads, In Proceedings of the 25th International Symposium on Computer Architecture, June 1998. 4. B. Verghese, S. Devine, A. Gupta, and M. Rosenblum, Operating System Support for Improving Data Locality on CC-NUMA Compute servers, In ASPLOS VII, Cambridge, MA, 1996.
900
Henrik Johansson, Dan Wallin, and Sverker Holmgren
5. E. Bugnion, J. M. Anderson and M. Rosenblum, Using SimOS to characterize and optimize auto-parallelized SUIF applications, First SUIF Compiler Workshop, Stanford University, Jan. 11-13, 1996. 6. M. Karlsson, K. Moore, E. Hagersten, and D. Wood,Memory System Behavior of Java-Based Middleware, n Proceedings of the Ninth International Symposium on High Performance Computer Architecture (HPCA-9), Anaheim, California, USA, February 2003. 7. M. Marden, S. Lu, K. Lai, M. Lipasti, Comparison of Memory System Behavior in Java and Non-Java Commercial Workloads, In the proceedings of the Computer Architecture Evaluation using Commercial Workloads (CAECW-02), February 2, 2002. 8. J. Hennessy and D Patterson, Computer Architechture, A Quantitative Approach, Morgan Kaufmann. 9. A. Jameson and D. A. Caughey, How Many Steps are Required to Solve the Euler Equations of Steady Compressible Flow: In Search of a Fast Solution Algorithm, AIAA 2001-2673, 15th AIAA Computational Fluid Dynamics Conference, June 11-14, 2001, Anaheim, CA. 10. M. Nord´en, M. Silva, S. Holmgren, M. Thun´e and R. Wait, Implementation Issues for High Performance CFD, Proceedings of International Information Technology Conference, Colombo, 2002. 11. F. Edelvik, Hybrid Solvers for the Maxwell Equations in Time-Domain, PhD thesis, Dep. of Information Technology, Uppsala University, 2002. 12. ˚A. Petersson, H. Karlsson and S. Holmgren, Predissociation of the Ar-12 van der Waals Molecule, a 3D Study Performed Using Parallel Computers, Submitted to Journal of Physical Chemistry, 2002. 13. D. Wallin, Performance of a High-Accuracy PDE Solver on a Self-Optimizing NUMA Architecture, Master’s thesis, Dep. of Information Technology, Uppsala University, 2001. 14. H. Johansson, An Analysis of Three Different PDE-Solvers With Respect to Cache Performance, Master’s thesis, Dep. of Information Technology, Uppsala University, 2003. 15. D. Wallin, H. Johansson and S. Holmgren, Cache Memory Behavior of Advanced PDE Solvers, Parallel Computing: Software Technology, Algorithms, Architectures and Applications 13, Proceedings of the International Conference ParCo2003, pp. 475-482,Dresden, Germany, 2003. 16. S. J. Eggers and T. E. Jeremiassen, Eliminating False Sharing, Proceedings of International Conference on Parallel Processing, pp. 377-381, 1991.
Towards Cache-Optimized Multigrid Using Patch-Adaptive Relaxation Markus Kowarschik, Iris Christadler, and Ulrich R¨ude System Simulation Group, Computer Science Department Friedrich-Alexander-University Erlangen-Nuremberg, Germany {Markus.Kowarschik,Iris.Christadler,Ulrich.Ruede}@cs.fau.de
Abstract. Most of today’s computer architectures employ fast, yet relatively small cache memories in order to mitigate the effects of the constantly widening gap between CPU speed and main memory performance. Efficient execution of numerically intensive programs can only be expected if these hierarchical memory designs are respected. Our work targets the optimization of the cache performance of multigrid codes. The research efforts we will present in this paper first cover transformations that may be automized and then focus on fundamental algorithmic modifications which require careful mathematical analysis. We will present experimental results for the latter.
1 Introduction In order to mitigate the effects of the constantly widening gap between CPU speed and main memory performance, today’s computer architectures typically employ hierarchical memory designs involving several layers of fast, yet rather small cache memories (caches) [6]. Caches contain copies of frequently used data. However, efficient code execution can only be expected if the codes respect the structure of the underlying memory subsystem. Unfortunately, current optimizing compilers are not able to synthesize chains of complicated cache-based code transformations. Hence, they rarely deliver the performance expected by the users and much of the tedious and error-prone work concerning the tuning of the memory efficiency is thus left to the software developer [3,8]. The research we will present in this paper focuses on the optimization of the data cache performance of multigrid methods which have been shown to be among the fastest numerical solution methods for large sparse linear systems arising from the discretization of elliptic partial differential equations (PDEs) [14]. In particular, we will concentrate on the optimization of the smoother which is typically the most time-consuming part of a multigrid implementation [7,15]. The smoother is employed in order to eliminate the highly oscillating components of the error on the corresponding grid levels. Conversely, the coarse-grid correction is responsible for the reduction of the slowly oscillating error frequencies. Our techniques can be divided into two categories. On the one hand, we will briefly address “compiler-oriented” cache optimization approaches which do not modify the results of the numerical computations and may therefore be fully automized using a J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 901–910, 2006. c Springer-Verlag Berlin Heidelberg 2006
902
Markus Kowarschik, Iris Christadler, and Ulrich R¨ude
compiler or a source code restructuring tool; e.g., ROSE [11]. On the other hand, we will describe a more fundamental approach which is derived from the theory of the fully adaptive multigrid method [13] and based on a patch-adaptive processing strategy. This novel algorithm requires careful mathematical analysis and is therefore a long way from being automatically introduced by a compiler [7]. See [3,8,9] for comprehensive surveys on memory hierarchy optimizations. In particular, cache performance optimizations for computations on structured grids have further been presented in [7,12,15]. For the case of unstructured grids, we particularly refer to [2]. Our paper is structured as follows. Compiler-oriented cache-based transformations will be reviewed briefly in Section 2. Section 3 contains the description of the adaptive relaxation schemes. Afterwards, experimental results will be demonstrated in Section 4. In Section 5, we will draw some final conclusions.
2 Compiler-Oriented Data Locality Optimizations 2.1 Data Layout Optimizations Data layout optimizations aim at enhancing code performance by improving the arrangement of the data in address space [7]. On the one hand, such techniques are applied to change the mapping of array data to the cache frames, thereby reducing the number of cache conflict misses and avoiding cache thrashing [3]. This is commonly achieved by a layout transformation called array padding which introduces additional array elements that are never referenced in order to modify the relative distances of array entries in address space [12]. On the other hand, data layout optimizations can be applied to increase spatial locality. In particular, they can be introduced in order to enhance the reuse of cache blocks once they have been loaded into cache. Since cache blocks contain several data items that are arranged next to each other in address space it is reasonable to aggregate those data items which are likely to be referenced within a short period of time. This technique is called array merging [6]. For a detailed discussion of data layout optimizations for multigrid codes, we again refer to [7,15]. 2.2 Data Access Optimizations Data access optimizations are code transformations which change the order in which iterations in a loop nest are executed. These transformations strive to improve both spatial and temporal locality. Moreover, they can also expose parallelism and make loop iterations vectorizable. Classical loop transformations cover loop fusion as well as loop blocking (tiling), amongst others [1]. Loop fusion refers to a transformation which takes two adjacent loops that have the same iteration space traversal and combines their bodies into a single loop, thereby enhancing temporal locality. Furthermore, fusing two loops results in a single loop which contains more instructions in its body and thus offers increased instruction-level parallelism. Finally, only one loop is executed, thus reducing the total loop overhead by a factor of 0.5 [8].
Towards Cache-Optimized Multigrid Using Patch-Adaptive Relaxation
903
Loop blocking refers to a loop transformation which increases the depth of a loop nest with depth d by adding additional loops. The depth of the resulting loop nest will be anything from d + 1 to 2d. Loop blocking is primarily used to improve temporal locality by enhancing the reuse of data in cache and reducing the number of cache capacity misses. Tiling a single loop replaces it by a pair of loops. The inner loop of the new loop nest traverses a block (tile) of the original iteration space with the same increment as the original loop. The outer loop traverses the original iteration space with an increment equal to the size of the block which is traversed by the inner loop. Thus, the outer loop feeds blocks of the whole iteration space to the inner loop which then executes them step by step [8]. As is the case for data layout transformations, a comprehensive overview of data access optimizations for multigrid codes can be found in [7,15]. See [2] for blocking approaches for unstructured grids.
3 Enhancing Data Locality Through Adaptive Relaxation 3.1 Overview: Fully Adaptive Multigrid Our novel patch-adaptive multigrid scheme is derived from the theory of the fully adaptive multigrid method. The latter combines two different kinds of adaptivity into a single algorithmic framework [13]. Firstly, adaptive mesh refinement is taken into account in order to reduce the number of degrees of freedom and, as a consequence, the computational work as well as the memory consumption of the implementation. Secondly, the fully adaptive multigrid method employs the principle of adaptive relaxation [13]. With this update strategy, computational work can be focused on where it can efficiently improve the quality of the numerical approximation. In this paper, we will concentrate on the principle of adaptive relaxation only and demonstrate how it can be exploited to enhance cache efficiency. For the sake of simplicity, we will restrict the following presentation of the algorithms to a single grid level and thus omit all level indices. However, due to the uniform reduction of the algebraic error, this scheme reveals its full potential when used as a smoother in a multilevel structure [13]. 3.2 Preliminaries We consider the system Ax = b , A = (ai,j )1≤i,j≤n ,
(3.1)
of linear equations and assume that the matrix A is sparse and symmetric positive definite. As a common example, standard second-order finite difference discretizations of the negative Laplacian yield such matrices. The exact solution of (3.1) shall be denoted as x∗ . We assume that x stands for an approximation to x∗ . The scaled residual r¯ corresponding to x is then defined as r¯ := D−1 (b − Ax), where D denotes the diagonal part of A. With ei representing the i-th unit vector, the components θi , of the scaled residual r¯ are given by θi := eTi r¯, 1 ≤ i ≤ n.
904
Markus Kowarschik, Iris Christadler, and Ulrich R¨ude
An elementary relaxation step for equation i of the linear system (3.1), which updates the approximation x and yields the new approximation x , is given by x = x + θi ei , and the Gauss-Seidel update scheme [5] can be written as follows: ⎛ ⎞ (k+1) (k+1) (k) (k) ⎝bi − xi = a−1 ai,j xj − ai,j xj ⎠ = xi + θi . i,i j
j>i
Here, θi represents the current scaled residual of equation i; i.e., the scaled residual of equation i immediately before this elementary update operation. As a consequence of this elementary relaxation step, the i-th component of r¯ vanishes. The motivation of the development of the adaptive relaxation method is based on the following relation, which states how the algebraic error x∗ − x is reduced by performing a single elementary relaxation step: ||x∗ − x||2A − ||x∗ − x ||2A = ai,i θi2 ,
(3.2)
see [13]. As usual, ||v||A stands for the energy norm of v. Equation (3.2) implies that, with every elementary relaxation step, the approximation = 0) or remains unchanged (if θi = 0). Note that, since A is is either improved (if θi positive definite, all diagonal entries ai,i of A are positive. Hence, it is the positive definiteness of A which guarantees that the solution x will never become “worse” by applying elementary relaxation operations. Equation (3.2) further represents the mathematical justification for the selection of the (numerically) most efficient equation update order. It states that the fastest error reduction is obtained by relaxing those equations i whose scaled residuals θi have relatively large absolute values. This is particularly exploited in the method of Gauss-Southwell where always the equation i with the largest value |ai,i θi | is selected for the next elementary relaxation step. Yet, if no suitable update strategy is used, this algorithm will be rather inefficient since determining the equation whose residual has the largest absolute value is as expensive as relaxing each equation once [13]. This observation motivates the principle of the adaptive relaxation method which is based on an active set of equation indices and can be interpreted as an algorithmically efficient approximation to the method of Gauss-Southwell. For a concise description of the operations on the active set and the method of adaptive relaxation, it is convenient to introduce the set of connections of grid node i, 1 ≤ i ≤ n, as = 0} . Conn(i) := {j; 1 ≤ j ≤ n, j = i, aj,i Intuitively, the set Conn(i) contains the indices of those unknowns uj that directly depend on ui , except for ui itself. Note that, due to the symmetry of A, these are exactly the = i, that ui directly depends on. unknowns uj , j Before we can provide an algorithmic formulation of the adaptive relaxation method, it is further necessary to introduce the strictly active set S(θ, x) of equation indices: S(θ, x) := {i; 1 ≤ i ≤ n, |θi | > θ} . Hence, the set S(θ, x) contains the indices i of all equations whose current scaled residuals θi have absolute values greater than the prescribed threshold θ > 0. Possible alternatives of this definition are discussed in [13].
Towards Cache-Optimized Multigrid Using Patch-Adaptive Relaxation
905
˜ x) is a superset of the strictly active set S(θ, x): Finally, an active set S(θ, ˜ x) ⊆ {i; 1 ≤ i ≤ n} . S(θ, x) ⊆ S(θ, ˜ x) may also contain indices of equations whose scaled residuals have relaNote that S(θ, tively small absolute values. The idea behind the introduction of such an extended active set is that it can be maintained efficiently. This is exploited by the adaptive relaxation algorithms that we will present next. 3.3 Point-Based Adaptive Relaxation The core of the sequential (Gauss-Seidel-type) adaptive relaxation method 1 is presented in Algorithm 1. Note that it is essential to initialize the active set S˜ appropriately. Unless problem-specific information is known a priori, S˜ is initialized with all indices i, 1 ≤ i ≤ n. Algorithm 1 Sequential adaptive relaxation Require: tolerance θ > 0, initial guess x, initial active set S˜ ⊆ {i; 1 ≤ i ≤ n} 1: while S˜ = ∅ do 2: Pick i ∈ S˜ {Nondeterministic choice} 3: S˜ ← S˜ \ {i} 4: if |θi | > θ then 5: x ← x + θi ei {Elementary relaxation step} 6: S˜ ← S˜ ∪ Conn(i) 7: end if 8: end while Ensure: S = ∅ {The strictly active set S is empty} The algorithm proceeds as follows. Elementary relaxation steps are performed until the active set S˜ is empty, which implies that none of the absolute values of the scaled residuals θi exceeds the given tolerance θ anymore. In Steps 2 and 3, an equation index ˜ Only if the absolute value of the corresponding i ∈ S˜ is selected and removed from S. scaled residual θi is larger than θ (Step 4), equation i will be relaxed. The relaxation of equation i implies that all scaled residuals θj , j ∈ Conn(i), are changed, and it must therefore be checked subsequently if their absolute values |θj | now exceed the tolerance θ. Hence, if equation i is relaxed (Step 5), the indices j ∈ Conn(i) will be put into the ˜ active set S˜ (Step 6), unless they are already contained in S. As was mentioned above, the full advantage of the adaptive relaxation scheme becomes evident as soon as it is introduced into a multigrid structure. The idea is to employ this method as a smoother in a multigrid setting, where it is not necessary to completely eliminate the errors on the finer grids. Instead, it is enough to smooth the algebraic errors such that reasonably accurate corrections can be computed efficiently on the respective coarser grids [14]. 1
A simultaneous (Jacobi-type) adaptive relaxation scheme is also discussed in [13].
906
Markus Kowarschik, Iris Christadler, and Ulrich R¨ude
The application of the adaptive relaxation method in the multigrid context corresponds to the approach of local relaxation. The principle of local relaxation states that the robustness of multigrid algorithms can be improved significantly by performing additional relaxation steps in the vicinity of singularities or perturbations from boundary conditions, for example. The purpose of these additional elementary relaxation steps is to enforce a sufficiently smooth error such that the subsequent coarse-grid correction performs well. We refer to [13] for further details and especially to the references provided therein. At this point, we eventually have everything in place to introduce our novel patchadaptive relaxation scheme which can be considered cache-aware by construction. 3.4 Cache-Optimized Patch-Based Adaptive Relaxation A patch can be interpreted as a frame (window) that specifies a region of a computational grid. For ease of presentation, we only consider non-overlapping patches. Our sequential patch-based adaptive relaxation scheme parallels the point-oriented version from Section 3.3. This implies that, when introduced into a multigrid structure, it also follows the aforementioned principle of local relaxation and, in general, behaves in a more robust fashion than standard multigrid algorithms. In the case of patch-adaptive relaxation, however, the active set S˜ does not contain the indices of individual equations (i.e., the indices of individual grid nodes). Instead, S˜ contains the indices of patches. The set of all patches is denoted as P. Since the number of patches is usually much smaller than the number of unknowns, the granularity of the adaptive relaxation scheme and, in addition, the overhead of maintaining the active set during the iteration are both reduced significantly. Furthermore, the patch-adaptive relaxation is characterized by its inherent data locality. Consequently, it leads to a high utilization of the cache and therefore to fast code execution. This is accomplished through the use of appropriately sized patches as well as through the repeated relaxation of a patch once it has been loaded into cache. It has been demonstrated that, once an appropriately sized patch has been loaded into the cache and updated for the first time, the costs of subsequently relaxing it a few more times are relatively low [10]. See [7] for details. The core of the sequential patch-adaptive relaxation scheme is presented in Algorithm 22 . Upon termination, the strictly active set S is empty and, therefore, each scaled residual θi is less or equal than the prescribed threshold θ, 1 ≤ i ≤ n. In Step 4, r¯P is used to represent the scaled residual corresponding to patch P (i.e., to the nodes covered by P) and ||¯ rP ||∞ denotes its maximum norm. Relaxing a patch (Step 5) means performing Gauss-Seidel steps on its nodes until each of its own scaled residuals θi has fallen below the (reduced) tolerance θ − δ. As a consequence, after a patch P has been relaxed any of its neighbor patches P will only be inserted into the active set if the absolute values of the scaled residuals of P have been increased “significantly”. This allows for the application of sophisticated activation strategies in Step 6, which aim at avoiding unnecessary patch relaxations. A 2
As is the case for the point-based method, a simultaneous version of the patch-based scheme is also conceivable.
Towards Cache-Optimized Multigrid Using Patch-Adaptive Relaxation
907
Algorithm 2 Sequential patch-adaptive relaxation Require: tolerance θ > 0, tolerance decrement δ = δ(θ), 0 < δ < θ, initial guess x, initial active set S˜ ⊆ P 1: while S˜ = ∅ do 2: Pick P ∈ S˜ {Nondeterministic choice} 3: S˜ ← S˜ \ {P } 4: if ||¯ rP ||∞ > θ − δ then 5: relax(P ) 6: activateNeighbors(P ) 7: end if 8: end while Ensure: S = ∅ {The strictly active set S is empty}
comprehensive discussion of such strategies is beyond the scope of this paper. Instead, we again point to [7].
4 Experimental Results Performance results that demonstrate the effectiveness of our data layout transformations as well as our data access transformations have extensively been studied in [15] for the 2D case and in [7] for the 3D case. Therefore, this section will concentrate on the efficiency of the patch-adaptive multigrid scheme. 4.1 Model Problem We define an L-shaped domain Ω := {(x, y) ∈ (−0.5, 0.5)2; x < 0 or y > 0} ⊂ R2 and introduce 2D polar coordinates (r, ϕ), r ≥ 0, 0 ≤ ϕ ≤ 2π, as usual; i.e., we have x = r cos ϕ and y = r sin ϕ. We consider the Dirichlet boundary value problem − ∆u = 0
in Ω , 2 2 u(r, ϕ) = sin ϕ r3 3
(4.3) on ∂Ω .
(4.4)
It is straightforward to verify that the analytical solution of (4.3), (4.4) is given by ¯ but u ∈ ¯ This u(r, ϕ) := sin( 23 ϕ)r2/3 and that we have u ∈ C 2 (Ω) ∩ C 0 (Ω), / C 1 (Ω). singular behavior of u results from the fact that the first derivatives of u are not bounded in r = 0, cf. [4]. Figure 1 shows a plot of u using a grid with a mesh width of 64−1 . In the following, we will demonstrate how this model problem can be solved efficiently using our patchadaptive multigrid scheme. 4.2 Numerical Tests We have employed a hierarchy of nine regular grids with a finest grid of mesh width of 1024−1. Standard coarsening has been used. The Laplacian has been discretized anew
908
Markus Kowarschik, Iris Christadler, and Ulrich R¨ude
Fig. 1. Analytical solution of the model problem
on each level using bilinear basis functions on the quadrilateral finite elements. This discretization approach leads to constant 9-pont stencils on each grid level. We have determined the total numbers of elementary relaxations steps per grid level for a multigrid V(0,4)-cycle involving a red-black Gauss-Seidel (RBGS) smoother. Additionally, we have determined the corresponding numbers for the patch-adaptive multigrid scheme. The patch-adaptive relaxation scheme from Section 3.4 has been applied to smooth the errors on the five finest levels, using a patch size of approximately 32 × 32 nodes. On the four coarser levels, a standard red-black Gauss-Seidel smoother has been employed. This is because it is only reasonable to apply the patch-adaptive scheme if the number of unknowns on the corresponding level is large enough such that it pays off to implement a smoother based on an active set strategy. The results of the comparison are summarized in Table 2. They clearly emphasize the numerical efficiency of the multigrid method involving our patch-adaptive smoother. It is obvious that the patch-adaptive multigrid algorithm requires far less computational work than the standard one, even when counting all necessary residual calculations in addition to the actual number of elementary relaxation steps, cf. again Algorithm 2 from Section 3.4. This is due to the fact that numerical schemes based on adaptive relaxation generally represent good candidates for solving such problems that exhibit singularities [13]. Their applicability to “smoother” problems, however, is subject to future research. Moreover, the RBGS-based multigrid method exhibits an average residual reduction factor of 0.22 per V-cycle3 . Therefore, we have prescribed the tolerance θ := 0.22 for the patch-adaptive smoother on each respective grid level, including the finest one which is typically considered solely for convergence rate measurements. The tolerance decrement δ has been chosen empirically as δ := 0.2θ. The patch-adaptive multigrid scheme has actually exhibited a significantly better residual reduction factor of 0.063. This observation again emphasizes the enhanced numerical efficiency of the patchadaptive scheme, cf. Section 3.3. 3
This means that we have divided the maximum norms of the scaled residuals on the finest grid level before and after each V-cycle.
Towards Cache-Optimized Multigrid Using Patch-Adaptive Relaxation
909
RBGS-based MG Patch-adapt. MG Patch-adapt. MG residual comp. Level unknowns 0
elem. rel. steps elem. rel. steps +elem. rel. steps
5
20
20
20
1
33
132
132
132
2
161
644
644
644
3
705
2 820
2 820
2 820
4
2 945
11 780
54 932
57 877
5
12 033
48 132
44 672
56 705
6
48 641
194 564
44 672
93 313
7
195 585
782 340
48 640
244 225
8
784 385
3 137 540
56 832
841 217
Total 1 044 493
4 177 972
253 364
1 296 953
Fig. 2. Comparison of standard and patch-adaptive multigrid methods
A detailed analysis of the behavior of the patch-adaptive multigrid scheme reveals that, on the coarsest level on which the patch-adaptive smoother is used, more elementary relaxation steps are performed than in the case of the standard smoother. However, this additional work pays off on the finer levels, where only those patches in the vicinity of the singularity at r = 0 are relaxed more intensively. We again refer to [7] for a comprehensive discussion. Preliminary performance experiments have indicated the improved cache efficiency of the patch-adaptive relaxation scheme. However, the development of highly tuned implementations which exhibit high floating-point performance (in terms of MFLOPS rates) is still in progress.
5 Conclusions The continued improvements in processor performance are generally placing increasing pressure on the memory hierarchy. Therefore, we have developed and examined optimization techniques in order to enhance the data locality exhibited by grid-based numerical computations, particularly the time-consuming smoothers in multigrid algorithms. We have presented a promising approach towards the design of novel, inherently cache-aware algorithms and, so far, we have demonstrated its enhanced numerical robustness. The development of highly tuned hardware-aware implementations, however, is the focus of ongoing and future work.
References 1. R. Allen and K. Kennedy. Optimizing Compilers for Modern Architectures. Morgan Kaufmann, 2001.
910
Markus Kowarschik, Iris Christadler, and Ulrich R¨ude
2. C.C. Douglas, J. Hu, M. Kowarschik, U. R¨ude, and C. Weiß. Cache Optimization for Structured and Unstructured Grid Multigrid. Electronic Transactions on Numerical Analysis, 10:21–40, 2000. 3. S. Goedecker and A. Hoisie. Performance Optimization of Numerically Intensive Codes. SIAM, 2001. 4. W. Hackbusch. Elliptic Differential Equations — Theory and Numerical Treatment, volume 18 of Springer Series in Computational Mathematics. Springer, 1992. 5. W. Hackbusch. Iterative Solution of Large Sparse Systems of Equations, volume 95 of Applied Mathematical Sciences. Springer, 1993. 6. J.L. Hennessy and D.A. Patterson. Computer Architecture: A Quantitative Approach. Morgan Kaufmann, 3. edition, 2003. 7. M. Kowarschik. Data Locality Optimizations for Iterative Numerical Algorithms and Cellular Automata on Hierarchical Memory Architectures. PhD thesis, Lehrstuhl f¨ur Informatik 10 (Systemsimulation), Institut f¨ur Informatik, Universit¨at Erlangen-N¨urnberg, Erlangen, Germany, July 2004. SCS Publishing House. 8. M. Kowarschik and C. Weiß. An Overview of Cache Optimization Techniques and CacheAware Numerical Algorithms. In U. Meyer, P. Sanders, and J. Sibeyn, editors, Algorithms for Memory Hierarchies — Advanced Lectures, volume 2625 of Lecture Notes in Computer Science, pages 213–232. Springer, 2003. 9. D. Loshin. Efficient Memory Programming. McGraw-Hill, 1998. 10. H. L¨otzbeyer and U. R¨ude. Patch-Adaptive Multilevel Iteration. BIT, 37(3):739–758, 1997. 11. D. Quinlan, M. Schordan, B. Miller, and M. Kowarschik. Parallel Object-Oriented Framework Optimization. Concurrency and Computation: Practice and Experience, 16(2–3):293–302, 2004. Special Issue: Compilers for Parallel Computers. 12. G. Rivera and C.-W. Tseng. Tiling Optimizations for 3D Scientific Computations. In Proc. of the ACM/IEEE Supercomputing Conf., Dallas, Texas, USA, 2000. 13. U. R¨ude. Mathematical and Computational Techniques for Multilevel Adaptive Methods, volume 13 of Frontiers in Applied Mathematics. SIAM, 1993. 14. U. Trottenberg, C. Oosterlee, and A. Sch¨uller. Multigrid. Academic Press, 2001. 15. C. Weiß. Data Locality Optimizations for Multigrid Methods on Structured Grids. PhD thesis, Lehrstuhl f¨ur Rechnertechnik und Rechnerorganisation, Institut f¨ur Informatik, Technische Universit¨at M¨unchen, Munich, Germany, December 2001.
Hierarchical Partitioning and Dynamic Load Balancing for Scientific Computation James D. Teresco1 , Jamal Faik2 , and Joseph E. Flaherty2 1
2
Department of Computer Science, Williams College Williamstown, MA 01267 USA [email protected] Department of Computer Science, Rensselaer Polytechnic Institute Troy, NY 12180, USA {faikj,flaherje}@cs.rpi.edu
Abstract. Cluster and grid computing has made hierarchical and heterogeneous computing systems increasingly common as target environments for large-scale scientific computation. A cluster may consist of a network of multiprocessors. A grid computation may involve communication across slow interfaces. Modern supercomputers are often large clusters with hierarchical network structures. For maximum efficiency, software must adapt to the computing environment. We focus on partitioning and dynamic load balancing, in particular on hierarchical procedures implemented within the Zoltan Toolkit, guided by DRUM, the Dynamic Resource Utilization Model. Here, different balancing procedures are used in different parts of the domain. Preliminary results show that hierarchical partitionings are competitive with the best traditional methods on a small hierarchical cluster.
Introduction Modern three dimensional scientific computations must execute in parallel to achieve acceptable performance. Target parallel environments range from clusters of workstations to the largest tightly-coupled supercomputers. Hierarchical and heterogeneous systems are increasingly common as symmetric multiprocessing (SMP) nodes are combined to form the relatively small clusters found in many institutions as well as many of today’s most powerful supercomputers. Network hierarchies arise as grid technologies make Internet execution more likely and modern supercomputers are built using hierarchical interconnection networks. MPI implementations may exhibit very different performance characteristics depending on the underlying network and message passing implementation (e.g., [32]). Software efficiency may be improved using optimizations based on system characteristics and domain knowledge. Some have accounted for clusters of SMPs by using a hybrid programming model, with message passing for inter-node communication and multithreading for intra-node communication (e.g., [1,27]), with varying degress of success, but always with an increased burden on programmers to program both levels of parallelization. Our focus has been on resource-aware partitioning and dynamic load balancing, achieved by adjusting target partition sizes or the choice of a dynamic load-balancing J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 911–920, 2006. c Springer-Verlag Berlin Heidelberg 2006
912
James D. Teresco, Jamal Faik, and Joseph E. Flaherty
procedure or its parameters, or by using a combination of load-balancing procedures. We retain the flat message passing programming model. For hierarchical and heterogeneous systems, different choices of load balancing procedures may be appropriate in different parts of the parallel environment. There are tradeoffs in execution time and partition quality (e.g., surface indices, interprocess connectivity, strictness of load balance) [35] and some may be more important than others in some circumstances. For example, consider a cluster of SMP nodes connected by Ethernet. A more costly graph partitioning can be done to partition among the nodes, to minimize communication across the slow network interface, possibly at the expense of some computational imbalance. Then, a fast geometric algorithm can be used to partition independently within each node.
1 Partitioning and Dynamic Load Balancing An effective partitioning or dynamic load balancing procedure maximizes efficiency by minimizing processor idle time and interprocessor communication. While some applications can use a static partitioning throughout a computation, others, such as adaptive finite element methods, have dynamic workloads that necessitate dynamic load balancing during the computation. Partitioning and dynamic load balancing can be performed using recursive bisection methods [2,29,31,38], space-filling curve (SFC) partitioning [7,23,24,25,37] and graph partitioning (including spectral [26,29], multilevel [6,18,20,36], and diffusive methods [8,19,22]). Each algorithm has characteristics and requirements that make it appropriate for certain applications; see [4,35] for examples and [33] for an overview of available methods.
Application Create Zoltan object Set parameters Invoke balancing
Zoltan Zoltan_Create() Zoltan_Set_Param() Zoltan_LB_Partition() Zoltan balancer
callbacks invoked by Zoltan
call application callbacks partition return migration arrays
migrate application data continue computation
Fig. 1. Interaction between Zoltan and applications
The Zoltan Parallel Data Services Toolkit [9,11] provides dynamic load balancing and related capabilities to a wide range of dynamic, unstructured and/or adaptive applications. Using Zoltan, application developers can switch partitioners simply by changing
Hierarchical Partitioning and Dynamic Load Balancing for Scientific Computation
913
a run-time parameter, facilitating comparisons of the partitioners’ effect on the applications. Zoltan has a simple interface, and its design is “data-structure neutral.” That is, Zoltan does not require applications to construct or use specific data structures. It operates on generic “objects” that are specified by calls to application-provided callback functions. These callbacks are simple functions that return to Zoltan information such as the lists of objects to be partitioned, coordinates of objects, and topological connectivity of objects. Figure 1 illustrates the interaction between Zoltan and an application. We focus here on the hierarchical balancing procedures we have implemented within Zoltan, where different procedures are used in different parts of the computing environment.
2 Hierarchical Balancing Consider four different 16-way partitionings of a 1,103,018-element mesh used in a simulation of blood flow in a human aorta [30]. Here, the geometric method recursive inertial bisection (RIB) [28] and/or the multilevel graph partitioner in ParMetis [20] are used for partitioning. Only two partition quality factors are considered: computational balance, and two surface index measures, although other factors [5] should be considered. The global surface index (GSI) measures the overall percentage of mesh faces inter-partition boundaries, and the maximum local surface index (M LSI) measures the maximum percentage of any one partition’s faces that are on a partition boundary. The surface indices are essentially normalized edge cuts when considering mesh connectivity in terms of a graph. Partitioning using RIB achieves an excellent computational balance, with partitions differing by no more than one element, and medium-quality surface indices with
Network Node 0
...
CPU CPU 0 1
CPU CPU 14 15
Node 1 CPU
CPU
Memory
...
CPU CPU 7 6
...
CPU CPU 6 7
Memory
Memory
(c)
Memory
Memory
Network Node 1
CPU CPU 0 1
Node 15 CPU
(b)
Network
CPU CPU 0 1
...
Memory
Memory
(a)
Node 0
Node 14 CPU
Node 0 CPU0 CPU1
Node 3 CPU2 CPU3
...
Memory
CPU0 CPU1
CPU2 CPU3
Memory
(d)
Fig. 2. Examples of parallel computing environments with 16 processors: (a) a 16-way SMP workstation; (b) a 16-node computer with all uniprocessor nodes, connected by a network; (c) two 8-way SMP workstations connected by a network; and (d) four 4-way SMP workstations connected by a network
914
James D. Teresco, Jamal Faik, and Joseph E. Flaherty
M LSI = 1.77 and GSI = 0.61. This would be useful when the primary concern is a good computational balance, as in a shared-memory environment (Figure 2b). Using only ParMetis achieves excellent surface index values, M LSI = 0.40 and GSI = 0.20, but at the expense of a large computational imbalance, where partition sizes range from 49,389 to 89,302 regions. For a computation running on a network of workstations (NOW) (Figure 2b), it may be worth accepting the significant load imbalance to achieve the smaller communication volume. For hierarchical systems, a hybrid partitioning may be desirable. Consider the two SMP configurations connected by a slow network as shown in Figure 2c,d. In the two 8-way node configuration (Figure 2c), ParMetis is used to divide the computation between the two SMP nodes, resulting in partitions of 532,063 and 570,955 regions, with M LSI = 0.06 and GSI = 0.03. Within each SMP, the mesh is partitioned eight ways using RIB, producing partitions within each SMP balanced to within one element, and with overall M LSI = 1.88 and GSI = 0.56. Since communication across the slow network is minimized, this is an appropriate partitioning for this environment. A similar strategy for the four 4-way SMP configuration (Figure 2d) results in a ParMetis partitioning across the four SMP nodes with 265,897, 272,976, 291,207 and 272,938 elements and M LSI = 0.23 and GSI = 0.07. Within each SMP, partitions are again balanced to within one element, with overall M LSI = 1.32 and GSI = 0.32. We first presented an example of this type of hybrid partition in [32], although the partitions presented therein were not generated by a fully automatic procedure. Zoltan’s hierarchical balancing automates the creation of such partitions. It can be used directly by an application or be guided by the tree representation of the computational environment created and maintained by the Dynamic Resource Utilization Model (DRUM) [10,13,34]. DRUM is a software system that supports automatic resourceaware partitioning and dynamic load balancing for heterogeneous, non-dedicated, and hierarchical computing environments. DRUM dynamically models the computing environment using a tree structure that encapsulates the capabilities and performance of communication and processing resources. The tree is populated with performance data obtained from a priori benchmarks and dynamic monitoring agents that run concurrently with the application. It is then used to guide partition-weighted and hierarchical partitioning and dynamic load balancing. Partition-weighted balancing is discussed further in [13]. DRUM’s graphical configuration program (Figure 3) may be used to facilitate the specification of hierarchical balancing parameters at each network and multiprocessing node. The hierarchical balancing implementation utilizes a lightweight “intermediate hierarchical balancing structure” (IHBS) and a set of callback functions. This permits an automated and efficient hierarchical balancing which can use any of the procedures available within Zoltan without modification and in any combination. Hierarchical balancing is invoked by an application in the same way as other Zoltan procedures. A hierarchical balancing step begins by building an IHBS using these callbacks. The IHBS is an augmented version of the distributed graph structure that Zoltan builds to make use of the ParMetis [21] and Jostle [36] libraries. The hierarchical balancing procedure then provides its own callback functions to allow existing Zoltan procedures to be used to query and update the IHBS at each level of a hierarchical balancing. After all levels
Hierarchical Partitioning and Dynamic Load Balancing for Scientific Computation
915
Fig. 3. DRUM’s graphical configuration program being used to edit a description of the “Bullpen Cluster” at Williams College. This program may be used to specify hierarchical balancing parameters for Zoltan
of the hierarchical balancing have been completed, Zoltan’s usual migration arrays are constructed and returned to the application. Thus, only lightweight objects are migrated internally between levels, not the (larger and more costly) application data. Figure 4 shows the interaction between Zoltan and an application when hierarchical balancing is used.
3 Examples We have tested our procedures using a software package called LOCO [16], which implements a parallel adaptive discontinuous Galerkin [3] solution of the compressible Euler equations. We consider the “perforated shock tube” problem, which models the three-dimensional unsteady compressible flow in a cylinder containing a cylindrical vent [14]. This problem was motivated by flow studies in perforated muzzle brakes for large calibre guns [12]. The initial mesh contains 69,572 tetrahedral elements. For these experiments, we stop the computation after 4 adaptive steps, when the mesh contains 254,510 elements. Preliminary results are show that hierarchical partitioning can be competitive with the best results from a graph partitioner alone. We use two eight-processor computing environments: one with four Sun Enterprise 220R servers, each with two 450MHz Sparc UltraII processors, the other with two Sun Enterprise 420R servers, each with four
916
James D. Teresco, Jamal Faik, and Joseph E. Flaherty Application Create Zoltan object
Zoltan Zoltan_Create() Zoltan_Set_Param()
Set parameters Invoke balancing
Zoltan_LB_Partition() Zoltan HIER balancer
at each level
callbacks invoked by Zoltan
migrate application data continue computation
call application callbacks (build initial IHBS) create Zoltan objects set parameters invoke balancing callbacks on IHBS update IHBS split MPI_Comm return migration arrays
Zoltan Zoltan_Create() Zoltan_Set_Param() Zoltan_LB_Partition() Zoltan balancer call HIER callbacks partition return mig. arrays
Fig. 4. Interaction between Zoltan and applications when hierarchical balancing is used 8 processes compute one 2−way ParMetis partitioning
Network Node 0 CPU0 CPU1
Node 3 CPU2 CPU3
Memory
CPU0 CPU1
CPU2 CPU3
Each SMP independently computes 4−way RIB partitioning
Memory
Fig. 5. Hierarchical balancing algorithm selection for two 4-way SMP nodes connected by a network
450MHz Sparc UltraII processors. The nodes are dedicated to computation and do not perform file service. In both cases, inter-node communication is across fast (100 Mbit) Ethernet. A comparison of running times for the perforated shock tube in these computing environments for all combinations of traditional and hierarchical procedures shows that while ParMetis multilevel graph partitioning alone often achieves the fastest computation times, there is some benefit to using hierarchical load balancing where ParMetis is used for inter-node partitioning and inertial recursive bisection is used within each node. For example, in the four-node environment (Figure 5), the computation time following the fourth adaptive step is 571.7 seconds for the hierarchical procedure with ParMetis and RIB, compared with 574.9 seconds for ParMetis alone, 702.7 seconds for Hilbert SFC partitioning alone, 1508.2 seconds for recursive coordinate bisection alone, and 822.9 seconds for RIB alone. It is higher for other hierarchical combinations of methods.
4 Discussion We have demonstrated the ability to use the hierarchical balancing implemented within Zoltan as both an initial partitioning and a dynamic load balancing procedure for a realistic adaptive computation. While a traditional multilevel graph partitioning is often the most effective for this application and this computing environment, the results to
Hierarchical Partitioning and Dynamic Load Balancing for Scientific Computation
917
date demonstrate the potential benefits of hierarchical procedures. In cases where the top-level multilevel graph partitioner achieves a decomposition without introducing load imbalance, it provides the fastest time-to-solution in our studies to this point (though only slightly faster than the multilevel graph partitioner alone is used). When imbalance is introduced by the multilevel graph partitioners, using hierarchical balancing to achieve strict balance within an SMP is beneficial. Studies are underway that utilize hierarchical balancing on larger clusters, on other architectures, and with a wider variety of applications. We expect that hierarchical balancing will be most beneficial when the extreme hierarchies found in grid environments are considered. Significant benefits will depend on an MPI implementation that can provide a very efficient intra-node communication. This is not the case in the MPICH [17] implementation used on the cluster in our experiments. Here, all communication needs to go through a network layer (MPICH’s p4 device), even if two processes are executing on the same SMP node. We are eager to run experiments of clusters built from other types of SMP nodes and on computational grids, and specifically those with MPI implementations that do provide appropriate intra-node communication optimizations. Enhancements to the hierarchical balancing procedures will focus on usability and efficiency. Further enhancements to DRUM’s machine model and graphical configuration tool will facilitate and automate the selection of hierarchical procedures. Efficiency may be improved by avoiding unnecessary updates to the intermediate structure, particularly at the lowest level partitioning step. Maintaining the intermediate structure across subsequent rebalancing steps would reduce startup costs, but is complex for adaptive problems. It would also be beneficial to avoid building redundant structures, such as when ParMetis is used at the highest level of a hierarchical balancing, however this would require some modification of individual Zoltan methods, which we have been careful to avoid thus far. The IHBS itself has potential benefits outside of hierarchical balancing. It could be used to allow incremental enhancements and post-processing “smoothing” [15] on a decomposition before Zoltan returns its migration arrays to the application. The IHBS could also be used to compute multiple “candidate” decompositions with various algorithms and parameters, allowing Zoltan or the application to compute statistics about each and only accept and use the one deemed best. Partitioning is only one factor that may be considered for an effective resource-aware computation. Ordering of computation and of communication, data replication to avoid communication across slow interfaces, and use of multithreading are other resourceaware enhancements that may be used. DRUM’s machine model currently includes some information that may be useful for these other types of optimizations, and it will be augmented to include information to support others.
Acknowledgments The authors were supported in part by Sandia contract PO15162 and the Computer Science Research Institute at Sandia National Laboratories. Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy under Contract DE-AC04-94AL85000.
918
James D. Teresco, Jamal Faik, and Joseph E. Flaherty
The authors would like to thank Erik Boman, Karen Devine, and Bruce Hendrickson (Sandia National Laboratories, Albuquerque, USA) for their valuable input on the design of Zoltan’s hierarchical balancing procedures. Students Jin Chang (Rensselaer), Laura Effinger-Dean (Williams College), Luis Gervasio (Rensselaer), and Arjun Sharma (Williams College) have contributed to the development of DRUM. We would like to thank the reviewers of this article for their helpful comments and suggestions.
References 1. S. B. Baden and S. J. Fink. A programming methodology for dual-tier multicomputers. IEEE Transactions on Software Engineering, 26(3):212–216, 2000. 2. M. J. Berger and S. H. Bokhari. A partitioning strategy for nonuniform problems on multiprocessors. IEEE Trans. Computers, 36:570–580, 1987. 3. R. Biswas, K. D. Devine, and J. E. Flaherty. Parallel, adaptive finite element methods for conservation laws. Appl. Numer. Math., 14:255–283, 1994. 4. E. Boman, K. Devine, R. Heaphy, B. Hendrickson, M. Heroux, and R. Preis. LDRD report: Parallel repartitioning for optimal solver performance. Technical Report SAND2004–0365, Sandia National Laboratories, Albuquerque, NM, February 2004. ¨ 5. C. L. Bottasso, J. E. Flaherty, C. Ozturan, M. S. Shephard, B. K. Szymanski, J. D. Teresco, and L. H. Ziantz. The quality of partitions produced by an iterative load balancer. In B. K. Szymanski and B. Sinharoy, editors, Proc. Third Workshop on Languages, Compilers, and Runtime Systems, pages 265–277, Troy, 1996. 6. T. Bui and C. Jones. A heuristic for reducing fill in sparse matrix factorization". In Proc. 6th SIAM Conf. Parallel Processing for Scientific Computing, pages 445–452. SIAM, 1993. 7. P. M. Campbell, K. D. Devine, J. E. Flaherty, L. G. Gervasio, and J. D. Teresco. Dynamic octree load balancing using space-filling curves. Technical Report CS-03-01, Williams College Department of Computer Science, 2003. 8. G. Cybenko. Dynamic load balancing for distributed memory multiprocessors. J. Parallel Distrib. Comput., 7:279–301, 1989. 9. K. Devine, E. Boman, R. Heaphy, B. Hendrickson, and C. Vaughan. Zoltan data management services for parallel dynamic applications. Computing in Science and Engineering, 4(2):90– 97, 2002. 10. K. D. Devine, E. G. Boman, R. T. Heaphy, B. A. Hendrickson, J. D. Teresco, J. Faik, J. E. Flaherty, and L. G. Gervasio. New challenges in dynamic load balancing. Appl. Numer. Math., 52(2–3):133–152, 2005. 11. K. D. Devine, B. A. Hendrickson, E. Boman, M. St. John, and C. Vaughan. Zoltan: A Dynamic Load Balancing Library for Parallel Applications; User’s Guide. Sandia National Laboratories, Albuquerque, NM, 1999. Tech. Report SAND99-1377. Open-source software distributed at http://www.cs.sandia.gov/Zoltan. 12. R. E. Dillon Jr. A parametric study of perforated muzzle brakes. ARDC Tech. Report ARLCBTR-84015, Ben´et Weapons Laboratory, Watervliet, 1984. 13. J. Faik, J. D. Teresco, K. D. Devine, J. E. Flaherty, and L. G. Gervasio. A model for resourceaware load balancing on heterogeneous clusters. Technical Report CS-05-01, Williams College Department of Computer Science, 2005. Submitted to Transactions on Parallel and Distributed Systems.
Hierarchical Partitioning and Dynamic Load Balancing for Scientific Computation
919
14. J. E. Flaherty, R. M. Loy, M. S. Shephard, M. L. Simone, B. K. Szymanski, J. D. Teresco, and L. H. Ziantz. Distributed octree data structures and local refinement method for the parallel solution of three-dimensional conservation laws. In M. Bern, J. Flaherty, and M. Luskin, editors, Grid Generation and Adaptive Algorithms, volume 113 of The IMA Volumes in Mathematics and its Applications, pages 113–134, Minneapolis, 1999. Institute for Mathematics and its Applications, Springer. 15. J. E. Flaherty, R. M. Loy, M. S. Shephard, B. K. Szymanski, J. D. Teresco, and L. H. Ziantz. Adaptive local refinement with octree load-balancing for the parallel solution of threedimensional conservation laws. J. Parallel Distrib. Comput., 47:139–152, 1997. 16. J. E. Flaherty, R. M. Loy, M. S. Shephard, and J. D. Teresco. Software for the parallel adaptive solution of conservation laws by discontinuous Galerkin methods. In B. Cockburn, G. Karniadakis, and S.-W. Shu, editors, Discontinous Galerkin Methods Theory, Computation and Applications, volume 11 of Lecture Notes in Compuational Science and Engineering, pages 113–124, Berlin, 2000. Springer. 17. W. Gropp, E. Lusk, N. Doss, and A. Skjellum. A high-performance, portable implementation of the MPI message passing interface standard. Parallel Computing, 22(6):789–828, Sept. 1996. 18. B. Hendrickson and R. Leland. A multilevel algorithm for partitioning graphs. In Proc. Supercomputing ’95, 1995. 19. Y. F. Hu and R. J. Blake. An optimal dynamic load balancing algorithm. Preprint DL-P-95-011, Daresbury Laboratory, Warrington, WA4 4AD, UK, 1995. 20. G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Scien. Comput., 20(1), 1999. 21. G. Karypis and V. Kumar. Parallel multilevel k-way partitioning scheme for irregular graphs. SIAM Review, 41(2):278–300, 1999. 22. E. Leiss and H. Reddy. Distributed load balancing: design and performance analysis. W. M. Kuck Research Computation Laboratory, 5:205–270, 1989. 23. W. F. Mitchell. Refinement tree based partitioning for adaptive grids. In Proc. Seventh SIAM Conf. on Parallel Processing for Scientific Computing, pages 587–592. SIAM, 1995. 24. A. Patra and J. T. Oden. Problem decomposition for adaptive hp finite element methods. Comp. Sys. Engng., 6(2):97–109, 1995. 25. J. R. Pilkington and S. B. Baden. Dynamic partitioning of non-uniform structured workloads with spacefilling curves. IEEE Trans. on Parallel and Distributed Systems, 7(3):288–300, 1996. 26. A. Pothen, H. Simon, and K.-P. Liou. Partitioning sparse matrices with eigenvectors of graphs. SIAM J. Mat. Anal. Appl., 11(3):430–452, 1990. 27. R. Rabenseifner and G. Wellein. Comparision of parallel programming models on clusters of SMP nodes. In H. Bock, E. Kostina, H. Phu, and R. Rannacher, editors, Proc. Intl. Conf. on High Performance Scientific Computing, pages 409–426, Hanoi, 2004. Springer. ¨ 28. M. S. Shephard, J. E. Flaherty, H. L. de Cougny, C. Ozturan, C. L. Bottasso, and M. W. Beall. Parallel automated adaptive procedures for unstructured meshes. In Parallel Comput. in CFD, number R-807, pages 6.1–6.49. Agard, Neuilly-Sur-Seine, 1995. 29. H. D. Simon. Partitioning of unstructured problems for parallel processing. Comp. Sys. Engng., 2:135–148, 1991. 30. C. A. Taylor, T. J. R. Hugues, and C. K. Zarins. Finite element modeling of blood flow in arteries. Comput. Methods Appl. Mech. Engrg., 158(1–2):155–196, 1998. 31. V. E. Taylor and B. Nour-Omid. A study of the factorization fill-in for a parallel implementation of the finite element method. Int. J. Numer. Meth. Engng., 37:3809–3823, 1994. 32. J. D. Teresco, M. W. Beall, J. E. Flaherty, and M. S. Shephard. A hierarchical partition model for adaptive finite element computation. Comput. Methods Appl. Mech. Engrg., 184:269–285, 2000.
920
James D. Teresco, Jamal Faik, and Joseph E. Flaherty
33. J. D. Teresco, K. D. Devine, and J. E. Flaherty. Numerical Solution of Partial Differential Equations on Parallel Computers, chapter Partitioning and Dynamic Load Balancing for the Numerical Solution of Partial Differential Equations. Springer-Verlag, 2005. 34. J. D. Teresco, J. Faik, and J. E. Flaherty. Resource-aware scientific computation on a heterogeneous cluster. Computing in Science & Engineering, 7(2):40–50, 2005. 35. J. D. Teresco and L. P. Ungar. A comparison of Zoltan dynamic load balancers for adaptive computation. Technical Report CS-03-02, Williams College Department of Computer Science, 2003. Presented at COMPLAS ’03. 36. C. Walshaw and M. Cross. Parallel Optimisation Algorithms for Multilevel Mesh Partitioning. Parallel Comput., 26(12):1635–1660, 2000. 37. M. S. Warren and J. K. Salmon. A parallel hashed oct-tree n-body algorithm. In Proc. Supercomputing ’93, pages 12–21. IEEE Computer Society, 1993. 38. R. Williams. Performance of dynamic load balancing algorithms for unstructured mesh calculations. Concurrency, 3:457–481, October 1991.
Cache Optimizations for Iterative Numerical Codes Aware of Hardware Prefetching Josef Weidendorfer and Carsten Trinitis Technische Universit¨at M¨unchen, Germany {weidendo,trinitic}@cs.tum.edu
Abstract. Cache optimizations use code transformations to increase the locality of memory accesses and use prefetching techniques to hide latency. For best performance, hardware prefetching units of processors should be complemented with software prefetch instructions. A cache simulation enhanced with a hardware prefetcher is presented to run code for a 3D multigrid solver. Thus, cache misses not predicted can be handled via insertion of prefetch instructions. Additionally, Interleaved Block Prefetching (IBPF), is presented. Measurements show its potential.
1 Introduction Cache optimizations typically include code transformations to increase the locality of memory accesses: standard strategies are loop blocking and relayouting of data structures, e.g. splitting or joining of arrays and padding [12]. An orthogonal approach is to enable for latency hiding by introducing prefetching techniques; i.e., by ensuring that any data is loaded early enough before it is actually used. Software prefetching enables this by inserting cache load instructions into the program code. However, the use of such instructions consumes both decoding bandwidth and hardware resources for the handling of outstanding loads. Because of this and the added complexity of manually inserting prefetch instructions, modern processors like Intel Pentium 4, Pentium-M [10] or AMD Athlon, are equipped with hardware prefetch units which predict future memory accesses in order to load data into cache in advance. Both prefetch approaches can be combined. Usually, there will be parts of code where hardware prefetching is doing a fine job, and parts where manual insertion of prefetch instructions is needed. The two cases can be distinguished by doing measurements with hardware performance counters. To this end, we use the Intel Pentium-M processor hardware, which has performance counters measuring the effect of its hardware prefetch unit [10]. This approach has some drawbacks: the actual prefetch algorithm implemented inside the processor is unknown, and thus, manual insertions of prefetch instructions will be specific to one processor. Therefore, additional measurements are done by simulating a known hardware prefetch algorithm. By choosing a simple stream detection, we are sure that program code where this prefetch algorithm is working fine, is also running fine with any existing hardware prefetcher. Additionally, we can compare results with and without hardware prefetching. There is however a case where prefetching alone can not help: When performance of a code section is limited by available bandwidth to main memory, prefetching can J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 921–927, 2006. c Springer-Verlag Berlin Heidelberg 2006
922
Josef Weidendorfer and Carsten Trinitis
not do any better. The first step to improve this situation is to use loop blocking [12]: instead of going multiple times over a large block of data not fitting into the cache, we split up this block into smaller ones fitting into the cache, and go multiple times over each small block before handling the next one. Thus, only in the first run over a small block has to fetch data from main memory. Still, this first run exhibits the bandwidth limitation. But as further runs on the small block do not use main memory bandwidth at all, this time can be used to prefetch the next block. Of course, the block size needs to be corrected in a way that two blocks fit into cache. We call this technique Interleaved Block Prefetching. For this study, a 3D multigrid solver is used as application. The code is well known, and various standard cache optimizations were applied [11]. The main motivation for this work is the fact that standard cache optimizations do not work as well as expected on newer processors [17]. On the one hand, hardware prefetchers seem to help in the non-optimized case, on the other hand, they sometimes seem to work against manual cache optimizations: e.g. blocking with small loop intervals, especially in 3D, leads to a lot of unneeded activity from a hardware prefetcher with stream detection. In the next chapter, some related work is presented. Afterwards, we give a survey of the cache simulator used and the hardware prefetch extension we added. We show measurements for our multigrid solver, using statistical sampling with hardware performance counters, and the simulator without and with hardware prefetching. Finally, for a simple example we introduce interleaved block prefetching and give some real measurements. Since adding interleaved block prefetching to the multigrid solver is currently work in progress, results will be given in a subsequent publication.
2 Related Work Regarding hardware prefetchers, [3] gives a good overview of well known algorithms. Advanced algorithms are presented in [1]. Still, in hardware manuals of processors like Pentium-4 [10], there is only a vague description of the prefetch algorithm used. Simulation is often used to get more details on the runtime behavior of applications or for research and development of processors [16],[9]. For memory access behavior and cache optimization, it is usually enough to simulate the cache hierarchy only like in MemSpy [14] or SIGMA [7]. The advantage of simulation is the possibility to get more useful metrics than plain hit/miss counts like reuse distances [4] or temporal/spatial locality [2]. For real measurements, we choose statistical sampling under Linux with OProfile [13]. With instrumentation, we would get exact values, e.g. using Adaptor [5], a Fortran source instrumentation tool, or DynaProf [8]. The latter uses DynInst [6] to instrument the running executable with low overhead. All these tools allow to use the hardware performance counters available on a processor.
3 Simulation of a Simple Hardware Prefetcher Our cache simulator is derived from the cache simulator Cachegrind, part of the Valgrind runtime instrumentation framework [15], and thus, it can run unmodified executables.
Cache Optimizations for Iterative Numerical Codes Aware of Hardware Prefetching
923
We describe the simulator with its on-the-fly call-graph generation together with our visualization tool KCachegrind in more detail in [18]. With a conversion script, the latter is also able to visualize the sampling data from OProfile. The implemented hardware prefetch algorithm gets active on L1 misses: for every 4KB memory page, it checks for sequential loads of cache lines, either upwards or downwards. When three consecutive lines are loaded, it assumes a streaming behavior and triggers prefetching of the cache line which is 5 lines in advance. As the simulator has no way to do timings, we simply assume this line to be loaded immediately. Although this design is arguable, it fulfills our requirement that every sensible hardware prefetcher should at least detect the same streaming as our prefetcher. The simulation will show code regions where simple hardware prefetchers probably have problems, i.e. where prefetching techniques are worthwhile. The following results have been measured for a cache-optimized Fortran implementation of multigrid V(2,2) cycles, involving variable 7-point stencils on a regular 3D grid with 1293 nodes. Compared to the standard version, the simulated as well as real measurement show little more than half the number of L2 misses / L2 lines read in. As only 2 smoothing iterations can be blocked with the given multigrid cycle, this shows that the blocking works quite well. In search for further improvement using prefetching, Table 1 shows the number of simulated L2 misses, first column with hardware prefetching switched off, second column with hardware prefetching switched on (i.e. only misses that are not catched by the prefetcher), and third column the number of prefetch actions issued. For real measurements, the Pentium-M can count L2 lines read in because of L2 misses alone, as shown in column 4, lines read in because of a request from the hardware prefetcher only in column 5. The last column shows the number of prefetch requests issued by the hardware. All numbers are given in millons of events (mega events). The first line gives a summary for the whole run, the next 2 lines splitted up by top functions: RB4W is the red-black gauss-seidel smoother, doing the main work, and RESTR is the restriction step in the multigrid cycle. Simulation results show that the function RESTR is hardware-prefetcher friendly, as L2 misses go down by a large amount when the prefetcher is switched on in the simulation. In contrast, for the first function, RB4W, our simulated prefetcher does not seem to work. Looking at actual hardware, the prefetcher in the Pentium-M works better than the simulated one: it reduces the numbers of L2 misses (col. 4) even more than our one (col.2), especially it is doing a better job in RB4W. This shows that our simulation can show the same trend as will be seen in real hardware regarding friendliness to stream prefetching.
4 Interleaved Block Prefetching Pure block prefetching means prefetching multiple sequential cache lines before actually using them afterwards. In contrast, interleaved block prefetching (IBPF) prefetches a block to be used in the future interleaved with computation on the previous block. One has to be careful to not introduce conflicts between the current and the next block. Interleaved block prefetching is best used with the standard loop blocking cache optimization. When a program has to work multiple times on a huge array not fitting
924
Josef Weidendorfer and Carsten Trinitis Table 1. Measurements for the multigrid code [MEv] Simulated L2 Misses
Real Pref.
L2 Lines In
Pref.
Pf. Off Pf. On Requests due to Misses due to Pref. Requests Summary
361
277
110
241
130
373
RB4W
233
226
-
201
47
270
RESTR
108
37
-
28
76
92
Address
Address Memory Bandwidth (MB)
L2 Bandwidth (L2B)
Blocking
L2 Cache Size
t orig Time Legend:
t L2 Miss
Time blocking
L2 Hit
Fig. 1. The blocking transformation
into cache, data has to be loaded every time from main memory. When data dependencies of the algorithm allow to reorder work on array elements such that the multiple passes can be done consecutively on smaller parts of the array, blocking is possible. This improves temporal locality of memory accesses, i.e. accesses to the same data happen near to each other. By using block sizes that fit into cache (usually L2), this means that only for the first run over a small block, data has to be fetched from main memory. Fig. 1 depicts this scenario on a 1-dimensional array. The time tblocking is substantially less than torig : with blocking, the second iteration of each block can be done with L2 bandwidth L2B MB instead of memory bandwidth M B, and thus gets a speedup of L2B , i.e. 1 L2B tblocking = torig ∗ ( + ). 2 2 ∗ MB As further runs on the small block do not use main memory bandwidth at all, this time can be used to prefetch the next block. This needs block size corrected in a way that two blocks fit into cache. For an inner block, we assume it to be already in the cache when work on it begins. When there are N runs over this block, we can interleave prefetching of the next block with computation over these N iterations. For the multigrid code, N = 2, i.e. the resulting pressure on memory bandwidth is almost cut in half. Fig. 2 shows this scenario. If we suppose that the needed bandwidth for prefetching is not near the memory bandwidth limit, almost all the time we can work with the L2 bandwidth
Cache Optimizations for Iterative Numerical Codes Aware of Hardware Prefetching
925
L2B, depending on the relation of the first block size to the whole array size as shown in the figure. Note that for the first block, memory pressure will be even higher than before because in addition to the first block, half of the second block is prefetched simultaneously. Address
Address
L2 Cache Size
Block Prefetching
t Legend:
Time blocking
L2 Miss
L2 Hit
t
Time block−pref
Software Prefetching
Fig. 2. The interleaved block prefetching transformation
In the following, a 1-dimensional streaming micro benchmark is used: it runs multiple times over an 8 MB-sized array of floating point values of double precision, doing one load and 1 or 2 FLOPS per element (running sums). Table 2 gives results with 1-dimensional blocking for different numbers of N . With pure blocking, N = 1 make no sense, as data is not reused. Still, block prefetching, which is the same as normal prefetching in this case, obviously helps. Again, these results were measured on a 1.4 GHz Pentium-M with DDR-266 memory, i.e. a maximum memory bandwidth of 2.1 MB/s. The benchmark was compiled to use x87 instructions only (no MMX or SSE). For each benchmark and N , we show the runtime in seconds, and the achieved (netto) bandwidth from main memory. The netto bandwidth is calculated from the runtime and the known amount of data loaded. A brutto bandwidth that gives the bandwidth needed from memory to L2, can be calculated from memory read burst transactions, measured by a performance counter: the benchmark only loads data, and cache line loads are translated to burst requests. The brutto bandwidth is always between 50 and 100 MB/s higher than the netto bandwidth in this benchmark, and therefore not shown in the table. Measurement shows that runtimes without IBPF can slow down by up to 46 percent compared to runtimes with IBPF (N = 2, 1 Flop/Load). To see the effect on the number of L2 lines read in on the one hand because of misses, and on the other hand because of hardware prefetch requests, we show these numbers for the first benchmark. Obviously, the prefetching virtually switches off the hardware prefetching. One has to note that interleaving prefetching with regular computation can increase instruction overhead: in the micro benchmark, an additional loop nesting level had to be introduced for inserting one prefetch instruction each time N ∗ 8 floating point elements are summed up: with a cache line size of 64 bytes, 8 double-precision values fit into
926
Josef Weidendorfer and Carsten Trinitis Table 2. Results of a micro benchmark Without Pref. 1
N
2
3
With IBPF 10
1
2
3
10
1 Flop/Load Runtime [s]
5.52 3.92 3.45 2.73 4.89 2.67 2.50 2.44
MFlop/s
194 274 311 393 220 402 430 440
Netto [GB/s]
1.48 1.04 0.79 0.30 1.68 1.53 1.09 0.34
Lines In b/o Misses [MEv.]
4.2 2.1 1.4 0.4 134
Lines In b/o Pref. [MEv.]
130
65
44
67
45
13
13 0.45 0.54 0.27 0.15
2 Flops/Load Runtime [s]
7.20 5.24 4.81 3.23 6.86 4.29 3.60 2.84
MFlop/s
298 410 446 665 313 501 597 756
Netto [GB/s]
1.14 0.78 0.57 0.25 1.19 0.95 0.76 0.29
one cache line, and with blocking of N iterations, the prefetching stream advances with 1/N of the computation stream. In more complex cases or with other CPUs, the instruction increase can cancel the benefit of interleaved block prefetching. To reduce this effect, future ISAs (instruction set architectures) should introduce block prefetching instructions, minimizing the additional instruction overhead.
5 Conclusion and Future Work In this paper, we have shown the usefulness of introducing a simple hardware prefetch algorithm into a cache simulator. By comparing a simulation with hardware prefetching switched off with the simulation of the same program run, but prefetching switched on, one can see the positions in the code where insertion of software prefetching instructions is useful. We extend this result by presenting the block prefetching technique which is able to lower the pressure on memory bandwidth. Still, for simulation with hardware prefetching to be really worthful, visualization possibilities have to be enhanced: even source annotation can not differentiate between iterations of the same loop body, as only sums are given. But it appears useful to be able to distinguish at least the first few iterations of a loop, as it is expected that prefetch characteristics change here. Thus, more detailed context separation of profile data is needed. The interleaved block prefetching technique has to be integrated into real world applications like our multigrid solver.
References 1. M. Bekerman, S. Jourdan, R. Romen, G. Kirshenboim, L. Rappoport, A. Yoaz, and U. Weiser. Correlated Load-Address Predictors. In Proceedings of the 26th International Symposium on Computer Architecture, pages 54–63, May 1999.
Cache Optimizations for Iterative Numerical Codes Aware of Hardware Prefetching
927
2. E. Berg and E. Hagersten. SIP: Performance Tuning through Source Code Interdependence. In Proceedings of the 8th International Euro-Par Conference (Euro-Par 2002), pages 177–186, Paderborn, Germany, August 2002. 3. Stefan G. Berg. Cache prefetching. Technical Report UW-CSE 02-02-04, University of Washington, Februar 2002. 4. K. Beyls and E.H. D’Hollander. Platform-Independent Cache Optimization by Pinpointing Low-Locality Reuse. In Proceedings of International Conference on Computational Science, volume 3, pages 463–470, June 2004. 5. T. Brandes. Adaptor. homepage. http://www.scai.fraunhofer.de/291.0.html. 6. B. Buck and J.K. Hollingsworth. An API for Runtime Code Patching. The International Journal of High Performance Computing Applications, 14:317–329, 2000. 7. L. DeRose, K. Ekanadham, J. K. Hollingsworth, and S. Sbaraglia. SIGMA: A Simulator Infrastructure to Guide Memory Analysis. In Proceedings of SC 2002, Baltimore, MD, November 2002. 8. Dynaprof Homepage. http://www.cs.utk.edu/ mucci/dynaprof. 9. H. C. Hsiao and C. T. King. MICA: A Memory and Interconnect Simulation Environment for Cache-based Architectures. In Proceedings of the 33rd IEEE Annual Simulation Symposium (SS 2000), pages 317–325, April 2000. 10. Intel Corporation. IA-32 Intel Architecture: Software Developers Manual. 11. M. Kowarschik, U. R¨ude, N. Th¨urey, and C. Weiß. Performance Optimization of 3D Multigrid on Hierarchical Memory Architectures. In Proc. of the 6th Int. Conf. on Applied Parallel Computing (PARA 2002), volume 2367 of Lecture Notes in Computer Science, pages 307– 316, Espoo, Finland, June 2002. Springer. 12. M. Kowarschik and C. Weiß. An Overview of Cache Optimization Techniques and CacheAware Numerical Algorithms. In U. Meyer, P. Sanders, and J. Sibeyn, editors, Algorithms for Memory Hierarchies — Advanced Lectures, volume 2625 of Lecture Notes in Computer Science, pages 213–232. Springer, March 2003. 13. J. Levon. OProfile, a system-wide profiler for Linux systems. Homepage: http://oprofile.sourceforge.net. 14. M. Martonosi, A. Gupta, and T. E. Anderson. Memspy: Analyzing memory system bottlenecks in programs. In Measurement and Modeling of Computer Systems, pages 1–12, 1992. 15. N. Nethercote and J. Seward. Valgrind: A Program Supervision Framework. In Proceedings of the Third Workshop on Runtime Verification (RV’03), Boulder, Colorado, USA, July 2003. Available at http://developer.kde.org/˜sewardj. 16. V. S. Pai, P. Ranganathan, S. V. Adve, and T. Harton. An Evaluation of Memory Consistency Models for Shared-Memory Systems with ILP Processors. In Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, pages 12–23, October 1996. 17. N. Th¨urey. Cache Optimizations for Multigrid in 3D. Lehrstuhl f¨ur Informatik 10 (Systemsimulation), Institut f¨ur Informatik, University of Erlangen-Nuremberg, Germany, June 2002. Studienarbeit. 18. J. Weidendorfer, M. Kowarschik, and C. Trinitis. A Tool Suite for Simulation Based Analysis of Memory Access Behavior. In Proceedings of International Conference on Computational Science, volume 3, pages 455–462, June 2004.
Computationally Expensive Methods in Statistics: An Introduction Organizer: Wolfgang M. Hartmann SAS Institute, Inc., Cary NC, USA
Assuming a twodimensional data set with N observations (rows) and n variables (columns) there are two types of large scale data which require intensive computational work: 1. N >> n: traditional statistical methods: many points in low or medium dimensional space; many observations but moderate number of variables; traditional ANOVA, L2, L1, and L∞ regression; discriminance analysis etc. applications in traditional data mining. 2. N << n: newer statistical methods: few points in very high dimensional space; few observations N but many variables n; curse of dimensionality; applications in Chemometrics (spectra), microarray expression data (few chips but many genes); text mining and search engines (few terms in many documents). In this minisymposium we were trying to concentrate on the second form of data, few points in highdimensional space, where new analysis methods are being developed. Wolfgang Hartmann, SAS Institute Inc., Cary NC, opened the session with an introduction clarifying the difference between dimension reduction and variable selection methods in statistical analyses. A definition of the dimension reduction and variable selection problem in statistics is formulated which is based on work of McCabe (1984). A number of commonly known exploratory and predictive modeling methods in statistics and machine learning (data mining) are classified either as dimension reduction or variable selection methods. Paul Somerville from the University of Central Florida in his presentation on Stepdown FDR Procedures for Large Numbers of Hypotheses introduced some new developments for a method of multiple comparisons in data with a large number of variables as with microarray data. In such large applications stepwise methods are used for computational efficiency but may suffer either from a number of false hypotheses or in a loss of statistical power. This work is still in progress and the talk highlighted some of the statistical problems with the authors stepwise approach compared to others already available. The new procedure seems to result in a reduced number of false rejections with a negligible reduction in power when the expected number of false hypotheses is small. Ulrich Mansmann, then from the University of Heidelberg, Department of Medical Biometry and Informatics, now at the University of Munich, proceeded with his talk on Reproducible Statistical Analysis in Microarray Profiling Studies. It is well known that for more complicate statistical modeling published results are not easily reproducable. Due to the complexity of the algorithms, the size of the data sets, and the limitations of the medium printed paper it is usually not possible to report all the minutiae of the data J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 928–930, 2006. c Springer-Verlag Berlin Heidelberg 2006
Computationally Expensive Methods in Statistics: An Introduction
929
processing and statistical computations. Microarrays are a recent biotechnology that offer the hope of improved cancer classification, providing clinicians with the information to choose the most appropriate form of treatment. But, reproducing published results in this domain is harder than it may seem. For example, Tibshirani and Efron (2002) report: "We reanalysed the breast cancer data from van’t Veer et al. (2002). ... Even with some help of the authors, we were unable to exactly reproduce this analysis.". To achieve reproducible calculations and to offer an extensible computational framework the tool of a compendium (Sawitzki , 1999; Leisch, 2002; Gentleman and Temple Lang, 2004; Sawitzki, 2002) is discussed. A compendium is a document that bundles primary data, processing methods (computational code), derived data, and statistical output with the textual documentation and conclusions. Ulrich’s talk presents examples, discusses the problems hidden in the published analyses and demonstrates a strategy to improve the situation which is based on the vignette technology available from the R and Bioconductor projects (Ihaka and Gentleman, 1996; Gentleman and Carey, 2002). Frank Bretz, then at the University of Hannover, now at Novartis Pharmaceuticals in Basel, Switzerland, presented some work with P. Westfall, Texas Tech University, on the computationally difficult problem of applying multiple comparisons to the analysis of microarray data. Algorithms for multiple comparisons usually require the highdimensional quadrature of multivariate normal or t distribution. Analytic formulas were developed for various types of power and error rates of some closed testing procedures. The formulas involve non-convex regions that may be integrated with high, pre-specified accuracy using available software. The non-convex regions are represented as a union of hyper-rectangles. These regions are transformed to the unit hypercube, then summed, to create an expression for power that is the integral of a function defined on the unit hypercube. This function is then evaluated using existing quasi-Monte Carlo methods that are known for their accuracy. Applications include individual, average, complete, and minimal power, for closed pairwise comparisons (max T-based), as well as fixed sequence and non-pairwise comparisons. Thomas Ragg, co-founder of quantiom bioinformatics GmbH, http://www.quantiom.de, of a small but growing company which "develops customized Analysis Strategies and specialized software solutions for bioanalytical questions and offers the necessary consulting, support, training, and service" concluded the session with a talk on Normalization, variable ranking and model selection for highdimensional genomic data - Design and Implementation of automated analysis strategies which outlined the procedure for analyzing microarray data as it is commonly encountered by his team. Some crucial problems in building models based on empirical genomic data are – normalization within experiments and between experiments to make them comparable – filtering of irrelevant variables to reduce the complexity of the search space – selection of appropriate subset of features (e.g. genes) – model training based on the available data, where the model complexity needs to be balanced against the training error – model selection based on a fitness criterion (e.g. model evidence)
930
Wolfgang M. Hartmann
– evaluation of the strategy and appropriate presentation of results The analysis strategy was demonstrated for oligo microarray data and showed the benefits of the quantiom software implementation (in Java) with its modular design principle. For references see the contributions.
Dimension Reduction vs. Variable Selection Wolfgang M. Hartmann SAS Institute, Inc., Cary NC, USA
Abstract. The paper clarifies the difference between dimension reduction and variable selection methods in statistics and data mining. Traditional and recent modeling methods are listed and a typical approach to variable selection is mentioned. In addition, the need for and types of cross validation in modeling is sketched.
1 Introduction The purpose of the paper is an approach to highlight the difference between methods of variable selection (VS) and dimension reduction (DR) in statistics and to show some basic features of the most common methods. Both types of methods originally submerged in computational statistics and are now more and more used and developed in the area of data mining, sometimes called as machine learning. It is first necessary to explain some of the terms used in later parts of this paper. The machine learning literature differs between two essential types of analysis: Unsupervised Learning also called exploratory analysis methods: The analysis data set consists only of one set of variables X for which N measurements (observations, cases) are available. A modeling method is used to find similarities and significant differences either – among the variables in X – or among the observations in X. Supervised Learning also called as predictive modeling: The analysis data set consists of two sets of variables, a set of response (dependent) variables Y and a set of predictor (independent) variables X. A method of linear or nonlinear parametric modeling or a method of nonparametric modeling is used that establishes a relationship among the values of the X variables and allows to predicts the values of the Y variables with sufficient precision. Here we will mostly deal with parametric modeling.
2 Difference Between DR and VS The following notation outlines the difference between DR and VS and is based on work of McCabe (1984, [21]): 1. Given an N × n matrix X as training data. 2. Usually X is column centered around mean (sometimes a covariance matrix) or even scaled (sometimes a correlation matrix). J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 931–938, 2006. c Springer-Verlag Berlin Heidelberg 2006
932
Wolfgang M. Hartmann
3. Find n × k, k n, transformation matrix Ak V = XAk , usually s.t. ATk Ak = Ik 4. For dimension reduction Ak is dense. 5. For variable selection: Ak = perm([Ik : 0k×(n−k) ]T ) where perm(A) is row permuted n × k matrix. This definition does not refer to the criterion for the selection of Ak and therefore covers both, supervised and unsupervised methods, of a wide range. It also does not specify the criterion used for selecting k, the number of dimensions or selected variables. Therefore, methods for DR and VS are of a very wide range. From that follows that we understand variable selection as a discrete process (sparse notation) whereas dimension reduction is a continuous process (dense notation). Therefore, DR shows the grouping of variables, variables with low impact have smaller weights than those with large impact. On the other hand VR picks usually only one representative from each cluster of similar variables and does not refer to variables which may be very similar to those picked but with only slightly smaller impact. To find the highly correlated variables of those variables picked with VS usually postprocessing of the data is needed. As an implication of the discrete selection process of VS we can expect rather high variance in bootstrap and cross validation, compared to a much lower variance for the more continuous DR. That again has an impact on choosing the methods used for tuning hyper parameters of the model.
3 Overview on Methods for VS and DR In the following we try to list the most common methods for VS and DR and refer to software found in the free statistical software R (see [26]) the authors own interactive matrix language CMAT (Hartmann, [15]) and in SAS/STAT (statistics)([29]), SAS/IML (interactive matrix language), and SAS/EM (enterprise miner). Of course, many other software packages like SPSS and S-PLUS are available that cover most the listed methods. 3.1 Methods for Dimension Reduction The purpose is finding p n components/factors/clusters which are (linear or nonlinear) functions of the n variables which describe the essential features of the data in some sense (maximizing explained variance, de-noise data in some sense, sufficient for prediction). 1. Exploratory Modeling (Unsupervised Learning): (a) (Kernel) Dense Principal Components and Factor Analysis, e.g. see Harman (1976, [14]) and Mulaik (1972, [23]) – functions prcomp and princomp and the kernlab package in R – functions eig, nlkpca, factor, and sem in CMAT – functions eigval, eigvec, and eigen in SAS/IML and PROCs PRINCOMP, FACTOR, and CALIS in SAS/STAT
Dimension Reduction vs. Variable Selection
933
(b) Singular Value Decomposition and Correspondence Analysis, e.g. see Gifi (1981, [11]) and Greenacre (1984, [12]) – function svd and functions corresp, mca, and CoCoAn in R – functions svd, gsvd, arpack, and svdtrip in CMAT; correspondence analysis currently not in CMAT – function svd in SAS/IML and PROC CORRESP in SAS/STAT (c) Multidimensional Scaling, e.g. see Carroll & Arabie (1998, [6]), – function smdscale in R – currently not in CMAT – PROC MDS in SAS/STAT (d) Methods of fuzzy variable clustering, e.g. see Kaufman & Rousseeuw (1990, [17]) – function cluster, hclust, mclust, cclust in R – functions cluster and emclus in CMAT – PROCs ACECLUS, CLUSTER, and MODECLUS in SAS/STAT 2. Predictive Modeling (Supervised Learning): (a) (Kernel) Partial least-squares, e.g. see Wold (1966, [38]), Rosipal & Trejo (2001, [28]) – package plspcr and gpls in R – functions pls and nlkpls in CMAT – PROC PLS in SAS/STAT (kernel PLS currently not in SAS) (b) Sliced inverse regression (SIR) and principal Hessian directions (pHd), e.g. see Weisberg, 2002, [35]) – function sir in R – function sir in CMAT – currently not in SAS (c) Neural Networks (Ripley, 1994, [27]) and Support Vector Machines (Vapnik, 1995, [37]; Schoelkopf & Smola, 2000, [30]) – function nnet, and function svm in e1071 package and the kernlab package in R – function svm in CMAT – SAS PROCs SVM, NEURAL, and DMNEURL in SAS/EM 3.2 Methods for Variable Selection The purpose is finding a subset of p n variables from the original data which describe the essential features of the data in some sense (maximizing explained variance, de-noise data in some sense, sufficient for prediction). 1. Exploratory Modeling (Unsupervised Learning): (a) Coloring a correlation matrix (b) Sparse Principal Components (SPCA) (Zou & Hastie, 2004, [41]) – not in R – function spca in CMAT – not in SAS (c) Principal Variables (McCabe, 1984, [21]) (d) Methods of discrete variable clustering
934
Wolfgang M. Hartmann
i. using common cluster methods with the transposed data matrix, e.g. see Kaufman & Rousseeuw (1990, [17]) ii. using hybrids of factor analysis and rotation to sparse factor patterns e.g. see Harman (1976, [14]) or Anderberg (1973, [1]) – not in R – function varclus in CMAT – SAS PROC VARCLUS iii. K Median Clustering (Mangasarian & Wild, 2004, [20]) 2. Predictive Modeling (Supervised Learning): (a) Type III analyses of many statistical modeling methods (b) Subset selection in (generalized linear) regression (Miller, 2002, [22]) : specifically stepwise (linear, logistic) regression – function step, glm, leaps in R – functions lrforw, reg, glim, glmixd in CMAT – PROCs REG, MIXED, LOGISTIC, PHREG, and GENMOD in SAS/STAT and PROC DMINE in SAS/EM (c) Recursive Partitioning and Regression Trees (CART, CHAID) see Breiman et.al (1984, [4]) – package tree and functions rpart, knntree in R – function split in CMAT – PROCs SPLIT, DMSPLIT, and ARBOR in SAS/EM (d) Step-up and step-down Multivariate Testing, e.g. see Benjamini & Hochberg (1995, [2]) Somerville & Bretz (2001, [32]) Somerville (2004, [31]) – function multcomp and multtest in Bioconductor package in R – function multtest in CMAT – PROC MULTTEST ins SAS/STAT (e) Garotte by Breiman (1993, [5]) – currently not in R – function garotte in CMAT – currently not in SAS (f) (Univariate) Soft Thresholding (UST) by Chen, Donoho, & Saunders (1999, [7]), Tibshirani, Hastie, Narasimhan, & Chu (2002, [34]) (g) Lasso by Tibshirani (1996, [33]) and Osborne, Presnell, & Turlach (2000, [24]) and LARS by Efron, Hastie, Johnstone, & Tibshirani (2002, [9]) – functions lars and lasso in R – function lars in CMAT – currently not in SAS (h) Elastic Net by Zou and Hastie (2003, [40]) – currently not in R – function lars in CMAT – currently not in SAS (i) Sparse L1 SV Regession (Bi, Bennett, Embrechts, Breneman, & Song, 2002, [3])
Dimension Reduction vs. Variable Selection
(j)
(k)
(l)
(m)
935
– currently not in R – function svm in CMAT – currently not in SAS Sparse L1 SV Classification (Fung & Mangasarian, 2003, [10]) – currently not in R – function svm in CMAT – PROC SVM in SAS/EM SVM Feature Selection by Guyon et.al. (2002, [13]), Weston et.al. (2000, [36]) – currently not in R – function svm in CMAT – PROC SVM in SAS/EM Feature Selection using Genetic Algorithms, e.g. see Yang & Honavar (1997, [39]) – functions gap, rgenoud, and gafit in R – function nlp in CMAT – currently not in SAS Bayesian methods of variable selection – naiveBayes in e1071 package and the BsMD package in R – currently not in CMAT – currently not in SAS
Note, even though some of the software has the same name in CMAT and SAS, the implementations are usually very different. For example, the algorithms for variable clustering in CMAT are not necessarily based on a covariance or correlation matrix (which is for n N sometimes difficult to store) as with the PROC VARCLUS in SAS/STAT. 3.3 Example: Variable Selection with SV Regression Illustrating the five-step variable selection algorithm by Bi, Bennett, Embrecht, Breneman, & Song (2001, [3]): 1. Attach a small set of random generated gauge variables (columns) to data set which have approximately zero correlation with response y. 2. Outer iteration performs Bootstrap aggregation (Bagging): using bootstrap samples for training data and average about results (reducing the variance, makes results more stable especially w.r.t. local optima of internal pattern search) 3. Pattern search cycle for hyper parameters of linear SV regression, regularization C and tube parameter is performed and best solution is selected for nearly optimal values of (C, ). 4. The pattern search is based on K fold cross validation result. This CV cycle requires to optimize K LPs (linear support vector machine problems) for specific values of (C, ). 5. Based on the former four steps, a small set of variables is selected. Using the small set of variables the last two steps of pattern search with cross validation is now performed for nonlinear kernel. That means, in addition to regularization C and tube parameter the pattern search now includes additional kernel parameters.
936
Wolfgang M. Hartmann
4 Cross Validation (CV) The machine learning literature refers to three types of data sets that are useful for modeling: Training Data: This data set contains N observations with n variables. For predictive modeling the latter can be predictor or response variables. This data set is being used to build the model. Validation Data: This data set contains Nval observations at the same n variables as the training set is using. It may be used as a second data set at the modeling process to measure the goodness-of-fit of the model and to determine the termination of the modeling process. Using a validation data set is similar to cross validation and can increase the generalizability of the model and to reduce the possibility of overfiting. That means the resulting model will also perform well with other than the training data. Test Data: This data set contains Ntest observations and is usually not available at the modeling process. Even for predictive modeling it must not contain the response variables, but must contain all the predictive variables as the training data set. After a parameteric model was established using training and perhaps validation data, the model could be applied to the test data. In predictive modeling that means using the model to compute the expected values of response variables (scoring). The purpose of cross validation is to: 1. Increase generalizeability of model estimates: model does not only fit one single sample, it also fits other samples of the same population well. 2. Prevent overfitting: too many parameters, some describing data noise specific for the sample used in training. This is achieved by evaluating the model on data which do not belong to data used for training the model. Block CV: in K-fold block CV there are K training runs: block of N/K observations is left out from model training; but the remaining N − N/K observations are scored; e.g. left out for K=10 and N=100: i=1,...,10; i=11,...,20; i=21,...,30;... (biased for sorted data) Split CV: in K-fold split CV there are K training runs: "successive groups of widely separated observations are held out as the test set", e.g. left out for K=10: i=1,11, 21,...; i=2,12,22,...; i=3,13,23,... Random CV: in K-fold random CV there are K training runs: in each run N/K randomly selected observations are left out from model training; observations are selected from remaining pool, i.e. after K training runs each observation is left out exactly once; Loo CV: in each of N training runs one observation is left out from the training data set: the remaining N − 1 observations are used for parm estimation, but the left out observation is scored (here is K = N ). This is related to the well known Jackknife in statistics.
Dimension Reduction vs. Variable Selection
937
Test Set Validation: the model is trained (parameters estimated) based on training data; model is evaluated on test data set; assumes that there are two large enough data sets. Random Validation: inside a loop of multiple training runs: the input data set is divided randomly in training and validation data set. The model estimates are computed with the training set and the fit is evaluated on the validation data set. The best model is selected based on the fit with the validation data sets. (If not done "balanced", the result can be biased, i.e. some observations are scored more than others) A number of functions in CMAT, like pls and svm offer options for cross validation. Usually, Loo is computationally the most expensive method since it needs the large number of N estimations. But using an estimation method that exploits good starting values the difference in computer time compared to 10-fold cross validation can be surprisingly small. An advantage of Loo is that it can be used for sensitivity analysis and the detection of model outliers: 1. During the N leave-one-out runs, the model fit can be evaluated at the remaining N − 1 observations of the training set. 2. If the training model fit improves significantly when leaving out observation Xi , it indicates that observation Xi is a "model misfit", a model outlier.
References 1. Anderberg, M.R. (1973), Cluster Analysis for Applications, New York: Academic Press, Inc. 2. Benjamini, Y. & Hochberg, Y. (1995), “Controlling the false discovery rate: a practical and powerful approach to multiple testing”, J. R. Statist. Soc., Ser. B, 289-300. 3. Bi, J., Bennett, K., Embrechts, M., Breneman, C. & Song, M. (2002), “Dimensionality Reduction via Sparse Support Vector Machines”, Journal of Machine Learning Research 1 1-48. 4. Breiman, L., Friedman, J.H., Olshen, R.A. & Stone, C.J. (1984), Classification and Regression Trees, Wadsworth. 5. Breiman, L. (1993), “Better subset selection using the non-negative garotte”; Technical Report, Univ. of California, Berkeley. 6. Carroll, J.D. & Arabie, P. (1998), Multidimensional scaling, in M.H. Birnbaum (Ed.), Handbook of Perception and Cognition: Measurement, Judgment and Decision Making, p. 179-250, San Diego, CA: Academic Press. 7. Chen, S.S., Donoho, D.L., & Saunders, M. (1999), “Atomic decomposition by basis pursuit”, SIAM Journal on Scientific Computing, 20, 33-61. 8. Donoho, D., Johnstone, I., & Tibshirani, R. (1995), “Wavelet shrinkage: asymptotia? (with discussion)”; J. R. Statist. Soc., Ser. B, 57, 301-337. 9. Efron, B., Hastie, T., Johnstone, I. & Tibshirani, R. (2002), “Least Angle Regression”, The Annals of Statistics, 32, 407-499. 10. Fung, G. & Mangasarian, O.L. (2003), “A Feature Selection Newton Method for Support Vector Machine Classification”, Computational Optimization and Aplications, 1-18. 11. Gifi, A. (1981), Nonlinear Multivariate Analysis, Dep. of Data Theory, Univ. of Leiden. 12. Greenacre, M.J. (1988), Theory and Applications of Correspondence Analysis, London: Academic Press. 13. Guyon, I., Weston, J., Barnhill, S. & Vapnik, V. (2002), "Gene Selection for cancer classification using support vector machines"; Machine Learning, 46, 389-422.
938
Wolfgang M. Hartmann
14. Harman, H.H. (1976), Modern Factor Analysis, Chicago: University of Chicago Press. 15. Hartmann, W. (1995), CMAT Users Manual, see http://www.cmat.pair.com/cmat 16. Joachims, T. (1999), "Making large-scale SVM learning practical", in B. Sch¨olkopf, C.J.C. Burges, and A.J. Smola (eds), Advances in Kernel Methods: Support Vector Learning, Cambridge: MIT Press. 17. Kaufman, L. & Rousseeuw, P. (1990), Finding Groups in Data: An Introduction to Cluster Analysis, New York: Wiley. 18. Li, K.C. (1991), “Sliced inverse regression for dimension reduction”; JASA, 86, 316-342. 19. Li, K.C. (1992), “On principal Hessian directions for data visualization and dimension reduction”; JASA, 87, 1025-1034. 20. Mangasarian, O.L. & Wild, E.W. (2004), “Feature Selection in k-Median Clustering”, Technical Report 04-01, Data Mining Institute, Madison: University of Wisconsin. 21. McCabe, G.P. (1984), “Principal variables”; Technometrics, 26, 139-144. 22. Miller, A. (2002), Subset Selection in Regression, CRC Press, Chapman & Hall. 23. Mulaik, S.A. (1972), The Foundations of Factor Analysis, New York: Mc Graw Hill. 24. Osborne, M.R., Presnell, B., & Turlach, B.A. (2000), “On the LASSO and its Dual”, JCGS, 9, 319-337. 25. Osborne, M.R., Presnell, B., & Turlach, B.A. (2000), “A new approach to variable selection in least squares problems”, IMA Journal of Numerical Analysis, 20, 389-404. 26. R Language and packages see: http://www.r-project.org/ and http://cran.r-project.org/ 27. Ripley, B,D. (1996), Pattern Recognition and Neural Networks, Cambridge: Cambridge University Press. 28. Rosipal, R. & Trejo, L.J. (2001), “Kernel partial least squares regression in reproducing kernel Hilbert space”, Journal of Machine Learning Research, 2, 97-123. 29. SAS/STAT User’s Guide, (1990), Version 6, Second Printing, SAS Institute Inc., Cary, NC. 30. Sch¨olkopf, B. & Smola, A.J. (2002), Learning with Kernels, Cambridge MA and London: MIT Press. 31. Somerville, P.N. (2004), “Step-down FDR Procedures for large numbers of hypotheses”, this volume. 32. Somerville, P.N. & Bretz, F. (2001), “FORTRAN 90 and SAS-IML programs for computation of critical values for multiple testing and simultaneous confidence intervals”, Journal of Statistical Software. 33. Tibshirani, R. (1996), “Regression shrinkage and selection via the Lasso”, J. R. Statist. Soc., Ser. B, 58, 267-288. 34. Tibshirani, R., Hastie, T., Narasimhan, B. & Chu, G. (2002), “Diagnosis of multiple cancer types by shrunken centroids of gene expression”; Proceeding of the National Academy of Sciences, 99, 6567-6572. 35. Weisberg, S. (2002), “Dimension reduction regression with R”, JSS, 7. 36. Weston, J., Mukherje, S., Chapelle, O., Pontil, M., Poggio, T., & Vapnik, V. (2000), “Feature Selection for SVMs”, Neural Information Processing Systems, 13, 668-674. 37. Vapnik, V.N. (1995), The Nature of Statistical Learning, New York: Springer. 38. Wold, H. (1966), “Estimation of principal components and related models by iterative least squares”, Multivariate Analysis, New York: Academic Press. 39. Yang, J. & Honavar, V. (1997), “Feature selection using a genetic algorithm”, Technical Report, Iowa State University. 40. Zou, H., & Hastie, T. (2003), “Regression shrinkage and selection via the elastic net, with applications to micro arrays”, Technical Report, Stanford University. 41. Zou, H., Hastie, T. & Tibshirani, R. (2004), “Sparse principal component analysis”, Technical Report, Stanford University.
Reproducible Statistical Analysis in Microarray Profiling Studies Ulrich Mansmann1, Markus Ruschhaupt2 , and Wolfgang Huber2 1
University of Heidelberg, Department for Medical Biometry and Informatics INF 305, 69120 Heidelberg, Germany [email protected] 2 German Cancer Research Center, Division of Molecular Genome Analysis INF 580, 69120 Heidelberg, Germany
Abstract. Reproducibility of calculations is a longstanding issue within the statistical community. Due to the complexity of the algorithms, the size of the data sets, and the limitations of the medium printed paper it is usually not possible to report all the minutiae of the data processing and statistical computations. Like the critical assessment of a mathematical proof it should be possible to check the software behind a complex data analysis. To achieve reproducible calculations and to offer an extensible computational framework the tool of a compendium is discussed.
1 Introduction Microarray technology allows simultaneous measurement of thousands of transcripts within a homogeneous sample of cells [1]. It is of interest to relate these expression profiles to clinical phenotypes to improve the diagnosis of diseases and prognosis for individual patients [2]. A number of publications presented clinically promising results by combining this new kind of biological data with specifically designed algorithmic approaches. A selection out of these papers will be discussed [3,4,5,6] with respect to different aspects of reproducibility. The most evident aspect of reproducibility is that of reproducing a calculation on the same data. Reproducing published results in the domain of microarray profiling studies is harder than it may seem. As example we look at a study of van ’t Veer et al. [3] which was reanalysed by Tibshirani and Efron [7]. Both state in their paper: We re-analyzed the breast cancer data from van ’t Veer et al. ... Even with some help of the authors, we were unable to exactly reproduce this analysis. We do not know any example where classification results gained with one microarray technology and a special algorithm were reproduced using an alternative microarray platform and algorithm. Papers with diverging results on profiles for the prognosis of tumour recurrence for breast cancer patients are [3,4]. How does this observation relate to the idea of a common underlying disease process? Should profiles have something common which are developed for the same disease? Is it of significance if they do not? The confounding of algorithmic problems with biotechnology and biology creates a gordic knot. The paper applies the tool of a compendium as interactive strategy to settle the algorithmic backbone of a profiling study and to derive reproducible results J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 939–948, 2006. c Springer-Verlag Berlin Heidelberg 2006
940
Ulrich Mansmann, Markus Ruschhaupt, and Wolfgang Huber
with a state-of-the-art methodology. A compendium is a document that bundles primary data, processing methods (computational code), derived data, and statistical output with the textual documentation and conclusions. It is interactive in the sense that it allows to modify the processing options, plug in new data, or insert further algorithms and visualisations.
2 Examples and Questions 2.1 Example 1 Van ’t Veer et al. [3] classify breast cancer patients after curative resection with respect to the risk of tumour recurrence. The study includes 78 patients. Forty four patients had a good prognosis and did not suffer from a recurrence during the first 5 years after resection. Thirty four patients had a bad prognosis and experienced a recurrence during the first 5 years after resection. Agilent microarray technology was used to quantify the transcripts probed by 24,881 oligonucleotides. Additional prognostic factors like tumour grade, ER status, PR status, tumour size, patient age, angioinvasion were also documented. The authors were interested to develop a classifier based on the gene expression and to compare the relevance of the genomic signature to the prognostic value of standard clinical predictors. They used to following algorithm to establish the signature which contains 70 genes: 1. Starting with 24,881 genes, filtering on fold-change and a p-value criterion reduced the number of relevant genes to 4,936. 2. Based on an absolute correlation of at least 0.3 between gene expression and group indicator (0,1) a further reduction on 231 genes 3. Calculation of the 231 dimensional centroid vector for the 44 good prognosis cases. 4. Correlation of each case with this centroid is calculated, cut-off of 0.38 is chosen to exactly misclassify 3 with poor prognosis 5. Case is classified to good prognosis if correlation calculated for some number n of genes (1 ≤ n ≤ 231) with the centroid vector is ≥ 0.38, otherwise the case is classified to the bad prognosis group. 6. Starting with the top 5, 10, 15,. . . genes, classification procedure is carried out with leave-one-out cross-validation, to pick the optimal number of genes → 70 Based on this algorithm van ’t Veer et al. achieved a correct classification for 26 of 44 patients without recurrence and for 31 of 34 with recurrence. Tibshirani and Efron [7] tried to reproduce this results and report: Even with some help of the authors, we were unable to exactly reproduce this analysis. In fact, the differences between both calculations were not dramatic. But, the example shows that algorithmic reproducibility is not a trivial issue and may have subjective elements. The algorithm is quite popular and is also used in [5] and other papers. The van ’t Veer examples rises the following questions: Why was a heuristic classification algorithm chosen and not a standard algorithm from machine learning? Are its computational aspects well understood? How important is the choise of the parameters for correct classification? Is the result easy to interpret? What justifies the popularity of the algorithm? Is the leave-one-out CV strategy appropriate? What is the effect of other CV strategies on the classification result?
Reproducible Statistical Analysis in Microarray Profiling Studies
941
2.2 Example 2 Huang et al. [4] investigated primary tumour samples from 52 patients with breast tumours and 1-3 positive lymph nodes. 18 patients had a recurrence within three years after surgery, and 34 patients did not. The authors concluded that they could predict tumour recurrence with misclassification rates of 2/34 and 3/18, respectively. The authors presented a tree classifier with split decisions based on an a posteriori distribution over all possible splits. A description is available in the form of a technical report on the webpage of one of the authors. Due to ambiguities we did not succeed in translating the statistical ideas into software with which we could reproduce the analysis. More importantly, there is no official software version of the algorithm. The authors perform two dimension reduction steps before they apply the Bayesian tree classifier. First, the 12,625 probe sets on the HGU95Av2 Affymetrix GeneChips are reduced to 7,030 by excluding probe sets with a maximum intensities below 29 and a low variatiability across the samples. The second reduction creates 496 metagenes out of the 7,030 probe sets by performing k-means clustering and using the first principal component of each cluster as expression measure for the metagene. As in example 1 this study is concerned with the prognosis of tumour recurrence of breast cancer patients after curative resection of the tumour. The authors state that they could not find any of the 70 genes used in the classifier of van ’t Veer et al. in any of the metagenes which come up in the Bayesian tree classifier. The Huang example rises the following questions: Why is it impossible for us to reproduce the good classification result? Why is the classification based on the idiosyncratic method so good? The preprocessing is not part of the CV loop, which influence may this have on the misclassification rate? How is it possible to disentangle computation, technology, and biology? Can we find a link between the Huang and van ’t Veer classifiers?
3 The Compendium Publications on microarray profiling studies often present one new microarray data set and one new classification method. Is it necessary to develop an idiosyncratic classification approach for each specific data set? Which classification result could be achieved with standard approaches? What loss in accuracy has to be traded for a rise in interpretability? How can I use new data to validate former results? If validation creates discrepancies how is it possible to assess the contribution made by algorithmic aspects: error in implementation or no success with validation? Do I have sufficient instructions and details to reproduce the method under validation in an exact way? How dependent is the classification result on the method used? What can be learned from a published profiling study for future projects? To answer these and other questions we introduce the compendium of computational diagnostic tools (CCDT) 3.1 Compendium for Computational Diagnostic Tools – CCDT The compendium is an interactive document that calculates the misclassification rate (MCR) for different classification methods. The validation strategy is fixed but may be
942
Ulrich Mansmann, Markus Ruschhaupt, and Wolfgang Huber
modified by setting specific parameters. Therefore, an outer and an inner cross-validation is mandatory: the inner CV tunes the algorithm specific parameters (including also parameters for the preprocessing) and the outer CV to estimate the misclassification rate. We see the preprocessing as part of the classification algorithm. The CV strategy is sketched in figure 1.
Fig. 1. The cross-validation stategy (CV - cross validation, MCR - missclassification rate)
Things that can be changed easily are: Classification methods, preprocessing steps, parameters for classification method and validation, and data set. The compendium allows to combine guidelines with software, and to embed good statistical analysis in a text which is accessible for medical or biological researchers. It is a document that bundles primary data, processing methods (computational code), derived data, and statistical output with the textual documentation and conclusions it. Especially it contains specific tools to represent results. The inclusion of an implemented classification method is via a wrapper. Writing a wrapper function for new classification algorithms is simple. So far, five classification approaches are already implemented: Shrunken centroids (PAM) [18], support vector machines (SVM) [19], random forest [20], general partial least squares, and penalized logistic regression [21]. The compendium can be found as a package for the statistical language R at http://www.biometrie.uni-heidelberg.de/technical_reports. The compendium does not implement new algorithms but offers a framework to apply existing implementations of classification algorithms in a correct way. 3.2 Static Versus Interactive Approaches The answer to questions like: Which classification result could be archived with standard approaches? or What loss in accuracy has to be traded for a rise in interpretability? is generally given in a paper which compares N different classification algorithms (C) and
Reproducible Statistical Analysis in Microarray Profiling Studies
943
M pre-processing (P) strategies on K different data sets (D). The N x M x K performance measures are tabulated and discussed. Dudoit et al. [12] published such a study which is extended by Lee et al. [13]. It is difficult to use their results for guidance because they certainly do not implement all algorithms of interest, the pre-processing strategies may change, and the data will become irrelevant when a new generation of microarrays is introduced. Therefore, it may be wise to replace the static by an interactive approach by offering the machinery which allows the researcher herself to perform such a study on the algorithms and pre-processing strategies of interest together with relevant data. The compendium offers different levels of interactivity. It can be used to produce a textual output comparable with the static approach. On an intermediate level one interacts with the compendium by specifying parameters and data sets. For example, one could change the kernel of a support vector machine by simply changing the parameter poss.pars: > poss.pars = c(list(cost = cost.range, kernel = "linear"),poss.k)
This is the level of sensitivity analyses or of comparing the performance of implemented algorithms on different data sets. The advanced level of interaction consists in introducing new ideas like new classification algorithms or new tools for the presentation of the results. Writing wrapper functions for new classification methods is simple. The following example shows a wrapper for diagonal discriminant analysis. The essential part is the specific definition of the predict.function by user specific needs and ideas: > + + + + + + + +
DLDA.wrap = function(x, y, pool = 1, ...) { require(sma) predict.function = function(testmatrix) { res = stat.diag.da(ls = x, as.numeric(y), testmatrix, pool = pool)$pred return(levels(y)[res]) } return(list(predict = predict.function, info = list())) }
3.3 Sweave Sweave [24] is a specific approach for generation of a dynamic report. We use Sweave as the technology for the compendium. It mixes S (R) and LATEX in a sequence of code and documentation chunks in a Rnw file. It uses S (R) for all tangling and weaving steps and hence has a very fine control over the S (R) output. Options can be set either globally to modify the default behaviour or separately for each code chunk to control how the output of the code chunks is inserted into the LATEX file. The working principle of Sweave is sketched in figure 2 and demonstrated in the following example. The parts between <<...>>= ...@ describe the code chunks which will be evaluated by S (R) and whose results will be woven if required into the output (again a LATEX of postscript document) or only available for later evaluation steps. The paragraphs between the code chunks will be processed as LATEX code for the output document.
944
Ulrich Mansmann, Markus Ruschhaupt, and Wolfgang Huber
Fig. 2. The working principle of Sweave
To obtain normalized expression measures from the microarray data, Huang et al. used Affymetrix’ Microarray Suite (MAS) Version 5 software.Additionally, they transformed the data to the logarithmic scale. Here, we use the function \Rf unction{mas5} in the \Rpackage{af f y} library. \Robject{eset} is an object of classexprSet, which comprises the normalized expression values as well as the tumour sample data. All the following analyses are based on this object. %%normalizing the affy batches <<normalizing,eval=FALSE>>= Huang.RE <- mas5(affy.batch.RE) exprs(Huang.RE) <- log2(exprs(Huang.RE)) @ So we have the following expression set for our further analysis <<show1>>= Huang.RE @
The Sweave output of this part is presented in figure 3.
4 Results The application of the compendium to data of microarray profiling study provides tools to answer crucial questions on the assessment of a new classification algorithm. A few aspects will be discussed. A new classification algorithm can be implemented by writing the specific wrapper function which is an easy exercise. Misclassification results can be calculated and compared to the results of a set of competing algorithms. The compendium presents classification results on the basis of the confusion matrix and individual classification. Individual classification based on a specific pre-processing strategy and classification algorithm can be presented for the whole sample graphically. figure 4 shows a vote plot which presents for each subject the percentage of correct classification in the determined
Reproducible Statistical Analysis in Microarray Profiling Studies
945
Fig. 3. Sweave output of example
Fig. 4. Vote plot for classification result using: reanalysis of the Huang et al. data [4]
number of CV loops. The first 34 patients belong to the group with no tumour recurrence. The remaining 14 patients suffered a relapse. A table giving a synopsis of all individual misclassifications can also be produced. Figure 5 shows all subjects who are misclassified by at least one of the classification strategies under consideration in the data of Huang et al. [4]. The individuals are ordered with respect to the number of classification strategies which lead to misclassification. Other presentations of the comparison are possible, for example a scatter plot to contrast individual MCRs between the new and the standard approaches. This idea can also be
946
Ulrich Mansmann, Markus Ruschhaupt, and Wolfgang Huber
Fig. 5. Reanalysis of Huang et al. data (RF - random forest, PAM - shrunken centroids, logReg penealized logistic regression, SVM - support vector machine, M - using metagenes)
implemented on the advanced level of interaction by writing the code for the respective figure. The Huang strategy misclassifies two of the 34 patients with good prognosis and three of the 18 patients with bad prognosis. The standard algorithms give results on correct classification below 80%. Why is it impossible for us to reproduce the good classification result with standard algorithms? No implementation for the Bayesian classification tree (BCT) is available and thus no direct comparison is possible. The algorithm of the BCT is not available and its description in a technical report does not give explicit advise for its implementation into software. Therefore, we tried a sensitivity analysis by using the intermediate interaction with the compendium. First, we excluded the the preproceeing form the inner CV loop, because the original paper [4] does not take care on the adjustment of the MCR for the pre-processing strategy. This did not improve classification quality in the expected way. Second, we reduced the data set to the twohundred most discriminating genes which introduces a huge selection bias. Only this way we could come up with 6-7 misclassifications which is still worse as the results in the original paper. Can we trust the original result? Our analysis reminds us to be quite critical to the original claims. Huang et al. [4] state that the genes found to be crucial for the classification are different from the genes found as crucial by van ’t Veer et al. [3]. What is the reason for the missing link between the Huang and van ’t Veer classifiers? How is it possible to disentangling computation, technology, and biology? The compendium takes care on the computational part. Additionally, the BCT and the van ’t Veer classifier need to be implemented and wrapper functions for both classification algorithms have to be written. Both wrapper and the respective pre-processing strategies will be applied to both data sets. For each data set correlation between the genes used in the classifiers can be calculated and visualised by a checkerboard figure. The checkerboard for genes
Reproducible Statistical Analysis in Microarray Profiling Studies
947
of both classifiers on the same data set allows to identify genes with similar expression behaviour and to study common biological aspects behind both classifiers. Comparing the checkerboards between both data sets gives hints on sample differences between both studies and discrepancies introduced by the different microarray technologies used.
5 Discussion The literature on the induction of prognostic profiles from microarray studies is a methodological wasteland. Ambroise and McLachlan [9] describe the unthorough use of cross-validation in a number of high-profile published microarray studies. Tibshirani and Efron [7] report the difficulty in reproducing a published analysis. Huang et al. [4] present results with the potential to revolutionize clinical practice in breast cancer treatment but use an ideosyncratic statistical method which is fairly complex and neither easy to implement nor to obtain as software. A series of papers published in Nature [3], NEJM [6], [17], and The Lancet [4], [5] base their impressive results on classification methods which were developed ad-hoc for the problem at hand. The global picture looks like: the MCRs reported are of questionable clinical relevance, reanalysis of a paper does not support the strong claims they made, results are not reproducible with respect to computation, validation, or biology. This situation has several implications: 1) It is nearly impossible to assess the value of the presented studies in terms of statistical quality and clinical impact. 2) Scientists looking for guidance to design similar studies are left puzzled by the plethora of methods. 3) It is left unclear how much potential there is for follow-up studies to incrementally improve on the results.
References 1. Microarray special. Statistical Science, 18:1-117, 2003. 2. Simon R., Rademacher M.D., Dobbin K., McShane L.M.: Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. J Nat. Cancer Inst., 95: 14-18, 2003. 3. van ’t Veer L., Dai H., van de Vijver M.J., He Y.D., Hart A.A.M., Mao M., Petersen H.L., van de Kooy K., Marton M.J., Witteveen A.T., Schreiber G.J., Kerkhoven R.M., Roberts C., Linsley P.S., Bernards R. and Friend S.H.: Gene expression profiling predicts clinical outcome of breast cancer. Nature, 415:530-536, 2002. 4. Huang E., Cheng S.H., Dressman H., Pittman J., Tsou M.H., Horng C.F., Bild A., Iversen E.S., Liao M., Chen C.M., West M., Nevins J.R. and Huang A.T.: Gene expression predictors of breast cancer outcomes. The Lancet, 361:1590-1596, 2003. 5. Chang J., Wooten E., Tsimelzon A., Hilsenbeck S., Gutierrez C, Elledge R., Mohsin S., Osborne K., Chamness G., Allred C., O’Connell P.: Gene expression profiling for the prediction of therapeutic response to docetaxel in patients with breast cancer. The Lancet, 362:362-369, 2003. 6. Bullinger L., D¨ohner K., Bair E., Fr¨ohling S., Schlenk R.F., Tibshirani R., D¨ohner H., Pollack J.R.: Use of Gene-Expression Profiling to Identify Prognostic Subclasses in Adult Acute Myeloid Leukemia. NEJM, 350:1605-1616, 2004. 7. Tibshirani R.J., Efron B.: Pre-validation and inference in microarrays. Statistical Applications in Genetics and Molecular Biology, 1:1, 2002.
948
Ulrich Mansmann, Markus Ruschhaupt, and Wolfgang Huber
8. Breiman L.: Statistical Modelling: The Two Cultures. Statistical Science, 16:199-231, 2001. 9. Ambroise C., McLachlan G.J.: Selection bias in gene extraction on the basis of microarray gene-expression data. Proc Natl Acad Sci, 99:6562-6566, 2002. 10. Brenton J.D., Caldas C.: Predictive cancer genomics - what do we need? The Lancet, 362:340341, 2003. 11. Leisch F., Rossini A.J.: Reproducible statistical research. Chance, 16:41-46, 2003. 12. Dudoit S., Fridlyand J., Speed T.P.: Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data. Journal of the American Statistical Association, 97:77-87, 2002. 13. Lee J.W., Korea University, Department of statistics, personal communication. 14. Ihaka R., Gentleman R.: R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics, 5:299-314, 1996. 15. Gentleman R., Carey V.: Bioconductor. R News, 2(1):11-16, 2002. 16. Leisch F.: Dynamic generation of statistical reports using literate data analysis. Compstat 2002 - Proceedings in Computational Statistics, 575-580, 2002. 17. van de Vijver M.J., He Y.D., van ’t Veer L.J. et al.: A gene-expression signature as a predictor of survival in breast cancer. N Engl J Med, 347:1999-2009, 2002. 18. Tibshirani R., Hastie T., Narasimhan B., Chu G.: Class prediction by nearest shrunken centroids, with application to DNA microarrays. Statistical Science, 18:104-117, 2003. 19. Vapnik V.: The Nature of Statistical Learning Theory. Springer, New York (1999) 20. Breiman L.: Random Forests. Machine Learning Journal, 45:5-32, 2001. 21. Eilers P.H., Boer J.M., Van Ommen G.J., Van Houwelingen H.C.: Classification of Microarray Data with Penalized Logistic Regression. Proceedings of SPIE volume 4266:progress in biomedical optics and imaging, 2:187-198, 2001. 22. Carey V.J.: Literate Statistical Programming: Concepts and Tools. Chance, 14:46-50, 2001. 23. Sawitzki, G.: Keeping Statistics Alive in Documents. Computational Statistics, 17:65-88, 2002. 24. Leisch, F.: Dynamic generation of statistical reports using literate data analysis. Compstat 2002 - Proceedings in Computational Statistics, 575-580, 2002.
Step-Down FDR Procedures for Large Numbers of Hypotheses Paul N. Somerville Emeritus Professor of Statistics University of Central Florida 2805 N. Hwy. A1A, Unit 502 Indialantic, FL 32903 [email protected]
Abstract. Somerville (2004b) developed FDR step-down procedures which were particularly appropriate for cases where the number of false hypotheses was small. The test statistics were assumed to have a multivariate-t distribution with common correlation. MCV’s (minimum critical values) were chosen so that 8 unique critical values resulted. Tables were given for numbers of hypotheses m, ranging from 50 to 10,000, for ρ = 0., 0.5, and ν = 15, ∞. In this paper we extend the results, using MCV’s resulting in 31 critical values. Tables are given for the same values of m, for ρ = 0, 0.1, 0.5 and ν = 15, ∞. Interpolation rules are given for m, ρ and ν. Use of larger numbers of critical values increase both the power and the number of hypotheses falsely rejected. When the expected number of false hypotheses is small, use of the procedures of this paper results in a reduced number of false rejections with a negligible reduction in power.
1 Introduction There are many situations where a researcher is interested in the outcome of a family of tests. An example is the case where m experimental drugs are compared to a standard with respect to an outcome. Tests can be made of the m null hypotheses of no effect, each at a level α. This would be testing at a per comparison error rate (PCER) of α. For m > 1, the probability of rejecting some null hypothesis when it was true would be larger than α. To protect from such situations arising from multiplicity, a standard procedure has been to use a family-wise error rate (FWER). In this case, the family-wise error rate would be the probability of rejecting at least one null hypothesis when in fact they are all true. A common single-step method to achieve the family-wise error rate is to use the Bonferroni procedure, discussed by R. A. Fisher (1935). Other single step procedures include those of Scheffe (1953) for testing several linear combinations, Tukey (1953) for pair-wise comparisons and Dunnett (1955) for comparisons with a control. Multi-step (step-wise) FWER procedures have been developed which are more powerful than single-step procedures. One of the first was introduced by Naik (1975). Others include step-up and step-down procedures developed for comparisons with a control by Dunnett and Tamhane (1991, 1992). More recently multi-step procedures which control the “false discovery rate" FDR (expected proportion of “false discoveries" or Type I errors) say q have been proposed. J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 949–956, 2006. c Springer-Verlag Berlin Heidelberg 2006
950
Paul N. Somerville
One motivation was the under-utilization of FWER procedures because of their perceived lack of “power". Benjamini and Hochberg (1995) presented a step-up procedure, valid for independent test statistics. Benjamini and Liu (1999a) presented a step-down procedure valid under the same conditions. Troendle (2000) developed both step-up and stepdown procedures which asymptotically control the FDR when the test statistics have a multivariate-t distribution. Benjamini and Liu (1999b) and Benjamini and Yekutieli (1999) gave distribution free FDR procedures. Benjamini and Yekutieli (2001) showed that the procedure of Benjamini and Hochberg (1995) was also valid when the p-values were “positively dependent". Benjamini, Krieger and Yekutieli (2001) developed a two stage step-up procedure. Sarkar (2002) has made important contributions. Somerville (2004b), using least favorable configurations, developed both step-up and step-down procedures which are valid for dependent or independent hypotheses. All calculations assume the test statistics have a joint multivariate-t distribution and a common correlation coefficient ρ. The concept of “Minimum Critical Value" (MCV) was introduced. When the MCV (see section 2.) is equal to 0 the step-down procedure is the same as that of Troendle (2000). Studies comparing powers of various procedures by Horne and Dunnett (2003) and Somerville (2003) have shown the methods of Troendle (2000) and Somerville (2004b) to be among the most powerful. FDR procedures, while controlling the expected number of “false positives", have also been criticized for not controlling the actual number or proportion. van der Laan, Dudoit and Pollard (2004) generalize FWER procedures, introducing gFWER which controls u, the number of Type I errors. They also introduce PFP to control γ, the proportion of “false positives". In this paper we introduce step-down FDR procedures which use large MCV’s (Minimum Critical Values). These procedures not only control the expected FDR, but also result in values of P [u ≤ 2] which compare with those achieved by Korn et al and van der Laan et al when the number of false hypotheses is small, or where finding a small number of false hypotheses is a satisfactory outcome. Six tables of critical values dm−30 to dm are given for 50 ≤ m ≤ 10, 000. Instead of using M CV = dm−30 , and 31 “unique" critical values, the tables may be “truncated" by elimination of the smallest values. Reducing the number of “unique" critical values, reduces the probability of “a or fewer" false positives, where the value of a is arbitrary.
2 Minimum Critical Values Assuming a multivariate-t distribution of the test statistics with a common correlation coefficient ρ, Somerville (2003) observed that, in most situations, the smallest of the calculated FDR critical values was less than the commonly used critical value of tα,ν for comparing two means, where tα,ν is the value of student’s t with type I error of α and ν degrees of freedom. In many situations it was negative, and when the number of tests was large, one or more of the critical values could be negatively infinite. This situation would certainly be unappealing to most users, and motivated the concept of “minimum critical value" (MCV). The “minimum critical value" is defined to be the smallest critical value, which, when exceeded or equaled, would result in an hypothesis rejection. The “minimum critical value" may be more or less arbitrarily chosen. If the calculated value
Step-Down FDR Procedures for Large Numbers of Hypotheses
951
for di is smaller than the chosen MCV, it is replaced by the MCV value, and di+1 , di+2 , are sequentially calculated, where di ≤ dj when i ≤ j. Somerville (2003, 2004a,b) made many calculations finding convincing empirical evidence, especially for large values of m, that if nF were the actual number of false hypotheses, that there was an inverse relationship between nF and the value of the MCV which maximized the “power" of the FDR procedure. That is, if nF were large, the MCV for maximum power of the procedure should be small and for small values of nF , the MCV should be large. In particular, Somerville observed that, for the values of m studied, choosing the value of MCV such that the number of “unique" critical values was equal to nF resulted in “powers" near the maximum. An approximately equivalent result could be obtained by automatically “truncating" the step-down FDR process after nF steps. It may be noted (Somerville (2003, 2004b)) that if the MCV is equal to the corresponding critical value for the single-step test, all the critical values of the FDR procedure are equal.
3 Tables for the FDR Step-Down Procedure when the Number of False Hypotheses Is Small Tables 2, 3 and 4 give step-down FDR critical values for m ranging from 50 to 10,000 when ν = 15 or ∞, q = .05 for the procedure, and ρ = 0, .1 and .5 respectively, when the hypotheses are one-sided. For each combination of m, ρ and ν. MCV was chosen as the smallest value such that d1 = d2 = . . . = dm−30 = M CV results in the FDR less than or equal to q. There are thus, for each m in each table, exactly 31 “unique" critical values, a more or less arbitrarily chosen “small" value. There may be error in the third decimal place in the tables. Critical values for m not included in the tables can be obtained by interpolation (or extrapolation if m < 50, 000 (say)). Each critical value di in a table is approximately linear in ln(m). Critical values for 15 < ν < ∞ can be obtained by linear interpolation in 1/ν. Critical values for 0 < ρ < .5, can be approximated using quadratic interpolation. It is worth noting that as ρ → 1, all critical values are equal to zq (or tq ), the value of the normal (or t) variate which is exceeded q of the time.
4 Calculation of Critical Values and Powers Fortran 90 programs SEQDN and SEQUP can sequentially calculate the critical values d2 to dm for step-down and step-up FDR, respectively, for arbitrary values of m, q, ρ, and ν. N (105 , 106 or 107 ) random normal multivariate vectors of size m were used to obtain each critical value. Fortran 90 programs FDRPWRDN and FDRPWRUP calculate powers, E(Q), P [u ≤ 1, 2, . . . , 7], and P [γ ≤ .05, .10, .15] where u and γ are the number and proportion of false discoveries respectively. Inputs are m, ρ, ν, ∆, nF and a set of m critical values. ∆ is the common standardized mean of the test statistics corresponding to the nF false hypotheses. N random normal multivariate vectors of size m are used. Three kinds of power are always calculated: per pair, all pairs and any pair (see Horn and Dunnett
952
Paul N. Somerville
(2004)), and also E(Q). The probability of rejecting at least one of the false hypotheses is called any pairs power. The probability of rejecting all false hypotheses is called all pairs power. Considering a specific hypothesis, the probability of its rejection is called the per pairs power. Since our calculations assume all the test statistics corresponding to the F hypotheses have the same location parameter, the per pair power is identical to the average power.
5 Example Korn, Troendle, McShane and Simon (2003) recently proposed two new procedures which control, with specified confidence, the actual number of false discoveries, and the actual proportion of false discoveries, respectively. They applied their procedures to analyze a microarray dataset consisting of measurements on approximately 9000 genes in paired tumor specimens, collected both before and after chemotherapy on 20 breast cancer patients. Their study, after elimination of cases of missing data included 8029 genes for analysis. The object was to simultaneously test the null hypotheses that the mean pre and post chemotherapy expression of genes was the same. Their Procedure A identified 28 genes where u, the number of false discoveries, was ≤ 2, with confidence .05. Procedure B identified the same 28 genes where γ, the false discovery proportion, was ≤ .10 with approximate confidence .95. Table 1. Table A: Number of genes identified by procedures, (m = 8029, q = 0.05). Table B: Minimum values of P [u ≤ 2] (m = 8029, q = .05, ν = 3.464). Rough approximate value of nF at minimum in parentheses T able A ν=∞ ν = 15 M CV ρ = .0 ρ = .1 ρ = .5 ρ = .0
T able B ν=∞ ν = 15 M CV ρ = .0 ρ = .1 ρ = .5 ρ = .0
dm−7
20
21
29
24
dm−7 .99(40) .96(70) .95(50) .90(50)
dm−11
23
23
33
27
dm−11 .98(50) .94(70) .93(55) .90(50)
dm−15
24
24
35
29
dm−15 .96(70) .91(70) .92(55) .89(80)
dm−19
27
27
38
31
dm−19 .92(80) .88(70) .91(60) .89(80)
dm−23
28
29
40
33
dm−23 .88(80) .85(70) .91(60) .88(90)
dm−27
29
29
> 50
33
dm−27 .83(80) .82(80) .90(60) .88(90)
dm−30
29
33
> 50
36
dm−30 .78(90) .78(90) .87(70) .88(100)
The tables 2, 3 and 4 were used to test the same hypotheses, “truncating" each table to utilize exactly 8, 12, 16, 20, 24, 28 and 31 “unique" critical values (“truncating" to exactly 12 “unique" critical values would mean setting MCV equal to dm−11 , and setting di = M CV for i < m − 11.). Assuming 15 degrees of freedom for each of the 8029 tests, the corresponding FDR step-down critical values were obtained using the Fortran program SEQDN. Table 1A indicates the number of genes identified by the procedures. Table 1B illustrates the control of the number of false positives, respectively for m = 8029. The graph P [u ≤ 2] vs. nF is (shallow) bowl shaped. The value in parentheses is a very rough estimate of the value of nF for the minimum.
Step-Down FDR Procedures for Large Numbers of Hypotheses
953
Table 2. Step-down FDR Critical Values (31 “unique") for ρ = 0 and q = 0.05 (the first columns should be read as di where i = m, m − 1, . . . , m − 30, . . . , 1)
m−1 m−2
3.140 3.467 3.876 4.173 4.458 4.822 5.081 5.340
m−2
2.731 2.958 3.234 3.426 3.613 3.842 4.009 4.172
m−3
2.999 3.335 3.753 4.054 4.345 4.710 4.976 5.225
m−3
2.629 2.865 3.150 3.349 3.538 3.772 3.940 4.106
m
15, q = 0.05) 1000 2500 5000 10000 4.842 5.177 5.420 5.656 4.614 4.962 5.214 5.464
T able D(ρ = 0, ν = ∞, q = 0.05) 50 100 250 500 1000 2500 5000 10000 3.083 3.283 3.533 3.713 3.884 4.102 4.259 4.412 2.867 3.081 3.346 3.537 3.714 3.938 4.103 4.266
T able C(ρ = 0, ν = 50 100 250 500 3.632 3.928 4.306 4.579 3.332 3.647 4.045 4.332
m
m m m−1
m−4
2.885 3.231 3.656 3.961 4.254 4.630 4.899 5.150
m−4
2.545 2.791 3.084 3.286 3.477 3.713 3.887 4.054
m−5
2.789 3.141 3.573 3.884 4.182 4.560 4.827 5.100
m−5
2.474 2.729 3.027 3.234 3.428 3.672 3.842 4.011 2.412 2.676 2.980 3.190 3.387 3.635 3.804 3.974
m−6
2.705 3.065 3.503 3.816 4.066 4.444 4.723 4.979
m−6
m−7
2.630 2.999 3.443 3.760 4.066 4.444 4.723 4.979
m−7
2.354 2.626 2.938 3.150 3.350 3.596 3.771 3.941
m−8
2.561 2.937 3.389 3.705 4.013 4.392 4.672 4.941
m−8
2.302 2.585 2.899 3.116 3.317 3.566 3.742 3.915
m−9 2.499 2.885 3.338 3.660 3.967 4.357 4.631 4.912 m−10 2.441 2.835 3.293 3.617 3.929 4.306 4.585 4.847 m−11 2.383 2.787 3.249 3.577 3.886 4.283 4.567 4.834
m−9 2.254 2.543 2.865 3.084 3.286 3.543 3.717 3.887 m−10 2.207 2.507 2.835 3.054 3.260 3.511 3.694 3.865 m−11 2.165 2.470 2.806 3.029 3.235 3.488 3.669 3.842
m−12 2.333 2.743 3.213 3.543 3.855 4.244 4.527 4.803 m−13 2.282 2.702 3.176 3.504 3.824 4.220 4.488 4.774 m−14 2.233 2.666 3.144 3.474 3.784 4.182 4.459 4.746
m−12 2.122 2.440 2.777 3.003 3.213 3.468 3.655 3.824 m−13 2.082 2.407 2.753 2.980 3.190 3.448 3.633 3.803 m−14 2.044 2.379 2.730 2.960 3.169 3.428 3.613 3.788
m−15 2.187 2.627 3.111 3.446 3.759 3.153 4.441 4.735 m−16 2.141 2.591 3.081 3.416 3.731 4.124 4.426 4.677 m−17 2.079 2.559 3.051 3.388 3.704 4.113 4.394 4.677
m−15 2.006 2.352 2.705 2.937 3.151 3.411 3.602 3.773 m−16 1.969 2.324 2.684 2.919 3.133 3.399 3.582 3.757 m−17 1.932 2.300 2.666 2.901 3.117 3.376 3.566 3.743
m−18 2.052 2.526 3.025 3.362 3.679 4.078 4.359 4.645 m−19 2.010 2.495 3.000 3.338 3.669 4.063 4.346 4.625 m−20 1.969 2.464 3.973 3.315 3.636 4.025 4.329 4.602
m−18 1.897 2.274 2.644 2.881 3.099 3.365 3.557 3.728 m−19 1.861 2.251 2.625 2.867 3.084 3.354 3.537 3.719 m−20 1.826 2.223 2.609 2.849 3.069 3.334 3.526 3.704
m−21 1.926 2.436 2.949 3.292 3.605 4.024 4.303 4.577 m−22 1.884 2.406 2.926 3.272 3.592 3.994 4.279 4.567 m−23 1.842 2.380 2.905 3.250 3.570 3.980 4.249 4.532
m−21 1.791 2.206 2.590 2.833 3.055 3.323 3.516 3.693 m−22 1.755 2.181 2.573 2.821 3.041 3.308 3.502 3.685
m−24 1.801 2.352 2.883 3.227 3.554 3.952 4.238 4.532 m−25 1.759 2.328 2.861 3.207 3.538 3.937 4.221 4.507 m−26 1.716 2.302 2.839 3.192 3.513 3.925 4.203 4.494 m−27 1.670 2.275 2.822 3.169 3.493 3.890 4.185 4.457 m−28 1.625 2.249 2.797 3.152 3.465 3.875 4.185 4.445
m−23 1.720 2.160 2.557 2.804 3.029 3.308 3.484 3.672 m−24 1.685 2.138 2.541 2.792 3.016 3.282 3.484 3.660 m−25 1.648 2.121 2.531 2.780 3.003 3.275 3.469 3.654 m−26 1.621 2.098 2.509 2.764 2.994 3.269 3.455 3.644 m−27 1.549 2.066 2.490 2.746 2.976 3.249 3.443 3.628
m−29 1.625 2.243 2.790 3.138 3.454 3.842 4.136 4.408 m−30 1.464 2.152 2.712 3.062 3.385 3.784 4.065 4.343
m−28 1.549 2.066 2.490 2.746 2.969 3.249 3.443 3.628 m−29 1.549 2.066 2.490 2.746 2.969 3.249 3.443 3.628 m−30 1.397 1.991 2.437 2.700 2.935 3.212 3.408 3.594
. . . . . . . . 1.464 2.152 2.712 3.062 3.385 3.784 4.065 4.343 1 ln(m) 3.912 4.605 5.521 6.215 6.908 7.824 8.517 9.210
. . . . . . . . . 1.397 1.991 2.437 2.700 2.935 3.212 3.408 3.594 1 ln(m) 3.912 4.605 5.521 6.215 6.908 7.824 8.517 9.210
6 Summary and Conclusion Step-down FDR procedures have been developed for the case where there are many hypotheses. The procedures are particularly appropriate when relatively few hypotheses are false, or where obtaining a limited number of “discoveries" is satisfactory. For 50 ≤ m ≤ 10, 000 and q = .05, tables of critical values of the test statistics are presented for ρ = 0, .1 and .5 when ν = ∞, or ν = 15. There are 31 “unique" tabulated critical values, but the tables may be “truncated". Setting MCV equal to dm results in a single unique critical value and a single step procedure. This choice yields the largest P [u ≤ a] where a is arbitrary. Choosing the appropriate value for MCV is an attempt to balance the conflicting goals of rejecting all false hypotheses and not rejecting those hypotheses which are true. Empirical results suggest that “powers" are maximized when the MCV is chosen so that the number of “unique" critical values is equal to nF (of course almost always unknown). Some additional observations and conclusions are: 1. i) P [u ≤ a] decreases as the number of unique critical values used increases. Thus the number of critical values can be chosen so as to control P [u ≤ a].
954
Paul N. Somerville
Table 3. Step-down FDR Critical Values (31 “unique") for ρ = 0.1 and q = 0.05 (the first columns should be read as di where i = m, m − 1, . . . , m − 30, . . . , 1) T able E(ρ = 0.1, ν = 15, q 50 100 250 500 1000 3.580 3.860 4.217 4.474 4.719 m m−1 3.295 3.597 3.975 4.245 4.505 m−2 3.113 3.427 3.816 4.097 4.364 m
= 0.05) 2500 5000 10000 5.033 5.260 5.479 4.831 5.074 5.296
m m m−1
T able F (ρ = 0.1, ν = ∞, q = 0.05) 50 100 250 500 1000 2500 5000 10000 3.070 3.268 3.516 3.693 3.864 4.081 4.235 4.386 2.858 3.072 3.335 3.522 3.700 3.924 4.086 4.242
4.703 4.947 5.171
m−2
2.724 2.949 3.224 3.416 3.599 3.829 3.995 4.153
m−3 2.987 3.300 3.701 3.985 4.261 4.597 4.848 5.092 m−4 2.867 3.201 3.608 3.899 4.176 4.536 4.777 5.018
m−3
2.623 2.859 3.141 3.339 3.526 3.759 3.929 4.091
m−5 2.774 3.115 3.530 3.825 4.107 4.459 4.711 4.957 m−6 2.693 3.041 3.465 3.764 4.050 4.406 4.662 4.900 m−7 2.619 2.980 3.405 3.706 3.997 4.355 4.597 4.853 m−8 2.552 2.918 3.356 3.658 3.934 4.302 4.581 4.822
m−4
2.542 2.786 3.076 3.277 3.468 3.704 3.880 4.041
m−5
2.471 2.725 3.021 3.226 3.420 3.660 3.832 3.999
m−6
2.409 2.670 2.974 3.181 3.377 3.621 3.795 3.963
m−7
2.352 2.623 2.932 3.142 3.343 3.588 3.763 3.932
m−8
2.300 2.580 2.895 3.107 3.309 3.558 3.735 3.907
m−9 2.491 2.868 3.306 3.612 3.906 4.276 4.522 4.778 m−10 2.434 2.820 3.264 3.572 3.851 4.223 4.507 4.742 m−11 2.379 2.773 3.220 3.535 3.834 4.202 4.451 4.722
m−9 2.253 2.541 2.862 3.077 3.280 3.531 3.708 3.876 m−10 2.207 2.503 2.829 3.049 3.255 3.503 3.685 3.858 m−11 2.164 2.469 2.801 3.023 3.227 3.481 3.662 3.834
m−12 2.328 2.730 3.186 3.502 3.796 4.159 4.428 4.691 m−13 2.279 2.691 3.150 3.466 3.759 4.144 4.395 4.650
m−12 2.122 2.437 2.774 2.998 3.205 3.461 3.643 3.817 m−13 2.084 2.407 2.749 2.975 3.185 3.442 3.625 3.797
m−14 2.231 2.655 3.119 3.434 3.742 4.102 4.395 4.623 m−15 2.186 2.617 3.086 3.406 3.708 4.066 4.350 4.623 m−16 2.141 2.583 3.058 3.380 3.679 4.054 4.333 4.573
m−14 2.045 2.377 2.725 2.954 3.161 3.424 3.609 3.778 m−15 2.008 2.350 2.703 2.934 3.147 3.403 3.586 3.766
m−17 2.098 2.549 3.028 3.350 3.648 4.054 4.299 4.556 m−18 2.054 2.519 3.003 3.330 3.641 4.001 4.281 4.545
m−16 1.970 2.324 2.681 2.914 3.127 3.389 3.572 3.749 m−17 1.937 2.298 2.661 2.896 3.110 3.372 3.560 3.735 m−18 1.899 2.274 2.641 2.878 3.096 3.359 3.542 3.720
m−19 2.010 2.486 2.980 3.307 3.601 3.986 4.253 4.545 m−20 1.973 2.456 2.952 3.278 3.601 3.958 4.237 4.495 m−21 1.929 2.432 2.929 3.260 3.563 3.958 4.213 4.484
m−19 1.866 2.250 2.623 2.862 3.078 3.343 3.533 3.710 m−20 1.830 2.226 2.605 2.845 3.064 3.331 3.513 3.698 m−21 1.795 2.204 2.587 2.831 3.052 3.313 3.506 3.680
m−22 1.888 2.400 2.908 3.238 3.538 3.919 4.213 4.440 m−23 1.847 2.375 2.884 3.218 3.527 3.913 4.188 4.439 m−24 1.807 2.346 2.863 3.195 3.501 3.889 4.162 4.419
m−22 1.760 2.182 2.572 2.813 3.033 3.309 3.491 3.672 m−23 1.727 2.161 2.555 2.803 3.025 3.289 3.488 3.663 m−24 1.691 2.138 2.539 2.786 3.008 3.278 3.464 3.652
m−25 1.765 2.324 2.841 3.176 3.494 3.865 4.136 4.419 m−26 1.720 2.297 2.820 3.156 3.457 3.853 4.124 4.396 m−27 1.682 2.269 2.801 3.135 3.439 3.833 4.101 4.355
m−25 1.654 2.121 2.523 2.773 2.997 3.269 3.459 3.639 m−26 1.622 2.099 2.510 2.760 2.985 3.259 3.449 3.630
m−28 1.634 2.244 2.776 3.116 3.425 3.806 4.094 4.350 m−29 1.622 2.229 2.760 3.099 3.425 3.780 4.046 4.307 m−30 1.453 2.133 2.681 3.020 3.325 3.708 3.978 4.240 . . . . . . . . 1.453 2.133 2.681 3.020 3.325 3.708 3.978 4.240 1 ln(m) 3.912 4.605 5.521 6.215 6.908 7.824 8.517 9.210
m−27 1.574 2.074 2.492 2.747 2.969 3.243 3.439 3.623 m−28 1.544 2.059 2.480 2.734 2.969 3.239 3.426 3.607 m−29 1.544 m−30 1.388 . . 1.388 1 ln(m) 3.912
2.059 2.480 2.733 2.956 3.233 3.424 3.593 1.978 2.420 2.681 2.912 3.187 3.381 3.568 . . . . . . . 1.978 2.420 2.681 2.912 3.187 3.381 3.568 4.605 5.521 6.215 6.908 7.824 8.517 9.210
2. ii) The number of hypotheses rejected increases with the number of “unique" critical values used in the step-down procedure. The probability of false rejection also increases. 3. iii) For sufficiently large nF , as nF increases P [u ≤ 2] increases and approaches 1. 4. iv) The tables have been calculated under the assumption that the test statistics have a joint multivariate-t distribution with common correlation coefficient ρ. However, the critical values tabulated for the test statistic can be converted to critical p-values. These converted p-values may be useful for cases where the joint distribution of the test statistics is other than multivariate-t or that any individual test is one- two sided. 5. v) Use of the step-down FDR procedure of this paper is an alternative to procedure A of Korn et al and the generalized procedure of van der Laan, Dudoit and Pollard. 6. vi) Underestimating ρ results in the identification of fewer false hypotheses. 7. vii) Use of tables 2, 3 and 4 can make using step-down FDR procedures simple and easily manageable, though requiring some normality and correlation assumptions. The procedures control the FDR, and the number of false positives can be controlled by “truncation" choice. Critical values for parameters not given by the tables can be obtained using the Fortran program SEQDN.
Step-Down FDR Procedures for Large Numbers of Hypotheses
955
Table 4. Step-down FDR Critical Values (31 “unique") for ρ = 0.5 and q = 0.05 (the first columns should be read as di where i = m, m − 1, . . . , m − 30, . . . , 1) T able G(ρ = 0.5, ν = 15, q 50 100 250 500 1000 3.243 3.452 3.711 3.896 4.067 m m−1 3.039 3.266 3.545 3.738 3.929 m−2 2.902 3.144 3.432 3.634 3.826 m
= 0.05) 2500 5000 10000 4.294 4.455 4.608 4.155 4.322 4.483
m m m−1
T able H(ρ = 0.5, ν = ∞, q = 0.05) 50 100 250 500 1000 2500 5000 10000 2.883 3.047 3.252 3.393 3.528 3.696 3.816 3.943 2.714 2.894 3.113 3.272 3.416 3.599 3.717 3.848
4.057 4.242 4.409
m−2
2.605 2.798 3.025 3.183 3.332 3.522 3.647 3.777
m−3 2.798 3.048 3.349 3.558 3.750 3.990 4.164 4.328 m−4 2.713 2.974 3.284 3.497 3.693 3.943 4.118 4.303
m−3
2.522 2.724 2.962 3.119 3.272 3.463 3.599 3.725
m−5 2.641 2.909 3.223 3.444 3.648 3.905 4.076 4.255 m−6 2.575 2.853 3.179 3.397 3.607 3.865 4.038 4.191 m−7 2.514 2.805 3.129 3.358 3.563 3.812 4.004 4.161 m−8 2.460 2.755 3.096 3.314 3.528 3.786 3.972 4.133
m−4
2.454 2.665 2.905 3.073 3.228 3.418 3.558 3.686
m−5
2.394 2.612 2.866 3.036 3.189 3.383 3.527 3.657
m−6
2.340 2.566 2.823 3.003 3.155 3.356 3.497 3.629
m−7
2.291 2.525 2.792 2.967 3.127 3.332 3.472 3.606
m−8
2.249 2.489 2.759 2.938 3.103 3.306 3.448 3.584
m−9 2.410 2.717 3.056 3.289 3.494 3.758 3.944 4.106 m−10 2.362 2.678 3.026 3.256 3.468 3.732 3.918 4.080 m−11 2.316 2.642 2.989 3.233 3.444 3.709 3.894 4.056
m−9 2.205 2.456 2.729 2.912 3.080 3.283 3.426 3.564 m−10 2.164 2.424 2.704 2.887 3.058 3.261 3.406 3.546 m−11 2.129 2.394 2.676 2.864 3.037 3.241 3.388 3.528
m−12 2.274 2.606 2.963 3.199 3.420 3.690 3.873 4.033 m−13 2.232 2.571 2.937 3.181 3.397 3.670 3.853 4.011
m−12 2.093 2.367 2.655 2.843 3.017 3.223 3.370 3.512 m−13 2.059 2.338 2.635 2.823 2.999 3.206 3.354 3.497
m−14 2.189 2.543 2.913 3.145 3.375 3.642 3.835 3.991 m−15 2.152 2.514 2.878 3.131 3.354 3.623 3.819 3.972 m−16 2.111 2.478 2.864 3.103 3.333 3.605 3.803 3.954
m−14 2.022 2.313 2.610 2.804 2.981 3.190 3.339 3.483 m−15 1.990 2.290 2.594 2.786 2.964 3.175 3.325 3.469
m−17 2.077 2.452 2.836 3.095 3.313 3.588 3.789 3.938 m−18 2.058 2.428 2.815 3.064 3.293 3.571 3.777 3.923
m−16 1.959 2.262 2.576 2.769 2.948 3.162 3.312 3.457 m−17 1.925 2.243 2.553 2.753 2.932 3.149 3.300 3.445 m−18 1.891 2.218 2.539 2.738 2.917 3.136 3.288 3.433
m−19 1.995 2.395 2.797 3.045 3.274 3.555 3.760 3.909 m−20 1.967 2.369 2.769 3.026 3.256 3.539 3.749 3.896 m−21 1.924 2.352 2.755 3.019 3.238 3.523 3.729 3.884
m−19 1.864 2.205 2.523 2.723 2.903 3.124 3.276 3.422 m−20 1.831 2.175 2.499 2.708 2.889 3.113 3.265 3.411 m−21 1.800 2.154 2.491 2.694 2.876 3.101 3.254 3.400
m−22 1.884 2.323 2.738 2.991 3.221 3.507 3.717 3.873 m−23 1.849 2.296 2.710 2.973 3.205 3.491 3.695 3.842 m−24 1.811 2.269 2.695 2.957 3.190 3.474 3.676 3.816
m−22 1.767 2.136 2.469 2.679 2.863 3.090 3.243 3.390 m−23 1.734 2.113 2.460 2.665 2.851 3.078 3.232 3.379 m−24 1.697 2.092 2.442 2.650 2.838 3.066 3.221 3.368
m−25 1.770 2.250 2.671 2.936 3.175 3.458 3.659 3.794 m−26 1.726 2.221 2.655 2.915 3.160 3.440 3.644 3.776 m−27 1.685 2.192 2.634 2.901 3.145 3.421 3.626 3.763
m−25 1.670 2.071 2.419 2.635 2.826 3.054 3.209 3.357 m−26 1.630 2.053 2.411 2.620 2.814 3.041 3.197 3.346
m−28 1.639 2.162 2.606 2.872 3.130 3.401 3.611 3.754 m−29 1.585 2.132 2.578 2.857 3.115 3.380 3.573 3.750 m−30 1.396 2.015 2.486 2.762 3.003 3.305 3.499 3.689 . . . . . . . . 1.396 2.015 2.486 2.762 3.003 3.305 3.499 3.689 1 ln(m) 3.912 4.605 5.521 6.215 6.908 7.824 8.517 9.210
m−27 1.594 2.029 2.390 2.603 2.803 3.027 3.184 3.334 m−28 1.554 2.004 2.375 2.586 2.782 3.013 3.170 3.322 m−29 1.512 m−30 1.335 . . 1.335 1 ln(m) 3.912
1.997 2.356 2.569 2.767 2.997 3.154 3.308 1.879 2.278 2.505 2.703 2.936 3.096 3.245 . . . . . . . 1.879 2.278 2.505 2.703 2.936 3.096 3.245 4.605 5.521 6.215 6.908 7.824 8.517 9.210
References 1. Benjamini Y., Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. J.R. Statist. Soc. B, 289-300. 2. Benjamini, Y., Liu, W. (1999). A step-down multiple hypotheses testing procedure that controls the false discovery rate under independence. JSPI, 163-170. 3. Benjamini, Y., Liu, W. (2001). A distribution-free multiple-test procedure that controls the false discovery rate. Manuscript available from FDR Website of Y. Benjamini. 4. Benjamini, Y., Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency. Annals of Statistics 29, 1165-1188. 5. Benjamini, Y., Krieger, A., and Yekutieli, D. (2001). Two stage linear step up FDR controlling procedure. Manuscript available from FDR Website of Y. Benjamini. 6. Dunnett, C. W. (1955). A multiple comparison procedure for comparing several treatments with a control. J. Amer. Statist. Assoc. , 60, 1096-1121. 7. Dunnett, C. W.,and Tamhane, A. C. (1991). Step-down multiple tests for comparing treatments with a control in unbalanced one-way layouts. Statistics in Medicine, 10, 939-947. 8. Dunnett, C. W.,and Tamhane, A. C. (1992). A step-up multiple test procedure. J. Amer. Statist. Assoc. 87, 162-170. 9. Horn, M., Dunnett, C. W. (2004). Power and sample size comparisons of stepwise FWE and FDR controlling test procedures in the normal many-one case. Recent Developments in
956
10.
11. 12. 13. 14. 15. 16.
17. 18.
Paul N. Somerville Multiple Comparison, edited by Y. Benjamini, S. Sarkar and F. Bretz. IMS Lecture Notes Monograph Series. 48 - 64. Korn, Edward L., Troendle, James F., McShane, Lisa M. and Simon, Richard (2003). Controlling the number of false discoveries: Application to high-dimensional genomic data. Journal of Statistical Planning and Inference 124 issue 2,379 - 398. Naik, U. D. (1975). Some selection rules for comparing p processes with a standard. Commun. Statist., Ser. A , 4 , 519-535. Sarkar, S.K. (2002). Some results on false discovery rate in stepwise multiple testing, Annals of Statistics 30, 239-257. Scheffe H. (1953). A method for judging all contrasts in the analysis of variance. Biometrika 40, 87-104. Somerville, Paul N. (2003). “Optimum" FDR procedures and MCV values. Technical report TR-03-01 Department of Statistics, University of Central Florida. Somerville, Paul N. (2004a) A step-down procedures for large numbers of hypotheses. Paper presented at the annual meetings of the Florida Chapter of the ASA. Somerville, Paul N. (2004b). FDR step-down and step-up procedures for the correlated case. Recent Developments in Multiple Comparison, edited by Y. Benjamini, S. Sarkar and F. Bretz. IMS Lecture Notes Monograph Series. 100 -118. Tukey, J. W. (1953). The problem of multiple comparisons. Department of Statistics, Princeton University. van der Laan, Mark J., Dudoit, Sandrine and Pollard, Katherine S. (2004). Multiple testing. Part II. Step-down procedures for control of the generalized family-wise error rate. Statistical Applications in Genetics and Molecular Biology: Vol. 3: No. 1, Article 13.
Applying Security Engineering to Build Security Countermeasures: An Introduction Organizers: Tai-hoonn Kim1 and Ho-yeol Kwon2 1
San-7, Geoyeo-Dong, Songpa-Gu, Seoul, Korea [email protected] 2 Gangwon University, 192-1, Hyoja2-Dong Chunchon, Kangwon-Do, 200-701, Korea [email protected]
Abstract. The general systems of today are composed of a number of components such as servers and clients, protocols, services, and so on. Systems connected to network have become more complex and wide, but the researches for the systems are focused on the ’performance’ or ’efficiency’. While most of the attention in system security has been focused on encryption technology and protocols for securing the data transaction, it is critical to note that a weakness (or security hole) in any one of the components may comprise whole system. Security engineering is needed for reducing security holes may be included in the software. Therefore, more security-related researches are needed to reduce security weakness may be included in the software. This paper introduces some methods for reducing the threat to the system by applying security engineering, and proposes a method for building security countermeasure.
1 Introduction Many products, systems, and services are needed and used to protect information. The focus of security engineering has expanded from one primarily concerned with safeguarding classified government data to broader applications including financial transactions, contractual agreements, personal information, and the Internet. These trends have elevated the importance of security engineering [1]. There are many security related standards, methods and approaches, and their common objective is to assure the security of IT system or software products. ISO/IEC TR 15504, the Software Process Improvement Capability Determination (SPICE), provides a framework for the assessment of software processes, and it has some considerations for security although they are relatively poor to others. For example, the considerations for security related to software development and developer are lacked. Considerations for security are expressed well in some evaluation criteria such as ISO/IEC 21827, the Systems Security Engineering Capability Maturity Model (SSE-CMM), and ISO/IEC 15408, Common Criteria (CC) [2][3][4][5][6]. It is essential that not only the customer’s requirements for software functionality should be satisfied but also the security requirements imposed on the software development should be effectively analyzed and implemented in contributing to the security objectives of customer’s requirements. Unless suitable requirements are established at the start of the software development process, the resulting end product, J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 957–963, 2006. c Springer-Verlag Berlin Heidelberg 2006
958
Tai-hoonn Kim and Ho-yeol Kwon
however well engineered, may not meet the objectives of its anticipated consumers. The IT products like as firewall, IDS (Intrusion Detection System) and VPN (Virtual Private Network) are made to perform special functions related to security, and used to supply security characteristics. But the method using these products may be not the perfect solution. Therefore, when making some kinds of software products, security-related requirements must be considered.
2 A Summary of Security Engineering Security engineering is focused on the security requirements for implementing security in software or related systems. In fact, the scope of security engineering is very wide and encompasses: – the security engineering activities for a secure software or a trusted system addressing the complete lifecycle of: concept definition, analysis of customer’s requirements, high level design and low level design, development, integration, installation and generation, operation, maintenance end de-commissioning, – requirements for product developers, secure systems developers and integrators, organizations that develop software and provide computer security services and computer security engineering, – applies to all types and sizes of security engineering organizations from commercial to government and the academe. The security engineering should not be practiced in isolation from other engineering disciplines. Maybe the security engineering promotes such integration, taking the view that security is pervasive across all engineering disciplines (e.g., systems, software and hardware) and defining components of the model to address such concerns. The main interest of customers and suppliers may be not improvement of the development of security characteristics but performance and functionality. If developers consider some security-related aspects of software developed, maybe the price or fee of software more expensive. But if they think about that a security hole can compromise whole system, some cost-up will be appropriate.
3 Application of Security Engineering A wide variety of organizations can apply security engineering to their work such as the development of computer programs, software and middleware of applications programs or the security policy of organizations. Therefore, appropriate approaches, methods and practices are needed for product developers, service providers, system integrators and system administrators even though they are not security specialists. Some of these organizations deal with high-level issues (e.g., ones dealing with operational use or system architecture), others focus on low-level issues (e.g., mechanism selection or design), and some do both. The security engineering may be applied to all kinds of organizations. Use of the security engineering principle should not imply that one focus is better than another is or that any of these uses are required. An organization’s business focus need
Applying Security Engineering to Build Security Countermeasures: An Introduction
959
not be biased by use of the security engineering. Based on the focus of the organization, some, but not all, of approaches or methods of security engineering may be applied very well. In fact, generally, it is true that some of approaches or methods of security engineering can be applied to increase assurance level of software. Next examples illustrate ways in which the security engineering may be applied to software, systems, facilities development and operation by a variety of different organizations. Security service provider may use security engineering to measure the capability of an organization that performs risk assessments. During software or system development or integration, one would need to assess the organization with regard to its ability to determine and analyze security vulnerabilities and assess the operational impacts. In the operational case, one would need to assess the organization with regard to its ability to monitor the security posture of the system, identify and analyze security vulnerabilities, and assess the operational impacts. Countermeasure Developers may use security engineering to address determining and analyzing security vulnerabilities, assessing operational impacts, and providing input and guidance to other groups involved (such as a software group). Software or product developers may use security engineering to gain an understanding of the customer’s security needs and append security requirements to the customer’s requirements. Interaction with the customer is required to ascertain them. In the case of a product, the customer is generic as the product is developed a priori independent of a specific customer. When this is the case, the product-marketing group or another group can be used as the hypothetical customer, if one is required. The main objective of application of security engineering is to provide assurance about the software or system to customer, and the assurance level of a software or system may be the critical factor has influence on deciding purchase. Therefore, the meaning of the application of security engineering to the software is the application of some assurance methods to the software development lifecycle phases. Assurance methods are classified in Fig.1 [7]. Depending on the type of assurance method, the assurance gained is based on the aspect assessed and the lifecycle phase. Assurance approaches yield different assurance due to the deliverable (IT component or service) aspect examined. Some approaches examine different phases of the deliverable lifecycle while others examine the processes that produce the deliverable (indirect examination of the deliverable). Assurance approaches include facility, development, analysis, testing, flaw remediation, operational, warranties, personnel, etc. These assurance approaches can be further broken down; for example, testing assurance approach includes general testing and strict conformance testing assurance methods.
4 Applying Security Engineering to Build Countermeasures In general, threat agents’ primary goals may fall into three categories: unauthorized access, unauthorized modification or destruction of important information, and denial of authorized access. Security countermeasures are implemented to prevent threat agents from successfully achieving these goals. Security countermeasures should be considered with consideration of applicable threats and security solutions deployed to support appropriate security services and objectives. Subsequently, proposed security solutions may be evaluated to determine if residual vulnerabilities exist, and a managed approach to mitigating risks may be proposed. Countermeasures must be considered and designed
960
Tai-hoonn Kim and Ho-yeol Kwon
Fig. 1. Categorization of existing assurance methods
from the starting point of some IT system design or software development processes. The countermeasure or a group of countermeasures selected by designers or administrators may cover all the possibility of threats. But a problem exits in this situation. How and who can guarantee that the countermeasure is believable? Security engineering may be used to solve this problem. In fact, the processes for building of security countermeasures may not fixed because the circumstances of each IT system may be different. We propose a method for building security countermeasures as below. 4.1 Threats Identification A ’threat’ is an undesirable event, which may be characterized in terms of a threat agent (or attacker), a presumed attack method, a motivation of attack, an identification of the information or systems under attack, and so on. Threat agents come from various backgrounds and have a wide range of financial resources at their disposal. Typically Threat agents are thought of as having malicious intent. However, in the context of system and information security and protection, it is also important to consider the threat posed by those without malicious intent. Threat agents may be Nation States, Hackers, Terrorists or Cyber terrorists, Organized Crime, Other Criminal Elements, International Press, Industrial Competitors, Disgruntled Employees, and so on. Most attacks maybe aim at getting inside of information system, and individual motivations of attacks to "get inside" are many and varied. Persons who have malicious intent and wish to achieve commercial, military, or personal gain are known as hackers (or cracker). At the opposite end of the spectrum are persons who compromise the network accidentally. Hackers range from the inexperienced Script Kiddy to the highly technical expert.
Applying Security Engineering to Build Security Countermeasures: An Introduction
961
4.2 Determination of Robustness Strategy The robustness strategy is intended for application in the development of a security solution. An integral part of the process is determining the recommended strength and degree of assurance for proposed security services and mechanisms that become part of the solution set. The strength and assurance features provide the basis for the selection of the proposed mechanisms and a means of evaluating the products that implement those mechanisms. Robustness strategy should be applied to all components of a solution, both products and systems, to determine the robustness of configured systems and their component parts. It applies to commercial off-the-shelf (COTS), government off-theshelf (GOTS), and hybrid solutions. The process is to be used by security requirements developers, decision makers, information systems security engineers, customers, and others involved in the solution life cycle. Clearly, if a solution component is modified, or threat levels or the value of information changes, risk must be reassessed with respect to the new configuration. Various risk factors, such as the degree of damage that would be suffered if the security policy were violated, threat environment, and so on, will be used to guide determination of an appropriate strength and an associated level of assurance for each mechanism. Specifically, the value of the information to be protected and the perceived threat environment are used to obtain guidance on the recommended strength of mechanism level (SML) and evaluation assurance level (EAL). 4.3 Consideration of Strength of Mechanisms SML (Strength of Mechanism Levels) are focusing on specific security services. There are a number of security mechanisms that may be appropriate for providing some security services. To provide adequate information security countermeasures, selection of the desired (or sufficient) mechanisms by considering particular situation is needed. An effective security solution will result only from the proper application of security engineering skills to specific operational and threat situations. The strategy does offer a methodology for structuring a more detailed analysis. The security services itemized in these tables have several supporting services that may result in recommendations for inclusion of additional security mechanisms and techniques. 4.4 Selection of Security Services In general, primary security services are divided five areas: access control, confidentiality, integrity, availability, and non-repudiation. But in practice, none of these security services is isolated from or independent of the other services. Each service interacts with and depends on the others. For example, access control is of limited value unless preceded by some type of authorization process. One cannot protect information and information systems from unauthorized entities if one cannot determine whether that entity one is communicating with is authorized. In actual implementations, lines between the security services also are blurred by the use of mechanisms that support more than one service. 4.5 Application of Security Technologies An overview of technical security countermeasures would not be complete without at least a high-level description of the widely used technologies underlying those countermeasures. Next items are some examples of security technologies.
962
– – – – –
Tai-hoonn Kim and Ho-yeol Kwon
Application Layer Guard, Application Program Interface (API), Common Data Security Architecture (CDSA), Internet Protocol Security (IPSec), Internet Key Exchange (IKE) Protocol, Hardware Tokens, PKI, SSL, S/MIME, SOCKS, Intrusion and Penetration Detection.
4.6 Determination of Assurance Level The discussion of the need to view strength of mechanisms from an overall system security solution perspective is also relevant to level of assurance. While an underlying methodology is offered by a number of ways, a real solution (or security product) can only be deemed effective after a detailed review and analysis that consider the specific operational conditions and threat situations and the system context for the solution. Assurance is the measure of confidence in the ability of the security features and architecture of an automated information system to appropriately mediate access and enforce the security policy. Evaluation is the traditional method ensures the confidence. Therefore, there are many evaluation methods and criteria exist. In these days, many evaluation criteria such as ITSEC are replaced by the Common Criteria. The Common Criteria provide assurance through active investigation. Such investigation is an evaluation of the actual product or system to determine its actual security properties. The Common Criteria philosophy assumes that greater assurance results come from greater evaluation efforts in terms of scope, depth, and rigor.
5 Conclusions and Future Work As mentioned earlier, security should be considered at the starting point of all the development processes. Making an additional remark, security should be implemented and applied to the IT system or software products by using the security engineering. This paper introduces some methods for reducing the threat to the system by applying security engineering, and proposes a method for building security countermeasure. But this method we proposed can’t cover all the cases. Therefore, more detailed research is needed. And the research for generalizing these processes may be proceeded, too.
References 1. ISO. ISO/IEC 21827 Information technology - Systems Security Engineering Capability Maturity Model (SSE-CMM) 2. ISO. ISO/IEC TR 15504-2:1998 Information technology - Software process assessment - Part 2: A reference model for processes and process capability 3. ISO. ISO/IEC TR 15504-5:1998 Information technology - Software process assessment - Part 5: An assessment model and indicator guidance 4. ISO. ISO/IEC 15408-1:1999 Information technology - Security techniques - Evaluation criteria for IT security - Part 1: Introduction and general model 5. ISO. ISO/IEC 15408-2:1999 Information technology - Security techniques - Evaluation criteria for IT security - Part 2: Security functional requirements
Applying Security Engineering to Build Security Countermeasures: An Introduction
963
6. ISO. ISO/IEC 15408-3:1999 Information technology - Security techniques - Evaluation criteria for IT security - Part 3: Security assurance requirements 7. Tai-Hoon, Kim: Approaches and Methods of Security Engineering, ICCMSE 2004 8. Tai-Hoon Kim, Byung-Gyu No, Dong-chun Lee: Threat Description for the PP by Using the Concept of the Assets Protected by TOE, ICCS 2003, LNCS 2660, Part 4, pp. 605-613 9. Tai-hoon Kim, Tae-seung Lee, Kyu-min Cho, Koung-goo Lee: The Comparison Between The Level of Process Model and The Evaluation Assurance Level. The Journal of The Information Assurance, Vol.2, No.2, KIAS (2002) 10. Tai-hoon Kim, Yune-gie Sung, Kyu-min Cho, Sang-ho Kim, Byung-gyu No: A Study on The Efficiency Elevation Method of IT Security System Evaluation via Process Improvement, The Journal of The Information Assurance, Vol.3, No.1, KIAS (2003) 11. Tai-hoon Kim, Tae-seung Lee, Min-chul Kim, Sun-mi Kim: Relationship Between Assurance Class of CC and Product Development Process, The 6th Conference on Software Engineering Technology, SETC (2003) 12. Ho-Jun Shin, Haeng-Kon Kim, Tai-Hoon Kim, Sang-Ho Kim: A study on the Requirement Analysis for Lifecycle based on Common Criteria, Proceedings of The 30th KISS Spring Conference, KISS (2003) 13. Tai-Hoon Kim, Byung-Gyu No, Dong-chun Lee: Threat Description for the PP by Using the Concept of the Assets Protected by TOE, ICCS 2003, LNCS 2660, Part 4, pp. 605-613 14. Haeng-Kon Kim, Tai-Hoon Kim, Jae-sung Kim: Reliability Assurance in Development Process for TOE on the Common Criteria, 1st ACIS International Conference on SERA
CC-SEMS: A CC Based Information System Security Evaluation Management System Young-whan Bang, Yeun-hee Kang, and Gang-soo Lee Hannam University, Dept. of Computer Science Daejon, 306-791, Korea {bangyh@se,dusi82@se,gslee@eve}.hannam.ac.kr
Abstract. Project of Common Criteria (CC) based evaluations of an IT security product/system takes so much cost and time that effective evaluation project management is strongly required. We present a CC based security evaluation management system (CC-SEMS) that is useful for evaluation facilities and developers of IT security product. Evaluation activity program, evaluation plan, XML-DTD of deliverable, management object, and evaluation DB are proposed as distinct approaches of CC-SEMS.
1 Introduction Various types of information security products have been developed, evaluated and certified since middle of 1980’s under various evaluation criteria/manual such as TCSEC [1], ITSEC/ITSEM [2,3], CTCPEC [4] and CC/CEM [5,6]. CC (Common Criteria), pronounced as ISO/IEC 15408, is an integrated evaluation criteria of the others. Most evaluations of product (i.e., Target of Evaluation: TOE), are completed within an optimum evaluation period and few hundreds of thousand dollars from starting. Others, of a more complex nature, can take much longer and more expensive depending on the target Evaluation Assurance Level (EAL) and Security Target (ST) that is a security requirement specification. Additionally, the elapsed time and cost for the security evaluation process is not only dependence on the availability of the correct developer documentation(or deliverable), but also the nature and complexity of the software, and whether any re-use can be made of work performed from previous evaluations of the same TOE [7]. An evaluation manual such as ITSEM and CEM is overall guidance of evaluation [3,6,7]. An evaluator in an evaluation facility should refer the manuals. ITSEM is a highlevel evaluation manual for ITSEC. ITSEM have only one Evaluation Activity Program (EAP). We need 7 kinds of EAP per each EAL. CEM has two problems. First, it is so abstract and high-level evaluation guide that it is un-useful for evaluators. In fact, CEM is merely a re-editing document of CC by context of “evaluator actions” statements in CC. Second, it is not providing guidance for EAL 5 through 7, nor for evaluations using other assurance packages. Thus we need a practical evaluation guidance, automation of evaluation activity, and a management system of evaluation projects. Guidance for accreditation and operation of evaluation facilities or test laboratories has been internationally or domestically standardized such as EN45001, ISO/IEC 17025, NIST Handbook 150, 150-20, CCEVS SP # 1, SP # 4, Canada’s PALCAN, J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 964–973, 2006. c Springer-Verlag Berlin Heidelberg 2006
CC-SEMS: A CC Based Information System Security Evaluation Management System
965
United Kingdom’s UK-SP2. Those cover, however, merely accreditation and operation of evaluation facilities. A detailed and practical evaluation guidance such as evaluation project program and management, that is useful to evaluators in evaluation facilities, is strongly required. An evaluation facility should operate an efficient CC Security Evaluation Management System (CC-SEMS) on their evaluation environment for the purpose of cooperative and concurrent managing the evaluation resource (e.g., evaluator, tool, deliverable, and criteria). CC-SEMS is integrated application principles of project management [8], workflow management [9], process management [10] and web service technologies [11]. Fig.1 presents architecture of CC-SEMS. Note that EAP template, Evaluation Plan and XML-DTD of deliverables, those are included in the preparation time, should be prepared before a start of evaluation. The right part of Fig.1 (i.e., management of Management Object and control engine) is operated in the evaluation time. In the next section, we present EAP template, Evaluation Plan and XML/DTD of deliverables in preparation time of CC-SEMS. Section 3 presents management object, Evaluation DB and control engine in evaluation time. Finally, we implement, analyze and conclude in Section 4. EAP1
Evaluation Plan
Programmed by
EAP7
CC-SEMS
Derived templates input
CC MICROSOFT CORPORATION
XML/DTD
$
CCIMB
Reference
Mgmt. Of Mgmt Object(MO)
deliverables Make
Workflow Engine
request
service
EDB
TOE Developer
Preparation time
cooperation
Evaluator
Evaluation time
Fig. 1. Architecture of CC-SEMS
2 Preparation Time of CC-SEMS A preparation time refers an activity of preparation of evaluation. Evaluation Plan and deliverable should be written by an evaluation project manager and developer, respectively. 2.1 Evaluation Activity Programs Templates and Evaluation Plan Evaluation Activity Programs(EAP) Templates: Dependency list is defined for each functional and assurance components in CC. Dependencies among “functional components” arise when a component is not self-sufficient and relies upon the functionality of, or interaction with, another component for its own proper functioning.
966
Young-whan Bang, Yeun-hee Kang, and Gang-soo Lee
Additionally, dependencies among assurance components arise when a component is not self-sufficient, and relies upon the presence of another component. Recall that assurance class, component, evaluator action and “developer action and content and presentation of evidence” correspond to activity, sub-activity, action and “work unit” in an evaluation project, respectively [6]. We regard dependence relations between assurance components as precedence relationship between evaluation sub-activities in context of evaluation project. We drive EAP templates, i.e., EAP1 through EAP7, which correspond to EAL1 through EAL7, from dependency relationship among assurance components. Note that the word “template” means what is independent to a TOE and an evaluation environment. Templates are only derived from CC. EAP is defined as follows:
S# = 1; // stage number Temp = Original; // Temp is a copy of Original (i.e., list of dependency relation defined in EALi). For each component COM in Temp that don't have 'in-going' dependency relation, do Draw the COM as an AN on stage P of EAP; // AN is an activity node. Delete COM's "out-going" dependent relations from Temp; end-do repeat for each component COM in Temp that don't have 'in-going' dependent relation, do Draw the COM as an AN in stage Q of EAP; Delete the COM's "out-going" dependent relations from Temp; Draw arcs from AN in P to AN in Q by using Original; end-do P = Q; S# = S# + 1; until (there is no more relation in Original.) if(A has an arc to B) and (B has an arc to C) and (A has an arc to C), then delete arc that is from A to C. // A, B, C are AN in EAP. If there is 'transitive relationship' in EAP, then delete them for the purpose of simplification.
Fig. 2. Evaluation activity program generation algorithm
EAP = (AN, ARC, STG) • AN is a set of activity nodes. AN has three attributes such as component name, deliverable name and assignment. (e.g., duration, cost, developer, tool). An AN presents an evaluation activity and necessary deliverables of an assurance component (see Table 1), as well as assignment of resource such as duration, cost, developer and tool. Note that assignment is empty in template of EAP. Assignment attribute, especially, is used for the purpose of evaluation project management. • ARC is a set of arcs from an AN to another AN in other stage. An ARC presents precedence relationships between two ANs. ANs in a stage can be concurrently processed alone with the project management. • STG is a set of stages. A pass from stage1 to final stage is total ordering of AN.
CC-SEMS: A CC Based Information System Security Evaluation Management System
967
• EAP is produced by means of an “EAP generation algorithm” as shows in Fig.2. The algorithm generates a total ordering from partial orderings. An EAP template is presented by an activity network or Gantt-chart. Fig.3 shows an activity network presentation of an EAP3 template.
ALC_DVS.1
ACM_CAP.3
ACM_SCP.1
AVA_SOF.1
ALR.1
CMR.3
CMR.3
VAR.2
4
5
2
4
ADV_HLD.2
ATE_DPT.1
HDR.2
TSR.3
6
2
ADV_RCR.1
ADV_FSP.2
FSR.1,HDR.2
FSR.1
2
5
ATE_FUN.1
AVA_VLA.1 AGD_ADM.1
VAR.2
GDR.1
4
3
ATE_IND.2
FSR.1
TSR.3 AGD_USR.1
6
2
AVA_MSU.1
GDR.1
VAR.2
5
ADO_DEL.1
3
DOR.2
ATE_COV.2
ADO_IGS.1
3
TSR.3
DOR.1
2
2
stage3
stage4
stage1
stage2
stage5
(Note: Numerical values are basic values that will be changed according to evaluation environment) Fig. 3. An activity network presentation of EAP3 template and an EP
Evaluation Plan (EP): An EP is realization, customization or instantiation of one of EAP templates. An evaluation project manager shell plan his evaluation project by using EAP template and scheduling algorithm of resources such as time, stakeholder, tool and cost. Fig.3 present an EP for EAL3 of some TOE. Note that assignment attribute values are filled in template. In Fig.3, the value can be interpreted as evaluation duration (weeks or months), cost ($), evaluator identifier, tool id., and so on. Numerical values(i.e. weeks/month, cost) in EAP are basic evaluation duration(week, month) or cost, that derived from CC and CEM. If evaluation environment is changed, the values in the EAP are also changed by the same ratio. An EP presented by a Gantt chart as shows in Fig.4. An EP can be scheduled by the following policies: • Multi-stage waterfall type scheduling : It is simple and un-optimal. An EP of Fig.3 takes 24 weeks by means of this policy.
968
Young-whan Bang, Yeun-hee Kang, and Gang-soo Lee 1
2
3
4
5
6
7
8
9
10
12
13
14
15
16
17
4
ALC_DVS.1
5
ACM_CAP.3
2
ACM_SCP.1 ADV_RCR.1
11
2 5
ADV_FSP.2
6
ADV_HLD.2 3
AGD_ADM.1
5
AGD_USR.1 2
ATE_COV.2 6
ATE_FUN.1
4
AVA_SOF.1 2
ATE_DPT.1 2
AVA_VLA.1
2
AVE_IND.2 2
ADO_IGS.1
3
AVA_MSU.1 ADO_DEL.1
3
Fig. 4. Optimal scheduling of EP of EAP3
• Optimal scheduling : In Fig.4, the critical evaluation activity path is ADV RC R.1→ ADV FSP.2 → ADV HLD.2 → AVA SOF.1. Total evaluation duration is 17 weeks. If we want to shortening the evaluation duration, we must be shorten the duration of the evaluation activities on the critical path.
2.2 Deliverables Necessary content of deliverables: Deliverables are produced by developer of TOE and consumed by evaluator. Deliverables that are presented by a document or source code is image of development process and a product. It is important to note that a TOE includes not only development process, but a product itself. We derive a list and a content of deliverables for each AN in EAP1 through EAP7 templates by using “deliverable generation algorithm”. The algorithm has two phases - derivation and refinement of necessary contents. 1. Derivation of necessary contents: Necessary contents in deliverable (or document) for evaluation of assurance component can be derived from “developer action elements” and “content and presentation of evidence elements” in assurance components. Each class, family and component in CC assurance requirement is mapped to title, chapter and section of a deliverables, respectively. 2. Refinement of necessary contents: If a document is derived from an assurance class, and then 13 documents are needed. Thus we need 91(13×7) types, in maximum, of documents. Those are redundant and too short or too long document. Thus we should not only minimize the amount of deliverables (or document), but also assure the consistency among documents. The more redundant documents are, the less consistent among documents might be.
CC-SEMS: A CC Based Information System Security Evaluation Management System
969
We can reduce volumes of documents by using the following rules: • CC conformance rule: A specific document includes a minimal content and presentation of evidence elements that specified in CC • Subset rule: A document for upper EAL includes those of lower EAL’s. • Merge Rule: Two or more documents for assurance classes that are small or redundant, may be merged into one document. • Minimal Rule: Two or more documents for assurance classes, which contents are the same, may be merged into one document. Table 1. List of deliverables Deliverables(documents) Security target report (STR) Configuration management Report (CMR) Delivery and operation report (DOR)
EAL1
EAL2
EAL3
EAL4
CMR.1
CMR.2
CMR.3
CMR.4
EAL6
EAL7
STR.1
DOR.1
DOR.2
Functional Specification (FSR)
FSR.1
High-level design specification (HDR)
HDR.1
FSR.2 HDR.2 IMR.1
Structural specification (INR) LDR.1
TSR.1
CMR.5
FSR.4
HDR.3
HDR.4
HDR.5
IMR.2
IMR.3
IMR.4
INR.1
INR.2
INR.3
LDR.2
LDR.3
LDR.4
TSR.2
ALR.3 TSR.4
VAR.1
VAR.2
VAR.4
VAR.3
DOR.4
FSR.3
SPR.1 GDR.1 ALR.1 ALR.2 TSR.3
Corresponding assurance class-family ASE
CMR.6
DOR.3
Implementation report (IMR)
Low-level design specification (LDR) Security policy sped (SPR) Guidance document (GDR) Lifecycle support report (ALR) Test report (TSR) Vulnerability analysis report (VAR)
EAL5
SPR.2
SPR.3
ALR.4 TSR.5
ALR.5 TSR.6
VAR.5
ACM ADO ADV_FSP, ADV_RCR ADV_HLD, ADV_RCR ADV_IMP, ADV_RCR ADV_INT, ADV_RCR ADV_LLD, ADV_RCR ADV_SPM AGD ALC ATE AVA
We can reduce numbers of deliverables to 51 (from 91) by using the rules as shown in Table 1. Our approach is more useful and practical than FAA’s Data Item Description[13]. XML-DTD for Deliverables: Content and structure (or schema) of deliverables are defined by XML-DTD. XML-DTD is regarded as syntax of deliverable. Each section or paragraph of a deliverable is presented by tag in XML. XML-DTD has many advantages in web based evaluation environment as follows: • Standardization of structure of deliverable • Automatic syntax and consistency checking by using XML processors • Automatic handling of deliverables by XML based application A producer (i.e., developer or sponsor) of deliverable, firstly, downloads a XMLDTD file of a deliverable from an evaluation web server of CC-SEMS. Then, he makes a XML file by using the loaded file and his own content; finally, he submits the XML file to an evaluation web server of CC-SEMS. In the server, a XML processor, firstly, will checks syntax and consistency of the submitted XML formed deliverables. Then, a
970
Young-whan Bang, Yeun-hee Kang, and Gang-soo Lee
--- omitted below ---Fig. 5. XML-DTD of Configuration Management Report (CMR.4) for EAL4
part (i.e., chapter tag or section tag) of deliverable is allocated to a scheduled evaluation activity by CC-SEMS. We define XML-DTDs for deliverables listed in Table 1. Fig.5 presents one of XML-DTD.
3 Evaluation Time of CC-SEMS 3.1 Management Object (MO) MO, that is resource or target of management, is defined as follows: MO = ( STKE, DELI, TOOL, TIME, COST, CRIT), where, • STKE is stakeholder or interested party of CC-SEMS. TOE produce, evaluation participant and evaluation users are subtypes of STKE [12]: TOE producer is a sponsor, developer or consultant. Evaluation participant is a evaluator, certifier and validator. Note that certifier and validator are working in a certification/validation body. CC evaluator an evaluator, approver or accredit or who is working in an evaluation facilities. Finally, evaluation user is a consumer or product vendor. • DELI is deliverables or documents. Subtypes of DELI are input (i.e., an image of TOE), process (temporal doc.) and output (results doc.). • TOOL is a given evaluation tool such as tester program. • TIME is a given evaluation time duration (e.g., 10 months). • COST is a given evaluation cost (e.g., 1 million dollars). • CRIT is evaluation criteria such as CC or PP(Protection Profile) or ST. Note that PP or ST can be regarded as evaluation criteria.
CC-SEMS: A CC Based Information System Security Evaluation Management System reference
CRIT
CC/CEM
STKE
access
DELI
evaluation
PLAN
TIME
usage
TOOL
Real file
Real plan file
COST
971
Real file
Fig. 6. A Schema of EDB
3.2 WorkFlow Engine (WFE) MO is controlled by the WFE or manager by way of a project plan. Inputs of WFE are EDB, an EP, and MO (e.g., evaluators, deliverables). WFE allocates or deallocates deliverables to evaluator, controls access to deliverables, conforms some work to evaluator, and so on. Outputs of WFE are control signal/message to evaluator or developer by means of online or offline. 3.3 Evaluation DB (EDB) EDB is used for storing state of MO of an evaluation project. EDB Schema, that is a logical structure of EDB, is entities of MO and relations among the entities as shows in Fig.6. A state of MO means a state of evaluation project, in other word, current value (snap shot) of an EDB that has MO’s attribute value. A state is transited to other state when an event (e.g., ‘testing is finished’) is occurred on evaluation time. Thus, processing activities of a state of MO are: identify, measurement and monitoring the MO’s attributes values; store and update EDB by means of manual or automatic. Tables, a data structure for entity and relation in EDB, are organized as below (Italic word is primary key, underline word is foreign key). • Entity tables and attributes: PLAN(plan id, proj id, proj information, link to real plan file), STKE(stke id, type, name, address, major, cost, etc.), DELI(deli id, type, name, content, link to real file, etc.), TOOL(tool id, type, cost, function, location, link to real file, etc.), COST(account id, amount, allocated, etc.), TIME(time id, etc.), CRIT( link to real CC/CEM, etc.). • Relation tables and attributes: access(stke id, deli id, access id, access type, access log, etc.), access type = (creation, update, read, approve, etc.),usage (stke id, tool id, usage id, usage type, usage log, etc.), usage type = (using, update), reference(stke id, crit id, refe id, ref log, etc.), evaluation (crit id, del id, eval id, tool id, etc.).
4 Implementation, Analysis and Conclusion Prototype version of CC-SEMS has been implemented by using Windows, JBuilder 9.0, MySQL 3.23.38 and PHP v.0.6. Currently we are developing upper version. CC-SEMS is
972
Young-whan Bang, Yeun-hee Kang, and Gang-soo Lee
Fig. 7. Some screen shots of CC-SEMS (Korean version)
consisted of EDB server, application server, client and web server. Note that XML-DTD files of deliverables are uploaded on the web server. We have developed a Security Target(ST) of CC-SEMS because CC-SEMS is also an IT security product. Recall that ST is a requirement specification in context of security function and assurance [14,15]. Fig.7 presents Korean version of CC-SEMS. CC-SEMS is not only an integrated approach from project management, groupware, workflow management, documentation engineering, XML and web technologies, but also has following distinct and useful approaches: • Necessary content of deliverables, that derived from “developer action elements” and “content and presentation of evidence elements” in CC, and some rules for reducing volumes of deliverables are expected to be useful at CC evaluation schemes. • XML-DTD of all sort of deliverables are useful for applying Web services in security evaluation. • Seven EWP templates, derived from dependency relations in components of CC, are useful for developing an Evaluation Plan for evaluation project scheduling in evaluation facilities. • Management object and evaluation DB schema are useful for modeling and developing any other evaluation management systems. The authors can not find yet any other similar approach or system to CC-SEMS in CC evaluation community. The system is ongoing development project. Thus we will formalize our approaches and upgrade the system by using actual data and problems that are occurred in real application of the system.
Acknowledgement This work was supported by a grant No. R12-2003-004-01001-0 from KOSEF.
CC-SEMS: A CC Based Information System Security Evaluation Management System
973
References 1. DoD. Trusted Computer System Evaluation Criteria (TCSEC), Dec. 1985. 2. European Communication, “Information Security Evaluation Criteria (ITSEC)”, Ver. 1.2, June 1991. 3. European Community, Information Technology Security Evaluation Criteria (ITSEM), Ver 1.0, 1993. 4. Canadian System Security Centre, , “The Canadian Trusted Computer Product Evaluation Criteria (CTCPEC)”, Ver.3e, Jan. 1993. 5. CC, Common Criteria for Information Technology Security Evaluation, CCIMB-2004-03, Version 2.2, Jan. 2004. 6. Common Methodology for Information Technology Security Evaluation (CEM), CCIMB2004-01-04, Version 2.2, Jan. 2004. 7. Common Criteria for Information Technology Security Evaluation, User Guide, Oct. 1999. 8. W. Royce, Software project management - A Unified Framework, AW, 1998. 9. P. Lawrence (ed.), Workflow handbook - 1997, John Wiley & Sons, 1997. 10. A. Fuggetta and A. Wolf (ed.). Software Process, John Wiley & Sons, 1996. 11. G. Alonso, et al., Web services - Concept, Architectures and Applications, Springer, 2004. 12. R. Priet-Diaz, The CC Evaluation Process - Process Explanation, Shortcomings, and Research Opportunities, Commonwealth Information Security Center Technical Report CISCTR-2002-003, James Madison University, Dec. 2002. 13. Data Item Description, Federal Aviation Administration, www.faa.gov/aio/cuiefsci/PPlibrary/index.htm. 14. ISO/IEC PDTR 15446, Guide for the production of PP and ST, Draft, Apr 2000. 15. NIAP, CC Toolbox Reference Manual, Version 6.0, 2000.
A Secure Migration Mechanism of Mobile Agents Under Mobile Agent Environments Dongwon Jeong1 , Young-Gab Kim2 , Young-Shil Kim3 , Lee-Sub Lee4 , Soo-Hyun Park5 , and Doo-Kwon Baik2 1
4
Dept. of Informatics & Statistics, Kunsan National University San 68, Miryong-dong, Gunsan, Jeolabuk-do, 573-701, Korea [email protected] 2 Department of Computer Science and Engineering , Korea University Anam-dong 5-ga, Seongbuk-gu, 136-701, Seoul, Korea {ygkim,baik}@software.korea.ac.kr 3 Division of Computer Science & Information, Daelim College Bisan-dong DongAn-ku Anayang-si Kyounggi-do, 431-715 Korea [email protected] Dept. of Computer Engineering, Kumoh National Institute of Technology 188, Shinpyung-dong, Gumi, Gyeongbuk [email protected] 5 School of Business IT, Kookmin University 861-1, Jeongreung-dong, Sungbuk-ku, SEOUL, 136-701, Korea [email protected]
Abstract. Although mobile agent paradigm provides many advantages for distributed computing, several security issues of the mobile agent paradigm remain as the most import difficulty that should be solved to increase its application. One of the most important mobile agent security issues is that mobile agents securely transfer from a host to other hosts. Until now, many mobile agent systems have been developed, but those systems do not support mechanisms to provide for secure transmission of mobile agents or have several problems. This paper proposes an integrity mechanism that mobile agents can migrate to other hosts securely. This mechanism is independent on specific security frameworks, so it can be used easily under various mobile agent environments.
1 Introduction Mobile agent paradigm has been evolved from transitional client/server paradigm[1,2] and it has many advantages such as mobility, autonomy, intelligence, adaptability, cooperation, etc [3]. Especially, the mobility is the most remarkable property of the mobile agent paradigm. Thus, the security issue of mobile agents and mobile agent systems is recognized as one the most important problems. The security issues of the mobile agent paradigm are classified into two key issues generally- mobile agent security and mobile agent system security. The security issues of mobile agent systems are related with managing of mobile agent system resource, protecting from malicious agents or malicious hosts, and so on. The mobile agent J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 974–982, 2006. c Springer-Verlag Berlin Heidelberg 2006
A Secure Migration Mechanism of Mobile Agents Under Mobile Agent Environments
975
security services are confidentiality, integrity, authentication, access control, and nonrepudiation. Until now, many mobile agent systems have been developed [5,6,7,8,9,10]. These systems provide various security policies and mechanisms to address the issues above. Mobile agents have mobility. In other words, the mobile agents migrate from a host to other hosts. If a mobile agent can be altered and be a malicious agent when its migration to other hosts, many hosts can be attacked easily. Therefore, the most important security issue is to safely transfer mobile agents from a host to others. Many mobile agent systems have been tried to solve the aforementioned security issues, but most mobile agent systems provide no mechanism to securely migrate mobile agents to other hosts. Several mobile agent systems support mechanisms to address this issue by a secure transfer protocol. In fact, the secure transfer protocol is different from radical integrity mechanism. In addition, they depend on specific security frameworks, so it causes compatibility problem between the frameworks. In this paper, we propose an integrity checking mechanism to migrate mobile agent to other hosts securely. The proposed mechanism is based on the bundle authentication mechanism in the OSGi [4]. We devised this mechanism extending the OSGi to be suitable for mobile agent system environments. The proposed mechanism is independent on existing security frameworks. It also was designed to increase usability. Therefore, this mechanism is lightweight less than other security frameworks and can be used as an integrity secure component.
2 Related Work This section introduces key security issues of mobile agent paradigm and describes the security policies of the existing mobile agent systems. 2.1 Security Issues in Mobile Agent Paradigm This section describes security issues of the mobile agent paradigm. The security issues can be classified into two groups- mobile agent security and mobile agent system security. The mobile agent security is to protect mobile agents from attacks of other malicious objects, and the mobile agent system security is to protect mobile agent systems from attacks of other objects. The objects may be mobile agents or mobile agent systems. Table 1 shows the security issues, i.e., attack types of mobile agent paradigm [11]. As shown in Table 1, there are many attack types against mobile agent systems or mobile agents. In case of mobile agent security, confidentiality, integrity, authentication, access control, and non-repudiation should be basically supported to secure mobile agents from illegal attacks described above. Mobile agents have several properties- intelligence, autonomy, adaptability, etc. Especially, mobile agents have mobility, and it mobility is the most remarkable property of mobile agent paradigm. Therefore, the integrity issue, secure migration for the mobile agents, is considered as one of the most important security issues. If anonymous mobile agent is illegally alerted while migration, it causes many serious problems. As a result, the integrity of mobile agents should be guarantee preferentially under the mobile agent environments.
976
Dongwon Jeong et al. Table 1. Attack Types in Mobile Agent and System Attacker
Target
Attack type - Camouflage
Mobile Agent
Mobile Agent System - Paralysis of mobile agent systems - Illegal access to resources or services - Camouflage
Mobile Agent
Mobile Agent
- Interfering with other mobile agents - Denial of receipt or transmission - Degeneration of mobile agents - Camouflage
Mobile Agent System
Mobile Agent
- Tapping status or code of MA - Altering code or status of MA - Camouflage
Mobile Agent System Mobile Agent System - Illegal accessing services or data - Illegal message copy or response
2.2 Security Policy of Mobile Agent Systems This section introduces security policies of existing mobile agent systems. Table 2 shows a summary of the security policies supported by the mobile agent systems- Aglet [5], Voyager [6], Concordia [7], Agent Tcl [8], Ajanta [9], and SOMA [10]. Almost all the mobile agent systems provide secure mechanisms for hosts. On the other hands, in case of mobile agent security, only several systems such as SOMA and Ajanta support the security mechanisms for mobile agents. As shown in table 2, SOMA and Ajanta provide mechanisms for securing agent status and gathered data respectively. Especially, Ajanta uses the PKI, one of the security frameworks, so a certificate authority is required for authentication between hosts. It makes authentication process slow, and it also causes heavy mobile agent systems. It causes compatibility problem between various mobile agent systems because of dependency on specific security frameworks. This paper focuses on the integrity issue. In other words, the goal of this paper is to develop a mechanism that secures both of status and data of agents and is independent on specific security frameworks.
3 Preconditions and Overall Process This section predefines several constraints and overall process of the proposed integrity checking mechanism. 3.1 Preconditions The integrity mechanism proposed in this paper has several preconditions. We already referred that this paper focuses on the integrity issue for mobile agents, so it is necessary for these preconditions to be defined. The summary of these preconditions is as follows:
A Secure Migration Mechanism of Mobile Agents Under Mobile Agent Environments
977
Table 2. Security Policies of Mobile Agent Systems System MAS security Aglet
MA security
- Access control based on proxy object
N/A
- Trusted and untrusted Voyager - Extending Java security manager
N/A
Concordia - Access control based on by security manager
N/A
- The security manager depend on users’ identity Agent Tcl - Access control depending on its security policy
N/A
- Anonymous and authenticated Ajanta
- Access control by the mechanism based
- Securing agent status
on proxy SOMA - Roll-based access control based on hierarchical security policy
- Integrity of data gathered
All hosts in the enclosed environments safe from any of threats. Mutual authentication between all hosts is accomplished in the initial of mobile agent systems. The hosts authenticated can trust mutually. The authenticated hosts never alter mobile agents migrated from other hosts. Mobile agents have an itinerary that is planned in advance. As aforementioned, there are many security issues for the mobile agent paradigm, but this paper focuses on the integrity for secure migration of agents. Thus, basic preconditions are required to describe the proposed integrity mechanism. First, we suppose that the security of mobile agent systems, i.e., security of hosts is guaranteed. This means that all hosts free from any threats and do not altered by malicious other objects. We suppose that all hosts are reliable. Thus, all mobile agents secure from the hosts that they migrate to and stay in. The mobile agents have mobility. Hence, they have an itinerary that they will visit. Various itinerary distribution policies are possible. In this paper, we suppose all mobile agents have an itinerary including all addresses of the hosts.
3.2 Overall Process of the Integrity Checking Mechanism The proposed integrity mechanism consists of two main steps- mutual authentication step and secure migration step. The first step is to authenticate mutually between mobile agent systems, hosts. The second step is that mobile agents migrate to and work on other hosts. Figure 1 depicts the overall process of the proposed integrity checking mechanism. At the first step in this figure, all hosts are authenticated mutually. The given function, h2h() means the mutual authentication is accomplished each other over all hosts. Therefore, if the number of all hosts is N, the whole mutual authentication operation is accomplished as many as a from N=1 to (N-1) N , i.e., N*(N-1)/2. As a result, the number of one-way authentication operations is Factorial [N-1], i.e., (N-1)!.
978
Dongwon Jeong et al.
This paper focuses on the integrity mechanism not the mutual authentication. Thus, this paper defined a simple authentication mechanism and implemented uses using the HmacSHA1 algorithm provided by SUN. After the mutual authentication, a closed environment is organized. Then, an itinerary is prepared, and then a mobile agent migrates to the hosts with the itinerary. The itinerary may include a subset or all of the hosts in the closed environment.
Fig. 1. Main Process of the Integrity Checking Mechanism
4 Integrity Checking Mechanism This section describes the integrity checking mechanism proposed in this paper. 4.1 Mutual Authentication Phase Suppose that anonymous host, H1 sends a mobile agent to H2 for examining computer viruses. Before sending the mobile agent to H2, H1 is willing to check whether H2 is reliable. If H1 sends the mobile agent without reliability examination and H2 is not reliable, then the mobile agent may be changed by H2. This paper focuses on the secure transfer of mobile agents not the mutual authentication between hosts. However, a mutual authentication is basically required for realizing the secure migration of mobile agents. Thus, this paper designs and uses a simple authentication mechanism. The mutual authentication mechanism is shown in the following figure. Figure 2 shows the mutual authentication process between two hosts in detail. Host H1 generates a nonce, NH1 using a symmetric key. The generated nonce, NH1 is encrypted with the symmetric key, KH1 H2 . The encrypted nonce, ˜NH1 is transferred to
A Secure Migration Mechanism of Mobile Agents Under Mobile Agent Environments
979
Host H2. H2 decrypts ˜NH1 generated by H1 with KH1 H2 . H2 generates a nonce, NH2 . Then, H2 encrypts the new nonce and NH1 together, and send the result to H1 again. H1 decrypts the result and compares the original NH1 and NH1 from H2. If the nonce from H2, NH1 is different from the original nonce, this mutual authentication is cleaned up. If not, H1 encrypts the nonce NH2 and sends to H2 again. H2 also compares the original nonce and the nonce from H1. When the mutual authentication between them is completed, a shared secret key is generated. Both of hosts create the shared secret key with a same hash function respectively. The shared secret key and two nonce values are used for encryption and decryption of a mobile agent.
Fig. 2. Mutual Authentication Process between Hosts
The process described above is accomplished iteratively until all hosts authenticate each other. Thus, the mutual authentication operation is required as many as a from N=1 to (N-1) N where the number of all hosts is N. 4.2 Secure Mobile Agent Migration Phase This section describes the integrity mechanism, the most crucial part of this paper. As above-mentioned, the objective of this integrity mechanism is to allow mobile agents to securely migrate to other hosts. This mechanism uses the shared secret key generated and nonce values generated from the mutual authentication process. Figure 3 shows the integrity mechanism between hosts in detail.
980
Dongwon Jeong et al.
Fig. 3. Secure Migration of Mobile Agents in the Integrity Checking Mechanism
In this figure, the MAC is generated by the encrypt hash function, named HMAC function. This paper uses the HmacSHA1 algorithm provided by SUN. We implemented the proposed mechanism using the hash function. The generated MAC is appended to the signed JAR file and the JAR file is sent to Host B. In other words, the agent on Host A migrates to Host B. Host B receives the JAR file. Host B also generates a MAC using the same hash function with Host A. Host B gives the shared secret key and the nonce values to the hash function. A MAC is generated through this process. Host B compares the generated MAC with the MAC in the receiving JAR file that is generated on Host A. If the agent was not changed by anonymous malicious object while transmitting, both of the MACs are same. If the JAR file is changed, the original MAC is different from the final MAC. The sequential processing steps of this mechanism in figure 3 is as follows: (1)Generating a signed JAR file including the agent to be migrated to Host B. In this step, the manifest file of the original JAR is used as an input of the hash function. The hash function encrypts the manifest file with the shared secret key and two nonce values through the mutual authentication. A MAC is generated as the result of the encryption (2)Transferring the JAR file (Migration of the agent). The final JAR file on Host A is sent to Host B. It means the agent migrates from Host A to Host B. (3)Generating a MAC on Host B. Host B generates a MAC. Host B also uses the same hash function, shared secret key, two nonce values, and the manifest file within the JAR file from Host A is used an input. (4)Comparing of two MACs. The MAC of Host B is compared with the MAC generated and sent on Host A.
A Secure Migration Mechanism of Mobile Agents Under Mobile Agent Environments
981
4.3 Comparison This section describes the comparison result of the proposed integrity mechanism and the existing mobile agent systems. Table 3 shows the summary of comparison result. Table 3. Qualitative Comparison Comparative Item Independency
SOMA Ajanta Proposed Mechanism N/A
N/A
Size of MA encrypted Large Large
Support Small
Encryption speed
Slow
Slow
Fast
Transmission speed
Slow
Slow
Fast
Compatibility
Low
Low
High
Application
Low
Low
High
The goal of this paper is to an integrity mechanism for secure migration. Until now, many several mobile agent systems have been developed, but most mobile agent systems do not provide such an integrity mechanism. In case of several mobile agent systems supporting it, they depend on the specific security frameworks. For example, Ajanta is based on the PKI, so its size is large and also requires a certain authenticating organization. The proposed integrity mechanism does not dependent on the existing security frameworks. Therefore, it is light-weighted and can be integrated into other security architecture. The existing mobile agent systems encrypt the entire agent. It makes the encryption speed slow, and it causes the decrease of migration speed. However, the propose mechanism encrypt the manifest file not the whole. Therefore, this integrity mechanism solves the issues above.
5 Conclusion In this paper, we proposed an integrity mechanism for secure migration of mobile agents. The mechanism has two key stages- mutual authentication and secure migration. This paper focuses on the secure migration, i.e., the mechanism to provide mobile agents’ integrity for their transferring. Hence, we used a simple mutual authentication mechanism to authenticate between hosts. The goal of this paper is to provide an integrity mechanism for secure migration of mobile agents. To achieve this goal, this paper extended the bundle authentication mechanism in OSGi. The OSGi bundle authentication mechanism is based on the PKI. It has several problems as referred in the section 4.3. The integrity mechanism proposed in this paper is based on MAC. Also we used the only manifest file not the whole file in aspect of encryption operation. As a result, it has several advantages. The first is that it makes the transmission speed of the agents increase. The second is that the encrypt operation is rapid. Finally, this mechanism does not depend on any other security frameworks. Therefore, the mobile agent systems using
982
Dongwon Jeong et al.
this mechanism become light-weighted. In addition, we can integrate this mechanism into other security architectures easily.
References 1. D.B. Lange and M. Oshima, “Programming and deploying Java mobile agents with aglets," Addison-wesley, 1998. 2. M.M. Karnik and A.R. Tripathi, “Design Issues in Mobile Agent Programming Systems," IEEE Concurrency, pp.52-61, July 1998. 3. D.B. Lange and M. Oshima, “Seven Good Reasons for Mobile Agents," Communications of the ACM, vol. 42, no 3, March 1999. 4. OGSi, http://www.osgi.org/ 5. G. Karjoth, D.B. Lange, and M. Oshima, “A Security Model for Aglets." Lecture Notes in Computer Science, vol. LNCS 1419, pp.188-205, 1998. 6. Voyager, http://www.recursionsw.com/products/voyager 7. T. Walsho, No. Pariorek, and D. Wong, “Security and Reliability in Corcordia," The 31st Annual Hawaii International Conference on System Sciences, Hawaii, 1998. 8. R.S. Gray and et al., “Agent: Security in a multiple-language, mobile-agent system," Lecture Notes in Computer Science, vol. LNCS 1419, 1998. 9. N. Karnik, A. Tripathi, “Security in the Ajanta Mobile Agent System," Software- Practice and Experience, January 2001. 10. A. Corradi, R. Montanari, and C. Stefanelli, “Mobile Agents Integrity in Ecommerce Applications," Proceedings of the 19th IEEE ICDCS’99, Austin, Texas, May 1999. T.A. Jones, “Writing a good paper," IEEE Trans. on General Writing, vol. 1, no. 2, pp.1-10, May 2002. 11. W. Jansen and T. Karygiannis, “Mobile Agent Security," NIST, Special Publication 800-19, August 1999.
A Flexible Privilege Management Scheme for Role Graph Model Yuna Jung1 and Eenjun Hwang2, 1
The Graduate School of Information and Communication, Ajou University Suwon, Korea, 442-749 [email protected] 2 Department of Electronics and Computer Engineering, Korea University Seoul, Korea, 136-701 Tel: +82-2-3290-3256 [email protected]
Abstract. Since the role-based access control was introduced in the early 1970s, it has been considered as one of the promising access control methods. The role graph model was suggested as a reference model for the role-based access control. However, its privilege management is too strict to be applied to various applications. In this paper, therefore, we propose a flexible privilege management scheme based on the refinement of privileges in the role graph model, and show its effectiveness through several scenarios. We expect that this scheme will make the role graph model more powerful and applicable.
1 Introduction The Role-based Access Control (RBAC) model [1] has emerged since 1990s as a promising approach for managing and enforcing security in a huge and complex system. The essential idea of RBAC is that privileges are associated with roles, and users can have privileges by being assigned to appropriate roles. The role is a major component in constructing a role-based access control model. In RBAC model, system administrators can create roles, grant privileges to those roles, and then assign users to the roles on the basis of their specific job responsibility and organization policy. During this step, role-privilege relationships are defined, which makes it very simple to assign users to the defined roles. Without RBAC, it would be difficult to determine what privileges have been authorized to which users. Generally, assigning users to roles requires less technical skill than assigning privileges to roles. Besides, role-based approach is more realistic than user-based approach since the structure of access right is based on roles in most organizations. Due to this property, RBAC has become the most popular access control model. RBAC model has evolved from RBAC0 to RBAC3 . As a fundamental stage, RBAC0 incorporated the minimal requirements of RBAC system. RBAC1 added role hierarchies into RBAC0 , while RBAC2 added constraints to RBAC0 . RBAC3 combined RBAC1 and RBAC2 . The role graph model, which was introduced by Nyanchama and Osborn, is a reference model for role-based access control, and provides a
Corresponding Author.
J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 983–991, 2006. c Springer-Verlag Berlin Heidelberg 2006
984
Yuna Jung and Eenjun Hwang
way of visualizing interactions among roles. In 1996, algorithms for manipulating role graphs were also implemented by them [3]. These algorithms allow roles and privileges to be created, deleted, and altered. They have improved the role graph model and provided a basis for implementing and maintaining RBAC. So far, most researches have focused on the role engineering, such as constructing and maintaining role hierarchy, assigning users to roles, and so on. However, in RBAC model, privilege changes are more frequent than role changes. Objects of a privilege are continuously created, deleted, and modified. Even more, privileges are also changing dynamically. However, in spite of the importance of privilege management, existing management are too simple and strict. Therefore, it is necessary to improve the privilege engineering to make RBAC model more practical. In this paper, we propose a privilege refinement scheme to achieve such privilege engineering. The rest of the paper is organized as follows. In Section 2, we first give a brief introduction to the role graph model and then introduce how to handle privilege conflicts in the existing role graph model. In Section 3, we propose a scheme to extend the role graph model through refining privileges. Then, in Section 4, we conclude the paper and discuss some of future works.
2 Role Graph Model The role graph model consists of three distinct entities; privileges, roles, and users. A privilege denotes an access authorization for an object and is described by two elements. One of them is a set of objects such as system resources or data, and the other is an operation for the objects. A role is a named set of privileges and is represented by a role name and a set of privileges. Role-role relationships are based on subset relation between roles and are represented by is-junior relationship on the role hierarchy [4]. A user is the subject of access control models, and wants to access system objects. To efficiently manage a number of users who should have the same authorization, the role graph model defines a group for a set of users. This is very similar to the user group in the UNIX operating system. The role graph is, in essence, an acyclic, directed graph in which a node represents a role in the system, and an edge represents a is-junior relationship [2]. Every role graph has one MaxRole and one MinRole. The MaxRole is a common senior of all the roles in the role graph. According to the is-junior relationship, it has all the privileges of the roles in the role graph. The MinRole is a common junior of all the roles in the role graph and represents the minimum set of privileges available to all the roles. According to the is-junior relationship, privileges of the MinRole are inherited to all the roles in the graph. However, MaxRole and MinRole need not necessarily be assigned to any user, because it could exist as a conceptual role. In the role graph model, senior roles have all privileges of their junior roles. Therefore, actual privileges of one role consist of two types: its own privileges called direct privileges, and the privileges inherited from its juniors. A set of actual privileges of a role is called as its effective privilege. In Fig. 1 and Fig. 2, pi s in the bracket represent direct privileges of a role. 2.1 Handling Privilege-Privilege Conflicts on the Role Graph Model To apply the role-based access control model to real applications, we should handle conflicts of interest efficiently [5]. Negligent manipulation of such conflicts might cause
A Flexible Privilege Management Scheme for Role Graph Model
985
serious security threat. For example, in a complex environment where the actions of an ill-intentioned user can cause harmful damages, granting a user certain combination of privileges that could be a threat should be prevented. Due to such threat, it is very important in the role graph model to deal with conflicts of interest. Nyanchama and Osborn [3] discussed conflicts of interest on the role graph model in 1997. First, they defined five different kinds of conflicts, which could arise in the role graph model; user-user/group-group/user-group conflict, role-role conflict, privilege-privilege conflict, user-role assignment conflict, and role-privilege assignment conflict. Then, they suggested algorithms for preventing two kinds of conflicts, role-role conflict and privilege-privilege conflict. A role-role conflict means that a user/group has two roles that should not appear together. A privilege-privilege conflict means that a role has two privileges that should not appear together. To manipulate privilege-privilege conflicts, they defined P-Conflicts, a set of pairs of conflicting privileges. Then, they ensured that no role (except for MaxRole) could have two privileges that were defined as a conflicting pair in the P-Conflicts. For the integrity, P-Conflicts must be referred whenever a system administrator performs maintenance operations of the role graph model such as adding a privilege to an existing role, creating a new role, or inserting an edge between roles. If such an operation would cause a conflicting pair in the effective privilege set of any role, then that operation is aborted. 2.2 Motivation Strict privilege manipulation of current role graph model might cause serious problems. First of all, the role graph could be very complex since privileges and roles are created and modified even though the modification is trivial. Moreover, such modification can be rejected if a conflict occurs. However, usually, a actual conflicting portion of the most conflicting pairs is partial. Nevertheless, the existing role graph model strictly aborts all modifications that caused a conflict. Such handling obstructs the RBAC model from being practical. Let’s examine two cases to see how to manipulate privileges in the current role graph model. Fig.1 (a) shows the case where a new object is added to the object set of p2 . Here, we assume that L2 and L4 roles can access new object of p2 , but L3 cannot. In this case, we should create a new role, which contains a set of expanded objects of p2 , and should cut edges between L2 and S2, and between L4 and S2. That is, we might create a new role graph whenever privileges are modified. Actually, privileges can be changed dynamically in the real environment. Also, role objects are changed frequently. Summarizing the situation, a role graph is continuously changed, and that makes a role graph more complex. Fig.1 (b) shows the case where a conflict arises between p11 and p14 , when a new privilege p14 is inserted into L4. In the current role graph model, insertion operation is aborted by a conflict even though the conflicting part between p11 and p14 is very small. Consequently, such simple and strict handling of privilege conflicts is not practical at all.
3 Managing Privileges on the Role Graph Model With strict privilege handling, it is very difficult to apply the role graph model to various applications. To relieve the situation, conflict handling method should be modified in
986
Yuna Jung and Eenjun Hwang MaxRole
MaxRole {p14}
VP2 {p11}
VP1 {p9,p10}
L1 {p3,p4}
L3 {p5 ,p6 }
L2 {p4,p5 }
S1 {p1 }
+
L4 {p7 ,p8}
S2 {p2 }
VP2 {p11}
VP1 {p9,p10}
S2 {p2 + new }
L1 {p3,p4 }
L3 {p5 ,p6}
L2 {p4 ,p5 }
L4 {p7,p8}
S2 {p2 }
S1 {p1}
? {pnew } MinRole
MinRole
Fig. 1. (a) Adding a new object to privilege p2 , (b) Inserting a new privilege into role L4
such a way to give flexibility. In this paper, we propose a flexible privilege management scheme using privilege refinement on the role graph model. 3.1 Privilege Refinement Scheme When a privilege-privilege conflict arises due to some privilege change, the operation is abandoned in the current role graph model. In this paper, we are improving the privilege conflict management by refining privileges. For the privilege refinement, a pair of two conflicting privileges in P-Conflicts are divided into two groups. One of them contains those conflicting pairs only and the other group contains all the remaining pairs. Then, the latter group can exist in a role graph without any trouble. For the flexibility of privilege management, we modified the role graph model as follows. With several exceptions related to privilege refinement, we stick to the definition of the current role graph model. A set O = {o1 , . . . , ov } is a set of objects such as system resources or data that users want to access. Xi = {xi |xi : object of pi , xi ∈ O} is a set of objects belonging to pi , and each element of xi is an element of O. Another set P = {P1 , . . . , Pn } is a set of all privileges on a role graph, and P is the same as effective privilege of MaxRole. A privilege pi is represented by a pair (Xi , m), where Xi is a set of objects and m is the operation. E = {e1 , e2 , e3 , . . . , el } is a set of edges connecting one role to another, and each edge represents the is-junior relationship between two roles. We can say that role ra is-junior to rb , iff ra .rpset ⊂ rb .rpset. An edge el = (ra , rb ) is represented by an ordered pair (ra , rb ), where the first element ra is-junior to rb and second element rb is-senior to ra . A set of roles R = {r1 , r2 , r3 , . . . , rm } is a set of all roles in a role graph, where rj = (rname, rpset) is a named set of privileges. Here, rname is the name of the role and rpset represents a set of privileges of the role. rpset, the effective privilege of the role, is union of direct privilege of the role and inherited privilege from all juniors of the role. (Definition 1) A role graph G = {R, E} is an acyclic graph, where R is a set of roles and E is a set of edges between roles. G has one MaxRole and one MinRole. A privilege pn can be refined into two privileges pn and pn by separating its objects, where pn = (Xi , m), Xi = {xi |xi : a subset of xi , xi ∈ O} and pn = (Xi , m),
A Flexible Privilege Management Scheme for Role Graph Model
987
Xi = {xi |xi = xi − xi , xi ∈ O}. There are two kinds of privilege-privilege conflicts. One is full privilege-privilege conflict, and other is partial privilege-privilege conflict. Full privilege-privilege conflict cannot be fragmented, and should be handled strictly. (pa , pb , full) represents the full privilege-privilege conflict between pa and pb . On the other hand, partial privilegeprivilege conflict can be fragmented into problematic part and the rest. (pa , pb , partial) represents the partial privilege-privilege conflict. Except for conflicting part of (pa , pb ), the rest part can exist in the role graph. Each partial privilege-privilege conflict is classified again into four types as follows.
A partial conflict pair is to be separated into four subsets, (pa , pb ), (pa , pb ), (pa , pb ), and (pa , pb ). Here, pa is a problematic part of pa , which bring on a conflict between pa and pb , and pb is a problematic part of pb . First of all, we fragment a privilege into problematic part and the rest, and make four combinations using four privilege pieces (Each privilege is divided into two parts). In the table of a fragmented conflict pair, possibility for inserting is represented as O (approval) or X (disapproval). After all, we can insert an authorized portion of a partial conflicting privilege even if the portion is just partial. However, it is in fact impossible to insert (pa , pb ) because both pa and pb are just problematic parts that caused the conflict. Therefore, we use three subset of a conflict pair except for (pa , pb ). Here, a security administrator defines pa and pb directly. If the problematic part is on both sides, pa and pa , only pa and pb can coexist in a role graph. If the problematic part is on one side, pa or pb , then either (pa , pb ) or (pa , pb ), can exist by separating a problematic privilege. When (pa , pb ) and (pa , pb ) are marked as O (approval), (pa , pb ) can be accepted. On the other side, (pa , pb ) and (pa , pb ) are marked as O (approval), (pa , pb ) can be accepted. As you see, we can improve flexibility of privilege insertion using the partial insertion of conflicting privilege. (Definition 2) P −Conf lict = {(pa , pb , f ull), (pc, pd , partial), . . . , (pi , pj , partial)} is a set of pairs of conflicting privileges, and is composed of full privilege-privilege conflict pairs and partial privilege-privilege conflict pairs. 3.2 Flexible Privilege Management Using the Privilege Refinement As we mentioned, existing privilege management ignores the real situation. Therefore, in this paper, we propose more bendable handling of privileges using privilege refinement. In Fig.2 (a) and (b), we show how we improve the privilege management in the role graph using the privilege refinement.
988
Yuna Jung and Eenjun Hwang MaxRole
MaxRole {p14?? }
{p2?? }
VP1 {p9 ,p10}
L1 {p3,p4 }
VP2 {p11}
L3 {p5,p6 }
L2 {p4,p5 }
L4 {p7,p8 }
L1 {p3 ,p4 }
S2 {p2 }
S1 {p1 }
VP2 {p11}
VP1 {p9 ,p10}
L3 {p5,p6}
L2 {p4,p5 }
L4 {p7,p8}
S2 {p2}
S1 {p1 }
MinRole
MinRole
Fig. 2. (a) Privilege refinement for Fig.1 (a), (b) Privilege refinement for Fig.1 (b)
4 Scenario In this section, we will consider four scenarios where our proposed scheme can support flexible insertion. To make the situation clearer, we use the role graph in Fig. 1. as a sample role graph. In addition, we assume the following P-Conflicts. P-Conflicts = {(P10 ,P11 ,full), (P6 ,P12 ,partial), (P4 ,P13 ,partial), (P1 ,P14 ,partial)} 4.1 Scenario 1: Handling a Full Privilege Conflict Let’s add p10 to VP2 as a direct privilege of VP2. However, p10 and p11 are fully conflicting each other. In case of the full conflict, there is no space for compromise because each object of two privileges collides against the other. Therefore, we manage such case with a strict method used in the existing role graph model, so the operation is denied. This process is shown in Fig. 3.
MaxRole {p10 }
VP2 {p11 }
VP1 {p9,p10 }
L1 {p3,p4}
L3 {p5,p6}
L2 {p4,p5}
S2 {p2}
S1 {p1}
MinRole
Fig. 3. Handling a full privilege conflicts
L4 {p7,p8 }
A Flexible Privilege Management Scheme for Role Graph Model
989
4.2 Scenario 2: Modifying an Inserted Privilege Let’s assume that we try to add p12 to L3. However, p12 is conflicting with p11 at effective privilege of VP2. Fortunately, the conflict is a partial conflict (You can find out a type of the conflict from its notation). In case of partial conflict, we can insert a part of p12 by separating the problematic part from the privilege. We assume that a table of conflict refinement is as follows. Authorized forms are (p6 , p12 ) and (p6 , p12 ). It means that (p6 , p12 ) do not cause a trouble, so the conflict pair should be modified into (p6 , p12 ) to be inserted to a role graph. Therefore, p12 is inserted instead of p12 . This process is shown in Fig. 4.
Fig. 4. An example of modifying an inserted privilege
4.3 Scenario 3: Modifying an Existing Direct Privilege We want to add p13 to L2, but p13 is conflicting with p4 , an existing privilege of L2. However, we can insert a part of conflicting privilege because this conflict is partial. Let’s assume that a table of conflict refinement is as follows. According to this, p4 and p13 can coexist without any conflict, because (p4 , p13 ) and (p4 , p13 ) are marked with O. Therefore, p4 replaces p4 as a direct privilege of L2, and then p13 is inserted. This process is shown in Fig. 5. 4.4 Scenario 4: Modifying an Existing Inherited Privilege Let’s suppose that we want to add p14 to VP1. As you know, however, p14 is conflicting with p1 , an existing privilege of VP1. We can insert a part of the conflicting privilege
990
Yuna Jung and Eenjun Hwang
MaxRole
VP2 {p11 }
VP1 {p9,p10 } {p 13 }
L1 {p3,p4}
L3 {p5,p6}
L2 {p4,p5}
L4 {p7,p8 }
p 4 ¡fi¡fl
S2 {p2}
S1 {p1}
MinRole
Fig. 5. An example of modifying an existing directed privilege
MaxRole {p 14 }
VP2 {p11 }
VP1 {p9,p10 }
L1 {p3,p4}
L3 {p5,p6}
L2 {p4,p5}
L4 {p7,p8 }
S2 {p2}
S1 {p1}
p 1 ¡fi¡fl
MinRole
Fig. 6. An example of modifying an existing inherited privilege
because this conflict is partial. However, p1 is an inherited privilege from S1. If a table of conflict refinement is as follows, authorized form is (p1 , p14 ), because (p1 , p14 ) and (p1 , p14 ) are approved. Therefore, we should change a conflicting pair, (p1 , p14 ),
A Flexible Privilege Management Scheme for Role Graph Model
991
into a non-conflicting pair, (p1 , p14 ). Therefore, p1 replaces p1 as a direct privilege of S1, and then p14 can be inserted into VP1. This process is shown in Fig. 6.
5 Conclusion Since the role-based access control model was introduced in the early 1990s, it has been applied to various security systems as a promising access control method. However, its privilege conflict management is too strict to be adapted to real applications. In this paper, we proposed a more flexible handling scheme for privilege conflicts using privilege refinement, and examined various cases to show its effectiveness.
Acknowledgements This research has been supported by University IT Research Center Project. We would like to thank Prof. Hongjin Yeh for his valuable comments.
References 1. R.S. Sandhu, "Role-Based Access Control", IEEE Computer, pp. 38-47, Feb., (1996). 2. Nyanchama, M., Osborn, S. L., Access rights administration in role-based security systems. In Proc. of the IFIP Working Group 11.3, Working Conference on Database Security, Elsevier North-Holland, Amsterdam, The Netherlands, (1994). 3. Nyanchama, M., Osborn, S. L., The Role Graph Model and Conflict of Interest, ACM transactions on Information and System Security, Vol. 2, No. 1, Feb. (1999). 4. R.S. Sandhu et.al, Role Hierarchies and Constraints for Lattice-based access controls, In Proc. of the Conf. On Computer Security, springer-Verlag, New York, (1996). 5. Simon R., Zurko M. E., Separation of duty in role based access control environments, In Proc. of the 10th IEEE Computer Society Press, Los Alamitos, CA, (1997). 6. Matunda Nyanchama and Sylvia Osborn, "Access rights administration in role-based access control revisited", In Proceedings of the IEIP Working Group 11.3 Working Conference on Database Security,Amsterdam, The Netherlands,(1994). 7. Matunda Nyanchama and Sylvia Osborn, "The Role Graph Model and Conflict of Interest", ACM Transactions on Information and System Security, Vol.2, No.1, pp. 3-33, Feb., (1999).
The System Modeling for Detections of New Malicious Codes EunYoung Kim, CheolHo Lee, HyungGeun Oh, and JinSeok Lee National Security Research Institute 62-1 Hwa-am-dong, Yu-seong-gu Daejeon, 305-718, Republic of Korea {eykim,chlee,hgoh,jinslee}@etri.re.kr
Abstract. During the last several years, Malicious code have become increasingly more sophisticated. At the same time, the Internet’s growing popularity and the steady adoption of broadband technologies have allowed malicious codes to spread quickly. However, traditional anti-malicious codes detections’s method is pattern matching. Pattern matching’s method just can detect within the narrow limits of known malicious codes. That is, in the past, pattern matching method was able to ship new pattern for most malicious codes before they could achieve widespread distribution. If malicious code software vendors could not provide new pattern, nobody can not detect new malicious code. Accordingly, users were hacked by somebody hacker. In this article, we suggest the new malicious code detection algorithm and the system modelings without malicious pattern DB. Keywords: Malicious Code, Realtime monitoring.
1 Introduction As the popularity of Internet and computer increases continuously along with the fast development of IT, the number of computer crimes increases dramatically. Specially malicious codes take advantage of various security holes of computer systems, and these codes are easily and widely distributed through the Internet. These malicious codes threaten core infrastructures of a country because the number of information hacking incidents at colleges, research institutes, and governmental organizations increases annually, and so do damage costs. In addition, hacking techniques and malicious development techniques continue to advance, but the malicious code detection techniques cannot catch up with the advance of hacking techniques. Statistics of computer incidents between 2002 and 2003 in Korea are shown in [Fig. 1]. These statistical data is published by CERTCC-KR in Korea [1]. The occurrences of malicious code hacking grew rapidly after 2001, and they continue to grow every year. Because the number of hacking incidents increases, system administrators encourage users to install personal firewall in their PC. Personal firewall is bundled with other security products, such as ZoneAlarm, BlackICE, Tiny and so on, and some firewall is available freely to users. Therefore when some data is going to be transmitted from a PC without the knowledge of its user, a personal firewall can check data transmission and may block the transmission. Even J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 992–999, 2006. c Springer-Verlag Berlin Heidelberg 2006
The System Modeling for Detections of New Malicious Codes
993
350 1.1276x
y = 11.01e 300
250 Macro Trojan Horse Boot/File Virus Worm Script Etc Trjoan’s Trend Line Worm’s Trend Line
200
150
100 0.4372x
y = 20.021e 50
0 2001
2002
2003
Fig. 1. The Number of Hacking Incidents in South Korea
though personal firewall can detect hacking activities that use network transmissions, they may not be able to detect hacking activities that use malicious codes. In this article, I would like to suggest a malicious code detection system which consisted of three modules. Three modules are real time System monitoring module, process profiling module and malicious code detection module. hence, The first chapter describes related work which various malicious code’s detection method. The second chapter presents the architecture of suggested malicious code detection system. The third chapter, we describe experimental setting and experimental results. and last chapter, we summarize our conclusions and further research issues in section 4th.
2 Related Work This chapter describes various the malicious code detection methods. There are four kinds of methods: Pattern Matching, Heuristic Method, Policy-based behavior Blocking Method and Expert-based behavior blocking Method [8]. Traditional pattern matching based anti-virus or anti-malicious software detects malicious code by searching for tens of thousands of digital patterns in all scanned files, disks and network status. Each Pattern is a short sequence of bytes extracted from the body of a specific virus strain. If a given pattern is found, the content is reported as infected. However, anti-virus patterns are based on known sequences of bytes from known infections, this technique often fails to detect new virus. On the contrary to pattern matching technique, heuristic anti-virus technology detects infections by scrutinizing a program’s overall structure, its computer instructions and other data contained in the file. The heuristic scanner then makes an assessment of the likelihood that the generally suspicious logic rather than looking for specific patterns. Behavior blocking software integrates with the operating system of a host computer and monitors program behavior in real-time for malicious actions. The Behavior blocking software then blocks potentially malicious action before they have a chance to affect the system. Monitored behaviors can include:
994
– – – – – –
EunYoung Kim et al.
Attempts to open, view, delete, and/or modify files Attempts to format disk drives and other unrecoverable disk operations Modifications to the logic of executable files, scripts of macros Modification of critical system settings, such ac start-IP settings Scripting of e-mail and instant messaging clients to send executable content Initiation of network communications
If the behavior blocker detects that a program a initiating would-be malicious behaviors ac it runs, it can block these behaviors in real-time the offending software. This gives it a fundamental advantage over such established anti-virus detection techniques such as pattern or heuristics. Existing behavior blocking systems can be split into tow categories: policy-based blocking systems and expert-based blocking systems. Policy-based systems allow the administrator to specify an explicit blocking policy stating which behaviors are allowed and which are blocked, Each time a program makes a request to the operating system, the behavior blocker intercepts the request, consults its policy database and either allows the request to proceed, or blocks the request entirely. policy-based behavior blocking system for Java might provide the following options[Table 1]: Table 1. Policy-based behavior blocking System Operation description
Block Request
Allows applets to open files
Yes
Allows applets to delete files
Yes
Allows applets to initiate network connections
No
Allows applets to access files in the system directory
No
In contrast to policy-based systems, expert-based systems employ a more opaque method of operation. In these systems, human experts have analyzed entire classes of malicious code and then designed their behavior blocking systems to recognize and block suspicious behaviors. Under some circumstances a would-be dangerous behavior is allowed, and under others, it is blocked. For example, a behavior blocking expert may know the 80% of malicious code first attempts to modify the startup area of the registry before accessing system files. So he can design his behavior blocking system to only block a program’s access to system files after first seeing it modify the startup area of the registry. Such a rule is less likely to block legitimate programs and generate false alarms, yet still blocks a high percentage of threats. While a policy-based system might offer an option to “block access to system files” the expert-based system would offer the option to “lock virus-like behavior”. Clearly, with such a system, the administrator must take a leap of faith that the experts that built the system have chosen their blocking criteria wisely.
3 The Architecture of Malicious Code’s Detection System The malicious code detection system proposed in this paper consists of three modules, realtime system monitoring module, process profiling module and malicious code detection module[Fig. 2]. The Real-Time System Monitoring module, processes’ log data
The System Modeling for Detections of New Malicious Codes
995
User Interface Module
Realtime System Monitors Module
Malicious Detector Module
Process Tables
Process Filtering Module
Process Profiler Module
Malicious Detection Policy
Process Profile Log
Fig. 2. The Malicious Code’s Detection System
Process Monitoring Function
Realtime System Monitoring Module
Registry Monitoring Function
Network Monitoring Function
P ·M output
R ·M output
System Data
Process Profiler Module
N ·M output
Fig. 3. Real Time System Monitoring Module’s Architecture
are transmitted to the process profiling module. The process profiling module re-creates process log data per process. The process profiling module lists information data that is transmitted through process monitoring and network monitoring. Therefore we can monitor user systems if this process profiling works. The malicious code detection module detects malicious activities using the malicious code detection policies. The results of a malicious code detection process and a realtime system monitoring process are informed to a user through user GUI(Graphic User Interface). I will describe each module in detail in the following sections. The realtime system monitoring module is composed of three submodules: the process monitoring module, the registry monitoring module and the network monitoring module[Fig. 3]. The process monitoring module collects various kinds of information about processes that are executed in user area. The registry monitoring module monitors registry information that all processes approaches. The registry monitoring module is implemented using a registry hooking driver. The network monitoring module gathers network state information, such as which IP addresses and ports are connected. The
996
EunYoung Kim et al. Table 2. Output data of the realtime system monitoring module Process monitoring
Output Data
– Process list and ID – Process Thread number – Process create time – CPU Utility time of process and Total CPU Utility time – Process DLL list
Registry monitoring
– – – –
Registry path Registry request type Registry access time Registry access result
Network monitoring
– Source Address and Port – Destination Address and Port – Port connection state
network monitoring module gets information of network using LSP(Layered Service Protocol). Output data of the realtime system monitoring module is as follows[Table 2]. The process profiling module records profiles of each process based on the logs that are created by the realtime system monitoring module [9]. The process profiling algorithm is as follows. First, compose by queue and store each process’ realtime system monitoring data. Because our realtime system monitoring has three monitoring modules, three queues are created for each process. Second, process monitoring information is stored sequentially at each queue in the order of occurrence in the system log. This time work to be process profiling executes depending on each event occurrence smallest time[Fig. 4]. P rof ilingSelectT ime = M in(T imeF MData , T imeRMData , T imeN MData ) (3.1) We do this process profiling task so that we can monitor system states and actions of each process. If the module detects detection policies of malicious codes, detection is possible to known malicious codes as well as unknown malicious codes. In addition, we find unknown malicious actions using malicious code profiles. Our proposed system detects malicious codes in the malicious code detection module using information in a table that is created by the process profiling module. If some process does malicious actions, the module changes the state of relevant process as ’Normal Status → Monitoring Status’. If a process with ’Monitoring Status’ continues malicious actions and the score of the process exceeds a dangerous point, the state of the process is changed to ’Critical Status’. However, if the process with ’Monitoring Status’ did not do suspicious actions, the process profiling module does not record profiles of the process. State cycles of a process are as follows[Fig. 5]. Malicious code detection module on the basis of malicious action of process malicious code through 3 steps status change of process detection. These reason is falsepositive decrease expectation of malicious code detection system. The malicious code detection engine can detect and remove malicious codes according to process status. Actual hacking simulation of the our proposed malicious code detection system has been done.
The System Modeling for Detections of New Malicious Codes
SampleProcess1
SampleProcess2
FMData ... FMData FMData FMData FMData FMData
n
FMData ... FMData FMData FMData FMData FMData
n
5 4 3 2 1
5 4 3 2 1
RMData ... RMData RMData RMData RMData RMData
n
RMData ... RMData RMData RMData RMData RMData
n
5 4 3 2 1
5 4 3 2 1
NMData ... NMData NMData NMData NMData NMData
n
NMData ... NMData NMData NMData NMData NMData
n
5 4 3 2 1
5 4 3 2 1 SampleProcess2 Profiling
……………… .
SampleProcess
n
FMData ... FMData FMData FMData FMData FMData
n 5 4 3 2 1
File Monitoring Data
RMData ... RMData RMData RMData RMData RMData
n
NMData ... NMData NMData NMData NMData NMData
5 4 3 2 1
Registry Monitoring Data
997
Time
n 5 4 3 2 1
Network Monitoring Data
Select time
Process Profiling
Min(time FMData1 , time RMData1 ,timeNMData1 )
time FMData1
FMData
1
Min(time FMData2 , time RMData1 ,timeNMData1 )
time RMData1
FMData RMData
1 1
Min(time FMData2 , time RMData2 ,timeNMData1 )
time NMData1
FMData RMData NMData
1 1 1
Min(time FMData2 , time RMData2 ,timeNMData2 )
time FMData2
FMData RMData NMData
2 1 1
Min(time FMDatan , time RMDatan ,timeNMDatan )
...
…
Fig. 4. Process Profiling Algorithm
(1)
Normal Status
Monitoring Status (2) (3)
(5)
(4)
Critical Status
Fig. 5. Process Status Change
4 Experiments We have implemented a simulated system. In the simulated, the malicious codes that are used in the experiments are malicious codes. Malicious Codes can be classified into two main groups: know Trojan Horses and our developed malicious codes. Malicious Code data were gathered from 200 trojan horse in the Internet. A more elaborate example is Netbus, Schoolbus, Back Orifice in know trojan horse. and then developed malicious codes’s character are similar to the know trojan horses. We construct the simulated system environment on Windows 2000 Professional machines(CPU Pentium 1.4GHz, 256MB Memory) which consist of attacking and victim. Attacking system operated the role of hacker[Fig. 6].
998
EunYoung Kim et al. Table 3. Process status change condition
State Number
1
State Change Condition
1. When is process monitoring (a) In case file name that is registered by malicious code conforms (b) In case refer DLL reference that is used in malicious code (c) In case agree with specification information and malicious code database that can get in process monitoring 2. When is registry monitoring (a) In case approach specification registry that approach in malicious code 3. When is network monitoring (a) In case connect specification port that use in malicious code (b) Although specification port that use in malicious code is not, case that port and admitted program that is admitted by user is used normal program
2
When is seceded in ’Monitoring Status’ condition
3
State Number 1 ’Monitoring Status’ reconsideration condition satisfaction
4
When is seceded in ’Critical Status’ condition
5
1. Process of in case is disappeared 2. When is exited by user
Fig. 6. Experiments architecture
The following results were obtained. We experiment on trojan horse with suggested malicious code detection system, got the result 93% detection [3]. It was found from the result that our suggested malicious code detection system could detect the new malicious code without malicious code pattern DB. These results lead us to the conclusion that new malicious codes can detect malicious detection policy [2].
The System Modeling for Detections of New Malicious Codes
999
5 Conclusions and Future Directions In this paper, we have presented the malicious code detection system concentration in real time system monitoring. From what has been discussed above, we can conclude that our suggested malicious code detection system can detect the new malicious code without malicious code pattern DB. As it turned out we got the 93% detection rate. Also,a continuous examination of the mechanism of new malicious code detection algorithm would strengthen this proposition. A further direction of this study will be to provide more efficient detection policy using Machine Learning.
References 1. Korea CERT Coordination Center http://www.krcert.or.kr/ 2. H.Han, X.L. Lu, J.Lu, C.Bo, R.L.Yong. Data mining aided signature discovery in networkbased intrusion detection system. ACM SIGOPS Operating Systems Review, Volume 36 Issue 4, October 2002. 3. PC Flank Experts. The Best Tools to neutralize Trojan horses. http://www.pcflank.com/art17d.htm/. PC Flank Experts team, USA. 4. G.McGraw, G.Morrisett. Attacking Malicious Code. A Report to the Infosec Research Council, 2000. 5. M.Roesch. Lightweight intrusion detection for networks. USENIX In prodcedings of the Thirteenth Systems Administration Conference, Seattle, WA, USA, 1999. 6. M.Schmall. Heuristic Techniques in AV Solutions: An Overview. http://online.securityfocus.com/infocus/1542. SecurityFocus, February, 2002. 7. M.Schmall. Building an Anti-Virus engine. http://online.securityfocus.com/infocus/1552. SecurityFocus, February, 2002. 8. C.Nachenberg. Behavior Blocking: The Next Step in Anti-Virus Protection. http://online.securityfocus.com/infocus/1557. SecurityFocus, March, 2002. 9. T.F.Lunt. Automated Audit Trail Analysis and Intrusion Detection: A Survey. 11th National Computer Security Conference, Oct, 1998.
Information Hiding Method Using CDMA on Wave Files Young-Shil Kim1 , Sang Yun Park1 , Suk-Hee Wang1 , and Seung Lee2 1 Division of Computer Science & Information, Daelim College 526-7, Bisan-dong, Dongan-gu, Anyang-si, Gyeonggi-do, 431-715, Korea {pewkys,sypark,shwang}@daelim.ac.kr 2 Dept. of Automatic System Engineering, Daelim College 526-7, Bisan-dong, Dongan-gu, Anyang-si, Gyeonggi-do, 431-715, Korea [email protected]
Abstract. Although many information hiding paradigm provides many advantages for protect important information, we try to introduce new method in concealing thing. One of these is Steganography Many efforts have been made to encrypt data as well as to hide data. Most users wish not to lose the data that they want to hide. The steganography is one of methods that users can hide data. Some steganography softwares use audio data among multimedia data. However, the commercialized audio steganography softwares have disadvantages that the existence of hidden messages can be easily recognized visually and only certainsized data can be hidden. To solve these problems, this study suggested, designed and implemented the Dynamic Message Embedding (DME) algorithm. Also, to improve the security level of the secret message, the file encryption algorithm has been applied. Through the DME algorithm and the file encryption algorithm, StegoWaveK system that performs audio steganography was designed and implemented. Then, the suggested system and the commercialized audio steganography system were compared and analyzed on some criteria. Keywords: Steganography, Information Hiding, CDMA, Stego-Data, CoverData, Wave file
1 Introduction As computers and communication systems are rapidly developing, several methods have been introduced to keep the data safely. The basic method is encoding and the other method is data hiding techniques. The most typical application technique to hide data is the steganography and watermarking. The steganography is a technique that unauthorized users from finding out hidden data by hiding data in a variety of media that of text, image, moving image, and audio. Although an attacker might find secret messages that were encoded and hidden, generally the attacker needs to decode the message, which means higher security level of the message. Also, users tend to compress the secret data before hiding because the required data size is decided depending on the size of the secret data. Here, the data to cover the secret data is called “Cover-data”, and the data to be hidden through the steganography method is called “Stego-data”. While image-based steganography is more often used and developed than other steganography methods, audio-based steganography is also actively researched. In the audio steganography, data J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 1000–1008, 2006. c Springer-Verlag Berlin Heidelberg 2006
Information Hiding Method Using CDMA on Wave Files
1001
is generally hidden where the listener actually does not listen to. Therefore, listeners are not aware that data is actually hidden. Besides, since the Stego-data size is almost similar to the Cover-data size, most listeners cannot tell the difference. For not only authentication and copy right but also concealing files containing the important information, many methods are appeared. We proposed the methods that hide the information on Wave files. This method extracts the information not by using original file. Thus we discriminate the file including the information and the file not including the information. Main method used on Wave files is low bit encoding, but we propose the new method using by CDMA. CDMA(Code Division Multiple Access) is a digital spread-spectrum modulation technique used mainly with personal communications devices such as mobile phones. CDMA digitizes the conversation and tags it with a special frequency code. The data is then scattered across the frequency band in a pseudo random pattern. The receiving device is instructed to decipher only the data corresponding to a particular code to reconstruct the signal. In order to protect the signal, the code used is pseudo-random. It appears random, but is actually deterministic, so that the receiver can reconstruct the code for synchronous detection. This pseudo-random code is also called pseudo-noise (PN). We used 8-bit PN code. We compared two methods, CDMA method and StegoWaveK, thus we know that the proposed method has similar to performance of StegoWaveK.
2 StegoWaveK StegoWaveK model where a message is hidden using the sine accumulation function has been designed and implemented. The commercialized audio steganography software hides the secret message by one bit in the 16 bit lowbit allowing attackers to visually filter the hidden message and lowering capacity. To solve this problem, the Dynamic Message Embedding (DME) module has been designed and implemented for StegoWaveK model (the suggested audio steganography model). The DME module analyzes the number of bits hidden in every 16 bits of the cover data and the locations within Cover-data characteristics by analyzing characteristics of the Cover-data and the secret message to select the most suitable embedding function and hide the secret message. Following Fig. 1 shows the flow of the StegoWaveK audio steganography model consisting of compression, encoding, and insertion of the secret message into the Coverdata. Users can select encoding. In the Dynamic Message Embedding (DME) module, the secret message is inserted into a certain critical value, not the lowbit as long as characteristics of the Cover-data are maintained. For the critical value, features of the Cover-data and the secrete message are analyzed and processed. Also, the most suitable algorithm is selected to insert the secret message from the algorithm that hides one bit and the algorithm that improves capacity by the sine curve form In the pre-processing, the sound volume of the Cover-data that is a wave file is analyzed and distribution of the sound volume of the wave file is studied. Then, referring to the ratio with the next secret message, the critical value to hide the secret message is decided. After the location to hide the secrete message is decided, the message embedding
1002
Young-Shil Kim et al. Secret Message Loseless Compression Encryption Cover-data Encrypted Secret Message
Data Chunk Read
DME (Dynamic Message Embedding) Stego-data
Fig. 1. Flowchart of StegoWaveK Model Compressed Encrypted Message
Cover-data
Extraction of Processing Unit
Clear for Processing Unit
Message Embedding
No
Is finished of Message hiding?
Yes
End
Stego-data
Fig. 2. Message Embedding FlowChart Stego-data
Data Extraction Composition of Message
No
All Hidden Data is Extracted ?
Yes
Decompression and Decryption of Message
Cover-data + Original Message
Fig. 3. Message Extracting FlowChart
algorithm is decided. Fig. 2 and Fig. 3 show actual procedures to hide and extract the secret message. Basically, Fmt chunk (the header of the wave file) size is 18 bytes. Last 2 bytes of these 18 bytes describe the size of extension information. However, if extension information is not included, remaining 16 bytes is designed as Fmt chunk. Actually, in most cases, Fmt chunk is designed to be 16 bytes. In the commercialized audio steganography software, only the wave file with the data chunk identifier in 4 bytes from the 37th byte (the location defined assuming that Fmt chunk of the wave file is 16 bytes) is recognized as a wave file. In other words, the commercialized audio steganography software does not recognize 18 byte wave files. Especially, when MP3 music files are converted into wave files, mostly they have 18 byte or 14 byte Fmt chunk, which is not supported by the commercialized audio steganography software. To solve this problem, the Chunk Unit Read (CUR) module processing according to the chunk of the wave file has been designed. In the CUR module, data characteristics are not judged by reading
Information Hiding Method Using CDMA on Wave Files
1003
fixed location values, but instead, the wave file is processed according to the chunk. Therefore, not only the wave file with 18 byte Fmt chunk (the basic format) but also wave files with 16 byte and 14 byte Fmt chucks can be used as the Cover-data
3 CDMA Method and CDStego This section introduces CDMA method used on information hiding, in comparison with StegoWaveK [9]. Fig. 4 shows the case where the spreading signal is demodulated by a different spreading code from the sender’s. If the spreading signal is demodulated by a different spreading code having the value of ‘01101001010’, or if it is not demodulated in time, the receiver can not decode the information data exactly. Therefore, the receiver has to know the exact spreading code and the time to recover the original information data in CDMA. Spreading Signal Spreading Signal in Receiving Side Receiving Data
Fig. 4. The case where the spreading is demodulated by a different spreading code from the sender’s
Spreading codes similar to white noises has no relevance to information data. Because spreading codes have random values and can be generated infinitely, it is next to impossible for eavesdroppers to guess the spreading codes. Fig. 5 and Fig. 6 shows a rule and an example for generating spreading codes respectively. This rule is called as Orthogonal Variable Spreading Factor(OVSF) code. Cch,1,0 =(1) [Cch,2,0 ] = [Cch,1,0 Cch,1,0 ] = (1, 1) [Cch,2,1 ] = [Cch,1,0 -Cch,1,0 ] = (1,-1) [Cch,2 (n+1),0] = [Cch,2 (n),0 Cch,2 (n),0] [Cch,2 (n+1),1] = [Cch,2 (n),0 -Cch,2 (n),0] [Cch,2 (n+1),2] = [Cch,2 (n),1 Cch,2 (n),1] [Cch,2 (n+1),3] = [Cch,2 (n),1 -Cch,2 (n),1], etc. Fig. 5. Rule for generating spreading codes
Let us look at CDMA more closely. A CDMA encoder Zi,m can be represented by the multiplication of the ith data bit di and the mth CDMA code cm as shown below. After receiving the encoded bit of Zi,m without interferences, a receiver can decode the original data bit di through the equation as shown below.
1004
Young-Shil Kim et al. Cch,8,0=(1, 1, 1, 1, 1, 1, 1, 1)
Cch,4,0=(1, 1, 1, 1)
Cch,8,1=(1,1,1,1,-1,-1,-1,-1)
Cch,2,0=(1, 1)
Cch,8,2=(1,1,-1,-1,1,1,-1,-1)
Cch,4,1=(1, 1,-1,-1)
Cch,8,3=(1,1,-1,-1,-1,-1,1,1)
Cch,1,0=(1)
Cch,8,4=(1,-1,1,-1,1,-1,1,-1)
Cch,4,2=(1,-1, 1,-1)
Cch,8,5=(1,-1,1,-1,-1,1,-1,1)
Cch,2,1=(1,-1)
Cch,8,6=(1,-1,-1,1,1,-1,-1,1)
Cch,4,3=(1,-1,-1, 1)
Cch,8,7=(1,-1,-1,1,-1,1,1,-1)
Fig. 6. Example for generating spreading codes
When N senders transmit data simultaneously without interferences, A Mixed ∗ CDMA encoder Zi,m can be represented as shown below. As the same case of the single sender’s example, receivers can decode the original data bit di through the equation. The interfaces of the encoding process consist of five steps including the additional step 4, but the interface of the decoding process has not been changed in comparison with the previous case. The interface for the step 4 has a checkbox option which describes whether the CDMA encoding function has been selected or not. It also has radio buttons which describe the size of a Pseudo-random Number(PN) code. There are three types of PN code like 8 bits, 16 bits, and 32 bits. We will now examine the encoding process more closely. When the size of data is 8 bits, 8 bits PN code can be generated as follows by using the generation rule. When the 8 bit data have the value of ‘10110010’, if the value of ‘0’ is represented as ‘-1’, the data can be expressed as (1,-1,1,1,-1,-1,1,-1). Then, the data can be encoded by the 8 bit PN code as shown above. Detailed encoding process is shown in Fig. 7.
ڌ
ڋ
ڌ
ڌ
ڋ
ڋ
Hidden Message (bit stream)
ڌ
8 bits PN code (7)
Processing Result
ڌ
ڌ
ڌ
ڌ
ڌ
ڌ
ڌ
ڌ
ڌ
ڌ
ڌ
ڌ
ڌ
ڌ
ڌ
ڌڈ
ڌ
ڌ
ڌ
ڌ
ڌڈ
ڌڈ
ڌڈ
ڌڈ
ڌڈ
ڌڈ
ڌڈ
ڌڈ
ڌ
ڌ
ڌ
ڌ
ڌ
ڌ
ڌ
ڌڈ
ڌڈ
ڌ
ڌ
ڌڈ
ڌڈ
ڌ
ڌ
ڌڈ
ڌڈ
ڌ
ڌ
ڌڈ
ڌڈ
ڌ
ڌ
ڌڈ
ڌڈ
ڌڈ
ڌڈ
ڌ
ڌ
ڌ
ڌ
ڌڈ
ڌڈ
ڌڈ
ڌڈ
ڌ
ڌڈ
ڌ
ڌڈ
ڌ
ڌڈ
ڌ
ڌڈ
ڌ
ڌڈ
ڌڈ
ڌ
ڌڈ
ڌ
ڌڈ
ڌ
ڌڈ
ڌ
ڌڈ
ڌ
ڌڈ
ڌ
ڌڈ
ڌڈ
ڌ
ڌڈ
ڌ
ڌڈ
ڌ
ڌڈ
ڌ
ڌ
ڌڈ
ڌ
ڌڈ
ڌ
ڌ
ڌڈ
ڌڈ
ڌ
ڌ
ڌڈ
ڌڈ
ڌ
ڌ
ڌڈ
ڌڈ
ڌ
ڌ
ڌڈ
ڌڈ
ڌ
ڎ
ڌ
ڌ
X
ڌ
ڒ
ڋ
ڌ
ڌ
ڐ
ڋ
ڌ
ڋ
ڎ
ڋ
ڋ
ڌ
ڌ
ڋ
ڋ
ڋ
ڌڈ
ڌ
ڌ
ڌ
ڎڈ
ڌ
ڌ
ڋ
ڐڈ
ڌ
ڋ
ڌ
ڒڈ
ڌ
ڋ
ڋ
=
ڌ
ڌ
ڌ
+ ڌ
ڎ
ڐڈ
ڌ
ڌ
ڋ
ڎ
Mapping(bit stream)
ڋ
ڋ
ڋ
ڋ
ڋ
ڌ
ڌ
ڋ
ڌ
ڋ
ڋ
ڋ
ڋ
ڋ
ڋ
ڋ
ڋ
ڋ
ڋ
ڋ
ڋ
ڌ
ڄۈڼۀۍۏێٻۏۄڽڃٻڼۏڼڿڈۊۂۀۏڮ
Fig. 7. CDMA encoding process when the number of the PN codes is 7
If we sum up the above result along the column, the calculation can be expressed as (0, 4,-4, 0, 4, 0, 0, 4), and then it would be stored to the memory as a bit stream. The calculation consists of eight numbers, and each number of the calculation ranges from -4 to 4. But, If all the same numbers in the column are 1 or -1 respectively, each number of the calculation can range from -8 to 8. Each number actually is one of the nine values as (-8, -6, -4, -2, 0, 2, 4, 6, 8), because the PN code consists of symmetrical 1 or -1 as
Information Hiding Method Using CDMA on Wave Files
1005
shown above. Since the size of the required memory is 4 bits to express the nine values, the total size of the required memory is 32 bits (8 bits × 4 bits) to encode 8 bits data by the 8 bits PN code. Four bits memory can include fifteen kinds of numbers. However, since the memory for the calculation only includes nine kinds of numbers, the memory for the other seven kinds of numbers becomes useless. To improve efficiency of the memory, if the last code of the above PN codes does not be used, each number of the calculation is one of the eight values as (-7, -5, -3, -1, 1, 3, 5, 7). Since the size of the required memory is 3 bits to express the eight values, the total size of the required memory is 24 bits (8 bits × 3 bits). And, if we apply only seven PN codes to the above example, the calculation can be expressed as (1, 3,-5, 1, 3, 1, 1, 3). For example, when the number of the PN codes is 7, the total size of the required memory is 48 bits (8 bits × 3 bits × 2 times) to encode 8 bits data because the remaining 1 bit is encoded at the second time. In the case of processing 8 bits data, the total size of the memory encoded by the seven PN codes is bigger than the total size of the memory encoded by the eight PN codes. However, in the case of processing 56 bits data, firstly, when the number of the PN codes is 8, the total size of the required memory is 224 bits (8 bits × 4 bits × 7 times). Secondly, when the number of the PN codes is 7, the total size of the required memory is 192 bits (8 bits × 3 bits × 8 times). Therefore, the total size of the memory encoded by the seven PN codes is smaller than the total size of the memory encoded by the eight PN codes. In the decoding process, the PN codes can be generated in the same way as shown in Fig. 5 and Fig. 6. Let us decode the fifth bit from the received bit stream shown in Fig. 7. The bit stream is divided into 3 bits fragments as shown in Fig. 8. Each fragment is mapped with a value in the mapping table. After mapping, the received bit stream is represented as (1,3,-5,1,3,1,1,3). As shown below, we multiply the mapped values by the fifth PN code expressed as (1,-1,1,-1,1,-1,1,-1) along the column. And then, we can get the numbers as (1,-3,-5,-1,3,-1,1,-3). Let us sum up the numbers as (1,-3,-5,-1,3,-1,1,-3). The sum of the numbers is -8 as shown below. The sum having the value of -8 is divided by 8 because the size of the PN code is 8. And we can get the fifth original bit, 0 corresponding to -1. Fig. 8 shows the CDMA decoding process when the number of the PN codes is 7. We define required classes and functions for CDMA as shown below. • CBitData class : This class provides functions for the conversion between bit and byte to process the modulation between an original data and a spreading bit stream. This class is similar to the existing CBitByteData class. • EncodeSize and DecodeSize functions : These functions provides operations for calculating the size of the result data in the encoding/decoding process. • EncodeCDMA, DecodeCDMA functions : These functions provides operations for encoding to CDMA or decoding from CDMA.
1006
Young-Shil Kim et al.
ڋ
ڋ
ڋ
ڋ
ڋ
ڌ
ڌ
ڋ
ڌ
ڎ
ڐڈ
ڌ
ڌ
ڌ
ڌ
ڌ
ڌ
ڎ
ڋ
ڌ
ڋ
ڋ
ڌ
ڋ
ڋ
ڌ
X ڌ
ڌ
ڌ
ڋ
ڋ
ڋ
ڋ
ڋ
ڋ
ڋ
ڋ
Stego-data (bit stream)
ڌ
ڎ
Mapping (bit stream)
ڌ
8 bits PN code (
ڒ
ڋ
ڌ
ڌ
ڐ
ڋ
ڌ
ڋ
ڎ
ڋ
ڋ
ڌ
ڌ
ڋ
ڋ
ڋ
ڌڈ
ڌ
ڌ
ڌ
ڎڈ
ڌ
ڌ
ڋ
ڐڈ
ڌ
ڋ
ڌ
ڒڈ
ڌ
ڋ
ڋ
ڌ
ڌ
ڌ
ڌ
ڌڈ
ڌڈ
ڌڈ
ڌڈ
ڌ
ڌ
ڌڈ
ڌڈ
ڌ
ڌ
ڌڈ
ڌڈ
ڌ
ڌ
ڌڈ
ڌڈ
ڌڈ
ڌڈ
ڌ
ڌ
ڌ
ڌڈ
ڌ
ڌڈ
ڌ
ڌڈ
ڌ
ڌڈ
ڌ
ڌڈ
ڌ
ڌڈ
ڌڈ
ڌ
ڌڈ
ڌ
ڌ
ڌڈ
ڌڈ
ڌ ژ
ڌ
ڌڈ
ڌڈ
ڌ
ڌ
ڌ
ڌ
ڌ
ڌ
ڌ
ڌ
ڌ
ڌڈ
ڌڈ
ڌڈ
ڌڈ
ڌ
ڌ
ڌ
ڌ
ڌ
ڌڈ
ڌڈ
ڌ
ڌ
ڌڈ
ڌڈ
ڌ
ڌ
ڌڈ
ڌڈ
ڌڈ
ڌڈ
ڌ
ڌ
ڌ
ڌ
ڌڈ
ڌ
ڌڈ
ڌ
ڌڈ
ڌ
ڌڈ
ڌ
ړڈ
ڌڈ
Hidden Message (bit stream)
ڌڈ
ڌ
ڌڈ
ڌ
ڌ
ڌڈ
ڌ
ڌڈ
ړڈ
ڌڈ
ڌ
ڌڈ
ڌڈ
ڌ
ڌ
ڌڈ
ڌڈ
ڌ
ړ
ڌ
)
ڌ
7
ړ
ڌ
ړڈ
ڌڈ
ړ
+
ړ
ڌ
/8
ڋ
ڌ
ڌ
ڋ
ڋ
ڌ
Fig. 8. CDMA decoding process when the number of the PN codes is 7
Fig. 9. Waveform Comparison of Cover-data and applied Audio Steganography WAVE Files
4 Performance Evaluation of CDStego In this section, StegoWaveK that has been implemented by VC++. Net is compared with Invisible Secretes 2002 (hereinafter to be referred to as “CS I”) and Steganos Security Suite 4 (hereinafter to be referred to as “CS II”) that have been commercialized and in use now. According to [7], in steganography analysis, visual, audible, structural, and statistical techniques are used. Therefore, the comparison and the analysis in this study were based on criteria of the Human Visible System (HVS), Human Auditory System (HAS), Statistical Analysis (SA), and Audio Measurement (AM). Since the HAS can be relatively subjective, audio measurement analysis was added to more objectively analyze and compare the Stego-data and the Cover-data. For comparison and analysis data, the song of Yun Do Hyen’s Dolgo was used. In experiments with other genre music, similar results were gained. According to the waveform analysis result of the Stego-data created by Invisible Secrets 2002 CDStego, and StegoWaveK using an audio editor, CoolEdit 2000, it is hard to visually tell the difference due to HVS characteristics. Fig. 9 shows waveform of the Stego-data captured by CoolEdit 2000. As showed in Fig. 9, it is extremely difficult to tell whether data is hidden or not only by visually checking waveform. To analyze and compare the suggested system with the existing system on criteria of the Human Auditory System (HAS), 13 messages with different sizes and 4 wave files
Information Hiding Method Using CDMA on Wave Files
1007
͖͢͡͡ ͖ͪͧ ͖ͪͣ ͖ͩͩ ͖ͩͥ ͖ͩ͡
ʹ΄͑ͺ
ʹ΄͑ͺͺ
ͿΠ͑͵ΚΗΗ͟
΄ΥΖΘΠΈΒΧΖͼ
ΖΔΠΘΟΚΫΖΕ͑͵ΚΗΗ
ʹ͵΄ΥΖΘΠ
ʹΠΞΡΝΖΥΖ͑͵ΚΗΗ͟
Fig. 10. Result of WAVE Listening Test
ʹΠΞΡΒΣΚΤΚΠΟ͑ΠΗ͑͑;ΖΥΙΠΕΤ ͢͡͡
΄Ϳ
ͩ͡
ͧ͡
ͥ͡
ͣ͡
͡ ʹ΄͑ͺ
ʹ΄͑ͺͺ
΄ΥΖΘΠΈΒΧΖͼ
ʹ͵΄ΥΖΘΠ
ΆΤΚΟΘ͑;ΖΥΙΠΕ
Fig. 11. Comparison of SNR(dB) between Cover-data and Stego-data
were selected as Cover-data. We played the Stego-data where the message is hidden through CS I and CS II using lowbit encoding and the Stego-data where the message is hidden through StegoWaveK system and CDStego to 100 students. Although it could be subjective, most students could not tell the difference between the Cover-data and the Stego-data. Following Fig. 10 shows the experiment result. Audio measurement and analysis includes frequency response, gain or loss, harmonic distortion, intermodulation distortion, noise level, phase response, and transient response. These parameters include the signal level or phase and the frequency. For example, the Signal to Noise Ratio (SNR) is a level measurement method represented by dB or ratio. In [4, 5], the quality of the Stego-data where the message is hidden was measured using SNR. SNR represents ratios of relative values [1, 2]. The following graph shows SNRs between the cover data and the Stego-data created by the CS I, the CS II, CDStego, and StegoWaveK. The SNR of the Stego-data created by the suggested system is not significantly different from that of the Stego-data created by the CS I. However, the SNR of the Stego-data created by the CS II is relatively different from that of the Stego-data created by the suggested system. In the following, the Cover-data and the Stego-data have been analyzed by the Perceptual Evaluation of Speech Quality (PESQ). It is difficult to completely trust the result of the automated sound measurement such as PESQ, but the result has enough reliability to be used in various kinds of related tests [6].
1008
Young-Shil Kim et al.
5 Conclusion The audio steganography that hides data uses the wave file as Cover-data; the Coverdata has the same size as the size of the Stego-data file in the suggested system; and most listeners and users do not tell any difference in the sound quality. Therefore, they cannot recognize that data is hidden in Stego-data. Also, none of the HVS system or the HAS system can analyze the wavelength that had been analyzed through a simple audio editing tool in an intuitive way. Therefore, it can be useful to send secret data. The commercialized steganography software uses only certain types of wave files as the Cover-data. As one bit of the secret message is inserted into the Lowbit of the Cover-data, the cover data can be easily filtered. Also, as one bit is inserted to hide the secrete message, the Cover-data size increases. CDStego model suggested in this study has been specially designed to conceal data more safely, and improve problems of existing commercial audio steganography softwares. Thus CDStego has similar property to StegoWaveK [9] and more large size of hidden message. While data is only visually hidden through characteristics of the file in Windows O/S, CDStego model hides data in the audio data that is often used. Although the hidden data does not exist in the computer any longer, the user can extract data whenever he/she wants without any loss of the hidden data. Therefore, CDStego and StegoWaveK model can be useful to hide important design drawings, program files, and confidential documents. More studies shall be performed relating to migration to the embedded system in the future. By introducing new methods suggested in [8], or by using loss-free compression programs such as TIFF with high performance, it would be possible to improve performance of CDStego.
References 1. S.K. Pal, P.K. Saxena, S.K. Muttoo, “The Future of Audio Steganography”, STEG’02, July 11-12, 2002. 2. http://www-ccrma.stanford.edu/CCRMA/Courses/422/projects/WaveFormat 3. Fabien A.P. Petitcolas, Ross J. Anderson, and Markys G.Kuhn, “Information Hiding - A Survey”, Proceedings of the IEEE, special issue on protection of multimedia content, 87(7):10621078, July 1999. 4. Stefan Katzenbeisser and Fabien A.P.Petitcolas “Information hiding techniques for steganography and digital watermarking”, Artech House Publishers, 2000. 5. J.D.Gordy and L.T.Bruton, “Performance Evaluation of Digital Audio Watermarking Algorithms.”, IEEE MWSCAS 2000. 6. Djimitri Wiggert, “Codes for Error Control and Synchronization”, Artech House Inc, 1988. 7. Peter Wayner, “Disappearing cryptography Information Hiding : Steganography & Watermarking”, second edition, chapter 17, Morgan Kauffman, 2002. 8. Ira S. Moskowitz, Garth E. Longdon, and LiWu Chang, “A New p aradigm Hidden Steganography”, New Security Paradigms workshop 2000, September, 19th 21st, 2000, Cork Ireland. 9. Y.S. Kim at el, “Information Hiding System StrgoWaveK for Improving Capacity”, The 2003 Internationbal Symposium on Parallel and Distributed Processing and Applications.
Efficient Key Distribution Protocol for Electronic Commerce in Mobile Communications Jin Kwak1 , Soohyun Oh2 , and Dongho Won1 1
Information and Communications Security Lab. School of Information and Communication Engineering Sunkyunkwan University 300 Choencheon-Dong, Jangan-Gu, Suwon, Gyeonggi-Do, 440-746, Korea {jkwak,dhwon}@dosan.skku.ac.kr 2 Division of Computer Science Hoseo University, Asan, Chuncheongnam-Do, 336-795, Korea [email protected]
Abstract. Recently, the development of mobile communication technology, the mobile terminal users are increasing. In the mobile environment, the security service is very important to provide secure mobile communication, such as mobile electronic-commerce. The security of information is significant to provide secure electronic commerce in mobile communications. Therefore, the security services must be provided in a mobile communication environment. However, the mobile communication environment has some disadvantages, because it has low capacity of power, low performance of CPU and limited memory, etc. In this paper, we propose an efficient key distribution protocol for mobile communications.
1 Introduction As the development of mobile communication technology, the mobile terminal users are increasing. In result, Internet services and applications for mobile terminals are increasing rapidly. In order to provide these services and applications over the mobile communications, the security services are very important to provide secure mobile communication. Therefore, the security services such as confidentiality, authentication, integrity, and non-repudiation are must be provided in mobile communication environment same as the wired Internet. However, mobile communication environment has some disadvantages, because it has low capacity of power, low performance of CPU and limited memory, etc. The security technology for secure mobile communication is applicable to the ubiquitous computing domain in the near future. The one of the famous protocols are provided [1,2,3,4,6,7,10,11,12,13], however proposed previous protocols has some weaknesses for active attacks. Therefore, in this paper, we consider the active attacks, and then propose the efficient key distribution protocol in mobile environment which is secure against active attacks.
This work was supported by the University IT Research Center Project by the MIC(Ministry of Information and Communication) Republic of Korea.
J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 1009–1016, 2006. c Springer-Verlag Berlin Heidelberg 2006
1010
Jin Kwak, Soohyun Oh, and Dongho Won
The goal of this paper is introduce the simple and efficient key distribution method, but it secure against active attacks. Then we propose an efficient key distribution protocol for electronic commerce in mobile communications(M-Commerce). This paper is organized as follows, first we will describe the security requirements of the mobile communication environment, and then we propose efficient key distribution protocol for M-Commerce. Next, we analyze the securities and properties of the proposed protocol comparison with existing protocols. Finally, we will make a conclusion.
2 Security Requirements Primer For secure communication in the mobile environment, the mobile key distribution protocol have to satisfied with the below requirements [6]. – n-pass: The protocol that needs n number of passes to compute the session key from one party’s point of view. The less n-pass is providing efficiency. – mutual entity authentication: One party is assured, through the acquisition of corroborative evidence, the identity of a second party involved in a protocol, and that the second party has actually participated in the key agreement protocol. – secret key agreement: To protect transaction data between the client and the server, secret key generated by the information both entities. – key authentication: One party is assured that no other party aside from a specifically identified second party could possibly compute the shared secret key. – key freshness: A key used to compute a shared secret value is updated after the execution of a new exchange protocol. – user anonymity: It is the state of being unidentifiable within a set of subjects. It is the confidentiality of the user identity within the communication. – non-repudiation: It is prevents an entity from denying previous actions, such as send or received data. When dispute arise due to an entity denying that certain actions were taken, a means to resolve the situation is necessary. It is a necessary security requirement in electronic commerce application.
3 Proposal Protocol In this section, we propose key distribution method for secure M-Commerce. It is secure against some active attacks(see details 4.1), also it can provide efficiencies(see details 4.2). The proposal protocol consists of the registration and key distribution phases, and the notations are shown below. [Notations and Definitions] – – – – –
i : it is an entity, the server or the client p : a prime defining the Galois field GF (p) q : a prime factor of p − 1 g : a primitive element of the q-order in Zp Pi : entity i’s public key, Pi = g Si mod p
Efficient Key Distribution Protocol for Electronic Commerce
– – – – – – – –
1011
Si : entity i’s secret key ri : random number selected by entity i SIGi : signature of the entity i IDi : identification information of the entity i meta-IDC : temporary ID of the client P INC : personal identification number of the client h() : hash function A B : concatenation A and B
[Registrations] When the client starts communication using the mobile device, he needs registration process to the server(e.g. the shopping mall or mobile station). Fig.1. shows the process of the registration in the proposed protocol.
Client
Server SS {X1} , rS1 ∈ RZP *
X1
X1 = PS { h(PINC) || IDC }
meta-IDC = h { h(PINC) || IDC || r S1 } X2
X2 = PC { meta-IDC || rS1 }
SC {X2} meta-ID'C = h {h(PIN)C || IDC || r S1} meta-ID'C ? = meta-IDC
h(h( meta-ID'C )) h(h(meta-ID'C)) ? = h(h(meta-IDC))
Fig. 1. Registration phase
1. The client selects P INC , and computes X1 using own IDC and h(P INC ). X1 = Ps {h(P INc ) IDc } 2. The server decrypts received X1 using his secret key SS , and select random number rS1 . The server computes temporary meta-IDC and X2 after stores rS1 and h(P INC ), then he sends X2 to the client. X2 = Pc {meta − IDc rs1 } 3. When the client received X2 , he decrypts it using his secret key SC . Then he com putes temporary meta-IDC using the server’s random number rS1 . If it equals to received meta-IDC , the client sends hashed value to the server. meta − IDc = meta − IDc ?
1012
Jin Kwak, Soohyun Oh, and Dongho Won
4. The server checks hashed meta-IDC equals to hashed meta-IDC generated by the server. ? h(h(meta − IDc )) = h(h(meta − IDc ))
[Key Distribution] In the previous protocols, the client authenticates the server, but the server cannot authenticate the client(one-way authentication). Therefore, the mutual entity authentication is needed to secure mobile communication [8]. The operation of the propose protocol is as follows. Fig.2 shows the key distribution process of the propose protocol for secure mobile communication. Client
Server
rC ∈ RZP * C1 = SIGC { rC meta-IDC }
Q1
Drs 1{Q1} , rS2
Q1 = Ers 1{ C1 || meta-IDC }
RZP
*
C2 = SIGS { rS2 || IDS } Q2 = Ers1(C2) Q2 , h(h(SKS))
SKS = h ( E h(PINc)(C1) || C 2 )
Drs1 {Q2} SKC = h ( E h(PINc) (C1) || C 2 ) h(h(SKC) ? = h(h(SKS))
Fig. 2. Key Distribution phase
1. The client selects random number rC , and computes C1 with his signature. Then he computes Q1 and sends it to the server. Q1 = Ers1 {C1 meta − IDc } 2. The server decrypts received Q1 and select random number rS2. The server computes C2 , Q2 , and generates session key SKS . Then he sends Q2 and hashed value of session key SKS . Q2 = Ers1 (C2 ) SKs = h{Eh(P INc ) (C1 ) C2 } 3. The client decrypts Q2, and computes session key SKC . Then he checks the received hashed SKS equals to hashed SKC computed by the server. SKc = h{Eh(P INc ) (C1 ) C2 } ?
h(h(SKc )) = h(h(SKs ))
Efficient Key Distribution Protocol for Electronic Commerce
1013
4 Analysis In this section, we analyze the securities and properties of the proposed protocol. Proposed protocol is based on the difficulty of the discrete logarithm problem, therefore an attacker computes the session key using public information, which is the equivalent to solve the difficulty of the discrete logarithm based problem. For analysis of the securities, we considered some active attack models [9]. [Securities] Definition 1. Active Impersonation: An adversary enters a session with a legitimate entity, impersonates another entity, and finally computes a session key between entities. Any adversary who does not know the secret information of each entity, and masquerades as an other entity, participates in the protocol and successfully shares a session key with the valid entity. The proposed protocol uses the hash value and secret information, therefore, an attack is impossible since the adversary should not know the secret key of each entity, and randomly selected rS2 to compute the valid session key. Definition 2. Forward Secrecy: The secrecy of the previous session keys established by a legitimate entity is compromised. A distinction is sometimes made between the scenario in which a single entity’s secret key is compromised(half forward secrecy), and the scenario in which the secret key of both entities are compromised(full forward secrecy). If the secret key of the client is exposed and the adversary obtains this, the adversary cannot compute the previous session key. The adversary should obtain the hash value h(P INC ) and randomly selected rS2 in order to compute the session key, therefore the attack is impossible. In contrast, if the secret key of the server is exposed, the adversary cannot compute the previous session key. The adversary should obtain the hash value h(P INC ) and randomly selected rS2 in order to compute the session key, therefore the attack is impossible. The attacker should get secret key of each entity, in order to compute the previous session key, and since it is equivalent to difficulty of the passive attacker because an adversary does not know the hash value h(P INC ) and randomly selected rS2 , therefore the attack is impossible. Accordingly, the proposed protocol satisfies the full forward secrecy since the session key is secure even if the old secret key of each entity is exposed. Definition 3. Key-Compromise Impersonation: Suppose the secret key of an entity is disclosed. Clearly an adversary that knows this value can now impersonate an entity, since it is precisely this value that identifies the entity. In some circumstances, it maybe desirable that this loss does not enable the adversary to impersonate the other entity, and share a key with the entity as the legitimate user. In the proposed protocol, an adversary cannot masquerade as a client to the server or viceversa, even if the secret key of the client is exposed because an adversary cannot
1014
Jin Kwak, Soohyun Oh, and Dongho Won
know the h(P INC ) and rS2 . Therefore, an adversary cannot masquerade as a valid client even if he obtains the secret key. Definition 4. Known Key Secrecy: A protocol should still achieve its goal in the face of an adversary who learns some other session keys. A protocol is said to provide a known key security if this condition is satisfied. Since the proposed protocol uses a random value rS2 chosen by the server and rC chosen by the client for each session in order to compute a session key, it cannot give any advantage in obtaining the session key of the current session, even if the previous transmission information are exposed. Therefore, the difficulty of a known key passive attack is the same as the passive attacker who has no information on a previous session. It is also impossible for a known key impersonation attack, because an adversary cannot go directly or participate in the session. The adversary cannot masquerade as the server or the client using the transformation information of the current session, the transformation information of the previous session, or to set the session key. [Properties] We analyze the efficiency of the proposed protocol, and compare it with the existing protocol. Concerning the efficiency analysis of the proposed protocol, we will consider the properties security requirements(see details 2.) [5,9]. Table 1. shows the characteristics of the proposed protocol in comparison with the existing protocol. The proposed protocol is a 2-pass protocol same as previous protocol. The previous BCY [1] does not provide an entity authentication, and PACS [10] provides a oneway entity authentication. On the other hand, the proposed protocol provides mutual entity authentication since the server and the client generates session key using secret information belongs to the client and the server. When the computes session key, proposed protocol provide implicit key authentication to the client, and provide explicit key authentication to the server. It is same as PACS protocol, but BCY provides explicit key authentication to each other. The proposed protocol provides mutual key freshness, since they compute session key with random values chosen by each other. In addition, BCY and PACS provides one-way user anonymity, but proposed protocol provides mutual anonymity. The proposed protocol also provides mutual non-repudiation service since using other party’s signatures when they compute the session key.
5 Conclusion In this paper, we proposed secure and efficient key distribution protocol in mobile environment, which is secure against active attacks such as active impersonation, forward secrecy, key-compromise impersonation, and known key secrecy. Furthermore, the proposed protocol can provide mutual entity authentication, key authentication, mutual key freshness, user anonymity, and non-repudiation. As a result, the proposed protocol will
Efficient Key Distribution Protocol for Electronic Commerce
1015
Table 1. Properties of proposed protocol in comparison with previous protocol properties
BCY
PACS
proposed protodcol
n-pass
2
2
2
entity authentication ×
one-way
mutual
key authentication
user : implicit
user : implicit
user : explicit
server : explicit server : explicit server : explicit key freshness
×
×
mutual
user anonymity
one-way
one-way
mutual
non-repudiation
×
one-way
mutual
be useful to electronic commerce using mobile terminals. Furthermore, the proposed protocol will be useful to ubiquitous computing environments, because the proposed protocol is simple but secure. The proposed protocol can be implemented efficiently, and provides the additional properties of security and efficiency which have been scrutinized in this paper.
References 1. M.J. Beller, L.F. Chang, and Y. Yacobi. Security for Personal Communication Service: Publickey vs. Private key approaches. Proceedings of Third IEEE International Symposium on Personal, Indoor and Mobile Radio Communications–PIMRC’92, pages 26–31, 1992. 2. M.J. Beller, L.F. Chang, and Y. Yacobi. Privacy and authentication on a portable communications system. IEEE Journal on Selected Areas in Communications, 11(6):821–829, 1993. 3. M.J. Beller and Y. Yacobi. Fully-fledged two-way public key authentication and key agreement for low-cost terminals. IEE Electronics Letters, 29(11):999–1001, 1993. 4. U. Carlsen. Optimal Privacy and Authentication on a Portable Communications System. ACM Operating Systems Review, 28(3):16–23, 1994. 5. J. S. Go and K. J. Kim. Wireless authentication Protocols Preserving User Anonymity. Proceedings of SCIS 2001, 2(1):159–164, 2001. 6. G. Horn, K. M. Martin and C. J. Mitchell. Authentication protocols for mobile network environment value-added services. IEEE Transactions on Vehicular Technology, 51:383–392, 2002. 7. K. H. Lee and S. J. Moon. AKA Protocols for Mobile Communications. Australasian Conference in Information Security and Privacy–ACISP 2000, number 1841 in Lecture Note in Computer Science Number, pages 400–411, 2000. 8. K. Nyberg and R. A. Rueppel. Message recovery for signature scheme based on the disscrete logarithm problem. Advacne in Cryptography–EUROCRYPT’94, number 950 in Lecture Note in Computer Science Number, pages 182–193, 1995. 9. S. H. Oh, J. Kwak, S. W. Lee, and D. H. Won. Security Analysis and Applications of Standard key Agreement Protocols. Computational Science and Its Applications–ICCSA 2003, number 2668 in Lecture Note in Computer Science Number, part 2, pages 191–200, 2003. 10. PACS. PACS(Personal Access Communications System) Air Interface Standard. J-STD-014, 1995.
1016
Jin Kwak, Soohyun Oh, and Dongho Won
11. V. Varadharajan, and Y. Mu. On the Design of Security Protocols for Mobile Communications. Australasian Conference in Information Security and Privacy–ACISP’96, pages 134–145, 1996. 12. Y. Zheng. An Authentication and Security Protocol for Mobile Computing. Proceedings of IFIP, pages 249–257, 1996. 13. Y. Zheng. Digital Signcryption or How to Achive Cost(Signature and Encryption) << Cost(Signature) + Cost(Encryption). Advacne in Cryptography–CRYPTO’97, number 1294 in Lecture Note in Computer Science Number, pages 165–179, 1997.
A Framework for Modeling Organization Structure in Role Engineering HyungHyo Lee1 , YoungLok Lee2 , and BongNam Noh2 1
2
Div. of Info. and Electronic Commerce Wonkwang Univ., Iksan, 570-749, Korea [email protected] Dept. of Computer Science Chonnam National Univ., Gwangju, 500-757, Korea {yrlee,bongnam}@chonnam.ac.kr
Abstract. RBAC model is renowned as a security model for corporate environment, since its components, especially role hierarchy, are suitable for modeling an organization structure. But the functional role hierarchy constructed through the existing role engineering approaches does not reflect an organization structure, because they do not take the structural characteristics of the organization into account. Also, it has been observed that the unconditional permission inheritance property in functional role hierarchy may breach a least privilege security principle and make it impossible to define separation of duty requirements on roles that have a common senior role. In this paper, we propose a role engineering methodology considering organizational roles as well as functional roles to provide a practical RBAC model for corporate environment. We also elaborate the characteristics of organizational roles relatively neglected in the previous work, and compare them with those of functional roles. And models for associating organizational and functional roles and those role hierarchies (unified vs. separate) are proposed and the advantages and shortcomings of those models are given.
1 Introduction Role-based Access Control (RBAC) is now widely accepted as an alternative to the traditional access control models, such as MAC and DAC. The basic idea of associating a set of privileges or permissions with a named role and assigning that role to users is well established and is deployed several commercial computer systems and applications [6]. While the concepts of a role and a role hierarchy are central to many RBAC models, most of the research on RBAC has focused on the representation, administration and activation of roles, assuming that roles and role hierarchy are already determined [8]. Recently, a number of researches have been conducted to develop a methodology for extracting the main components of RBAC model such as roles and role hierarchy from organization’s business processes and documents [3, 4, 7, 8]. Role engineeringfor RBAC is the process of defining roles, permissions, role hierarchies, constraints and assigning the permissions to the roles [9]. It is the first step to implement RBAC system and also a crucial component of security engineering. After conducting researches on role engineering, researchers experienced that there exist different role classes such as functional,
This paper was supported by Wonkwang University in 2004.
J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 1017–1024, 2006. c Springer-Verlag Berlin Heidelberg 2006
1018
HyungHyo Lee, YoungLok Lee, and BongNam Noh
organizational, hierarchical etc., and suggested that those roles are needed for modeling a given organization’s security policy completely. But, little research clearly explains how those hierarchies are interrelated and proposes a role engineering methodology employing more than one role class. Most of the researches dealt with the processes for deriving only functional roles. RBAC model is renowned as a security model for corporate environment, since its components, especially role hierarchy, are suitable for modeling an organization structure. But the functional role hierarchy constructed through the existing role engineering approaches does not reflect an organization structure, because they do not take the structural characteristics of the organization into account. Also, it has been observed that the unconditional permission inheritance property in functional role hierarchy may breach a least privilege security principle and make it impossible to define separation of duty requirements on roles that have a common senior role. Researchers found that elements of an organization structure such as departments and teams inherently provide a important means for deriving functions (i.e., permissions) and a natural interface for managing user-to-role assignment task for security administrator. For example, departments Personnel and Accounting have their own functionalities, and organizational roles Personnel Manager and Accounting Director are authorized in the context of their departments’ tasks. In addition, it is natural for security administrator to assign a user to organizational role, i.e., Personnel Manager rather than to several functional roles, such as Personnel Information Management and Personal Rating Information Management, which are the job functions of the Personnel Manager role. Unlike functional role hierarchy, a senior role in organizational role hierarchy need not inherit the permissions of its junior roles, since the inheritance scope of organizational role hierarchy is typically bound to their structural elements. In this paper, we propose a role engineering methodology considering organizational roles as well as functional roles to provide a practical RBAC model for corporate environment. We also elaborate the characteristics of organizational roles relatively neglected in the previous work, and compare them with those of functional roles. And models for associating organizational and functional roles and those role hierarchies (unified vs. separate) are proposed and the advantages and shortcomings of those models are given. The rest of this paper is organized as follows: Section 2 describes the related work. Section 3 describes the methodology of modeling RBAC with organization structure from two viewpoints. And finally this paper concludes with analyzing the framework suggested with conventional framework and describing the further work.
2 Related Works In [6], the RBAC framework is extended to include the notion of role hierarchies. The model allows the occupants of superior roles to inherit all the positive access rights of their inferiors, and conversely ensures that the occupants of inferior positions inherit any prohibitions that apply to their superiors. However, the authors observed that in some situations inheritance of access rights down the organizational hierarchy may be undesirable. They outline two possible ways of avoiding this. First, they use some other
A Framework for Modeling Organization Structure in Role Engineering
1019
ordering than the organizational hierarchy to define the role hierarchy. Second, they define subsidiary (“private") roles outside the hierarchy. In [7], the authors used a generalization (is-a relationship) hierarchy based on professional competencies for the analysis of role hierarchies. On the other hand [8] used a generalization hierarchy based on an organization’s functional hierarchy. In [4], they outlined the control principles which are applied in many large organizations and their impact on inherited access rights, and came to the conclusion that the interaction of control principles and role hierarchies could have undesirable consequences for access control. They pointed out that the generalization hierarchy is not the only one that could be validly used for a role hierarchy. That paper was limited mainly to pointing out the problems. In [4], they describe the work in progress with a process-oriented approach for role-finding to implement Role-Based Security Administration. They pointed out that there are a few authorizations which do not belong to functional roles. To overcome that problem, they have agreed on a few nonfunctional role-classes to be introduced to the RBAC concept. The roles are organized within different role classes. Role classes defined at Siemens ICN are functional, organizational, basic, hierarchical, and special roles.
3 A Framework for Modeling Organization Structure As earlier researches show, functional roles alone can not model the authorization characteristics of an organization and does not provide an intuitive mapping between an organizational structure and role hierarchies. In this section, we propose two different approaches to solve those problems. The first one is a unified role hierarchy approach. In this methodology, functional sub-role hierarchies and organizational sub-role hierarchy are interconnected and formed into a unified role hierarchy. The other approach is based on separate role hierarchy. In the methodology, rather than combining functional and organizational roles, two separate role hierarchies are modeled and managed: the functional role hierarchy is constructed in terms of permission inheritance for efficient permission management while the organization role hierarchy is modeled to reflect the organization structure. 3.1 Unified Role Hierarchy Approach We briefly describe the notion of sub-role and sub-role hierarchy, key characteristics of the unified role hierarchy approach. They provide the functional (i.e., job function) and organizational (i.e., job position) authorization features through inter-related sub-role hierarchies. The detailed components and features of the model are in [5]. 3.1.1 Roles and Sub-roles. A role is a job function in the organization that describes the authority and responsibility conferred on a user assigned to the role [6]. Administrators can create roles according to the job functions performed in their organizations. This is very intuitive and efficient approach except unconditional permission inheritance. In RBAC models, a senior role inherits the permissions of all its junior roles. This property
1020
HyungHyo Lee, YoungLok Lee, and BongNam Noh
may breach the least privilege principle, one of the main security principles RBAC models support [3]. In order to address this drawback, we divide a role into a number of sub-roles based on the characteristics of job functions and the degree of inheritance. Figure 1 shows a proposed sub-role concept compared with an original or macro role.
Fig. 1. Sub-role concept for corporate environment
3.1.2 Permission Inheritance in Sub-role Hierarchies. There are two kinds of subrole hierarchies: horizontal and vertical. Horizontal sub-role hierarchy (i.e., job position plane) is a partially ordered relation between sub-roles which are divided from the same macro role, and only unconditional permission inheritance exists between subroles. Vertical sub-role hierarchy (i.e., job function plane) is a partially ordered relation between sub-roles in the same sub-role category (i.e., CC, DC, RI, PR sub-roles) but in different macro roles, and both unconditional and restricted inheritance exit in it. Sub-roles in RI category need to specify the sub-roles to which their permissions should be inherited. Only the specified sub-roles in RI hierarchy can inherit the permissions of their junior sub-roles. In Figure 2, even if sub-role ri3 belongs to General Manager role, a senior role to Manager role, the permissions of rj3 are not inherited to ri3 .
Fig. 2. Example of sub-role hierarchies
A Framework for Modeling Organization Structure in Role Engineering
1021
Figure 3 shows an example of an organization’s structure and its corresponding sub-role hierarchies. A sub-role hierarchy in the same horizontal plane represents a job position role in the organization unit such as the Project Leader of Project Team 1. And, a sub-role hierarchy in the same vertical plane such as CC, DC, PI, PR is analyzed and constructed from the perspective of permission inheritance. As we can see, vertical planes in (b) correspond to organization units in (a). Users are assigned to private subroles since the private sub-role inherits all permission of its junior roles. Despite of its complex structure of sub-role hierarchies, sub-role hierarchy approach provides a role engineering approach for analyzing functional and organizational roles and their relationships to security administrators.
Fig. 3. Example of an organizational structure and its sub-role hierarchies
3.2 Separate Role Hierarchy Approach In this section, we briefly review the shortcomings of the traditional role engineering process, in which a discrepancy between the role hierarchy and organization structure usually exists. Also, we propose another role engineering approach, in which role engineer constructs organizational and functional role hierarchies separately and associates each of roles in the hierarchies according to the organization’s regulation. 3.2.1 Discrepancy Between Role Hierarchy and Organization Structure. Every organization has its own organization structure which consists of units such as departments, sections, or divisions etc. Furthermore, in each unit of the organization, there are job positions such as a General Manager, an Assistant General Manager, to perform their own tasks and responsibilities. Therefore, the organization’s structure provides important authorization information to security analyzers in role engineering process. But in most traditional role engineering process, since role engineer focuses on efficient permission management through permission inheritance characteristic, the constructed role hierarchy usually has different structure with that of the organization. With
1022
HyungHyo Lee, YoungLok Lee, and BongNam Noh
the role hierarchy which is inconsistent with the organization’s structure, it is not easy for security administrator to perform security management tasks and reflect the change of security policy. In addition, considering a user assignment (UA) task of RBAC model is usually performed by employees of personnel department, the discrepancy between the role hierarchy and the organization’s structure makes UA tasks complicated and error prone.
Fig. 4. Example of traditional role hierarchy
Figure 4 shows a typical role hierarchy which models two different properties of a single hierarchy; the permission inheritance feature and the organization structure. As a result, there is no clear distinction between the two concepts and then makes security management tasks difficult and unmanageable. For example, the structure like (i.e., the hierarchy among PE1, QE1, E1 roles) rarely can be found in real enterprise environment, since it is not common for an organization unit which has two different senior units. That kind of structure stems from the purpose of efficient permission management. In Figure 4, the permissions common to PE1 and QE1 are extracted and assigned to a new virtual role, E1, for efficient permission management. As mentioned earlier, it is desirable for personnel employees to perform user assignment tasks with the role hierarchy of which structure is as close as organization structure. Furthermore, let us assume that roles PE1 and QE1 are mutually exclusive. In order to meet the separation of duty principle, the role PL1 can not be a common senior to both PE1 and QE1. While keeping the organization structure and satisfying the separation of duty principle, a role engineer need to add private roles of PE1 and QE1 (i.e. PE1’ and AE1’, respectively), which prevents conflicting permissions from inherited to their senior role. The resulting role hierarchy has more complex and different form than the original one. All those problems are caused by different types of information, i.e. organization structure and permission inheritance, are mingled with a single role hierarchy without analyzing their own characteristics. 3.2.2 Separate Organizational Role Hierarchy and Functional Role Hierarchy. In the second role engineering approach, in order to solve the problems mentioned in 3.2.1,
A Framework for Modeling Organization Structure in Role Engineering
1023
we use both organizational role hierarchy and functional role hierarchy. The organizational role hierarchy provides intuitive information on how the organization’s units are related and which job positions exist and how they are inter-related etc., to non-security experts including employees in the personnel department. But the functional role hierarchy is constructed by role engineers through an analysis of permissions from the viewpoint of efficient permission management regardless of the organization’s structure and the job positions. The major advantage of this approach is an independent and transparent management of each structure. A personnel employee assigns a user (i.e. a new employee or a transferred employee) to a specific job position without the detailed knowledge of functional role hierarchies. A role engineer analyzes authorization information based on the organization’s regulations and constructs the organizational role hierarchy. And then, he/she maps each job position to its corresponding organizational roles. Figure 5 shows examples of organizational and functional role hierarchies. Comparing Figure 5 (a) with Figure 4 (a), no virtual role (i.e. E1, E2) is in organizational role hierarchy. Also, mutually exclusive roles (i.e. PE1 and QE1, PE2 and QE2) have the common senior roles in the organizational role hierarchy, but the conflicting permissions are prohibited from being inherited to senior roles (i.e. by assigning the conflicting permissions between PE1 and QE1 to roles r2 and r3 respectively in functional role hierarchies, roles PE1 and QE1 have the common senior role, PL1, in the organizational role hierarchy).
Fig. 5. Example of organizational and function role hierarchies
4 Conclusions and Future Work Role engineering in RBAC system analysis and design phase means the process of defining roles, permissions, role hierarchies, constraints and assigning the permissions to the roles. It is the first step to implement RBAC system and also a crucial component of security engineering. In this paper, we suggest two different approaches of role engineering methodology; unified and separate role hierarchy approaches. In the unified role engineering
1024
HyungHyo Lee, YoungLok Lee, and BongNam Noh
approach, roles are divided into sub-roles and a number of sub-role hierarchies are constructed based on the job functions and positions. After then, each sub-role hierarchies are interconnected to form a single hierarchy of the organization. Despite of its complex structure, the resulting role hierarchy provides an integrated view of role-based authorization structure. On the contrary, in the separate role hierarchy approach, organizational and functional role hierarchies are constructed independently, and mapping roles between two hierarchies is performed. This enables role engineers to define the role hierarchy focusing on the efficient permission management and personnel employees to perform user assignment tasks transparent to the details of functional role hierarchies. Our future work is to develop detailed processes for constructing unified role hierarchy, organizational and functional role hierarchies. In addition, the administration of each role hierarchy is another important research topic.
References 1. Ezedin Barka and Ravi Sandhu, “Framework for Role-Based Delegation Models," Proceedings of 23rd National Information Systems Security Conference, pp.101-114, Baltimore, Oct. 16-19, 2000. 2. Virgil Gligor, Serban Gavrila and David Ferraiolo, “On the Formal Definition of Separa-tionof-Duty Policies and Their Composition," Proceedings of the IEEE Symposium on Se-curity and Privacy, pp.172-183, May 3-6, 1998. 3. Jonathan D. Moffett, “Control Principles and Role Hierarchies," 3rd ACM Workshop on Role Based access Control (RBAC). George Mason University, Fairfax, Virginia, USA. 1998. 4. Haio Roecle, Gerhard Schimpf, Rupert Weidinger, “Process-oriented approach for role-finding to implement role-based security administration in a large industrial organization," Proceedings of the fifth ACM workshop on Role-based access control, Germany, pp103-110, 2000. 5. HyungHyo Lee, YoungLok Lee, BongNam Noh, “A New Role-Based Delegation Model Using Sub-Role Hierarchies", International Symposium on Computer and Information Sci-ences (ISCIS 2003) LNCS 2869 pp.811-818, 2003. 6. Ravi Sandhu, Edward Coyne, Hal Feinstein and Charles Youman, “Role-based ac-cess control model," IEEE Computer, Volume 29, pp.38-47, Feb, 1996. 7. Lupu, E.C. and M.S. Sloman. “Reconciling Role-Based Management and Role-Based Ac-cess Control," in 2nd ACM Workshop on Role-Based Access Control. 1997. 8. Awischus, R. “Role Based Access Control with the Security Administration Manager," in 2nd ACM Workshop on Role-Based Access Control. 1997. 9. Gustaf Neumann, Mark Strembeck, “A Scenario-driven Role Engineering Process for Functional RBAC Roles," in SACMAT’02, pp.33-42, 2002.
An Efficient Pointer Protection Scheme to Defend Buffer Overflow Attacks Yongsu Park2 and Yookun Cho1 1
Department of Computer Science and Engineering, Seoul National University San 56-1 Shilim-Dong Gwanak-Ku, Seoul 151-742, Korea [email protected] 2 The College of Information and Communications, Hanyang University 17 Haengdang-dong, Seongdong-gu, Seoul 133-791, Korea [email protected]
Abstract. We present a new efficient pointer protection method to defend buffer overflow attacks. It uses a simple watermark to protect the pointer: during dereferencing the pointer variable, a watermark is also written/updated and before referencing the pointer variable, it verifies consistency of the watermark. If the pointer’s watermark does not exist or was damaged, our scheme regards this as an intrusion and stops the execution. The proposed scheme has the following strong points. First, unlike other randomization methods, our scheme has no possibility of malfunction caused by the execution of arbitrary instructions. Second, we conducted various experiments on prototype implementation, which showed that our scheme is as secure as the previous randomization schemes. Third, experimental results showed that the performance degradation is not high. Forth, unlike other randomization schemes, our scheme can support attack profiling. Keywords: system security, buffer overflow, randomization
1 Introduction In spite of countless methods designed to cope with buffer overflow vulnerabilities, new attacks are continuously appeared such as format string attacks, heap overflows, or multiple free errors. Up till now, buffer overflows are still major cause of exploited vulnerability. In order to address this problem, new approaches have been recently developed: code randomization, address randomization, and pointer randomization. Among them, code randomization and address randomization have demerits such as significant performance degradation or partial coverage of defense. PointGuard [3] that was appeared in USENIX Security 2003 uses pointer encryption/decryption to provide both efficiency and wide defense coverage together. However, the shortcoming is that when there exists an attack, PointGuard should execute arbitrary instructions, which causes possible malfunction of the privileged process. In this paper, we present a new efficient pointer protection method to defend buffer overflow attacks. It uses a simple watermark to protect the pointer: during dereferencing the pointer variable, a watermark is also written/updated and before referencing the pointer variable, it verifies consistency of the watermark. If the pointer’s watermark J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 1025–1030, 2006. c Springer-Verlag Berlin Heidelberg 2006
1026
Yongsu Park and Yookun Cho
does not exist or was damaged, our scheme regards this as an intrusion and terminates the execution. The proposed scheme has the following strong points. First, unlike other randomization schemes such as RISE [1], [6], and PointerGuard [3], our scheme has no possibility of malfunction caused by the execution of arbitrary instructions. Second, we conducted various experiments on prototype implementation, which showed that our scheme is as secure as the previous randomization schemes. Third, experimental results showed that the performance degradation is not high. Forth, unlike other randomization schemes, our scheme can support attack profiling. Our scheme can be viewed as an approach to increase trustworthiness as well as to enhance security of the systems to be protected. The rest of this paper is organized as follows. In Section 2 we describe the previous work. In Section 3, we explain PointGuard and our method. In Sections 4 and 5, we analyze performance and security of our scheme. Finally, we offer some conclusions in Section 6.
2 Related Work To defend buffer overflow attacks, there have been many research activities such as imposing run-time type checking [5,8] or enhancing the conventional programming languages [9]. As mentioned in Introduction, in spite of these methods new attacks are continuously appeared and buffer overflows are still major cause of exploited vulnerability. In this section, we deal with recently proposed approach which is so called “randomization”. To the best of our knowledge, the first randomization scheme is ASLR (Address Space Layout Randomization) provided by the PAX project [11]. It randomly assigns the address of each memory object such as program code or data. Since the most attacks overflow a specific memory region to overwrite the adjacent memory objects, it seems that ASLR works effectively. Moreover, the execution overhead is very low (only 10∼18% slower). Bhatkar et al. [2] enhanced this concept and made a new implementation to cover wider range of defense. However, ASLR and [2] cannot defend GOT (Global Object Table) attack or function pointer attacks as will be shown in Section 4. Moreover, they suffer from address fragmentation. Another approach is to randomize the instruction codes. [1,6] encrypts each instruction code in the program and when it is to be fetched, it is decrypted and then executed. This approach relies on the CPU emulator, which results in significant performance degradation. Cowan presented the StackGuard that checks the integrity of activation records in the stack [4]. It is very efficient such that its performance degradation can be negligible. However, its coverage of defense is relatively narrow: it can only defend against the stack-related attacks. To the best of our knowledge, the recently proposed PointerGuard [3] has both wide coverage of defense and efficiency together. We explain PointGuard and discuss its demerits in Sections 3 and 4.
An Efficient Pointer Protection Scheme to Defend Buffer Overflow Attacks
1027
3 Proposed Method In Section 3.1, we explain PointGuard [3] and address some demerits. Then, we describe our method that relies on watermarking in Section 3.2. 3.1 PointGuard PointGuard exploits the property that buffer overflow attacks modify function/data pointers. It operates as follows: 1. Initialization: first, symmetric key K is initialized as a random value, which is obtained by reading /dev/urandom. Then, K is stored in the protected page in the memory space. 2. Dereference of the pointer: whenever the pointer value P is required to be initialized or modified, P is encrypted with K and then the result is stored in the memory. 3. Reference of the pointer: after reading the encrypted value from the memory, it is deciphered by using K. Then, decrypted pointer value can be used. The main idea of PointGuard is that by defending only the symmetric key K, attackers cannot guess all the pointer values. Moreover, illegally modified pointer values become meaningless since the decrypted addresses would have arbitrary values. Pointer access occurs frequently so efficiency is very important. This is why authors use a single xor operation to implement encryption/decryption of the pointer. Implementation of the PointGuard algorithm can be performed by the one of the following three ways [3]. First, the source code of the target process can be modified to support PointGuard. Second, we can modify the compiler to insert pointer encryption/decryption. Third, loader can be modified. Authors of [3] modified the gcc compiler to implement prototype version of PointGuard. 3.2 Proposed Scheme One of the demerits of instruction/pointer randomization methods is that in the case of the attack arbitrary codes are executed. Considering that randomization methods are mainly applied to the privileged processes, execution of arbitrary codes could result in a serious disaster such as breaking important configuration files or causing system crashes. The main idea of the proposed scheme is that we insert a watermark for each pointer to detect the modification of either the pointer or watermark before the reference. 1. Initialization: first, symmetric keys K1 and K2 are initialized as random values, which are from /dev/urandom. Then, they are stored in the protected page in the memory space. 2. Dereference of the pointer: whenever the pointer value P is required to be initialized or modified, P is encrypted by K1 and then the result is stored in the memory. By another key K2 , P is encrypted and this value is inserted in the address P .
1028
Yongsu Park and Yookun Cho K1 P
K2 P
E
E
Memory EK1(P)
EK2(P) Data P P+1
Address:
Fig. 1. Dereference of the pointer K1 P
D
? P
K2 D
Memory EK1(P) Address:
EK2(P) Data P P+1
Fig. 2. Reference of the pointer
3. Reference of the pointer: after reading the encrypted value from the memory, it is deciphered by using K1 to produce P . Then, the watermark in the address of P is decrypted by using K2 . If this value is identical to P , then, decrypted pointer value (= P ) can be safely used. Otherwise, the target process should be terminated. As in PointGuard, we use a xor operation to encrypt/decrypt the pointer since performance degradation should be minimized. In this case, P ⊕ K1 , P ⊕ K2 are stored in the memory with respect to P . If an attacker A reads these values, he can xor them to obtain K1 ⊕ K2 . If A replaces them with C and C ⊕ K1 ⊕ K2 , he successfully forges the above check routine. However, this attack is unlikely to occur since P ⊕ K2 is stored in the address P . If A knows P ⊕ K2 , this means that he already knows P . Since we assume that K1 and K2 cannot be revealed (as in PointGuard), A is unable to know the value P by the property of the xor operation. Another shortcoming of PointGuard is that it cannot protect all the pointers. More specifically, pointers invisible in the source code cannot be protected. Among them, we think that the most important pointers to be protected are frame pointer and return address in the stack since numerous buffer overflow attacks modify them. By inserting the watermark and the checking routine for them, defence coverage of the proposed scheme would be significantly extended.
4 Security Analysis As mentioned in Section 3, the proposed scheme can be implemented by the one of the following three ways: source code modification, compiler modification, and loader modification. To simplify the implementation, we chose the first approach. We used C language and modified all the pointers in the source code to perform encryption / decryption and to check the watermarks.
An Efficient Pointer Protection Scheme to Defend Buffer Overflow Attacks
1029
To protect frame pointer/return address, we used modified gcc that has xor random canary method of StackGuard [4]. Although the algorithm is different, xor random canary method has the same object as that of our scheme for protecting frame pointer/return address in the stack. Test programs consist of Straw man vulnerability overflowing example in Section 6 of [4] (overflowing function pointer), examples of Phrack 49 (typical buffer overflow and small buffer overflow) [10], and examples in Omega project (GOT attack) [7]. For all the tests, the programs were safely terminated and we were unable to find any success of attack in the proposed scheme. Table 1 shows comparison of various buffer overflow defences. Compared schemes are PointGuard [3], PAX/ASLR [2], StackGuard [4], Bochs [6], and RISE [1]. Table 1. Comparison of buffer overflow defences Methods
Execution of
Defence against various attacks
arbitrary codes Stack Smashing Attacks Function Pointer Attacks PAX/ASLR [2]
Yes
Yes/No
Yes/No
StackGuard [4]
No
Yes
No
PointGuard [3]
Yes
No
Yes
RISE [1]
Yes
Yes/No
Yes/No
Bochs [6]
Yes
Yes/No
Yes/No
Proposed Scheme
No
Yes
Yes
5 Performance Analysis To evaluate the performance of the proposed scheme, we conducted the following three tests: 1. Read: repeat referencing the pointers in the pointer arrays. 2. Write: dereference the pointers by storing values in the pointer arrays. 3. Read/Write: repeat the job that reads a pointer value and then stores it in the another place. Table 2 contains the performance results. Experimental environments are as follows: CPU, RAM, OS, and compiler are Intel P4/Xeon 1.8 GHz, 512 MBytes, Linux 2.6.41, and gcc 3.3.3, respectively. For each trial, we measured the elapsed time for 107 operations of the pointers. This table shows that our scheme is about 1.5∼2 times slower for each pointer access. We think that performance degradation would be much less if we implement our scheme in the AST/RTL level of gcc, as in the experiment of PointGuard.
6 Conclusion In this paper we presented a new efficient pointer protection method to defend buffer overflow attacks. It uses a simple watermark to protect the pointer: during dereferencing the pointer variable, a watermark is also written/updated and before referencing
1030
Yongsu Park and Yookun Cho Table 2. Comparison of elapsed time (in seconds) Test
Unmodified source code Proposed scheme
Read
17.9
34.1
Write
25.6
42.6
Read/Write
28.9
46.3
the pointer variable, it verifies consistency of the watermark. If the pointer’s watermark does not exist or was damaged, our scheme regards this as an intrusion and stops the execution. The proposed scheme has the following strong points. First, unlike other randomization methods, our scheme has no possibility of malfunction caused by the execution of arbitrary instructions. Second, we conducted various experiments on prototype implementation, which showed that our scheme is as secure as the previous randomization schemes. Third, experimental results showed that the performance degradation is not high. Forth, unlike other randomization schemes, our scheme can support attack profiling. Our scheme can be viewed as an approach to increase trustworthiness as well as to enhance security of the systems to be protected.
References 1. Gabiela Barrantes, David H. Ackley, Stephanie Forrest, Trek S. Palmer, Darko Stefanovic, and Dino Dai Zovi. Randomized Instruction Set Emulation to Disrupt Binary Code Injection Attacks. In 10th ACM Conference on Computer and Communication Security, pages 281–289, October 2003. 2. Sandeep Bhatkar, Daniel C. DuVarney, and R. Sekar. Address Obfuscation: an Efficient Approach to Combat a Broad Range of Memory Error Exploits. In 12th USENIX Security Symposium, pages 105–120, August 2003. 3. Crispin Cowan, Steve Beattie, John Johansen, and Perry Wagle. PointGuard: Protecting Pointers From Buffer Overflow Vulnerabilities. In 12th USENIX Security Symposium, pages 91– 104, August 2003. 4. Crispin Cowan, Calton Pu, Dave Maier, Jonathan Walpole, Peat Bakke, and Steve Beattie. StackGuard: Automatic Adaptive Detection and Prevention of Buffer-overflow Attacks. In 7th USENIX Security Symposium, pages 63–78, January 1998. 5. Richard Jones and Paul Kelly. Bounds Checking for C. avaliable at http://wwwala.doc.ic.ac.u/˜phjk/BoundsChecking.html, July 1995. 6. Gaurav S. Kc, Angelos D. Keromytis, and Vassilis Prevelakis. Countering Code-Injection Attacks With Instruction-Set Randomization. In 10th ACM Conference on Computer and Communication Security, pages 272–280, October 2003. 7. Lamagra. Project OMEGA. avaliable at http://ouah.kernsh.org/omega1lam.txt. 8. Greg McGary. Bounds Checking for C and C++ Using Bounded Pointers. avaliable at http://gcc.gnu.org/projects/bp/main.html, 2000. 9. George C. Necula, Scott McPeak, and Westley Weimer. CCured: Type-Safe Retrofitting of Legacy Code. In 29th ACM Symposium on Principles of Programming Languates (POPL02), 2002. 10. Aleph One. Smasing The Stack For Fun And Profit. Phrack 49, File 14 of 16, 1996. 11. PaX team. The PaX Project. avaliable at http://pageexec.virtualave.net, 2001.
Parallel Hierarchical Radiosity: The PIT Approach Fabrizio Baiardi, Paolo Mori, and Laura Ricci Dipartimento di Informatica, Universit´a di Pisa Via F.Buonarroti, 56125 - Pisa, Italy {ricci,baiardi,mori}@di.unipi.it
Abstract. Radiosity is a method to compute the global illumination of a scene. To reduce its complexity, hierarchical radiosity decomposes the scene into a hierarchy of patches and computes the light exchanged between patches at different levels, according to their distance and/or to the amount of light they emit. A distributed memory implementation of this method has been developed through PIT, a problem independent library that supports hierarchical applications on distributed memory architectures. PIT functions exploit a distributed version of the tree representing the hierarchical decomposition.
1 Introduction Radiosity [5] defines the global illumination of a scene by decomposing objects into sets of patches and by computing the forms factor of each pair of patches, i.e. the fraction of the light they exchange. To reduce the overall complexity, a hierarchical approach exploits hierarchical patches and it computes the light exchanged between two patches at distinct levels, according to their distance and/or to the amount of light they emit. One of the problems posed by a parallel implementation of this method is the irregular and dynamic distribution of the patches in the scene. Furthermore, the computational loads of the patches may be very different because the light exchange is computed at distinct levels. This paper evaluates a parallel version of a hierarchical radiosity method developed through PIT [3,1,2], a problem independent programming library to support the development of irregular hierarchical applications on distributed memory architectures. Its key assumption is that both the sequential application and the parallel one can be built around a tree that represents the distribution of elements in the domain. The definition of element is problem dependant and it constitutes the elementary component of the considered problem. As an example, in the case of radiosity, an element is one patch, while in the case of the Barnes-Hut method it is a body. The interactions among elements are computed by applying proper operators to this tree. In the parallel version, the tree is mapped into the local memories of the processing nodes, pnodes. A new data type, the Parallel Irregular Tree, describes this mapping and it defines the operations to access information mapped onto other pnodes, to insert new nodes into the tree and to balance the computational load. To focus on the adoption of PIT to develop hierarchical radiosity on a distributed memory architecture, we have considered scenes in flatland, i.e. a bi-dimensional world. Sect. 2 describes the main functions of PIT . The sequential hierarchical approach to radiosity in flatland is revised in Sect. 3. Sect. J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 1031–1040, 2006. c Springer-Verlag Berlin Heidelberg 2006
1032
Fabrizio Baiardi, Paolo Mori, and Laura Ricci
4 describes the parallelization of the method by PIT. Experimental results are discussed in Sect. 5.
2 The PIT Library Parallel Irregular Tree, PIT, is a library addressed to the parallelization of irregular problems on distributed memory architectures. This library adopts a hierarchical method to solve irregular problems, and it is based upon the assumption that both the sequential and the parallel versions of a hierarchical method may be structured in terms of operations on a hierarchical tree, Htree. The Htree represents both the subdomains resulting from the hierarchical domain decomposition and the relation among them and any element they include. Any hierarchical method, to solve irregular problems, iteratively visits the Htree and it updates the properties of its elements. For each hnode hn, the method updates the properties of element(hn), the element paired with hn, after collecting the neighbors of element(hn). The operations to update the properties of an element depend upon the nature of the problem to be solved, target problem, and are denoted target problem operators, or simply operators. A goal of PIT is to preserve the code of these operators but, while in a sequential application the Htree is recorded in just one memory, in a distributed memory architecture the HTree is distributed across the local memories of all the pnodes. The functions of PIT support the application of these operators in spite of the mapping onto distinct memories. They are defined in terms of the Parallel Hierarchical Tree, PITree, the distributed version of the Htree. A PITree P (H) is a tuple h0 , .., hnp−1 , mht, where np is the number of pnodes, that describes the mapping of a Htree H. Any hi is the private Htree of Pi , the i-th pnode, i.e. the subset of all and only the hnodes of H that represent domains or elements mapped into the local memory of Pi . mht, instead, is the mappingHtree that represents the hierarchical relations among h0 , .., hnp−1 . mht includes the root R of H and all the hnodes on the paths from R to the root of any hj . Each hnode of mht corresponds to a hnode hp of a private Htree hj and it records j, the identifier of the pnode where hp has been mapped. mht is the minimal amount of information to be replicated in all the pnodes to enable the PIT functions to determine where any hnode of H has been mapped. In the following, we focus on the functions of PIT used to define the parallel version of hierarchical radiosity methods. PIT is fully described in [3], together with the algorithm to map the Htree onto the pnodes. PITree Creation is the function that partitions the initial set of elements among the pnodes and builds the PITree, i.e. both the private and the mapping Htrees. This function returns to each pnode a reference to the root of its private Htree. Then, this reference is trasmitted to the PIT functions that support the application of the target problem operators. To reduce the communication among the pnodes, the strategy to partition the elements takes into account the locality property. Since PIT is independent of the target problem and, in particular, of the element properties, the user has to define both a structure to record these properties and three functions, include el, decompose el, remove el, to handle the elements. These functions are exploited by PITree Creation to properly partition the elements. Each pnode Pi applies the target problem operators to each hnode in its private Htree hi , in a SPMD style. However, hi does not include all the elements the operator requires,
Parallel Hierarchical Radiosity: The PIT Approach
1033
because some neighbours of a hnode hn of hi , the remote neighbours of hn, may have been mapped onto distinct pnodes. PITree Completion is the PIT function that collects all the elements required by an operator and creates, for each private Htree hi , the essential Htree ehi . ehi is the union of hi with all the remote neighbours of any element in hi . In general case, PITree Completion is invoked before the execution of each target problem operator so that the same operator of the sequential version of the application can be correctly applied to ehi ∩ hi . The parameters of this function are the root of the private Htree and a user defined function stencil to compute the neighborhood stencil, i.e. the set of neighbors of each element. The stencil function depends upon the target problem operator so that distinct operators of the same problems may have distinct neighborhood stencils. To reduce the communication overhead, PITree Completion implements a fault prevention strategy [1], where each pnode Pk determines, through the neighborhood stencil, which of its elements are required by Ph , ∀h = k and sends these elements to Ph without an explicit request from Ph to Pk . PITree Balance updates the mapping of the PITree onto the pnodes to balance the computational load, and it is invoked anytime the distribution of the elements in the domain is updated. Hence, in the simplest case, it is invoked after the execution of each operator. Its input parameters are a reference to the roots of the private Htrees and a threshold to prevent the recovery of very low unbalances. It returns a reference to the root of the updated private Htree.
3 Hierarchical Radiosity in Flatland Hierarchical radiosity has been introduced to compute the heat transfer between surfaces in [9]. It has been applied in computer graphics to compute the global illumination of unoccluded environments in a scene in [6] and to compute the illumination of scenes with occlusions in [4]. Hierarchical radiosity [7] discretizes the objects of the scene through small elements, denoted as patches, whose brightness is assumed to be constant. The radiosity Bi of a patch pi depends upon those of all the other patches pj according to the equation: n−1 Bi Ai = Ei Ai + ρi j=0 Bj Aj Fj,i where Ai is the area of pi , Ei is its emissivity, i.e. the light emitted by pi , ρi is its reflectance, and Fj,i is the percentage of light leaving the patch pj that reaches pi . Fi,j is the form factor between pj and pi . In flatland [8,10], the bidimensional world we consider, a scene is a set of polygons. Each polygon is decomposed into a set of segments, each with two sides. Only one side interacts with the other polygons, because the other one is internal to the polygon. Let us consider a pair of segments Si and Sj . A popular and efficient formula for the form factor Fi,j is the the string rule [9] defined in terms of the strings stretched from the endpoints of Si to the corresponding endpoints of Sj , i.e. the uncrossed strings, and those that links up the opposite endpoints, i.e. the crossed strings. Fi,j is proportional to the difference between the lengths of the crossed strings and that of the uncrossed ones. If some parts of Si and Sj are not mutually visible, the strings are “stretched” around the obstruction so that the string rule can consider an arbitrary number of occluding segments.
1034
Fabrizio Baiardi, Paolo Mori, and Laura Ricci
Hierarchical radiosity recursively partitions each segment into a hierarchy of patches and it computes the interactions among patches so that a segment interacts with the other ones at distinct levels. Approximated interactions are computed among low level patches, while accurate interactions are computed among high level ones. An interaction list paired with each patch pi records the interacting patches of the other segments. By pairing each element of this list with the list of the related occlusions, we produce the visibility list that includes the information to compute the radiosity of the corresponding patch. Hierarchical radiosity computes a sequence of approximations starting from a domain that includes all the segments of the polygons in the scene. The algorithm computes the visibility list of each segment and it sets the initial radiosity of each segment to its emissivity. Then, an iterative procedure computes the global radiosity of each segment. Each iteration applies the functions Gather, PushPull and RefineLink. For each segment Si , Gather collects the light emitted by the other segments that reaches Si , according to the current decomposition. The function is recursively applied to Si and to any patch produced by the decomposition of Si and it applies the string rule to compute the light contributions received by any patch in the visibility list. For each segment Si , PushPull redistributes among the patches of Si the light contributions received at all the levels of the hierarchy as computed by Gather. PushPull includes two steps. The Push step pushes the radiosity of all the patches of Si down into the hierarchy, from the root patch that represent the segment Si to the leaf ones. The radiosity of a patch pi is updated by the current value of the radiosity of the patch pi belongs to. The Pull step pulls the radiosity of each patch of Si up in the hierarchy, from the leaf patches to the root one. The radiosity of pi is the weighted average of those of all the subpatches of pi . PushPull computes the overall radiosity of the scene as well and the method terminates if the difference between the current value and the one at the previous iteration is smaller than a user defined threshold. RefineLink partitions a patch pi because its interactions have to be computed with a better accuracy, i.e. because its radiosity is larger than a predefined threshold. The partition is defined according to the geometric properties of pi . RefineLink updates the visibility lists of pi and of each patch pj in the interaction list of pi .
4 Parallelizing Hierarchical Radiosity: The PIT Approach In general, radiosity methods represents distinct objects of the domain by distinct trees, where the nodes of the tree represent the patches. In our approach, instead, a single Htree represents the whole domain and it includes all the objects. The hnodes represent the spaces that include the patches. Fig. 1 shows the Htree and the domain it represents. The scene includes two segments, S0 and S1 , partitioned into patches. The white hnodes of the HTree represent the spaces of the domain, while the black ones represent the elements, i.e. the patches. A node corresponding to a patch stores the patch properties, i.e. the coordinates of the segment, its orientation with respect to the polygon that includes it, the emissivity, the reflection and the current radiosity. The initial domain is partitioned until each space includes one element only. For instance, the segment S1 of Fig. 1 has been partitioned into two segments, p0 and p1 , at level l = 1, because the smallest space that includes it, that is the space that represents
Parallel Hierarchical Radiosity: The PIT Approach
S0
p2 p3
S1 p1
1035
p0 l=1
p8
p4 p7
l=2
p6
P0
S0
p5
p2 p3 p 11
p 10 p9
p4
p0
p1 p5 p6
l=3
P1
p7
p8 p 9 p 10
p 11
Fig. 1. Domain decomposition and PITree
the whole domain, also includes the segment S0 . This agrees with the philosophy of this method that partitions a segment Si before the computation if another segment Sj belongs to the smallest space including Si . In this case, Si and Sj are close in the domain, and they will be decomposed anyway by the refinement function to compute their interactions. Fig. 1 shows a simple domain decomposition and the related PITree. Fig. 2 describes the parallel version of the hierarchical radiosity method developed through the P IT library by inserting the proper PIT functions into the sequential code. The first PIT function invoked by the parallel code is PITree creation that, starting from the initial set of elements of the domain, creates the PITree and returns to each pnode a reference to its private HTree. As an example, in Fig. 1 the segment S0 and all its patches have been mapped onto the pnode P0 , while S1 and all its patches have been mapped onto P1 . Before the first iteration, PITree completion gathers the information to compute the visibility list of each segment. In this case, the neighborhood stencil of a segment Si , i.e. vis stencil, includes all the spaces that are visible from the external side of Si . Since the radiosity of a patch pi depends upon those of other patches in the scene, some of the patches that pnode Pa needs to compute the radiosity of pi may have been mapped onto another pnode Pb a = b, PITree completion is applied before each operator that requires the remote neighbours of a patch. Gather and PushPull compute the radiosity of a local patch in terms of that of local and of remote ones, while RefineLink requires local values only. However, if a patch and all the subpatches it has been recursively partitioned into are mapped onto the same pnode, also PushPull requires local values only. The patches that Gather needs to compute the contributions to the radiosity of pi are a function of the visibility list of pi . This list includes the references to all the patches that interact with pi and to all those that occlude these interactions. To collect the remote neighbors Gather requires, the fault prevention strategy can be applied. In the parallel version of the method the two steps of PushPull are implemented by two distinct procedures, Push and Pull. The neighborhood stencil of PushPull is very simple because, in the Push step, the stencil of pi includes only the patches including pi . In the Pull step the stencil of pi includes any patch resulting from the decomposition of pi . Furthermore, both procedures are implemented through a breadth first visit of the tree. The adoption of an anticipated visit is not convenient because the execution of the Push
1036
Fabrizio Baiardi, Paolo Mori, and Laura Ricci
step and/or the Pull one on the objects whose patches are not mapped onto the same pnode strongly serializes the overall computation. hierarchical radiosity method(element list *scene) { pht root = PITree creation(scene, include el, decompose el, remove el); PITree completion(vis stencil, all levels); Visib list determ H(pht root); while (not end) { PITree completion(interaction list, all levels); Gather(pht root) for level from L min to L max PITree completion(push stencil, level); Push(pht root, level) for level from L max to L min PITree completion(pull stencil, level); Pull(pht root, level) end = RefineLink H(pht root); pht root = PITree balance(pht root); }} Fig. 2. Hierarchical Radiosity PIT Parallel Code
RefineLink inserts into the HTree a new hnode for each new patch of the scene. Since the new patches created by a pnode Pi partition the patches already mapped onto Pi , they are mapped onto Pi as well. This is coherent with the adopted methodology because, if the PIT strategy maps a space A onto a pnode Pi , then maps onto Pi all the subspaces of A before the other spaces of the domain. To determine the computational load of each element, we observe that most of the computation is devoted to compute the form factors. Hence, l(pi ), the load of a patch pi , can be estimated in terms of the number of interactions executed by pi that is proportional to the size of the visibility list of pi . RefineLink updates the load of the pnodes, because it partitions the existing patches and the new patches generate new interactions. The partitioning of pi affects both the visibility list of pi and that of any patch pj that interacts with pi . This decreases l(pi ) , because the interaction with pj has been removed from its visibility list, and it increases l(pj ), because the interaction with pi has been replaced by those with the subpatches of pi . Moreover, new elements, i.e. the subpatches of pi , have been added to the domain . The initial load of these elements depend upon the visibility list they inherit from pi . PITree balance is invoked after RefineLink to recover any unbalance in the computational load.
5 Experimental Results To evaluate both the effectiveness and the performances of the PIT approach, this section presents some experimental results of the PIT parallel version of the hierarchical radiosity
Parallel Hierarchical Radiosity: The PIT Approach
1037
method. The first scene we consider is very simple and it is shown in Fig. 3. Initially it includes 224 segments that compose 48 polygons.
Fig. 3. Test scene: 224 segments
After 5 iterations only of the method, the total number of patches of the scene is about 3500% of the initial one. Fig. 4 shows the subdomain assigned to pnode 0 after 5 iterations in an the case of an execution on 10 pnodes.
Fig. 4. Subdomain of pnode 0 after 5 iterations
The hierarchical decomposition of the domain is highly irregular, because some segments, or part of it, have been decomposed more deeply than others. We recall that this decomposition depends upon both the geometrical features of the scene and the exchange of light among the segments. The first parallel architecture we consider in our experiments is a cluster of 10 workstations. Each workstation is a PC with an Intel Pentium II CPU (266 MHz) and 256 Mbyte of local memory and the interconnection network is a switched 100Mbit Fast Ethernet.
1038
Fabrizio Baiardi, Paolo Mori, and Laura Ricci 30
time (sec)
28
26
24
22
20 0
20
40
60
80
100
threshold
Fig. 5. Completion time with alternative load balancing thresholds
The first set of experimental results concerns the importance of the load balancing to achieve a good efficiency. Fig. 5shows the completion time of 30 iterations of hierarchical radiosity when applied to the scene of Fig. 3 for different values of the load balance threshold. We can notice that the lowest completion time is achieved by adopting a balance threshold of 20%. In this case, the largest unbalance that is tolerated by the parallel application is 20% of the average load of the pnodes. Moreover, the figure shows that, by adopting a load balance threshold of 100%, the completion time is 116% of the optimal one. Hence, considerable performance improvements can be achieved by updating the data mapping at run time to recover load unbalances. Moreover, this also confirms the effectiveness of the PIT approach. The next set of experimental results evaluates the performance of the parallel implementation. Fig. 6 shows the efficiency of the hierarchical radiosity method for a variable number of pnodes. The scene considered in this experiment is that of Fig. 3, and 30 iterations have been executed. 100
efficiency(%)
80
60
40
20
0 2
4
6
8
10
number of p-nodes
Fig. 6. Efficiency for the Hierarchical Radiosity: 224 segments
The figure shows that, even on 10 pnodes, our implementation achieves an efficiency close to 60%. The efficiency improves for a lower number of pnodes. For instance, it is about 80% in the case of 4 pnodes. A simple way to increase the efficiency is by considering larger scenes. Fig. 7 shows the other scene considered in our experiments that includes 896 segments that defines 192 polygons.
Parallel Hierarchical Radiosity: The PIT Approach
1039
100
100
95
95
90
90
85
85 efficiency (%)
efficiency (%)
Fig. 7. Test scene: 896 segments
80 75 70
80 75 70
65
65
60
60
55
55
50
50 2
4
6 number of p-nodes
8
10
2
4
8 10
16
32
number of p-nodes
Fig. 8. Efficiency for Hierarchical Radiosity: 896 segments
Fig.8 shows the performance of the PIT hierarchical radiosity method, on two architectures. The left size of the figure shows the experiments executed on the architecture previously described and resulting in an efficiency larger than 80%, on 10 pnodes. Such a good efficiency is mostly due to the high locality of the scene. In fact, the large number of polygons of the scene results in a large number of occlusions among segments. In turns, this implies that each segment mainly interacts with close segments. The mapping strategy of PIT exploits at best this locality to minimize the communication overhead. We have also considered an IBM Linux Cluster including 64 nodes, each with 2 Intel Pentium III (1.133 GHz) and 1Gbyte of local memory. The interconnection network is a Miricom LAN “C" Version. The right side of Fig. 8 shows the performance on this architecture. These results confirm the previous ones because the efficiency is larger than 70% even on 32 pnodes.
References 1. F. Baiardi, S. Chiti, P. Mori, and L. Ricci. Parallelization of irregular problems based on hierarchical domain representation. Proceedings of HPCN 2000: Lecture Notes in Computer Science, 1823:71–80, May 2000.
1040
Fabrizio Baiardi, Paolo Mori, and Laura Ricci
2. F. Baiardi, S. Chiti, P. Mori, and L. Ricci. Integrating load balancing and locality in the parallelization of irregular problems. Future Generation Computer Systems: Elsevier Science, 17:969–975, June 2001. 3. F. Baiardi, P. Mori, and L. Ricci. Pit: A library for the parallelization of irregular problems. Proceedings of the 6th International Conference on Applied Parallel Computing (PARA 2002): Lecture Notes in Computer Science, 2367:185–194, 2002. 4. M.F. Cohen and D.P. Greenberg. The hemi-cube: a radiosity approach for complex environment. Computer Graphics, 19(3):31–40, 1985. 5. C.Puech F.X.Sillion. Radiosity and Global Illumination. Kaufmann Publ., 1994. 6. C.M. Goral, K.E. Torrance, D.P. Greenberg, and B. Battaill. Modelling the interaction of light between diffuse surfaces. Computer Graphics, 18(3):213–222, 1984. 7. P. Hanrahan, D. Salzman, and L. Aupperle. A rapid hierarchical radiosity algorithm. Computer Graphics, 25(4):197–206, 1991. 8. P.S. Heckbert. Radiosity in flatland. Eurographics 92, 11(3):181–192, 1992. 9. H.C. Hottel. Radiant heat transmission, chapter 4. McGraw Hill, 1954. 10. R. Orti, S. Rivi´ere, F. Durand, and C. Puech. Radiosity for dynamic scenes in flatland with the visibility complex. Computer Graphics Forum, 15(3):237–248, 1996.
Optimizing Locationing of Multiple Masters for Master-Worker Grid Applications Cyril Banino Norwegian University of Science and Technology (NTNU) Department of Computer and Information Science (IDI) Sem Sælandsvei 7-9 NO-7491 Trondheim, Norway [email protected]
Abstract. The problem of allocating a large number of independent tasks to a heterogeneous computing platform is considered. A non oriented graph is used to model a Grid, where resources can have different speeds of computation and communication. We claim that the use of multiple masters is necessary to achieve good performance on large-scale platforms. The problem considered is to find the most profitable master locations in order to optimize the platform throughput.
1 Introduction A recent trend in high performance computing is to deploy computing platforms that span over large networks in order to harness geographically distributed computing resources. The aim is often to provide computing power to applications at unprecedented scale. Good candidates for such environments are Master-Worker applications, composed of a large number of computational tasks independent from each other, i.e. where no inter-task communications take place, and where the tasks can be computed in any order. Many applications have been and can be implemented under the Master-Worker paradigm. They include: Processing of large measurement data sets like the SETI@home project [1], biological sequence comparisons [2], or also distributed problems organized by companies like Entropia [3]. The Master-Worker paradigm consists in the execution of independent tasks by a set of processors, called workers, under the supervision of a particular processor, the master. The master holds all the tasks initially, and sends them out to the workers over a network. Workers compute their tasks, and send the results of the computation back to the master. This scheduling problem is well recognized, and several studies [4,5] have recently revisited the Master-Worker paradigm for clusters and Grids. Following previous studies [4,5], the performance measure adopted in this paper is the steady-state throughput of the platform, i.e. the number of tasks computed per time unit. This work does not directly address the dynamically changing nature of large-scale computing platforms. Although this is an important consideration, this paper nonetheless suggests a simple but mandatory departure from traditional implementations: The need to deploy several masters to efficiently utilize currently emerging large-scale platforms. Multiple masters are needed since the centralization of the tasks in one single place, not only makes the system more vulnerable to failures [6], but also clearly limits the J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 1041–1050, 2006. c Springer-Verlag Berlin Heidelberg 2006
1042
Cyril Banino
scalability of the system. In this paper, we propose to circumvent the master bottleneck by introducing multiple masters. The problem becomes to determine how many masters should be used, and where should these masters be located on the platform graph in order to achieve better performance. As a consequence, a discrete network location problem arises, and this paper contributes insights on the problem difficulty within static settings. Static knowledge cannot be underestimated, as after all, a dynamic context may often be viewed as a succession of static contexts. This work hence provides an important first step for efficient deployments of large Master-Worker applications on computational Grids. Growing the number of control resources to unblock the serial bottleneck has been successful in many different domains: Multi-mastering is employed to allow simultaneous bus transfers between peripherals within a single system, multi-server architectures are used for massively multi-player network games, replication in Data Grid environments ensure efficient access to widely distributed data [7]. Closely related to our work is the hierarchical Master-Worker implementation proposed by Aida et al. [8] where a supervisor controls multiple processor sets, each of which is composed of a master and several workers. The supervisor achieves load balancing by migrating tasks among masters. However, the work of Aida et al. does not take in consideration the topology of the platform, an essential aspect for large-scale Grids that we address in this paper. To the best of our knowledge, Shao et al. [5] were the first to consider resource selection problems within the steady-state Master-Worker scheduling framework. The authors aim at selecting performance-efficient hosts for both the master and worker nodes. For that end, an exhaustive search is performed, consisting in solving n networkflow problems, where n is the number of processors composing the platform. Then the configuration that achieved the highest throughput is selected.Unfortunately, this approach is not applicable with several masters. There are indeed np possible master locations sets, where p is the number of masters to be located on the platform. For this reason, we cannot simply compute the best scheduling strategy for each set, and then select the best result. As an example, for n = 50 and p = 10, a problem that is not large, the resulting number of possibilities is 10, 272, 278, 170. Clearly, even for moderate values of n and p, such enumeration is not realistic, and we need more advanced techniques. We introduce a cost model to establish and operate masters on the platform graph. Given a general interconnection graph, we consider the problem of selecting a master locations set that optimizes the throughput of the platform within a budget constraint B. We term such problem the B-COVER problem. The rest of this paper is organized as follows: Related work is exposed in Section 2. Our model of computation and communication is introduced in Section 3. The B-COVER problem is formally stated and shown to be NP-hard in Section 4. We give a simple heuristic in Section 5, and present our experimental results in Section 6. Finally, conclusions and future work are discussed in Section 7. The extended version of this paper [9] introduces another discrete location problem, whose goal is to utilize all the computing resources of the platform at minimal cost. Possible extensions of our model, as well as a wide set of simulation results are also presented in [9].
Optimizing Locationing of Multiple Masters for Master-Worker Grid Applications
1043
2 Related Problems The B-COVER problem is related to the Facility Location class of problems [10,11]. A classic facility location problem is a spatial resource allocation problem in which one or more service facilities have to be located to serve a geographically distributed set of population demands according to some objective function. The term facility is used in its broadest sense, as it is meant to include entities such as air and maritime ports, factories, warehouses, schools, hospitals, subway stations, satellites, etc [10]. In our case, we can assimilate the facilities to the master processors, the service to the input files, and the set of demands to the computing power of the worker processors. Of particular interest because close in spirit to the B-COVER problem, are the Maximal Covering Location Problem [12] that addresses planning situations which have an upper limit P on the facilities to be located, and the Fixed Charge Location Problem [10] that introduces capacities as well as costs constraints on the facilities to be located. However, the B-COVER problem slightly differs from traditional facility location problems since the tasks allocated to workers produce some results that must be gathered at master location sites. In fact, the B-COVER problem can be expressed as a Supply Chain Management problem [13]. A supply chain can be defined as a network of facilities that manufactures finished products, and distributes these products to customers. Supply chain management involves deciding (i) where to produce, what to produce, and how much to produce at each site, and (ii) where to locate plants and distribution centers [13]. In our case, we can assimilate the manufactured products to output files and the plants to worker processors. The master processors will play two roles: Supply raw material (i.e. input files) to the plants, and then collect and distribute the final products to the user.
3 Multiple Master Model Our model builds on the model presented by Banino et. al [14], but differs in that we here separate the link bandwidths from the communication times between processors, as well as by differentiating input files from output files. We also introduce a model for the master locations selection. The target architectural framework is represented by a graph G = (V, E) as illustrated in Fig. 1. Each vertex Pi ∈ V represents a computing resource of weight wi , meaning that processor Pi requires wi units of time to process one task. Each edge ei,j : Pi → Pj represents a communicating resource having a bandwidth equal to γi,j , which limits the amount of data that can be transfered on link ei,j per time unit in both direction. The target application tasks are modeled as requiring some input data file of size βI , and producing some output data file of size βO . It takes ci,j time units to transfer one input file from processor Pi to a neighbor processor Pj , and ci,j time units to transfer one output file from Pi to Pj . The communication times ci,j and cj,i can possibly be different, due to, say different I/O hardware device of processors Pi and Pj . This holds also for ci,j and cj,i . Computations and communications are supposed to be atomic, i.e these operations, once initiated, cannot be preempted. All wi are assumed to be positive rational numbers since they represent the processor computing times. We disallow wi = 0 since it would permit processor Pi to perform
1044
Cyril Banino
P1 e1,2
e3,4
P3
P2 e2,5
e1,4
e1,3
e2,6
e4,7
e3,6 e6,7
P5
P4
P6
P7
Fig. 1. Grid graph. Vertices and edges stand for processors and communication links
an infinite number of tasks, but we allow wi = +∞; then, Pi has no computing power, but can still forward tasks to other processors (e.g. to model a switch). Similarly, we assume that all ci,j and ci,j are positive rational numbers since they correspond to the communication times between two processors. The full overlap, single-port model [14] is adopted, in order to represent the operation mode of the processors. Under this model, a processor can simultaneously receive data from one of its neighbors, perform some (independent) computation, and send data to one of its neighbors. At any given time-step, a given processor may open only two connections, one in emission and one in reception. Stating the communication model more precisely: If Pi sends an input file to Pj at time-step t, then: Pj cannot start executing the associated task, or forwarding this input file before time-step t + ci,j ; Pj cannot initiate a new receive operation before time-step t+ci,j (but, it can perform a send operation and independent computation). Pi cannot initiate another send operation before time-step t + ci,j (but, it can perform a receive operation and independent computation). This model holds also for the communication of output files. Finally, let Jm ⊆ V denote the index set of the processors susceptible to be chosen as a master, and for each processor Pi ∈ Jm , let xi ∈ {0, 1} be the decision variable to place a master at location Pi , i.e. xi = 1 if Pi is chosen as a master, and xi = 0 otherwise. Let fi be the fixed cost of establishing a master at location Pi , and ti be the per task cost of operating a master at location Pi . All fi and ti are assumed to be positive constants since, from a practical viewpoint, it is rather absurd to have negative costs for establishing master locations.
4 The B-COVER Problem 4.1 Mathematical Formulation of B-COVER To formally define the B-COVER problem, let n(i) denote the index set of the neighbors of processor Pi . During one time unit, αi is the fraction of time spent by Pi computing, si,j and si,j are the fractions of time spent by Pi sending input and output files to its neighbor Pj ∈ n(i), ri,j and ri,j are the fractions of time spent by Pi receiving input and output files from its neighbor Pj ∈ n(i). Defined on a platform graph G, a mathematical formulation of the B-COVER problem can be stated by the following integer linear program, whose objective is to maximize the throughput ntask (G) of the platform
Optimizing Locationing of Multiple Masters for Master-Worker Grid Applications
1045
graph G. Maximize ntask (G) = i∈V
αi , wi
Subject to
(12) ∀i ∈ Jm , xi +
(1) ∀i, 0 ≤ αi ≤ 1 (2) ∀i, ∀j ∈ n(i), 0 ≤ si,j ≤ 1 (3) ∀i, ∀j ∈ n(i), 0 ≤
si,j
≤1
(4) ∀i, ∀j ∈ n(i), 0 ≤ ri,j ≤ 1 (5) ∀i, ∀j ∈ n(i), 0 ≤ ri,j ≤1
(13) ∀ei,j ∈ E, si,j
+
ci,j
ri,j
cj,i
ri,j ≤ 1
j∈n(i) si,j r + ci,j ci,j j,i
βI +
βO ≤ γi,j
(14) ∀i ∈ / Jm , gi = 0
(6) ∀i ∈ Jm , xi ∈ {0, 1}
(15) ∀i ∈ Jm , 0 ≤ gi ≤
(7) ∀i, ∀j ∈ n(i), si,j = rj,i
(16) ∀i ∈ / Jm , gi = 0
(8) ∀i, ∀j ∈ n(i), si,j = rj,i
(17) ∀i ∈ Jm , 0 ≤ gi ≤ w1i + µi × xi ri,j αi si,j (18) ∀i, gi + = + cj,i wi ci,j
(9) ∀i,
(si,j +
si,j )
≤1
j∈n(i)
(10) ∀i ∈ Jm , xi +
si,j ≤ 1
j∈n(i) (ri,j + ri,j )
(11) ∀i, j∈n(i)
≤1
j∈n(i)
(19) ∀i,
gi
+ j∈n(i)
1 wi
+ µi × xi
si,j αi = + ci,j wi
j∈n(i)
j∈n(i)
ri,j cj,i
(fi xi + ti gi ) ≤ B
(20) i∈Jm
• Equations (1), (2), (3), (4) and (5) specify that all the activity variables must belong to the interval [0, 1], as they correspond to the fraction of activity during one time unit. • Equation (6) specifies candidate locations for establishing the masters. • Equations (7) and (8) ensure communication consistency: The time spent by Pi to send input (output) files to Pj should be equal to the time spent by Pj to receive these input (output) files from Pi . • Equation (9) ensures that send operations to neighbors of Pi are sequential. • Equation (10) enforces that masters do not send output files to other processors. • Equation (11) ensures that receive operations to neighbors of Pi are sequential. • Equation (12) enforces that masters do not receive input files from other processors. • Equation (13) ensures that link bandwidths cannot be exceeded. This constraint is due to our hypothesis that the same link ei,j may be used in both directions simultaneously. • Equations (14) and (15) specify that only the masters can generate input files to be allocated on the Grid. Let gi be the number of input files generated by Pi per time unit. In practice gi will be limited by the number of tasks that Pi can process per time unit (i.e. w1i ) plus the maximum number of input files µi that Pi can communicate to its neighbors per time unit. Under the base model, µi is bounded by the inverse of the smallest communication time ci,j of the neighbors of Pi . We hence have µi = 1 min{ci,j |j∈n(i)} . • Equations (16) and (17) specify that only masters can collect output files generated on the Grid. Let gi be the number of output files collected by Pi per time unit, and µi
1046
Cyril Banino
be the maximum number of output files that Pi can receive from its neighbors per time unit. We hence have µi = min c 1|j∈n(i) . { j,i } • Equations (18) and (19) stand for conservation laws. For every processor Pi , the number of input files generated, plus the number of input files received, should be equal to the number of input files processed, plus the number of input files sent (equation (18)). Similarly, for every processor Pi , the number of output files collected, plus the number of output files sent, should be equal to the number of input files processed, plus the number of output files received (equation (19)). It is important to understand that equations (18) and (19) really apply to the steadystate. Consider an initialization phase during which some input files are forwarded to processors, but no computation is performed. Then some computation take place, and some output files are produced. At the end of this initialization phase, we enter the steady-state: During each time-period in steady-state, each processor can send/receive data, and simultaneously perform some computation. Equations (7), (8), (18) and (19) ensure that there are as many input files produced as output files consumed. by combining these four equations, we can obtain the Indeed, following equation: i∈Jm gi = i∈Jm gi • Equation (20) ensures that the costs generated by establishing (fi) and operating (ti ) the chosen master locations do not exceed the budget constraint B. • Finally, the objective function of tasks computed within one unit of is theαnumber i time, i.e. the platform throughput i∈V w . i 4.2 Complexity of B-COVER Theorem 1. B-COVER is NP-hard. Proof. We reduce the MAXIMUM KNAPSACK (MK) problem [15] to the B-COVER problem. MAXIMUM KNAPSACK INSTANCE: Finite set U , for each u ∈ U a size s(u) ∈ Z+ and a value v(u) ∈ Z+ , a positive integer B ∈ Z+ . SOLUTION: A subset U ⊆ U such that
s(u) ≤ B. u∈U
MEASURE: Total weight of the chosen elements, i.e.
v(u). u∈U
Let us construct an instance of the B-COVER problem as follows: • • • • • •
Create a set V containing |U | processors. Create a bijective function f : V −→ U . 1 ∀Pi ∈ V, wi = v(f (P . i )) ∀Pi ∈ V, fi = s(f (Pi )). ∀Pi ∈ V, ti = 0. E = ∅ and Jm = V .
The graph of the B-COVER instance is edgeless (E = ∅), meaning that no tasks can be communicated among processors. Consequently, tasks can only be computed at
Optimizing Locationing of Multiple Masters for Master-Worker Grid Applications
1047
the location where they are generated. A solution of the B-COVER instance consists in determining a subset V ⊆ V such that Pi ∈V fi ≤ B in order to maximize the platform throughput, i.e. Pi ∈V w1i . Thus, a solution of the B-COVER problem instance provides a solution of the MK one. This proves that B-COVER is at least as difficult as MK. Since MK is NP-hard [15] and since the transformation is done in polynomial time, B-COVER is also NP-hard.
5 Heuristic Based on LP-Relaxation LP-relaxations, i.e. relaxing the integer constraints of an integer linear program, have considerable interest since they provide the basis both for various heuristics and for the determination of bounds for the most successful integer linear programs [11]. If we replace equation (6) by the following equation: ∀i ∈ Jm , 0 ≤ xi ≤ 1, we obtain a linear program in rational numbers, that can be solved in polynomial time. Solving the relaxed linear program associated to B-COVER, will give the upper bound of the optimal platform throughput reachable without exceeding the budget constraint. This bound will be our benchmark for testing and comparing our heuristic. It is important to notice that (i) this bound might not be achievable by any discrete solutions and (ii) this bound might not be tight. Our heuristic for solving the B-COVER problem consists in solving the relaxed linear program in the first place, and then select in a greedy fashion the vertices Pi that have the highest gi values without exceeding the budget constraint. We then obtain a set M ⊆ Jm of master locations, and we replace equation (6) by the following equations: ∀i ∈ M, xi = 1 and ∀i ∈ Jm \M, xi = 0. We then deal with a new linear program, equivalent to the one proposed in [14], which can be solved in polynomial time. The solution of the linear program gives rational values for all the variables, which means that we obtain a description of the resource activities during one time unit. The way to construct a schedule (i.e. where an integer number of input/output files are communicated, as well as an integer number of tasks is executed) out of the former description is shown in [14]. In this paper, our main focus is to determine optimal master locations. Reconstructing the associated schedules is therefore not relevant in our context.
6 Simulations To verify our theoretical finding, a software simulator has been developed [9]. The inputs of the simulator are the number of vertices and edges in the graph and a probability function for the communication and the computation costs. Because input and output operations are symmetric, we let βO = 0, i.e. tasks do not produce output files. Consequently, the masters initially hold input files, and only distributed them to the workers. We also have enforced that ∀i, ∀j ∈ n(i), ci,j = cj,i , i.e. communication times are equal between two neighbor processors, and we fixed γi,j = ci,j and βI = 1. We hence retrieve the model proposed by Banino et al. in [14]. For the sake of generality, we let Jm = V , i.e. any vertex can be chosen as a master. We let the fixed cost fi to establish
1048
Cyril Banino
master locations be equal to 1, and the per task cost ti for operating a master be equal to 0. In other words, there are no differences in term of cost among the different potential master locations, and the influent factor for choosing master locations becomes the platform heterogeneity. We hence retrieve the model proposed by Shao et al. in [5]. As a consequence, the objective of the B-COVER problem becomes to maximize the platform throughput when using at most B masters. The problem becomes then similar to the Maximal Covering Location Problem [12] also known to be NP-hard. The aim of these arbitrary decisions is (i) to keep the number of parameters as low as possible, while maintaining the problem complexity and (ii) to follow previous models proposed in the literature. Nevertheless, we expect our heuristic to perform better in presence of more heterogeneity, e.g. with different communicating times for ci,j and ci,j , as well as heterogeneous cost distributions, since the problem will become more specific. Several random connected graph based on these parameters have been generated. For each graph, the throughput of our heuristic is computed, and compared to the upper bound. The following results are averaged values on 1000 random graph, each composed of 20 vertices and 39 edges. In the following simulation depicted in Fig. 2, the probability functions for the integer computation costs follow a uniform distribution on the interval [200, 400], while the integer communication costs follow a uniform distribution on (i) the interval [100, 200]: The average computation to communication ratio (C/C) is then equal to 2, (ii) the interval [40, 80]: C/C is then equal to 5, and (iii) on the interval [20, 40]: C/C is then equal to 10. B-COVER for B in [0,5],#graphs=1000, #nodes=20 100%
Platform Utilization
90% 80%
C/C=10
70% 60%
C/C=5
50% 40% 30%
C/C=2
20% 10% 0% 0
1
2
3
4
5
Number of Masters used
Fig. 2. Average platform utilization for the B-COVER problem. Square curves correspond to the upper bound, while triangle curves correspond to our heuristic
An important choice for the decision maker is deciding what level of expenditure (i.e. how many masters should be used) can be justified by the resultant throughput [12]. In our experiments for instance, 6, 3, and 2 masters would be appropriate when the C/C are equal to 2, 5 and 10 respectively.
Optimizing Locationing of Multiple Masters for Master-Worker Grid Applications
1049
7 Conclusion and Future Work The problem of allocating a large number of independent, equal-sized tasks to a Grid composed of a heterogeneous collection of computing resources was considered. This paper suggests a simple but mandatory departure from traditional implementations: The need to deploy several masters to efficiently utilize currently emerging large-scale platforms. A cost model for establishing and operating master locations was provided, and the problem became to find the most profitable locations for the masters in order to optimize the platform throughput without violating a budget constraint. On one hand, we derived pessimistic theoretical results, by showing that the BCOVER problem is NP-hard. On the other hand, we proposed a simple heuristic that achieves very good performance on a wide range of simulations. This heuristic can be embedded within a multi-round dynamic approach that could cope with, and respond to variations in computation speeds or network bandwidths. Similarly to the hierarchical Master-Worker approach [8], before each round a supervisor could decide how many masters should be used, where to place them and how many tasks each master will produce for the current round. We also showed that the B-COVER problem presents tight connections with wellknown problems from the Operation Research field. Facility Location and Supply Chain Management problems have been the subject of a wealth of research, and we believe that models and solutions to these problems can be adapted for Grid computing. Of particular interest are facility location models under uncertainty (i.e. under dynamic conditions), and facility location models with facility failures [13]. Resource failures and variations in resource availability are very likely to occur on these new emerging large-scale platforms, especially when the overall processing time is large. This work can be extended in the following directions. First, we need approximate solutions of the B-COVER problem. Although our simple heuristic achieves very good performance on a wide range of simulations, there are no guarantees on how far the heuristic is from the optimal solution. Second, the B-COVER problem might be easier to solve (i.e. in polynomial time) on tree-shape platforms, as this is the case for the Maximal Covering Location Problem [16]. Finally, enabling the use of several masters transparently is a challenging task, but mandatory in order to promote the distributed computing technology. The work of Chen [6] is a step in this direction.
Acknowledgments The author would like to thank his PhD advisor Anne C. Elster for her comments and suggestions, which greatly improved the final version of the paper. This research was supported by the Department of Computer and Information Science (IDI), at the Norwegian University of Science and Technology (NTNU).
References 1. Anderson, D.P., Cobb, J., Korpela, E., Lebofsky, M., Werthimer, D.: Seti@home: An Experiment in Public-Resource Computing. Communications of the ACM 45 (2002) 56–61
1050
Cyril Banino
2. Sittig, D.F., Foulser, D., Carriero, N., McCorkle, G., Miller, P.L.: A Parallel Computing Approach to Genetic Sequence Comparison: The Master Worker Paradigm with Interworker Communication. Computers and Biomedical Research 24 (1991) 152–169 3. Chien, A., Calder, B., Elbert, S., Bhatia, K.: Entropia: Architecture and Performance of an Enterprise Desktop Grid System. Journal of Parallel and Distributed Computing 63 (2003) 597 4. Beaumont, O., Legrand, A., Robert, Y.: The Master-Slave Paradigm with Heterogeneous Processors. IEEE Trans. on Parallel and Distr. Sys. 14 (2003) 897–908 5. Shao, G., Berman, F., Wolski, R.: Master/slave Computing on the Grid. In: Proceedings of the 9th Heterogeneous Computing Workshop, IEEE Computer Society (2000) 3 6. Chen, I.: A Practical Approach to Fault-Tolerant Master/Worker in Condor. Master’s thesis, Dept. of Computer Science, Univ. of California at San Diego (2002) 7. Lamehamedi, H., Szymanski, B.: Data Replication Strategies in Grid Environments. In: Proc. Of the Fifth International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP’02) (2002) 8. Aida, K., Natsume, W., Futakata, Y.: Distributed Computing with Hierarchical Master-worker Paradigm for Parallel Branch and Bound Algorithm. In: Proceedings of the 3rd International Symposium on Cluster Computing and the Grid, Tokyo, Japan (2003) 9. Banino, C.: Optimizing Locationing of Multiple Masters for Master-Worker Grid Applications: A Thorough Study. Tech. Report 09/04, Dept. of Computer and Info. Science, NTNU (2004) URL: http://www.idi.ntnu.no/∼banino. 10. Current, J., Daskin, M., Schilling, D.: Discrete Network Location Models. In Drezner, Z., Hamacher, H., eds.: Facility Location Theory: Applications and Methods. Springer-Verlag, Berlin (2002) 11. Krarup, J., Pruzan, P.: The Simple Plant Location Problem: Survey and Synthesis. European Journal of Operations Research 12 (1983) 36–81 12. Church, R.L., ReVelle, C.S.: The Maximal Covering Location Problem. Papers of the Regional Science Association 32 (1974) 101–118 13. Daskin, M. S., Snyder, L.V., Berter, R.T.: Facility Location in Supply Chain Design. In Langevin, A., Riopel, D., eds.: Logistics Systems: Design and Optimization. Kluwer (2005) (forthcoming). 14. Banino, C., Beaumont, O., Carter, L., Ferrante, J., Legrand, A., Robert, Y.: Scheduling Strategies for Master-Slave Tasking on Heterogeneous Processor Platforms. IEEE Trans. Parallel Distributed Systems 15 (2004) 319–330 15. Ausiello, G., Protasi, M., Marchetti-Spaccamela, A., Gambosi, G., Crescenzi, P., Kann, V.: Complexity and Approximation: Combinatorial Optimization Problems and Their Approximability Properties. Springer-Verlag New York, Inc. (1999) 16. Megiddo, N., Zemel, E., Hakimi, S.L.: The Maximum Coverage Location Problem. SIAM Journal on Algebraic and Discrete Methods 4 (1983) 253–261
An OGSA-Based Bank Service for Grid Accounting Systems Erik Elmroth1 , Peter Gardfj¨all1 , Olle Mulmo2 , and Thomas Sandholm2 1
2
Dept. of Computing Science and HPC2N, Ume˚a University, SE-901 87 Ume˚a, Sweden {elmroth,peterg}@cs.umu.se Dept. of Numerical Analysis and Computer Science and PDC, Royal Institute of Technology SE-100 44 Stockholm, Sweden {mulmo,sandholm}@pdc.kth.se
Abstract. This contribution presents the design and implementation of a bank service, constituting a key component in a recently developed Grid accounting system. The Grid accounting system maintains a Grid-wide view of the resources consumed by members of a virtual organization (VO). The bank is designed as an online service, managing the accounts of VO projects. Each service request is transparently intercepted by the accounting system, which acquires a reservation on a portion of the project’s bank account prior to servicing the request. Upon service completion, the account is charged for the consumed resources. We present the overall bank design and technical details of its major components, as well as some illustrative examples of relevant service interactions. The system, which has been implemented using the Globus Toolkit, is based on state-of-the-art Web and Grid services technology and complies with the Open Grid Services Architecture (OGSA). Keywords: Grid accounting, allocation enforcement, OGSA, SweGrid.
1 Introduction This contribution presents the design and implementation of a bank service, constituting a key component in the SweGrid Accounting System (SGAS) [10] – a recently developed Grid accounting system, initially targeted for use in SweGrid [15]. A Grid accounting system maintains a Grid-wide view of the resources consumed by members of a virtual organization (VO) [6]. The information gathered by the system can serve several useful purposes, such as to allow enforcement of project quotas. The bank is designed as an online service, handling accounts that contain the resource allocations of Grid projects. Each service request (job submission) is transparently intercepted by the accounting system, which acquires a reservation on a portion of the project account prior to job execution (cf. credit card reservations). Upon job completion, the project account is charged for the consumed resources and the reservation is released. The decentralized nature of Grids, assuming the absence of any central point of control, adds complexity to the problem. The problem is further complicated by the distributed
This work was funded by The Swedish Research Council (VR) under contracts 343-2003-953 and 343-2003-954 and The Faculty of Science and Engineering, Ume˚a Univ.
J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 1051–1060, 2006. c Springer-Verlag Berlin Heidelberg 2006
1052
Erik Elmroth et al.
resource administration in Grids, i.e., the requirement that resource owners at all times retain local control over their resources. The rest of this paper is outlined as follows. Section 2 gives a brief introduction to the accounting issues addressed, the SweGrid environment, the concept of Grid services, and an overview of the accounting system in which the bank is a key component. The bank design is described in Section 3, including a presentation of the services constituting the bank and their relationships. Some implementation details, including a fine-grained and customizable authorization framework, are summarized in Section 4. Finally, Section 5 concludes the paper.
2 Background The accounting system is primarily targeted towards allocation enforcement in SweGrid [15], although it has been designed to allow simple integration into different Grid environments. SweGrid is a Swedish national Grid, initially including 6 geographically distributed Linux clusters with a total of 600 CPUs dedicated for Grid usage 24 hours a day. The computer time of SweGrid is controlled by the Swedish National Allocations Committee (SNAC) [12], which issues computer time, measured in node hours per month, to research projects. Node hours are assigned to projects based on their scientific merits and resource requirements, after a peer-reviewed application process. The accounting system must provide coordinated enforcement of these quotas across all sites. That is, the whole allocation may be consumed at one cluster or in parts at any number of clusters. The design considerations for the bank service is closely related to those of the accounting system as a whole. The design is based on the assumption that a user can be a member of several VOs and participate in one or more projects within each VO. Each project’s resource allocation is kept in an account, which is handled by the bank. The information maintained by the accounting system can, e.g., form the basis for direct economic compensation, usage quota enforcement, tracking of user jobs, resource usage evaluation, and dynamic priority assignment of user requests based on previous resource usage. The system performs soft real-time allocation enforcement. The enforcement is realtime in that resources can deny access at the time of job submission, e.g., if the allocation has been used up, and soft in that the level of enforcement strictness is subject to local policies. 2.1 Accounting System Overview Figure 1 presents the main entities and interactions of SGAS. Entity interactions are illustrated in a scenario where a job is submitted to a computer cluster, although it could be generalized to cover a generic service request to an arbitrary Grid resource. Each VO has an associated Bank service to manage the resource allocations of the VO research projects. The Bank is primarily responsible for maintaining a consistent view of the resources consumed by each research project, and enables coordinated
An OGSA-Based Bank Service for Grid Accounting Systems
1053
Fig. 1. Interactions among accounting system entities (shaded) during job submission
quota enforcement across the Grid sites. From a scalability perspective, a single bank per VO might seem restrictive. However, it should be stressed that the Bank service is not confined to a single site. It could be implemented as a virtual resource, composed of several distributed services, to achieve load-balancing and scalability. The Job Account Reservation Manager (JARM) is the (single) point of integration between SGAS and the underlying Grid environment. On each Grid resource, a JARM intercepts incoming service requests, performs account reservations prior to resource usage and charges the requester’s account after resource usage. A Log and Usage Tracking Service (LUTS) collects and publishes Usage Records [8], holding detailed information about the resources consumed by particular service interactions. In a typical scenario, the user, or an entity acting on behalf of the user, such as a broker, sends a job request to a resource that has been selected to execute the job. During the job request process, mutual authentication is performed and the user’s credentials are delegated to the resource (1). The job request, which includes the identity of the project account, is intercepted by the resource’s JARM (2), which contacts the VO Bank to acquire a time-limited reservation on a portion of the project allocation (3). Such an account reservation is referred to as a hold. If a hold can be granted, the JARM forwards the job request to a local resource manager (4) that runs the job and gathers information about the resources consumed by the job. At job completion, the JARM collects the usage information (5), charges the project account utilizing the hold (6), and records the usage information in a Usage Record which is logged in a LUTS (7). Any residual amount of the hold is released. Notably, a user can query both the Bank and the LUTS, e.g., for various account and job information. 2.2 The Grid Service Concept The implementation of the accounting system, including its bank component, is based on the concept of Grid services as defined by the Open Grid Services Architecture (OGSA) [5]. OGSA extends the concept of Web services [17] by introducing a class of transient
1054
Erik Elmroth et al.
(bounded lifetime) and stateful (maintain state between invocations) Web services. Web services represent an XML-based distributed computing technology that is independent of platform, operating system and programming language. Web services are ideal for loosely coupled systems where interoperability is a primary concern. The transient nature of Grid services makes them suitable for representing not only physical resources but also more lightweight entities/activities, such as a video conference session, a data transfer, or in this case, different parts of a Grid accounting system. All services expose service data, a dynamic set of XML-encapsulated information about service metadata and local service state. Grid services provide operations for querying (service introspection) and updating service data. The Open Grid Services Infrastructure (OGSI) [16] defines a core set of composable interfaces which are used for constructing Grid services. The bank component presented in Section 3 makes use of the OGSI interfaces for, e.g., service creation, lifetime management, and service introspection. The interfaces are defined in Web Services Description Language (WSDL) [18] portTypes and specify the operations as well as the service data exposed by each service.
3 Bank Design The Bank component of SGAS has been designed with generality and flexibility in mind. Specifically, the bank does not assume any particular type of resources, and as such it can be integrated into any Grid environment. For example, all bank transactions are performed using Grid credits – an abstract, unit-less currency that can be used to charge for arbitrary resource usage. Prior to charging an account, a resource may apply any transformation function to map different kinds of resource usage into Grid credits. Since SweGrid resources are currently homogeneous and node-hours is the only resource type being accounted for, the resource-to-Grid credits mapping is trivial. In the general case of a heterogeneous Grid environment, different types of resource usage as well as differing resource characteristics need to be considered. For example, storage utilization could be accounted for as well, and faster processors might be more expensive. In case dedicated allocations need to be maintained for different types of resources, separate accounts may be provided for each resource type. The bank is also neutral with respect to policies. Policy decisions are left to users, resource managers and allocation authorities. The bank provides the flexibility to accommodate such policies, as well as means of enforcing them. For example, policies dictated by the allocation authority or the resource might allow a job to run even though the project allocation has been used up. However, if project quotas are not strictly enforced by a resource, the user can still decide only to perform safe withdrawals that do not exceed the project allocation. The SGAS bank is composed of three tightly integrated Grid services (Bank, Account, Hold), whose relationships are illustrated in Figure 2. Bank Service. The Bank service is responsible for creating as well as locating Accounts. The Bank service implements the factory pattern as provided by OGSI.
An OGSA-Based Bank Service for Grid Accounting Systems
1055
<>
OGSI:GridService
<<«uses»>>
<>
<>
OGSI:Factory
ServiceAuthorizationManagement
<>
Bank Administrator <<«creates»>>
<>
<>
Account <<uses>> <<«creates»>>
<<«uses»>>
<>
Hold Account holder
Fig. 2. Bank interface relationships
This allows the Bank to create new Account service instances. Clients can also query the Bank to obtain a list of the Accounts they are authorized to use. Account Service. An Account service manages the resource allocation of a research project. A project member can request a hold on the Account, effectively reserving a portion of the project allocation for a bounded period of time. A successful hold request results in the creation of a Hold service, acting as a lock on the reserved account quota. We refer to the Account service that created a Hold service as the parent account of that hold. Account services publish transaction history and account state, which can be queried by authorized Account members. The set of authorized Bank and Account users, as well as their individual access permissions, is dynamic and can be modified at run-time by setting an authorization policy, defined using XML. To this end, the Bank and Account interfaces extend the ServiceAuthorizationManagement (SAM) [10] service interface. SAM allows authorization policies to be associated with a service. The authorization policy could, e.g., contain an access control list associating a set of authorized Account members with their individual privileges. SAM is customizable, in that it allows different back-end authorization engines, also referred to as Policy Decision Points (PDP), to be configured with the service. Note that there is nothing Bank-specific about SAM; it can be used by any service requiring fine-grained management of authorization policies. Hold Service. A Hold service represents a time-limited reservation on a portion of the parent account’s allocation. Holds are usually acquired prior to job submission and committed after job completion to charge their parent account for the resources consumed by the job. Hold services are created with an initial lifetime and can be
1056
Erik Elmroth et al.
destroyed either explicitly or through expired lifetime. Hold expiry is controlled using the lifetime management facilities provided by OGSI. On commit or destruction, the Hold service is destroyed and any residual amount is returned to the parent account. In the case of destruction, the entire reservation is released. On commit, a specified portion of the Hold, corresponding to the actual resource usage, is withdrawn from the parent account, and a transaction entry is recorded in the transaction log. Service Interfaces. The operations exposed by the bank services are presented in Table 1. Note that the service interfaces extend OGSI interfaces, which are used for such purposes as lifetime management, service creation and service introspection. The Bank and Account services provide batch commit operations (commitHolds), which allow resources to perform asynchronous commits of sets of Hold services in a single service invocation. The benefit of this approach is twofold, the perceived resource response time decreases, allowing higher job throughput during periods of high load, and the bank request load is reduced, improving overall system scalability. Table 2 gives an overview of the service data exposed by the bank services. Note that the service data set of each service also contains the service data of extended OGSI interfaces. Furthermore, note that since Bank and Account extend the SAM interface, these services also expose the servicePolicy service data element. Only authorized service clients are allowed to query the service data. Table 1. The operations exposed by the SGAS bank interfaces portType
Operation
Description
SAM
setPolicy
Set an authorization policy to be associated with service.
Bank
getAccounts
Get all accounts that caller is authorized to use.
commitHolds Batch commit of several Holds (created by any account in Bank). Account
requestHold
Creates an account Hold if enough funds are available.
addAllocation Add a (potentially negative) amount to the account allocation. commitHolds Batch commit of several Holds (created by this account). Hold
commit
Withdraws a specified amount of the hold from the parent account.
Service Interactions. The Bank allows privileged users to create new Account services. During the lifetime of an Account, its set of members and their access rights can be modified by associating a new authorization policy with the Account, through the setPolicy operation. The project’s resource allocation can be updated by an administrator by issuing a call to the addAllocation operation of the Account.
An OGSA-Based Bank Service for Grid Accounting Systems
1057
Table 2. The service data exposed by the SGAS bank interfaces portType
serviceData
Description
SAM
servicePolicy The authorization policy associated with the service.
Bank
none
-
Account
accountData
The account state: total allocation, reserved funds and spent funds.
Hold
holdData
transactionLog Publishes all transaction log entries.
Resource
The reserved amount and the identity of the parent account.
Bank
Account
getAccounts()
requestHold() createService() Hold requestHold() createService()
requestTerminationAfter()
Hold termination time
commit()
Fig. 3. Hold creation and usage
Figure 3 illustrates typical interactions between a resource and an Account service. A job is submitted by a user to a resource, which invokes the getAccounts operation on the Bank to get the user Account (unless it is specified in the job request). The resource requests a Hold on the Account, prior to job execution. The request includes an initial Hold expiry time. The expiry time can be reset at any time using the lifetime management operations provided by OGSI. A resource may choose to extend the Hold lifetime, e.g., due to a long batch queue. The commit operation is invoked by the resource after job completion to charge the Hold’s parent account for the consumed resources. If the Hold is destroyed, either explicitly or through expired lifetime, the Hold amount is returned to the parent Account.
4 Bank Implementation Interoperability is a primary concern in Grid computing. Thus, rather than implementing our own middleware with ad-hoc protocols and communication primitives, we leverage
1058
Erik Elmroth et al.
the latest Grid standardization efforts and toolkits. Current Grid standardization focuses on Web service technologies in general and OGSA in particular. Globus Toolkit. Our implementation is based on the Globus Toolkit (GT) [11] opensource Java reference implementation of the OGSI specification, implemented on top of the Axis SOAP engine [19]. GT provides a container framework to host Grid services and a set of tools and implementations of core OGSI interfaces, allowing developers to build and deploy custom Grid services. By basing our solution on GT we conform to the latest standard specifications, thereby achieving desirable interoperability. Furthermore, composing our solution of toolkit primitives cuts down development time. GT also provides a security framework orthogonal to application code, which includes mutual authentication, message encryption and message signing. These security primitives are based on the WS-SecureConversation [9], XML-Signature [3] and XMLEncryption [7] standards. Service Implementations. All bank services are implemented using operation providers, which offer a delegation-based implementation approach where services are defined in terms of operation providers, each implementing part of the service functionality. Operation providers facilitate implementation reuse as well as a development model based on composition of primitives. For example, a service can be given the capabilities of the SAM interface simply by configuring the SAM operation provider with the service. In order to guarantee recoverability, all services are checkpointed to a back-end database that runs embedded in the GT container. Our implementation uses Xindice [2], an open-source native XML database that conforms to the standards developed by the XML:DB group [21]. In the event of a server crash the state of all services needs to be recovered from secondary storage on server restart. The recovery is performed by means of GT service loaders. Besides enabling state checkpointing and recovery, the database solution further allows users to pose non-trivial queries against bank information, such as the transaction log, by exposing database content as service data, and embedding database query expressions in service data queries. The GT query evaluator framework allows us to redirect those service data queries to the back-end database, effectively exposing a database view through the service data framework. For example, an Account member can run XPath [20] queries against the transaction log to obtain specific transaction information. The XPath query approach further avoids the inconvenience of defining a query language by means of different interface operations. Authorization Framework. The GT framework provides a declarative security infrastructure through the use of security deployment descriptors. Authentication method as well as handling of delegated credentials can be specified on a per-operation basis. The available framework for authorization, on the other hand, only allows authorization to be specified on a per-service basis. As the current all-or-nothing access to a service is too coarse for our purposes, we have developed a fine-grained authorization framework [10]. Through the SAM inter-
An OGSA-Based Bank Service for Grid Accounting Systems
1059
face, an authorization policy and an authorization engine (a PDP) can be associated with a service. The SAM design is highly flexible and customizable since it allows different authorization back-ends, using different policy languages, to be configured with a service. The current implementation uses a default authorization back-end based on the eXtensible Access Control Markup Language (XACML) [1]. Specifically, Sun’s XACML PDP implementation [14] is used as the underlying authorization engine. However, successful experiments have been carried out using other authorization engines as well. Using XACML we have also included an overdraft policy, allowing an administrator to set an upper limit on acceptable account overdraft. GT uses different handlers that intercept a SOAP [13] request before reaching the target service. To incorporate our authorization framework we provide an authorization handler that delegates the authorization decision to the target service’s PDP, if one is available. The authorization framework is orthogonal to the service implementation. That is, the service implementation is not affected by customization or replacement of the security implementation.
5 Concluding Remarks We have presented the design and implementation of a bank service for use in a recently developed Grid accounting system. The bank, as well as the accounting system as a whole, is designed to be general and customizable, allowing non-intrusive integration into different Grid environments. The system, which is standards-based and leverages state-of-the-art Web and Grid services technology, offers user-transparent enforcement of project allocations while providing fine-grained end-to-end security. Our planned future work includes a transition towards the Web Services Resource Framework (WSRF) [4]. Other areas that are subject to further investigation include more flexible handling of project allocations and measures of improving scalability.
Acknowledgments We acknowledge Lennart Johnsson, Royal Institute of Technology and Bo K˚agstr¨om, Ume˚a University, for fruitful discussions and constructive comments. We would also like to thank Martin Folkman, Uppsala University, for his work on developing an SGAS administration tool. We are also grateful for the constructive comments in the feedback from the refereeing process.
References 1. A. Anderson, A. Nadalin, B. Parducci, D. Engovatov, H. Lockhart, M. Kudo, P. Humenn, S. Godik, S.Anderson, S. Crocker, and T. Moses. eXtensible Access Control Markup Language (XACML) Version 1.0, OASIS, 2003. 2. Apache Xindice, 2004. http://xml.apache.org/xindice/. 3. M. Bartel, J. Boyer, B. Fox, B. LaMacchia, and E. Simon. XML-Signature Syntax and Processing, W3C, 2002.
1060
Erik Elmroth et al.
4. K. Czajkowski, D. Ferguson, I. Foster, J. Frey, S. Graham, I. Sedukhin, D. Snelling, S. Tuecke, and W. Vambenepe. The WS-Resource Framework: Version 1.0, 2004. http://www.globus.org/wsrf/specs/ws-wsrf.pdf. 5. I. Foster, C. Kesselman, J.M. Nick, and S. Tuecke. The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration. 2002. http://www.globus.org/research/papers/ogsa.pdf. 6. I. Foster, C. Kesselman, and S. Tuecke. The Anatomy of the Grid: Enabling Scalable Virtual Organizations. International Journal of High Performance Computing Applications, 15(3):200 – 222, 2001. 7. T. Imamura, B. Dillaway, and E. Simon. XML Encryption Syntax and Processing, W3C, 2002. 8. S. Jackson and R. Lepro. Usage Record – XML Format, Global Grid Forum, 2003. 9. G. Della-Libera, B. Dixon, P. Garg, and S. Hada. Web Services Secure Conversation (WSSecureConversation), Microsoft, IBM, VeriSign, RSA Security, 2002. 10. T. Sandholm, P. Gardfj¨all, E. Elmroth, L. Johnsson, and O. Mulmo. An OGSA-Based Accounting System for Allocation Enforcement Across HPC Centers. Proceedings of the 2nd International Conference on Service Oriented Computing (ICSOC’04), ACM, New York, USA, November 15-19, 2004 (to appear). 11. T. Sandholm and J. Gawor. Globus Toolkit 3: A Grid Service Container Framework, 2003. http://www-unix.globus.org/toolkit/3.0/ogsa/docs/gt3 core.pdf. 12. SNAC - Swedish National Allocations Committee, 2004. http://www.snac.vr.se/. 13. SOAP Specifications, 2004. http://www.w3.org/TR/soap/. 14. Sun’s XACML Implementation, Sun Microsystems, 2004. http://sunxacml.sourceforge.net/. 15. SweGrid, 2004. http://www.swegrid.se/. 16. S. Tuecke, K. Czajkowski, J. Frey, S. Graham, C. Kesselman, T. Maquire, T. Sandholm, D. Snelling, and P. Vanderbilt. Open Grid Services Infrastructure: Version 1.0, Global Grid Forum, 2003. 17. Web Services, 2004. http://www.w3.org/2002/ws/. 18. Web Services Description Language (WSDL), 2004. http://www.w3.org/TR/wsdl. 19. WebServices – AXIS, 2004. http://ws.apache.org/axis/. 20. XML Path Language (XPath), 2004. http://www.w3.org/TR/xpath. 21. XML:DB Initiative, 2004. http://xmldb-org.sourceforge.net/.
A Grid Resource Broker Supporting Advance Reservations and Benchmark-Based Resource Selection Erik Elmroth and Johan Tordsson Dept. of Computing Science and HPC2N, Ume˚a University, SE-901 87 Ume˚a, Sweden {elmroth,tordsson}@cs.umu.se
Abstract. This contribution presents algorithms, methods, and software for a Grid resource manager, responsible for resource brokering and scheduling in early production Grids. The broker selects computing resources based on actual job requirements and a number of criteria identifying the available resources, with the aim to minimize the total time to delivery for the individual application. The total time to delivery includes the time for program execution, batch queue waiting, input/output data transfer, and executable staging. Main features of the resource manager include advance reservations, resource selection based on computer benchmark results and network performance predictions, and a basic adaptation facility. Keywords: Resource broker, scheduling, production Grid, benchmark-based resource selection, advance reservations, adaptation, Globus toolkit.
1 Introduction The task of a Grid resource broker and scheduler is to dynamically identify and characterize the available resources, and to select and allocate the most appropriate resources for a given job. The resources are typically heterogeneous, locally administered, and accessible under different local access policies. The broker operates without global control, and its decisions are entirely based on the information provided by individual resources and information services with aggregated resource information. In addition to information about what resources are available, each resource may provide static information about architecture type, memory configuration, CPU clock frequency, operating system, local scheduling system, various policy issues, etc, and dynamic information such as current load and batch queue status. For more information on typical requirements for a good resource broker, see [3]. A reservation capability is vital for enabling co-allocation of resources in highly utilized Grids. This feature also provides a guaranteed alternative to predicting batch queue waiting times. For general discussions about resource reservations (and co-allocations), see, e.g., [5,8]. A reservation feature naturally depends on the reservation support provided by local schedulers [12] and the use of advance reservations also have implications on the utilization of each local resource [21].
This work was funded by The Swedish Research Council (VR) under contract 343-2003-953 and The Faculty of Science and Engineering, Ume˚a University.
J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 1061–1070, 2006. c Springer-Verlag Berlin Heidelberg 2006
1062
Erik Elmroth and Johan Tordsson
The performance differences between Grid resources and the fact that their relative performance characteristics may vary for different types of applications makes resource selection difficult, see, e.g., [9,16,19,20]. Our approach to handle this is to use a benchmark-based procedure for resource selection. Based on the user’s identification of relevant benchmarks and an estimated execution time on some specified resource, the broker estimates the execution time for all resources of interest. This requires that a relevant set of benchmark results are available from the resources’ information systems. As most Grid resources still are available without performance and queue time guarantees, the adaptation feature provided by our resource broker is useful. It allows a user to request that the broker after an initial job submission strives to re-direct the job to another resource, likely to give a shorter total time to delivery. A resource broker and scheduler is fundamental in any large-scale Grid-environment for scientific applications. Our software is mainly targeted for the NorduGrid and SweGrid infrastructures. These are both Globus-based production environments for 24 hour per day Grid usage. The outline of the paper is as follows. Section 2 gives a brief introduction to the NorduGrid software and the general resource brokering problem. The main algorithms and techniques of our resource broker are presented in Section 3, including, e.g., support for making advance resource reservations and to select resources based on estimates of the total time to delivery. The latter feature builds on benchmark-based execution time estimates and network performance predictions. Minor extensions to the NorduGrid user interface are presented in Section 4. Sections 5 and 6 present some concluding remarks and acknowledgments, respectively, followed by a list of references.
2 Background and Motivation Our development of resource brokering algorithms and prototype implementations are mainly focused on the infrastructure and usage scenarios typical for NorduGrid [13] and SweGrid [23]. The main Grid middleware used is the NorduGrid software [6]. The Grid resources are typically Linux-based clusters (so, in the rest of this paper, the word cluster is often used instead of the more general term Grid resource). NorduGrid includes over 40 clusters of varying size, most of them located in the Nordic countries. SweGrid currently includes six clusters, each with 100 Pentium 4 processors. 2.1 The NorduGrid Software The NorduGrid middleware is based on standard protocols and software such as OpenLdap [14], OpenSSL [15] and Globus Toolkit version 2 [7,10]. The latter is not used in full, as some Globus components such as the GRAM (with the gatekeeper and jobmanager components) are replaced by custom made components [6]. NorduGrid makes use of the Grid Security Infrastructure (GSI) in Globus Toolkit 2. GSI specifies a public key infrastructure, and SSL (TLS) for authenticated and private communication. GSI also specifies short-term user-proxy certificates, signed by either the user or by another proxy. The proxy allows the user to access various remote resources without (re)typing passwords and to delegate authority. An extension of a subset
A Grid Resource Broker Supporting Advance Reservations
1063
of the Globus resource specification language (RSL) [17] is used to specify resource requirements and necessary job execution information. The NorduGrid user interface consists of command line tools for managing jobs. Users can submit jobs, monitor the execution of their jobs and cancel jobs. The resource broker is part of the the job submission tool, ngsub. Other tools allows, e.g., the user to retrieve output from jobs, get a peak preview of job output and remove job output files from a remote resource. Communication with remote resources is handled by a simple GridFTP client module. Each Grid resource runs one instance of the NorduGrid GridFTP server. When submitting a job, the user invokes the broker which uploads an RSL job submission request to the GridFTP server. The GridFTP server specifies plugins for custom handling of FTP protocol messages. NorduGrid makes use of these for Grid access to the local file systems of the resources, to manage Grid access control lists, and most important, for Grid job management. Each resource also runs a Grid manager that manages the Grid jobs through the various phases of their execution. The Grid manager periodically searches for recently submitted jobs. For each new job, the RSL job description is analyzed, and any required input files are staged to the resource using GridFTP. The job description is translated into the language of the local scheduler, and the job is submitted to the batch system. Upon job completion, the Grid manager stages the output files to the locations specified in the job description. 2.2 The Resource Brokering Problem In general, we can identify two major classes of Grid resource brokers, namely centralized and distributed brokers. A centralized broker manages and schedules all jobs submitted to the Grid, while a distributed broker typically handles jobs submitted by a single user only. Centralized brokers are, at least in theory, able to produce optimal schedules as they have full knowledge of the jobs and resources, but such a broker can easily become a performance bottleneck and a single point of failure. Moreover, they turn a Grid environment more into a single virtual computer than into a dynamic infrastructure for flexible resource sharing. A distributed broker architecture, on the other hand, scales well and makes the Grid more fault-tolerant, but the partial information available to each instance of the broker complicates the scheduling decisions. A hybrid type is the hierarchical broker architecture, in which distributed brokers are scheduled by centralized brokers, attempting to combine the best of two worlds. Grid systems utilizing a centralized broker include, e.g., EDG [18] and Condor [11]. Distributed brokers are implemented by, e.g., AppLeS [2] and NetSolve [4]. For further discussions on a similar classification of brokers, see, e.g., [1]. In the following we focus on algorithms and software for a distributed resource broker. Such a broker typically seeks to fulfill the user’s resource requests by selecting the resources that best suit the user’s application. Selecting the most suitable resources often means identifying the resources that provide the shortest Total Time to Delivery (TTD) for the job. TTD is the total time from the user’s job submission to the time when the output files are stored where requested. This includes the time required for transferring input files and executable to the resource, the waiting time, e.g., in a batch
1064
Erik Elmroth and Johan Tordsson
queue, the actual execution time, and the time to transfer the output files to the requested location(s). In order to manage this task, the broker needs to identify what resources are available to the user, to characterize the resources, to estimate all parts of the TTD, etc. Other requests the user may put on the broker is to make advance resource reservations, e.g., at a specific time, or to not only submit the job but also to adapt to possible changes in resource load and possibly perform job migration.
3 Resource Brokering Algorithms Our main brokering algorithm performs a series of tasks, e.g., it processes the RSL specifications of job requests, identifies and characterizes the resources available, estimates the total time to delivery for each resource of interest, makes advance reservation of resources, performs the actual job submission. Figure 1 presents a high-level outline of the algorithm. Input: RSL-specification(s) of one or more job requests. Action: Select and submit jobs to the most appropriate resources. Output: none. 1. Process RSL specification and create a list of all individual job requests. 2. Contact GIIS server(s) to obtain a list of available clusters. 3. Contact each resource’s GRIS for static and dynamic resource information (hardware and software characteristics, current queue and load, etc). 4. For each job: (a) Select the cluster to which the job will be submitted: i. Filter out clusters that do not fulfill the requirements on memory, disk space, architecture etc, and clusters that the user is not authorized to use. ii. Estimate TTD for each remaining resource (see Section 3.2). If requested, resource reservation is performed during this process. iii. Select the cluster with the shortest predicted TTD. (b) Submit the job to the selected resource. (c) Release any reservations made to non-selected clusters. Fig. 1. Brokering algorithm overview
The input RSL specification(s) contains one or more job requests including information about the job to run (e.g., executable, input/output files, arguments), actual job requirements (e.g., amount of memory needed, architecture requirement, computer time required, requests for advance reservations), and optionally, job characteristics that can be used to make improved resource selection (e.g., listing of benchmarks with relevant performance characteristics). In Step 1, the user’s request is processed and separated into individual jobs. In Step 2, the broker identifies what resources that are available by contacting one or more Grid Index Information Services. The specific characteristics of the resources found are identified in Step 3, by contacting the Grid Resource Information Service on
A Grid Resource Broker Supporting Advance Reservations
1065
each individual resource. The actual brokering process is mainly performed in Step 4, where resources are evaluated, selected, and optionally reserved in advance for each job. Finally, the jobs are actually submitted and any non-utilized reservations are released. In the following presentation, we focus on the more intricate issues of performing advance reservations and on how to determine an estimate (prediction) of the total time to delivery. Notably, this algorithm does not reorder the individual job requests (when multiple jobs are submitted in a single invocation). This can possibly be done in order to reduce the average batch queue waiting time, at least by submitting shorter jobs before longer ones given that they require the same number of CPUs. In the general case, factors such as local scheduling algorithms and competing Grid brokers make the advantage of job reordering less obvious. 3.1 Advance Resource Reservations The advanced reservation feature makes it possible to obtain a guaranteed start time in advance. A guaranteed start time brings two advantages. It makes it possible to coordinate the job with other activities, and resource selection can be improved as the resource comparison is based on a guaranteed start time rather than on an estimate. The reservation protocol developed supports two operations: requesting a reservation and releasing a reservation. A reservation request contains the start time and the requested length of the reservation, the requested number of CPUs, and optionally, an account to be charged for the job. Upon receiving a reservation request from the broker, the GridFTP server adds the user’s local identity to the request, information unknown to the broker but required by the local scheduler. Then, the GridFTP server invokes a script to request a reservation from the local scheduler. If the scheduler accepts the request and creates the reservation, the GridFTP server sends a unique identifier and the start time of the reservation to the broker. If no reservation could be created, a message indicating failure is returned to the broker. The GridFTP server saves the reservation identifier and a copy of the user’s proxy for every successful reservation, enabling subsequent identification of the user who made the reservation. For releasing a reservation, the broker uploads a release message containing the reservation identifier and the GridFTP server confirms that the reservation is released. Job Submission with a Reservation. Upon a successful reservation, the broker adds the received reservation identifier to the RSL job description before submitting the job to the resource. Before the job is submitted to the local scheduler, the Grid manager analyzes the job description and detects the reservation identifier. The Grid manager inspects the saved proxy files and their associated reservation identifiers to ensure that the reservation exists. Furthermore, the Grid manager compares the proxy used to submit the job with the one used to make the reservation, ensuring that no user can submit jobs to another user’s reservation by spoofing the reservation id. The job is not submitted to the local scheduler unless the specified reservation exists and was created by the user submitting the Grid job. When the job finishes executing on the resource, the Grid manager may remove the reservation, allowing the user to run only the requested job. Alternatively, resources
1066
Erik Elmroth and Johan Tordsson
may allow the user to submit more jobs to the reservation once the first has finished. The configuration of the Grid manager determines the policy to be used. The advance resource reservation feature requires a reservation capability in the local scheduler. The current implementation supports the Maui scheduler [22], although any local scheduler may be used. Support for other schedulers can easily be added by adapting the scripts interacting with the local scheduler (see, e.g., [12]). 3.2 Estimating the Total Time to Delivery The estimation of the total time to delivery (TTD), from the user’s job submission to the final delivery of output files to requested storage requires that the time to perform the following operations is estimated: 1. 2. 3. 4.
Stage in: transfer of input files and executable to the resource, Waiting, e.g., waiting in batch queue and for operation 1 to complete, Execution, and, Stage out: transfer of output files to requested location.
Notably, the waiting time is here defined as the maximum of the time for stage in and all other waiting times before the job actually can start to execute. The estimated TTD is given as the sum of estimated times for operations 2, 3, and 4. Sometimes, the time for stage out cannot be estimated, due to lack of information about output file sizes. In these cases that part is simply omitted from the TTD used for comparisons. Below, we summarize how we make these estimates. Benchmark-Based Time Predictions. The execution time estimate needs to be based both on the performance of the resource and the characteristics of the applications. This issue is made slightly more intricate by the fact that the relative performance difference between different computing resources typically varies with the character of the application. In order to circumvent this problem, we give the user the opportunity to specify one or more benchmarks with performance characteristics similar to those of the application. This information is given together with an execution time estimate on a resource with a specified benchmark result. Based on this information and benchmark results for each individual resource, we make execution time estimates for all resources of interest. In doing this, we assume linear scaling of the application in relation to the benchmark, i.e., a resource with a benchmark result a factor k better is assumed to execute the application a factor k faster. We remark that a good execution time estimate serves two purposes. First, a sufficient but short estimated execution time may lead to an earlier job start, due to standard batch system scheduling algorithms. Second, a good estimate is more likely not to be too short, and hence reduces the risk for job preemption by the local scheduler. The user can specify k benchmarks as triples {bi , ri , ti }, i = 1, . . . , k, where bi is the benchmark name, ri is the benchmark result on a system where the application requires the time ti . The broker matches these specifications with the benchmark results provided by each cluster. For each requested benchmark that is available at the resource, an execution time for the application is predicted.
A Grid Resource Broker Supporting Advance Reservations
1067
If the cluster does not provide a result for a requested benchmark, the corresponding time estimate is taken to be a penalty factor c times the longest execution time estimated from other benchmarks for that cluster. The penalty factor can be configured by the user. As a default value, c = 1.25 is used. When comparing the performance of different clusters during the resource selection procedure, one part of the TTD estimate for each resource is given from this procedure. The value used is the average execution time estimate of the k estimates obtained from different benchmarks. At job submission, after the selection procedure, an execution time must be included in the request sent to the resource. In order to reduce the risk for scheduler preemption, this execution time is chosen as the maximum of the k estimates calculated. Network Performance Predictions. The time estimation for the stage in and stage out procedures are based on the actual (known) sizes of input files and the executable file, user-provided estimates for the sizes of the output files, and network bandwidth predictions. The network bandwidth predictions are performed using the Network Weather Service (NWS) [24]. NWS combines periodic bandwidth measurements with statistical methods to make short-term predictions about the available bandwidth. 3.3 Job Queue Adaptation Information gathered about the state of a Grid is necessarily old. Network load and batch queue sizes may change rapidly, new resources may appear and others become unavailable. The load predictions used by the broker as a basis for resource selection will become out-of-date. Nevertheless, more recent information will always be available as Grid resources periodically advertise their state. With this in mind, the broker can keep searching for better resources once the initial job submission is done. If a new resource that is likely to result in a earlier job completion time is found (taking into account all components of TTD, including file restaging), the broker migrates the job to the new resource. This procedure is repeated until the job starts to execute on the currently selected resource.
4 User Interface Extensions We have extended the standard NorduGrid user interface with some new options and added some new attributes to the NorduGrid RSL, making the new features available to users. Benchmarks and Advance Resource Reservations. In order to make use of the feature of benchmark-based execution time prediction, the user must provide relevant benchmark information as described by the following example. Assume that the user knows that the performance of the application my app is well characterized by the NAS benchmarks LU, BT and CG. For each of these benchmarks, the user needs to specify a benchmark result and an expected execution time on a system corresponding to that benchmark
1068
Erik Elmroth and Johan Tordsson
result. Notably, the expected execution time must be given for each benchmark, as the benchmark results may be from different reference computers. This is specified using the new RSL attribute benchmarks. Figure 2 illustrates how the user specifies that the application requires 65 minutes on a cluster where the results for the NAS LU and BT benchmarks class C are 250 and 200, respectively. The estimated execution time is 50 minutes on an (apparently different) cluster where the CG benchmark result is 90. &(executable = my_app)(stdin = my_app.in) (stdout = my_app.out)(benchmarks = (nas-lu-c 250:65) (nas-bt-c 200:65)(nas-cg-c 90:50)) Fig. 2. Sample RSL request including benchmark-based execution time predictions
Network Transfers. In the example in Figure 3, the job involves transfer of large input and output files. The broker will determine the actual sizes of the input files when estimating the transfer time for these. The new, optional, RSL attribute outputfilessizes allows the user to provide an estimate of the size of the job output. As shown in the figure, the user need not include all output files in the outputfilessizes relation. Estimated file sizes can specified in bytes or with any of the suffixes kB, MB or GB. The NorduGrid middleware uses the inputfiles and outputfiles attributes as a replacement for file stage in and file stage out from the RSL specification. &(executable = my_program)(arguments = params input) (stdout = logfile) (inputfiles = (params gsiftp://host1/file1) (input http://host2/file2)) (outputfiles = (results gsiftp://host3/my_program.results) (data gsiftp://host3/my_program.data) (logfile gsiftp://host4/my_program.log)) (outputfilessizes = (results 230MB) (data 5GB)) Fig. 3. Sample RSL request with specifications required to estimate file transfer times
Command Line Options. In addition to the RSL extension, the broker supports some new command line options. The option -A is used to request the broker to perform queue adaptation. The reservation feature is invoked using the option -R. The option -S is used to build a pipeline between jobs, so that output from one job is used as input to the next.
5 Concluding Remarks The resource broker presented is developed with focus on the NorduGrid software and the NorduGrid and SweGrid production environments. It includes support for making advance resource reservations and selects resources based on benchmark-based execution
A Grid Resource Broker Supporting Advance Reservations
1069
time estimates and network performance predictions. The broker is a built-in component of the user’s submission software, and is hence a user-owned broker acting with no need for global control, entirely basing its decisions on the dynamic information provided by the resources.
Acknowledgments We acknowledge ˚Ake Sandgren, High Performance Computing Center North (HPC2N), Ume˚a, Sweden, for technical support. We are also grateful for the constructive comments in the feedback from the refereeing process.
References 1. C. Anglano, T. Ferrari, F. Giacomini, F. Prelz, and M. Sgaravatto. WP01 report on current technology. http://server11.infn.it/workload-grid/docs/DataGrid-01-TED-0102-1 0.pdf. 2. F. Berman and R. Wolski. The AppLeS project: A status report. In N. Koike, editor, Proceedings of the 8th NEC Research Symposium, 1997. 3. J. Brooke and D. Fellows. Draft discussion document for GPA-WG – Abstraction of functions for resource brokers. http://grid.lbl.gov/GPA/GGF7 rbdraft.pdf. 4. H. Casanova and J. Dongarra. Netsolve: A network server for solving computational science problems. Int. J. Supercomput. Appl., 11(3):212–223, 1997. 5. K. Czajkowski, I. Foster, C. Kesselman, V. Sander, and S. Tuecke. SNAP: A protocol for negotiating service level agreements and coordinating resource management in distributed systems. In Feitelson D.G. et al., editors, Job Scheduling Strategies for Parallel Processing, volume 2537 of Lecture Notes in Computer Science, pages 153–183, Berlin, 2002. SpringerVerlag. 6. P. Eerola, B. K´onya, O. Smirnova, T. Ekel¨of, M. Ellert, J.R. Hansen, J.L. Nielsen, A. W¨aa¨ n¨anen, A. Konstantinov, J. Herrala, M. Tuisku, T. Myklebust, F. Ould-Saada, and B. Vinter. The NorduGrid production Grid infrastructure, status and plans. In Proc. 4th International Workshop on Grid Computing, pages 158–165. IEEE CS Press, 2003. 7. I. Foster and C. Kesselman. Globus: A metacomputing infrastructure toolkit. Int. J. Supercomput. Appl., 11(2):115–128, 1997. 8. I. Foster, C. Kesselman, C. Lee, B. Lindell, K. Nahrstedt, and A. Roy. A distributed resource management architecture that supports advance reservations and co-allocation. In 7th International Workshop on Quality of Service, pages 27–36, Washington - Brussels - Tokyo, 1999. IEEE. 9. I. Foster, J.M. Schopf, and L. Yang. Conservative scheduling: Using predicted variance to improve scheduling decisions in dynamic environments. In Proceedings of the ACM/IEEE SC2003: International Conference for High Performance Computing and Communications, 2003. 10. Globus. http://www.globus.org. 11. M. Litzkow, M. Livny, and M. Mutka. Condor - a hunter of idle workstations. In Proceedings of the 8th International Conference of Distributed Computing Systems, pages 104–111, June 1988. 12. J. MacLaren. Advance reservations state of the art. http://www.fz-juelich.de/zam/RD/coop/ggf/graap/sched-graap-2.0.html. 13. NorduGrid. http://www.nordugrid.org. 14. OpenLDAP. http://www.openldap.org.
1070
Erik Elmroth and Johan Tordsson
15. OpenSSL. http://www.openssl.org. 16. K. Ranganathan and I. Foster. Simulation studies of computation and data scheduling algorithms for data Grids. Journal of Grid Computing, 1(1):53–62, 2003. 17. The Globus Resource Specification Language RSL v1.0. http://www-fp.globus.org/gram/rsl spec1.html. 18. M. Ruda, C. Anglano, S. Barale, L. Gaido, A.Guarise, S. Lusso, A. Werbrouck, S. Beco, F. Pacini, A. Terracina, A. Maraschini, S. Cavalieri, S. Monforte, F. Donno, A, Ghiselli, F. Giacomini, E. Ronchieri, D. Kouril, A. Krenek, L. Matyska, M. Mulac, J. Popisil, Z. Salvet, J. Sitera, J.Visek, M. Vocu, M. Mezzadri, F. Prelz, M. Sgaravatto, and M. Verlato. Integrating Grid tools to build a computing resource broker: activities of datagrid WP1. In CHEP’01 computing in high energy and nuclear physics. 19. W. Smith, I. Foster, and V. Taylor. Predicting application run times using historical information. In Dror G. Feitelson and Larry Rudolph, editors, Job Scheduling Strategies for Parallel Processing, volume 1459 of Lecture Notes in Computer Science, Berlin, 1999. SpringerVerlag. 20. W. Smith, I. Foster, and V. Taylor. Using run-time predictions to estimate queue wait times and improve scheduler performance. In Dror G. Feitelson and Larry Rudolph, editors, Job Scheduling Strategies for Parallel Processing, volume 1459 of Lecture Notes in Computer Science, pages 202–219, Berlin, 1999. Springer-Verlag. 21. W. Smith, I. Foster, and V. Taylor. Scheduling with advanced reservations. In 14th International Parallel and Distributed Processing Symposium, pages 127–132, Washington - Brussels - Tokyo, 2000. IEEE. 22. Supercluster.org. Center for HPC Cluster Resource Management. http://www.supercluster.org. 23. Swegrid. http://www.swegrid.se. 24. Rich Wolski. Dynamically forecasting network performance using the network weather service. Journal of Cluster computing, 1(1):119–132, 1998.
The Dragon Graph: A New Interconnection Network for High Speed Computing Jywe-Fei Fang St.John’s and St.Mary’s Institute of Technology, Taipei 251, Taiwan, R.O.C.
Abstract. A new interconnection network called the Dragon graph, is proposed. A Dragon graph is a variation of the hypercube and the cube-connected-cycle with constant degree four. The Dragon graph gains many advantages. It has a smaller diameter and cost than the comparable cube-connected-cycle. It is nodesymmetric and edge-symmetric. A routing algorithm and a broadcasting algorithm are proposed in this paper.
1 Introduction The aim of studying interconnection networks and their combinatorial properties is to find good topologies for massively parallel computing. Ideally, attributes of a good topology include the diameter, cost, communication and symmetry [4]. The diameter is the maximum distance of each pair of nodes in a graph. So, we can use it to measure the maximum communication delay. The degree can be used to measure the hardware complexity of each processor. Usually, an interconnection network with a small degree implies potentially a large diameter. It appears that there is a tradeoff between the degrees and diameters of interconnection networks. Thus, there is a commonly used metric, called cost, which is the product of diameter and degree. It is widely recognized that interprocessor communication is one of the most important issues for interconnection networks because the communication problem is the key issue to many parallel algorithms [1]. Routing and broadcasting are two primitive communication problems. The former is to send a message from a source to a destination node and the latter is to distribute the same message from a source node to all other nodes. They appear in applications such as matrix operations (e.g., matrix multiplication, factorization, inversion, transposition), database operation (e.g. polling, master-slave operation) and transitive closure algorithms [4]. An interconnection network is node-symmetric if and only if for any two nodes U and V , there exists an automorphism of the network that maps U to V . An interconnection network is edge-symmetric if and only if for any two edges e and g, there exists an automorphism of the network that maps e to g. An interconnection network with nodesymmetry and edge-symmetry implies that it can be implemented regularly. In recent years, among many interconnection networks, the hypercube has been the focus of many researchers due to its structural regularity, potential for the parallel computation of various algorithms, and the high degree of fault tolerance capabilities [8]. However, the hypercube has a practical limitation. Each processor of the n-dimensional hypercube is connected to n other processors. Consequently, the hypercube is unfeasible for practical implementation of massively computing because its large fanout J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 1071–1078, 2006. c Springer-Verlag Berlin Heidelberg 2006
1072
Jywe-Fei Fang
logarithmically proportional to the number of processors. To overcome the disadvantage of the hypercube, some variations of hypercube with constant degree such as cubeconnected cycles, butterf ly networks, debruijn networks, shuf f le exchange graphs have been proposed [4]. The cube-connected cycle has received considerable attention [7]. Researchers have devoted themselves to various issues of cube-connected cycles. Meliksetian and Chen devised an optimal routing algorithm for the cube-connected cycle and derived the the exact diameter [6]. Kuo and Fuchs proposed two reconfigurable architectures of cubeconnected cycles, which are capable of tolerating classes of multiple failures [3]. Chen and Lau presented a new layout of the cube-connected cycle which uses less than half the area of the Preparata-Vuillemin layout [2,7]. On the other hand, various variations of the cube-connected cycle have been proposed. Lo and Chen proposed a generalized cube-connected cycle called k-ccc to enhance the extensibility [5]. Sebastian et al. proposed the folded cube-connected cycle, another variation of the cube-connected cycles, to improve the diameter. However, the both topologies are neither node-symmetric nor edge-symmetric. Thus, to lay out them is potentially complex. In this paper, the Dragon graph, a new variation of the cube-connected cycle, with constant degree(degree four) is proposed. It gains many advantages. For example, It is both node-symmetric and edgesymmetric. Thus, it is potential to be implemented regularly. We derive the topological properties such as diameter and cost of the Dragon graph. To compare with the corresponding cube-connected cycle, it has a smaller diameter and cost. We also propose a routing algorithm and a broadcasting algorithm for the Dragon graph. The rest of this paper is organized as follows. In Section 2, the structure of the Dragon graph is described. We also present some notations and background that will be used throughout this paper. In Section 3, we propose a routing algorithm for the Dragon graph. In Section 4, a broadcasting scheme of the Dragon graph is introduced. Some topological properties of the Dragon graph are discussed in Section 5. Finally, in Section 6, we give some concluding remarks.
2 Notation and Backgrounds Let H(n) denote an n-dimensional hypercube comprising of 2n nodes. H(n) is constructed by numbering the nodes from 0 to 2n − 1 and linking each pair of nodes whose addresses differ by exactly one bit position. A link is assigned a link number i if it connects two nodes whose addresses are different at exactly the ith bit position (starting with the least significant bit as bit 0). The n-dimensional cube-connected-cycle denoted by CCC(n) is derived from the n-dimensional binary hypercube by replacing each node of the hypercube with a cycle of n nodes. Each node of CCC(n) is labelled by a pair (w, i), where w is the label of the node in the hypercube which corresponds to the cycle and i is the position of the node within the cycle. Two nodes (w, i) and (w , i ) of CCC(n) are connected by an edge if and only if either (1) w = w and i ≡ (i + 1) mod n. (2) i = i and w differs from w by exactly the ith bit. An edge of the first type(second type) is called cycle(hypercube) edge. It is easy to see that CCC(n) is not edge-symmetric because cycle edges and hypercube edges play different roles in the network. In Figure. 1, the structure of CCC(3) is displayed.
The Dragon Graph: A New Interconnection Network for High Speed Computing (001,3)
1073
(000,3) (000,1)
(001,2)
(000,2)
(001,1)
Fig. 1. The structure of CCC(3) 00f
f00 0f0 10r
f01 0f1
f10 01f
f11
1f0
1f1 11f
Fig. 2. The structure of DG(3), where the dash line and dash node show the hypercube that is the DG(3) derived from
The n-dimensional Dragon graph denoted by DG(n) is derived from the CCC(n) by merging each pair of nodes which is connected with each other by the hypercube edge in CCC(n). Each node U of DG(n) is labelled by a string of n symbols un−1 un−2 . . . ui . . . u1 u0 over set 0, 1, f, where f is to label the hypercube edge in the corresponding CCC(n). For example, a pair of nodes (000,3) and (001,3) of CCC(3) will be merged to a node 00f of DG(3). It is easy to see that f exists exactly once in each node address of DG(n). In Figure. 2, the structure of DG(3) is shown.. Definition 1. Let U = un−1 un−2 . . . ui . . . u1 u0 , where ui is f , be a node in DG(n). The flag bit of U , denoted by f lag(U ), is defined as i . Two nodes of the DG(n), U with flag bit i and V with flag bit j are connected by an edge if and only if both (1) j ≡ i + 1 mod n or j ≡ i − 1 mod n, and (2) each bit except the ith bit and the jth bit of U and V is common. For example, 0f 0 is connected to f 00, f 10, 00f and 01f . f 00 is connected to 00f , 10f , 0f 0 and 1f 0. In fact, each node of Dragon graph is connected to four neighbors, (r,1) neighbor,(r,0) neighbor, (l,0) neighbor and (l,1) neighbor.
1074
Jywe-Fei Fang
Definition 2. The (r,1)((r,0)) neighbor of a node U = un−1 un−2 . . . ui . . . u1 u0 with flag bit i can be derived by shifting the flag bit to bit i − 1 mod n(that is, ui−1 mod n is replaced by symbol f )and complementing ui with symbol 1(0). Definition 3. The (l,1),((l,0)) neighbor of a node U = un−1 un−2 . . . ui . . . u1 u0 with flag bit i can be derived by shifting the flag bit to bit i + 1 mod n(that is, u(i+1) mod n is replaced by symbol f )and complementing ui with symbol 1(0). For example, (r,1) neighbor,(r,0) neighbor, (l,1) neighbor and (l,0) neighbor of 0100f is f 1001, f 1000, 010f 1 and 010f 0. Definition 4. The binary sequence of node U = un−1 un−2 . . . u1 u0 with flag bit i, denoted by Binseq(U ),is defined as un−1 un−2 . . . ui+1 ui−1 . . . u1 u0 . For example, Binseq(10f 1) is 101. If nodes are with the same binary sequence, They are called in the common binary sequence class. For Example, f 010, 0f 10, 01f 0, 010f are in the common binary sequence class 010. Definition 5. In a common binary sequence class, compare the nodes by their flag bit position, the node with higher(lower) flag bit position, is called the senior(junior). For example, f 010 and 0f 10 are the seniors of 01f 0, and 010f is the junior of 01f 0. Definition 6. Specifically, U = un−1 un−2 . . . u1 u0 with flag bit i is called as the ith element of the common binary sequence class un−1 un−2 . . . ui+1 ui−1 . . . u1 u0 . By the structure of a Dragon graph and the above definition, we have the following proposition. Proposition 1. In a common binary sequence class, the ith element is connected to the (i − 1)th element. For example, in the common binary sequence 1101 of DG(5), each pair neighbors of f 1101, 1f 101, 11f 01, 110f 1, 1101f are connected. Definition 7. Let U = un−1 un−2 . . . u1 u0 and V = vn−1 vn−2 . . . v1 v0 be two nodes in DG(n). The ith bit is referred as an active bit if and only if both of ui and vi are not = vi . flag bit and ui
3 Routing Algorithm Given a source node and a destination node of an interconnection network. The routing problem is to find a path such that a message can be transmitted from the source node to the destination node. To present the idea of this algorithm, suppose that source node and destination node are f 0000 and 1101f . First,if there exists any active bit, it is required to "eliminate" these active bits, that is, to find a path from source node to an intermediate node with no active bit between this node and destination node. For example, shifting the flag bit rightward and "eliminate" the active bit, a path can be derived as f 0000, 1f 000, 11f 00, 110f 0. Likewise, shifting the flag bit leftward and "eliminate" the active bit, a path can be derived as f 0000, 1000f , 100f 0, 10f 10, 1f 010. There is a choice between rightward and leftward. The shorter one is selected. Second, it is required to shift the flag bit to the "right" position, that is, to find a path from the intermediate node to the destination node. For example, from the intermediate node 110f 0, shifting the flag bit rightward, a path can be derived as 110f 0 and 1101f . Likewise, shifting the flag
The Dragon Graph: A New Interconnection Network for High Speed Computing
1075
bit leftward, a path can be derived as 110f 0, 11f 10, 1f 010, f 1010 and 1101f . There is also a choice between rightward and leftward. The shorter one is selected, too. Clearly, this routing algorithm can rout a message from source node to destination node. We propose the routing algorithm formally as follows. Suppose that the routing message is included by the address of destination node D = dn−1 dn−2 . . . d1 d0 . Algorithm 1. Routing algorithm of DG(n) Step 1. When a node V = vn−1 vn−2 . . . v1 v0 with flag bit i, receives the message included by the node address of destination D = dn−1 dn−2 d1 d0 with the flag bit j, Step 1, starting with the bit i, to find the nearest active bits between V and D rightward and leftward, respectively. performs by the following conditions: 1) : If there exists any active bit and i = j, if the rightward is nearer than the leftward, rout the message to the (l, di ) neighbor. Otherwise, rout the message to the (r, di ) neighbor. 2) : If there exists any active bit and i = j, if the rightward is nearer than the leftward, rout the message to the (l, 0) neighbor. Otherwise, rout the message to the (r, 0) neighbor. 3) : If there exists no active bit and and i = j, if (j > i and j − i < [n/2]) or (j < i and i − j ≥ [n/2] ), rout the message to the (l, di ) neighbor. Otherwise, rout the message to the (r, di ) neighbor. 4) : If there exists no active bit and and i = j, then V = D, terminates the route.
4 Broadcasting Algorithm Broadcasting on an interconnection network can be realized based on a broadcasting tree that is a spanning tree rooted at the source node. In this section, we introduce a broadcasting algorithm for Dragon graphs by building a broadcasting tree. Since DG(n) is node-symmetric, the same broadcasting algorithm can be applied to each node of DG(n). Without loss of generality, suppose that the source node is r00 . . . 0. This algorithm comprises two phases. Phase 1, to build a primitive tree. Initialization: The broadcasting tree has one node r00 . . . 0 as root. While the flag bit of a leaf node is not 0, appending its (r,0) neighbor and (r,1) neighbor as left child and right child, respectively. In Figure 3, the primitive tree of DG(4) is shown. To build the primitive tree is very simple. However, the tree does not include each node of the whole Dragon graph. In the Phase 2, the remainder will be appended. Observe that each node and its left child are in the same binary sequence. Thus, the juniors of a node will be its descendants in the primitive tree. Recall that according to the Proposition 1, in a common binary sequence class, the ith element is connected to the (i − 1)th element. Phase 2, appending phase. If a node is right child, append its seniors in a linear array one by one to the primitive tree. In Figure 4, the broadcasting tree of DG(4) is shown.
1076
Jywe-Fei Fang f000 0f00
1f00 01f0
00f0
000f
010f
001f
11f0
10f0
011f
101f
100f
111f
110f
Fig. 3. The primitive tree of DG(4) f000 1f00
0f00 00f0 000f
001f
011f
010f
0f01 f001
100f
101f
01f1
00f1
11f0
10f0
01f0
0f11
110f
11f1
10f1
1f11
1f01
f011
111f
f111
f101
Fig. 4. The broadcasting tree of DG(4) Table 1. Symmetric properties of some constant degree hypercubic networks Networks
Node symmetry Edge symmetry
Shuffle exchange
None
None
Debruijn network
None
None
Butterfly network
None
None
CCC
Yes
None
Dragon graph
Yes
Yes
5 Properties of Dragon Graphs Since CCC(n) is node-symmetric, clearly, DG(n) is node-symmetric, too. Furthermore, because only the cycle edges of CCC(n) are reserved in DG(n), DG(n) is also edge-symmetric. In Table 1, we show the symmetric properties of some constant degree hypercubic networks [2]. Clearly, The Dragon graph is superior to the other interconnection networks in symmetric properties. The distance between two nodes of an interconnection network is defined as the length of their shortest path. The diameter of an interconnection network is defined as the maximal distance between each pair of nodes. By using Algorithm 1, each node can route a message to any other node within n − 1 + (n − 1)/2 hops (n − 1 hops for "eliminating" the active bit and (n − 2)/2
The Dragon Graph: A New Interconnection Network for High Speed Computing
1077
Table 2. The degrees, diameters and costs of DG(n) and CCC(n) DG(n)
CCC(n)
n degree diameter cost n degree diameter cost 3
4
3
12 3
3
7
21
4
4
4
16 4
3
9
27
5
4
5
20 5
3
12
36
6
4
7
28 6
3
14
42
7
4
8
32 7
3
17
51
8
4
10
40 8
3
19
57
9
4
11
44 9
3
22
66
10
4
13
52 10
3
24
72
11
4
14
56 11
3
27
81
12
4
16
64 12
3
29
87
13
4
17
68 13
3
32
96
14
4
19
76 14
3
34
102
15
4
20
80 15
3
37
111
16
4
22
88 16
3
39
117
17
4
23
92 17
3
42
126
18
4
25
100 18
3
44
132
19
4
26
104 19
3
47
141
20
4
28
112 20
3
49
147
hops for shifting the flag bit to the "right" position) in DG(n) for n is an even number. Likewise, each node can route a message to any other node within n − 1 + (n − 3)/2 hops (n − 1 hops for "eliminating" the active bit and (n − 3)/2 hops for shifting the flag bit to the "right" position) in DG(n) for n is an odd number. In Table 2, we summarize the degrees, diameters and costs of an DG(n) and an CCC(n). The diameters and costs of a DG(n) are smaller than the comparable CCC(n).
6 Conclusions In this paper, the Dragon graph, a new variation of the hypercube and the cubeconnected-cycle with constant degree(four) has been proposed. The Dragon graph gains many advantages. It is with a smaller diameter and cost than the comparable cubeconnected-cycles. It is node-symmetric and edge- symmetric. In this paper, we also proposed a routing algorithm which is not complex, thus it can be applied to practical implementation. By building a broadcasting tree, we have presented a broadcasting algorithm which is also potential to be implemented.
1078
Jywe-Fei Fang
Acknowledgements The author would like to thank reviews for their valuable comments and suggestions. This work was supported in part by the National Science Council of the Republic of China under the contract number: NSC92-2213-E-129-010.
References 1. S. G. Akl,The Design and Analysis of Parallel Algorithms, Prentice-Hall, 1989. 2. G. Chen and F. C. M. Lau, “Tighter layouts of the cube-connected cycles,“ IEEE Trans. Parallel and Distributed Systems, vol. 11, pp. 182-191, 2000. 3. S. Y. Kuo and W. K. Fuchs, “Reconfigurable cube-connected cycles architectures,“ Journal of Parallel and Distributed Computing, vol. 9, pp. 1-10, 1990. 4. F. T. Leighton, Introduction to Parallel Algorithms and Architectures: Arrays,Trees, Hypercubes, California, Mogran Kaufmann, 1992. 5. H. Y. Lo and J. D. Chen, “A routing algorithm and generalization for cube-connected cycles,“ IEICE Trans. Information Systems, vol. E80-D, pp. 829-836, 1997. 6. D. S. Meliksetian and C. Y. R. Chen, “Optimal routing algorithm and the diameter of the cube-connected cycles,“ IEEE Trans. Parallel and Distributed Systems, vol. 4, pp. 1172-1178, 1993. 7. F. P. Preparata and J. Vuillemin, “The cube-connected cycles: A versatile network for parallel computations,“ Commun. ACM, vol. 25, pp. 300-309, 1981. 8. Y. Saad and M. H. Schultz, “Topological properties of hypercubes,“ IEEE Trans. Computers, vol. 37, pp. 867-872, 1988. 9. M. P. Sebastian, P. S. N. Rao and L. Jenkins, “Properties and performance of folded cubeconnected cycles,“ Journal of Systems Architecture, vol. 44, pp. 359-374, 1998.
Speeding up Parallel Graph Coloring Assefaw H. Gebremedhin1, , Fredrik Manne2 , and Tom Woods2 1
Computer Science Dept., Old Dominion University Norfolk, VA 23529-0162, USA [email protected] 2 University of Bergen, N-5020 Bergen, Norway {Fredrik.Manne,tomw}@ii.uib.no
Abstract. This paper presents new efficient parallel algorithms for finding approximate solutions to graph coloring problems. We consider an existing shared memory parallel graph coloring algorithm and suggest several enhancements both in terms of ordering the vertices so as to minimize cache misses, and performing vertex-to-processor assignments based on graph partitioning instead of random allocation. We report experimental results that demonstrate the performance of our algorithms on an IBM Regatta supercomputer when up to 12 processors are used. Our implementations use OpenMP for parallelization and Metis for graph partitioning. The experiments show that we get up to a 70 % reduction in runtime compared to the previous algorithm.
1 Introduction A graph coloring asks for an assignment of as few colors (or positive integers) as possible to the vertices V of a graph G = (V, E) so that no two adjacent vertices receive the same color. This is often a crucial stage in the development of efficient, parallel algorithms for many scientific and engineering applications, see [8] and the references therein for examples. As a graph coloring is often needed to perform some later task concurrently, it is natural to consider performing the coloring itself in parallel. In the current paper, we consider how to parallelize fast and greedy coloring algorithms where vertices are colored one at a time and each vertex is assigned the smallest possible color. Several papers have appeared in the literature over the last couple of years dealing with this issue [5,7,8,11]. We consider the algorithm in [8] and show how its performance can be enhanced both by paying attention to its sequential memory access pattern, and by employing graph partitioning to assign vertices to processors. The rest of the paper is organized as follows: In Section 2 we present relevant background information on parallel graph coloring, in Section 3 we motivate our methods, in Section 4 we report on experimental results, before we conclude in Section 5.
2 Parallel Graph Coloring The problem of finding a legal coloring using the minimum number of colors is known to be NP-hard for general graphs [6]. Moreover, it is a difficult problem to approximate [3].
Supported by the U.S. National Science Foundation grant ACI 0203722
J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 1079–1088, 2006. c Springer-Verlag Berlin Heidelberg 2006
1080
Assefaw H. Gebremedhin, Fredrik Manne, and Tom Woods
However, in practice greedy sequential coloring heuristics have been found to be quite effective [2]. Previous efforts in designing fast efficient parallel graph coloring algorithms have been centered around various ways of computing an independent set in parallel. This research direction was initiated by Luby’s algorithm for computing a maximal independent set in parallel [10,15]. The first algorithms proposed along this line enforce that one makes exactly one choice for the color of any vertex (i.e. no backtracking). To achieve this, while coloring a vertex v, the colors of already colored neighbors of v must be known, and none of the uncolored neighbors of v can be colored at the same time as v. The work of Jones and Plassmann [12], Gjertsen et al. [9], and Allwright et al. [1] follow this approach. All of these algorithms are based on the distributed memory model where the vertices of a graph are partitioned into the same number of components as there are processors. Each component including information about its inter- and intra-processor edges is assigned to and colored by one processor. To overcome the restriction that two adjacent vertices on different processors cannot be colored at the same time, Johansson [11] proposed a parallel algorithm where each processor is assigned exactly one vertex. The vertices are then colored simultaneously by randomly choosing a color from the interval [1, ∆ + 1], where ∆ is the maximum vertex degree in the graph. This may lead to an inconsistent coloring, and hence one may have to repeat the process recursively for the vertices that did not obtain legal colors. This algorithm was further investigated by Finocchi et al. [5] who performed extensive sequential simulations both of this algorithm and a variant where the upper-bound on the range of legal colors is initially set to be smaller than ∆ + 1 and then increases only when needed. Gebremedhin and Manne [8] developed a parallel graph coloring algorithm suitable for shared memory computers. Here it is assumed that the number of processors p is substantially less than the number of vertices n and each processor is assigned n/p vertices. Each processor colors its vertices sequentially, at each step assigning a vertex the smallest color not used by any of its neighbors (both on and off-processor). An inconsistent coloring arises only when a pair of adjacent vertices that reside on different processors are colored simultaneously. Inconsistencies are then detected in a subsequent phase and resolved in a final sequential phase. Figure 1 outlines this scheme as presented in [8]. This algorithm was later extended to various coloring problems in numerical optimization including distance-2 (d2) coloring, a case in which a vertex v is required to receive a color distinct from the colors of vertices within distance 2 edges from v [7]. This shared memory approach is the only algorithm that we know of that has been shown through parallel implementations to actually give lower running time as more processors are applied. In the current work we look at various ways in which this algorithm can be enhanced to run even faster. In [8] it is shown that the expected number of conflicts in most cases is small and the resulting running time is bounded by O(∆n/p). Although Phase 0 states that the vertices be randomly distributed among the processors, practical experiments showed that the best running time was obtained by keeping the original ordering of the vertices of the graph.
Speeding up Parallel Graph Coloring
1081
GM-Algorithm(G = (V, E)) Phase 0 : Partition Randomly partition V into p equal blocks V1 , . . . , Vp . Processor Pi is responsible for coloring the vertices in block Vi . Phase 1 : Pseudo-color for i = 1 to p do in parallel for each v ∈ Vi do assign a legal color to v, paying attention to already colored vertices (both on and off-processor). Phase 2 : Detect conflicts for i = 1 to p do in parallel for each v ∈ Vi do if ∃(v, w) ∈ E s.t. color(v) = color(w) and v ≤ w Li ← Li ∪ {v} Phase 3 : Resolve conflicts Color the vertices in the conflict list L = ∪Li sequentially. Fig. 1. The GM-algorithm as presented in [8]
3 Speeding up the Algorithm In the following we describe the various enhancements we have considered in order to speed up the GM-algorithm. Sections 3.1 and 3.2 deal with issues related to the partitioning of the graph onto the processors and the subsequent internal ordering, while Section 3.3 deals with algorithmic issues. In a parallel application the graph is usually already distributed among the processors in a reasonable way. This also includes distinguishing between interior and boundary vertices since the communication volume typically grows as a function of the number of boundary vertices while the amount of computation grows as a function of the total number of vertices assigned to a processor. This is true both for shared and distributed memory parallel computers. Thus it is not unreasonable to make the initial assumption that the graph is already partitioned in a reasonable way, with the vertices on each processor classified as either interior or boundary, and with an internal ordering of the vertices such that traversing the graph should be efficient. 3.1 Sequential Optimization It is a well known fact that sequential optimization can in many cases yield speedups comparable to those obtained by parallelization. The main objective here is to make the code more cache friendly by reusing data elements already in the cache. The GM-algorithm uses a compressed adjacency list representation of a graph. Each vertex has a pointer into this list that shows where its neighbors are located. The actual color values are stored in a separate vertex-indexed array. When accessing the list of neighbors of a vertex, the information will be in consecutive memory locations. However, accessing the colors of neighboring vertices might lead to sporadic memory access patterns with ensuing cache misses. This is particularly true if the neighbors are scattered rather than being clustered.
1082
Assefaw H. Gebremedhin, Fredrik Manne, and Tom Woods
In light of this, if the vertex set is randomly permuted as was suggested in Phase 0 of the GM-algorithm, one would expect the neighbors of a vertex to be randomly distributed throughout the set and thus cause a large number of cache misses. This could be one explanation why this scheme performed worse than keeping the natural order of the vertices. To reduce the number of cache misses, one would order the vertices in such a way that the neighbors of each vertex span as few cache lines as possible. One way this can be solved approximately is by employing a bandwidth or envelope size reducing ordering on the graph such as the reverse Cuthill-McKee ordering [4]. For an adjacency matrix representation this would have the effect of clustering the non-zero elements close to the diagonal. 3.2 Graph Partitioning There exist several graph partitioning packages that can be used for partitioning the vertices of a graph into a predefined number of components in such a way that the number of vertices in each component is nearly the same and the number of cross-edges (edges with endpoints in different components) is as small as possible. In a parallel application this can be used to partition the graph into the same number of components as there are processors and then assign one component to each processor. Graph partitioning is also exploited in the parallel coloring algorithms presented in [12] and [9] where one takes advantage of the fact that boundary vertices (vertices incident on inter-processor edges) are the only vertices that call for communication, and interior vertices (vertices incident only on intra-processor edges) can be colored concurrently. The use of graph-partitioning software can help lower the run-time of the GMalgorithm in three different ways. i) Cross-processor memory accesses. In Phase 1 and 2 of the GM-algorithm, part of the run-time associated with the coloring of a vertex v can be divided into two separate parts: the time required to obtain the color of an adjacent on-processor vertex, and the time required to obtain the color of an adjacent off-processor vertex. For most architectures, the latter will be larger since it most likely involves accessing memory that is associated with another processor. For this reason, one would like to maximize the number of neighbors of v that are allocated to the same processor as v itself. Thus an assignment of the vertices to processors that keeps the number of cross-edges low will also reduce the time spent on off-processor memory accesses. ii) Number of conflicts. In the context of the GM-algorithm, minimizing the number of cross-edges also reduces the number of conflicts since cross-edges are the only edges that can give rise to conflicts. This is because vertices in the same component will be colored (sequentially) by the same processor. Reducing the number of conflicts will subsequently reduce the time needed by Phase 3 of the algorithm. In fact, the bound of O(δp) on the number of expected conflicts shown in [8] can easily be improved to O(δpm /m). Here, δ¯ is the average vertex degree, m is the number of cross-edges, and m is the total number of edges in the graph. iii) Vertices that need to be checked for conflicts. As we have already seen, any conflict that arises in Phase 1 of the GM-algorithm will involve at least one boundary
Speeding up Parallel Graph Coloring
1083
vertex. Thus by distinguishing between interior and boundary vertices, it is possible to avoid checking the interior vertices for conflicts in Phase 2. This should save up to half the processing time spent on the interior vertices. Although reducing the number of boundary vertices is not the main objective of graph-partitioning software, we expect this number to correlate with the number of cross-edges. Finally we note that in the case where a d2-coloring is desired, any conflict will still include at least one boundary vertex. This is because the d2-neighbors of an interior vertex v will either be in the same component as v or a boundary vertex of some adjacent component. Thus even in this case it is sufficient to check the boundary vertices to detect all possible conflicts. For distance-k coloring, k > 2 this is no longer true as the distance relationship then stretches further into each component. 3.3 Conflict Reduction In [7] it was shown that for dense graphs the number of conflicts could become sufficiently high that the final sequential conflict resolution step starts to dominate the overall runtime. Also for dense graphs the effect of graph partitioning is less pronounced. One solution for reducing the number of conflicts suggested in [11] is to assign a vertex a random color from the interval [1, ∆ + 1]. However, with this scheme one is very likely to end up using close to ∆ + 1 colors while the actual number of colors needed is much smaller. To reduce the number of colors used, one would like each processor to choose a small legal color when coloring a vertex. If each processor always chooses the lowest possible color, the number of conflicts is likely to increase. This is particularly true during the early steps of the coloring process where there are relatively few forbidden colors. Thus one would like different processors to choose different low-numbered colors. In the following paragraphs we describe two solutions that have been suggested to this problem. Both are based on randomization. Gebremedhin et al. [7] suggested that one use a geometrically distributed random variable to determine which of the available legal colors to use. This was implemented by processing the available legal colors in increasing order, each time choosing a color with probability q. The process terminates as soon as a successful color choice is made. The disadvantages of this method is that it requires some fine-tuning of the parameter q and for small values of q, one might need to generate a large number of random numbers for each vertex. We note that the latter shortcoming can be alleviated by an inverse transformation method to simulate the required random variable and hence require only one random number generation [16]. Finocchi et al. [5] recently suggested the “Trivially Hungarian” (TH) method where the color of a vertex v is initially restricted to be chosen from the range [1, r], where r = deg(v)/s, deg(v) is the degree of v, and s is an a priori determined shrink factor of the algorithm. Only when there are no legal colors in the given range can the bound r be increased to 1 + min{c, deg(v)}, where c is the largest color used in the neighborhood of v. Again, this algorithm requires that a suitable value of s be found. The TH-method has not previously been implemented in an actual working parallel code.
1084
Assefaw H. Gebremedhin, Fredrik Manne, and Tom Woods
Table 1. Structural properties of graphs (left) and run-times for d1 and d2 coloring using various vertex orderings (right) Name
|V |
|E|
density −7
mrng4 7,533,224 14,991,280 5.3 × 10 auto
−5
448,695 3,314,611 3.3 × 10
∆
δ¯
Natural Random
RCM
4 3.98 3.6/17.2 10.5/73.9 2.5/10.4 37 14.77 0.5/8.0 0.8/12.1 0.4/5.0
4 Experiments In the following we report on experiments performed on an IBM p690 Regatta supercomputer using up to 12 Power4 1.3 Ghz processors. Our objective is to show how the various strategies described in Section 3 can be used to speed up the GM-algorithm. The algorithms have been implemented in Fortran 90 and parallelized using OpenMP. Metis [14] was used for graph partitioning. 4.1 Vertex Ordering Here we present experimental results that demonstrate the effect of vertex ordering on the practical run-time of the coloring algorithms. Due to space limitations, we only provide results for two representative graphs from our testbed, both of which are from finite element methods [13]. The left part of Table 1 lists relevant structural properties of these graphs. The column labeled density shows 2|E| the quantity |V |×(|V |−1) . The column marked ∆ gives the maximum vertex degree ¯ while column δ gives the average vertex degree in the graph. In the right part of Table 1, the columns labeled Natural, Random, and RCM show the time (in seconds) used by a sequential greedy coloring algorithm when the vertices are visited in their natural order, in a random order, and in the reverse Cuthill-McKee order, respectively. For each ordering, run-times for a d1-coloring and a d2-coloring are provided. Table 1 shows that a random vertex ordering destroys any available locality and hence increases the running time by a factor of nearly three for d1-coloring and by a factor of four for d2-coloring. The RCM ordering reduces the running time by 31% and 20% for the d1-coloring and by 40% and 37% for the d2-coloring. 4.2 Vertex to Processor Mapping Figure 2 shows the parallel running times of our algorithms for d1-coloring as the number of processors is increased from 2 to 12. A substantial speedup was observed in going from one to two processors even if the parallel algorithm performs nearly twice as many operations as the sequential algorithm. (We believe this is due to access to more cache when using more than one processor.) We do not show this as it would disrupt the figures. The line labeled Original refers to the basic GM-algorithm as described in Section 2 run on the graph with the vertices in their natural order and partitioned into contiguous blocks of equal size (block-partition). The line labeled RCM corresponds to an RCM ordering of the vertices prior to a block-partition. The line RCM+Metis shows the case where the vertices are ordered using RCM and the graph partitioned using Metis. Finally
Speeding up Parallel Graph Coloring
1085
1.4
0.16
1.2
0.14
0.8 0.6 0.4
0.1 0.08 0.06 0.04
0.2 0 2
Original RCM RCM+Metis RCM+Metis+Local
0.12
1
Time (In seconds)
Time (In seconds)
RCM+Metis+Local shows when both RCM and Metis are used and the interior vertices are not checked for inconsistencies. It should be noted that Metis only specifies to which partition a particular vertex should belong, it does not alter the relative order among the vertices within a particular partition. In Figure 2, for a fixed number of processors, the decrease in running time relative to the basic algorithm ranges from 63% to 75% with a mean of 68%. Although not reported, we have also made experiments using only Metis, which gave similar results to the ones obtained using only RCM. In all cases, the number of conflicts was observed to be consistently low and did not influence the overall running time. As expected, we also observed that for a fixed number of processors, the number of conflicts decreased with the number of boundary vertices (and cross-edges). Moreover, the numbers of colors used remains close to that used by the sequential greedy algorithm.
0.02 4
6 8 Processors
10
12
0 2
4
6 8 Processors
10
12
Fig. 2. d1-coloring of the graphs mrng4 (left) and auto (right)
To help further explain the decrease in running time we also present Figure 3 which shows the percentage of boundary vertices while using the original, RCM, and Metis ordering. Applying Metis to the RCM ordered graph has little effect as Metis is fairly insensitive to the original ordering. Note that RCM ordering improves locality but not as much as Metis does. Thus since applying either only RCM or only Metis gives approximately the same running times, the gain obtained by the Metis ordering where there are fewer cross-processor memory accesses and a smaller set of boundary vertices seems to be compensated for by the more efficient memory access pattern given by the RCM ordering. As is evident from Figure 3 the percentage of boundary vertices increases as the number of processors increases. But at the same time the number of interior vertices handled by each processor decreases. Thus one might suspect that the obtained speedup is mainly due to each processor spending less and less time on the interior vertices. This would be similar to the algorithm by Gjertsen et al. [9]. To see that this is not the case, note that already when using four processors more than 80 % of the vertices of both graphs are on the boundary but the speedup still continues.
1086
Assefaw H. Gebremedhin, Fredrik Manne, and Tom Woods 90 Percentage of boundary vertices
Percentage of boundary vertices
100
80
60
40
20
80 70 60
Original
50
RCM
40
Metis
30 20 10
0 2
4
6 8 Processors
10
12
0 2
4
6 8 Processors
10
12
Fig. 3. Percentage of boundary vertices for graphs auto (left) and mrng4 (right)
In Gjertsen et al. [9] the interior and boundary vertices are colored separately. We found that in our setting it is advantageous to store and color the interior and boundary vertices interleaved. We believe that this is due to external memory references being evened out in time and thus avoiding communication congestion. 4.3 Randomization In the experiments in the previous section, the number of conflicts that had to be resolved was small enough that the final sequential phase did not influence the overall running time. In this section we report on experiments where this is not the case. Specifically we test how the randomized conflict reduction schemes described in Section 3.3 can be applied to improve the running time. For these experiments we report results obtained on a random graph (d2-random) with 30,000 vertices, 8,999,700 edges, a maximum vertex degree of 699, and an average vertex degree of 600. This graph was chosen such that the original parallel d2-coloring algorithm would give a large number of conflicts. We have also performed experiments using graphs tailored toward d1-coloring. These gave results (omitted for space considerations) similar to those reported here. For the randomized algorithms we performed similar experiments as those discussed in Section 4.2. However, these did not yield any significant improvements in the running times over the original algorithm but kept the results fairly stable. The reason for this is that irrespective of the chosen ordering technique, almost all the vertices were on the boundary. Hence the experiments shown here are based on the basic algorithm with the vertices in their natural order. The main parameter that must be determined for the TH-algorithm is the shrinking factor s. In [5] a value between four and six is suggested for d1-coloring. Our configuration differs from theirs in that we rely on a shared memory model while they assume a distributed memory model. Also, they apply the algorithm recursively to the vertices that do not receive a legal color after the first round while our approach is to recolor illegally colored vertices in a sequential phase.
Speeding up Parallel Graph Coloring
1087
4
3
x 10
10000
200
p=2 Time (in seconds)
p=8 Conflicts
Colors
2.95
2.9
2.85
2.8
8000
p=4
6000 4000 2000
0
20
40 s
60
80
0
0
20
40 s
60
80
150
100
50
0
0
20
40 s
60
80
Fig. 4. The TH-algorithm run on the d2-random graph
Figure 4 shows the performance of the TH-algorithm for a d2-coloring of the graph d2-random using different values of s. The leftmost figure shows the number of colors used, the middle figure shows the number of conflicts, and the rightmost figure depicts the time used. As is evident from Figure 4, the shrinking factor in the TH-algorithm is critical for its performance. If it is set too small, the algorithm can use more than double the number of colors used by the sequential greedy algorithm. On the other hand, if the shrinking factor is set too large, the number of conflicts, and consequently, the running time increases. The interval for the shrinking factor that produces the best values, both in terms of the number of colors used and run-time, is fairly small. Interestingly, in our experiments we found that the best results are obtained if the shrinking factor is set in such a way that the number of initial available colors is equal or almost equal to the number of colors used by the sequential greedy algorithm. We note that for practical purposes this might be difficult to estimate a priori. In order to compare the TH algorithm with the GMP-algorithm [7], we ran the GMPalgorithm on the d2-random graph using various values for the parameter q. Due to space limitations we do not show these numbers here but we observed that the optimal value of q in terms of speed seems to decrease as the number of processors increases. In terms of best possible running times, the TH and the GMP-algorithm have fairly similar values, but obtaining these requires fine tuning of the algorithms.
5 Conclusion One open interesting question is: Given a partition, how should one order the vertices within each component to minimize cache misses both locally and between processors? Even though the randomized methods seem to be promising in terms of handling dense graphs, more work remains to be done to determine how these need to be tailored for specific applications. It would also be of interest to know how these techniques could be applied in a distributed memory setting.
References 1. J. A LLWRIGHT, R. B ORDAWEKAR , P. D. C ODDINGTON , K. D INCER , AND C. M ARTIN, A comparison of parallel graph coloring algorithms, NPAC technical report SCCS-666, Northeast Parallel Architectures Center at Syracuse University, 1994. 2. T. F. C OLEMAN AND J. J. M ORE , Estimation of sparse Jacobian matrices and graph coloring problems, SIAM J. Numer. Anal., 1 (1983), pp. 187–209.
1088
Assefaw H. Gebremedhin, Fredrik Manne, and Tom Woods
3. P. C RESCENZI AND V. K ANN, A compendium of NP optimization problems. http://www.nada.kth.se/˜viggo/wwwcompendium/. 4. E. C UTHILL AND J. M C K EE , Reducing the bandwidth of sparse symmetric matrices, in proceedings of ACM NAT. Conf., 1969, pp. 157–172. 5. I. F INOCCHI , A. PANCONESI , AND R. S ILVESTRI, Experimental analysis of simple, distributed vertex coloring algorithms, in Proc. 13th ACM-SIAM symposium on Discrete Algorithms (SODA 02), 2002. 6. M. R. G AREY AND D. S. J OHNSON, Computers and Intractability, Freeman, 1979. 7. A. H. G EBREMEDHIN , F. M ANNE , AND A. P OTHEN, Parallel distance-k coloring algorithms for numerical optimization, in proceedings of Euro-Par 2002, vol. 2400, Lecture Notes in Computer Science, Springer, 2002, pp. 912–921. 8. A. H. G EBREMEDHIN AND F. M ANNE , Scalable parallel graph coloring algorithms, Concurrency: Practice and Experience, 12 (2000), pp. 1131–1146. 9. R. K. G JERTSEN J R ., M. T. J ONES , AND P. P LASSMANN, Parallel heuristics for improved, balanced graph colorings, J. Par. Dist. Comput., 37 (1996), pp. 171–186. 10. A. G RAMA , A. G UPTA , G. K ARYPIS , AND V. K UMAR, Introduction to Parallel Computing, 2ed, Addison Wesley, 2003. ¨ J OHANSSON, Simple distributed δ + 1-coloring of graphs, Inf. Proc. Letters, 70 (1999), 11. O. pp. 229–232. 12. M. T. J ONES AND P. P LASSMAN, A parallel graph coloring heuristic, SIAM J. Sci. Comput., (1993), pp. 654–669. 13. G. K ARYPIS. Private communications. 14. G. K ARYPIS AND V. K UMAR, Multilevel k-way partitioning scheme for irregular graphs, J. Par. Dist. Comp., 48 (1998), pp. 96–129. 15. M. L UBY, A simple parallel algorithm for the maximal independent set problem, SIAM J. Comput., (1986), pp. 1036–1053. 16. S. M. ROSS, Introduction to Probability Models, 7ed, Academic Press, 2000.
On the Efficient Generation of Taylor Expansions for DAE Solutions by Automatic Differentiation Andreas Griewank1 and Andrea Walther2 1
2
Department of Mathematics, Humboldt-Universit¨at Berlin, Germany [email protected] Institute of Scientific Computing, Technische Universit¨at Dresden, Germany [email protected]
Abstract. Under certain conditions the signature method suggested by Pantiledes and Pryce facilitates the local expansion of DAE solutions by Taylor polynomials of arbitrary order. The successive calculation of Taylor coefficients involves the solution of nonlinear algebraic equations by some variant of the Gauss-Newton method. Hence, one needs to evaluate certain Jacobians and several right hand sides. Without advocating a particular solver we discuss how this information can be efficiently obtained using ADOL-C or similar automatic differentiation packages.
1 Introduction The differential algebraic systems in question are specified by a vector function F (t, y) ≡ F (t, y1 , y2 , . . . , yn ) with F : IR
1+m
≡ IR × IR1+m1 × IR1+m2 × · · · × IR1+mn → IRn .
Here, t denotes the independent “time” variable, n is the dimension of the state space and the yj are (1 + mj )-dimensional vectors whose rth component yj,r represents the rth derivative of the state space component yj ≡ yj,0 . However, this relation is really of no importance as far as the pure automatic differentiation task is concerned. In principle, F can be an arbitrary algebraic mapping from IR1+m to IRn but for our purposes it should be defined by an evaluation procedure in a high level computer language like Fortran or C. Then, the technique of automatic differentiation (AD) offers an opportunity to provide derivative information of any order for the given code segment by applying the chain rule systematically to statements of computer programs. For that purpose, the code is decomposed into a typically very long sequence of simple evaluations, e.g. additions, multiplications, and calls to elementary functions such as sin(x) or exp(x). The derivatives with respect to the arguments of these operations can be easily calculated. Exploiting the chain rule yields the derivatives of the whole sequence of statements with respect to the input variables. Depending on the starting point of this methodology—either at the beginning or at the end of the chain of computational steps—one distinguishes between the forward mode and the reverse mode
This work was supported by the DFG research center MATHEON, Mathematics for Key Technologies in Berlin, and DFG grant WA 1607/2-1.
J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 1089–1098, 2006. c Springer-Verlag Berlin Heidelberg 2006
1090
Andreas Griewank and Andrea Walther
of AD. Using the forward mode, one computes the required derivatives together with the function evaluation in one sweep. Applying the reverse mode of AD, the derivative values are propagated during a backward sweep. Hence after a function evaluation, one starts computing the derivatives of the dependents with respect to the last intermediate values and traverses backwards through the evaluation process until the independents are reached. A comprehensive exposition of these techniques of AD can be found in [7]. For many DAEs the computational graph related to the code segment to evaluate F (t, y) will be of quite manageable size, but still we will try to keep the number of sweeps through it as low as possible. Moreover, we will try to amortize the overhead of each sweep by performing reasonably intense calculations for each graph vertex, which represents an intermediate quantity in the evaluation of the algebraic function F (t, y). The structure of the paper is the following. In Section 2 we discuss the efficient computation of first and higher-order derivatives using automatic differentiation. The structural analysis required for the signature method suggested by Pantelides [9] and Pryce [10,11] is the subject of Section 3. New drivers of the AD-tool ADOL-C [8] are presented for the first time in Section 4. They are tailored especially for the use in the DAE context. Preliminary numerical results illustrating the effort to compute higherorder derivatives are discussed in Section 5. Section 6 contains some conclusions and an outlook.
2 Computing First and Higher Order Derivatives Throughout the paper we assume that the time t has been shifted so that its current value is simply t = 0. Then we obtain for any analytic path
y(t) =
r¯
y(r) tr
r=0
a corresponding value path
F (t, y(t)) =
r¯
tr Fˆr (0, y(0) , y(1) , . . . , y(r) ) + O(tr¯+1 )
r=0
The coefficient functions Fˆr : IR1+m∗r → IRn are analytic provided this is true for F as we assume for simplicity. To compute the desired higher-order information, we will first examine the derivative computation for intrinsic function. For a given Taylor polynomial x(t) = x0 + x1 t + x2 t2 + · · · + xr¯−1 tr¯ ∈ IRn a naive implementation to compute higher-order derivatives could be based on “symbolic” differentiation. This approach would yield for a general smooth v(t) = ϕ(x(t)) the derivative expressions
On the Efficient Generation of Taylor Expansions for DAE Solutions
1091
v0 = ϕ(x0 ) v1 = ϕ1 (x0 ) x1 v2 = ϕ2 (x0 ) x1 x1 + ϕ1 (x0 ) x2 v3 = ϕ3 (x0 ) x1 x1 x1 + 2 ϕ2 (x0 ) x1 x2 + ϕ1 (x0 ) x3 v4 = ϕ4 (x0 ) x1 x1 x1 x1 + 3ϕ3 (x0 ) x1 x1 x2 +ϕ2 (x0 ) (x2 x2 + 2 x1 x3 ) + ϕ1 (x0 ) x4 v5 =
...
Hence, the overall complexity grows rapidly in the highest degree r¯ of the Taylor polynomial. To avoid these prohibitively expensive calculations the standard higher-order forward sweep of automatic differentiation is based on Taylor arithmetic [2] yielding an effort that grows like r¯2 times the cost of evaluating F (t, y). This is quite obvious for arithmetic operations as shown below. For a general elemental function ϕ, one finds also a recursion with quadratic complexity by interpreting ϕ as solution of a linear ODE. The following table illustrates the resulting computation of the Taylor coefficients for a simple multiplication and the exponential function: Recurrence for k = 1 . . . r¯ − 1 OPS k x(t) ∗ y(t) vk = xj yk−j ∼ r¯2 v(t) =
exp(x(t))
j=0 k
kvk =
jvk−j xj
∼ r¯2
MOVES 3 r¯ 2 r¯
j=1
Similar formulas can be found for all intrinsic functions. This fact permit the computation of higher-order derivatives for the vector function F (t, Y) as composition of elementary components. The AD-tool ADOL-C [8] uses the Taylor arithmetic as described above to provide an efficient calculation of higher-order derivatives. Furthermore, the ADtools FADBAD [1] and CppAD [4] use the same approach to compute higher-order information. In ODE and DAE solving, the coefficients y(r) are generated in increasing order using successive values of the residuals Fˆr . For that purpose, we have to compose the input coefficients of the vectors (r)
(r)
y(r) = [y1 , y2 , . . . , yn(r) ] from the values yj,s obtained so far. This simple application of the chain rule is the only extra procedure we have to attach to our AD software to facilitate the calculation of the desired Taylor coefficients. Specifically, using ADOL-C we must set (r)
yj,s = yj,r+s /s! because the computations performed by ADOL-C are based on the unscaled Taylor coefficients. If the Fˆs for s ≤ r are reevaluated from scratch every time this requires r sweeps and thus a computational effort of order r3 times the cost of evaluating the underlying
1092
Andreas Griewank and Andrea Walther
algebraic mapping F (t, y). As observed in [7, Section 10] there are at least two ways in which this effort can be reduced to being quadratic in r. The first option is to store and retrieve the partial Taylor polynomials of all intermediate quantities that occur during the evaluation of F . This has been done in the Fortran package ATOMFT [5] for the solution of ODEs by the Taylor series method. The second possibility is to exploit the property that Fr is linear in all y(s) with s > r/2 so that in fact Fr (0, y(0) , y(1) , . . . , y(r) ) = Fr (0, y(0) , . . . , y(s−1) , 0, . . . 0) +
r
Ak−s (0, y(0) , . . . , y(k−s) )y(s)
k=s
Here the As (0, y(0) , . . . , y(k−s) ) ∈ IRn×n for s ≤ r are the Taylor coefficients of the Jacobian path J(0, y(t)). They can also be evaluated by standard AD methods and are needed anyway if one wishes to compute sensitivities of the Taylor coefficients with respect to the basis point y0 in an implicit Taylor method. In contrast to the save and restore option, exploiting the linearity reduces the number of sweeps through the computational graph essentially to the logarithm of the maximal order r¯.
3 Structural Analysis in Terms of the Jacobian J The structural analysis used by Pantelides [9] has become part of professional simulation software [3,6] and has been applied successfully to a wide variety of systems. Nevertheless, it has to be mentioned, that Pantelides’ algorithm applied to DAEs of index 1 may perform an arbitrarily high number of iterations and differentiations [12]. This behaviour is due to the fact that the structural index of the DAE may exceed the index of the DAE and that the differentiation needed by Pantelides’ algorithm relate to the structural index. However, the present paper focuses on the derivative computation. Therefore, these possibly difficulties are just a side note for computing the consisted point. They might be overcome in the future by a better suited determination method for the nonnegative shifts mentioned below. In the structural analysis used by Pantelides [9] and Pryce [10,11], the elements σij of the signature matrix Σ are defined by max{r | yj,r occurs in Fi } , σij = −∞ if no component of yj occurs in Fi where Fi denotes the ith component function of F . Here, “yj,r occurs in Fi ” means that the value of the latter depends nontrivially on the former, which leads to its occurrence in a symbolic expression for Fi . The signature matrix is used to determine vectors c = (ci )ni=1 and d = (dj )nj=1 of nonnegative shifts. It is shown in [10] that if a solution ∗ ∗ ∗ ∗ , . . . , y1,d , . . . , yn,0 , . . . , yn,d ) Y∗ = (y1,0 1 n
of the equations
On the Efficient Generation of Taylor Expansions for DAE Solutions (0)
(1)
(c1 )
F1 , F1 , . . . , F1 .. .
⎫ ⎪ ⎪ ⎬ ⎪
(0) (1) (c ) ⎪ Fn , Fn , . . . , Fn n ⎭
exists and the system Jacobian J with ⎧ ⎨ ∂Fi Jij = ∂yj,dj −ci ⎩ 0
1093
=0
if yj,dj −ci occurs in Fi otherwise
is nonsingular at the solution, then Y∗ is a consistent point of the DAE at time t. The required derivative information for constructing J can be obtained by a single reverse sweep in vector mode to evaluate the rectangular Jacobian J(0, y) ≡ [J1 (0, y), J2 (0, y), . . . , Jn (0, y)] ∈ IRn×m ∂F (0, y) Jj (0, y) ≡ ∈ IRn×(1+mj ) . ∂yj
with
Irrespective of the size of m, the operations count for this will be about n times that of evaluating F (t, y) by itself. To compute a solution Y∗ one has to solve a sequence of underdetermined systems of nonlinear equations. For that purpose, one needs the corresponding Jacobian. This matrix is given by a part of the system Jacobian and can be evaluated using again either standard higher-order automatic differentiation or the more efficient variants discussed above. However, the initialization of the input Taylor coefficients is considerably easy since one simply has to choose the corresponding unit vectors. Once a consistent point at time t = 0 is computed, one may for example apply an explicit Taylor method to integrate the DAE. For that purpose, only one solve of a linear system with the system Jacobian J as linear operator is required for each order of the Taylor method. Hence, one performs one LU-factorization of J. Subsequently, one only has to evaluate higher-order derivatives occurring in the right-hand sides. For this purpose, the technique explained in the preceding section can be used.
4 Implementation Details For simplicity we have assumed that the DAE system has been written in autonomous form, but an explicit time dependence could certainly be accounted for too. Although variations are possible we suggest that the problem be specified by an evaluation code of the following form void sys_eval(int n, adouble** y, adouble* F) using the active variable type adouble provided by ADOL-C. Here y[j][k] represents yj,k , i.e. the k-th derivative of the j-th variable with k ≤ mj . In other words, the calling program must have allocated n adouble pointers y[j] for j = 0, . . . , n − 1 where each of them is itself a vector of mj + 1 =: m[j] adoubles. On exit the components F[i] for i = 0, . . . , n − 1 contain the function components Fi in Pryce notation.
1094
Andreas Griewank and Andrea Walther
Provided the code sys eval does not contain any branches it must be called only once before the actual DAE solving begins. Before the call, y and F must be allocated and y initialized by a loop of the form int tag = 1; trace_on(tag) y = new adouble*[n]; F = new adouble[n]; for(j=0; j>= Fp[j] trace_off() Here, yp[j][i] is an array of double values at which sys eval can be sensibly called. The loop after the call to sys eval determines the dependent variables in ADOL-C terminology, where Fp[j] is like yp[j][i] of type double. Now the DAE system has been taped. If the function evaluation contain branches, the generated tape can be reused as long as the control does not change. If the control flow changes, the return values of the drivers for computing the desired derivatives indicate that the tape is no longer valid and a retaping has to be performed. Hence, by monitoring the corresponding return values the correctness of the derivative information can be ensured while keeping the effort for the taping as low as possible. Once, the tape is generated, function and derivative evaluations are performed for example by the routines zos forward partx(..), fos forward partx(..), hos forward partx(..), and jacobian partx(..) that are problem independent. For example the call zos_forward_partx(tag,n,n,m,yp,Fp) will yield as output the system values Fp[j] for arbitrary inputs yp[j][i] and the array m describing the partition of yp. Here, zos forward stands for zero-order scalar forward mode, since no derivatives are required. Now suppose we have allocated and assigned values to a three dimensional tensor yt[j][i][r] for j < n, i < m[j], r ≤ b. Mathematically, this is interpreted as the family of Taylor expansions b y[j][i] ≡ yt[j][i][r] tr . r=0
For ADOL-C, the values y[j][i][r] are completely independent but for use in the integration method proposed by Pryce they must be given values that are consistent in that yt [j][i][r] = yj,i+r /r! . (i+r)
Here the yj,i+r = yi are the already known or guessed solution values and derivatives. All derivatives of higher-order should be set to zero. Now the call
On the Efficient Generation of Taylor Expansions for DAE Solutions
1095
fos_forward_partx(tag,n,n,m,yt,Ft) yields the Taylor coefficients Ft[j][r] for r < 2, where fos forward stands for first-order scalar forward mode and the call hos_forward_partx(tag,n,n,m,b,yt,Ft) yields the Taylor coefficients Ft[j][r] for r ≤ b of the resulting expansion F[i] ≡
b
Ft[i][r] tr .
r=0
In other words, the derivative (r!) Ft[j][r] corresponds exactly to the value Fj,r needed for the approach described in [10,11] by Pryce, except for the scaling by r! that has to be done by the user. For computing the system Jacobian that is also required by the method of Pryce and similar integration methods, ADOL-C provides also a special driver. For using it one must allocate the array jac[i][j][k] for i < n, j < n, k ≤ mj . In order to obtain the values jac[i][j][k] ≡ ∂Fi /∂yj,k of this Jacobian, the user has to call the new Jacobian driver jacobian_partx(tag,n,m,n,xp,jac). of ADOL-C. The presented new driver of ADOL-C are available in the current version 1.9.0 and have been incorporated into a software-prototype in order to test the calculation of higher order derivatives for the integration of high-index DAEs. The achieved numerical results are presented in the next section.
5 Numerical Example We implemented a very simple version of the algorithm given in [10] to illustrate the capabilities of ADOL-C to provide the required higher-order derivatives. Furthermore, the resulting code may form one possibility to verify the run-time saving that can be achieved using the improvements stated in Section 2. As test example, we choose a model of two pendula from [10], where the λ component of the first one controls the length of the second one. This system with index 5 is described by the DAE F1 = x + xλ = 0 F2 = y + yλ − g = 0 F3 = x2 + y 2 − L = 0
F4 = u + uκ = 0 F5 = v + vλ − g = 0 F6 = u2 + v 2 − (L + cλ)2 = 0
and has four degrees of freedom. For the numerical tests presented below, we use the gravity constant g = 1, the length L = 1 of the first pendulum, c = 0.1 and simulate the behaviour of both pendula for the time interval [0, 40]. This choice of T ensures that we simulate the pendula for a period where they do not show chaotic behaviour, which
1096
Andreas Griewank and Andrea Walther
Fig. 1. Index 5 two-pendulum problem, numerical results Run-time develpoment 25 h = 0.0025 seconds
20 15 h = 0.005 10 5 0
h = 0.01 5
10
15
order
20
25
30
Fig. 2. Index 5 two-pendulum problem, run-times
is the case for T ≥ 50, cf. [10]. The values (x0 , y0 ) = (u0 , v0 ) = (1, 0) and (x0 , y0 ) = (u0 , v0 ) = (0, 1) serve as initial point. In order to judge the influence of the derivative calculation on the run-time, we consider three discretizations based on the step size h = 0.0025, 0.005, 0.01 which results in the time step numbers 16 000, 8 000 and 4 000 respectively. In addition we use explicit Taylor methods of order 5, 10, 15, 20, 25, 30 for the integration of the DAE. The computed results for the two pendula are illustrated by Fig. 1, where the first one shows obviously a periodic behaviour. Since the length of the second one varies in dependence on the first pendulum one can observe for the second one a non-periodic solution which results eventually in a chaotic behaviour. However, for the chosen combinations of step size and integration order the computed solutions are identical for the first pendulum and close or at least comparable for the second one. This fact was acceptable for us since the influence of the derivative order on the computing time was the main subject. The simulations were computed using a Red Hat Linux system, with an AMD Athlon XP 1666 Mhz processor and 512 MB RAM. The required computing times for the different step sizes and integration orders are illustrated by Fig. 2. Here, for one specific step size the computational efforts varies mainly due to the integration order of the explicit Taylor method. The predicted nonlinear behaviour can be observed very clearly
On the Efficient Generation of Taylor Expansions for DAE Solutions
1097
for the step size h = 0.0025. For the larger time steps, i.e. h = 0.01 and h = 0.005, the derivative calculation is dominated by the linear algebra cost. Therefore, Fig. 2 shows only a slight nonlinear influence of the integration order on the run-time. Nevertheless, the numerical experiment confirms that the effort for computing higher-order derivatives increases only moderately when using the AD-tool ADOL-C. This is in accordance to the theoretical results sketched in Section 2.
6 Conclusion and Outlook This paper discusses the computation of higher-order derivatives using automatic differentiation in the context of high index DAEs. For that purpose, the standard higher-order forward sweep of automatic differentiation based on Taylor arithmetic is discussed as well as two possible improvements that are valuable for very high order derivatives. This methodology is embedded in the structural analysis for high-index DAEs. One case study presents numerical results achieved with the AD-tool ADOL-C that confirm the theoretical complexity of computing higher-order derivatives. Certainly, the implementation of the improvements for the generation of Taylor expansions presented in this paper forms one specific future challenge. This may include also a possibly thread-based parallelization. Here one can exploit the coarse-grained nature of Taylor computations that differ significantly from the situation when computing first-order derivatives using AD. An additional task is to ease the use of existing AD-tools for the usage in connection with the structural approach to integrate high-index DAEs. For that purpose, we presented a description of new drivers provided by ADOL-C 1.9.0 that take into account the special structure of a given DAE system where in addition to the values of the variables also the values of specific derivatives of the variables enter the evaluation of the system function.
References 1. C. Bendtsen and O. Stauning: FADBAD, a flexible C++ package for automatic differentiation. Department of Mathematical Modelling, Technical University of Denmark, 1996. 2. R. Brent and H. Kung: Fast algorithms for manipulating formal power series. Journal of the Association for Computing Machinery 25, 581–595, 1978. 3. F. Cellier and H. Elmquist: Automated formula manipulation supports object-oriented continuous-system modelling. IEEE Control System Magazine 13, 28–38, 1993. 4. http://www.seanet.com/∼ bradbell/CppAD/ 5. Y.F. Chang and G. Corliss: Solving ordinary differential equations using Taylor series. ACM Trans. Math. Software 8, 114–144, 1982. 6. W. Feehery and P. Barton: Dynamic optimization with state variable path constraints. Comput. Chem. Engrg. 22, 1241–1256, 1998. 7. A. Griewank: Evaluating Derivatives, Principles and Techniques of Algorithmic Differentiation, Frontiers in Appl. Math. 19, SIAM, Phil., 2000. 8. A. Griewank, D. Juedes, and J. Utke: ADOL-C: A package for the automatic differentiation of algorithms written in C/C++. TOMS 22, 131–167, 1996. 9. C.C. Pantelides: The consistent initialization of differential-algebraic systems. SIAM J. Sci. Statist. Comput. 9, 213–231, 1988.
1098
Andreas Griewank and Andrea Walther
10. J. Pryce: Solving high-index DAEs by Taylor series. Numer. Algorithms 19, 195–211, 1998. 11. J. Pryce: A simple structural analysis method for DAEs. BIT 41, 364–394, 2001. 12. G. Reißig, W. Martinson, and P. Barton: Differential-algebraic equations of index 1 may have an arbitrarily high structural index. SIAM J. Sci. Comput. 21, 1987–1990, 2000.
Edge-Disjoint Hamiltonian Cycles of WK-Recursive Networks Chien-Hung Huang1, Jywe-Fei Fang2 , and Chin-Yang Yang3 1
2
National Formosa University , Yun-Lin 632, Taiwan, R.O.C. St.John’s and St.Mary’s Institute of Technology, Taipei 251, Taiwan, R.O.C. 3 National Taiwan University Hospital, Taipei 100, Taiwan, R.O.C.
Abstract. In this paper, we show that there exist n edge-disjoint Hamiltonian cycles in the WK-Recursive networks with amplitude 2n + 1. By the aid of these edge-disjoint Hamiltonian cycles, nearly optimal all-to-all broadcasting communication on all-port WK-Recursive networks can be derived.
1 Introduction In massively parallel MIMD systems, the topology plays a crucial role in issues such as communication performance, hardware cost, potentialities for efficient applications and fault tolerance capabilities. A topology named W K-Recursivenetwork has been proposed by Vecchia and Sanges under CAPRI (Concurrent Architecture and Programming environment for highly Integrated systems) project supported by the Strategic Program on Parallel Computing of the National Research Council of Italy [1]. The topology has many attractive properties, such as high degree of regularity, symmetry and efficient communication. Particularly, for any specified number of degree, it can be expanded to arbitrary size level without reconfiguring the links. WK-Recursive networks have received considerable attention. Researchers have devoted themselves to various issues of WK-Recursive networks such as broadcasting algorithms [2], topological properties [6], substructure allocation [5] and communication analysis [4]. d’Acierno et al. built a neural network on the WK-Recursive network, which gained considerable performance enhancement [15]. In recent years, Fang et al. have proposed a novel broadcasting scheme for the WK-Recursive network, which requires only constant data included in each message and constant time to determine the neighbors to forward the message [12]. Fu has discussed the Hamiltonian-connected property of the WK-Recursive network [13]. Wu et al. have proposed a generalized and efficient allocation scheme to the Recursively Decomposable Interconnection Networks, which is including the WK-Recursive network [14]. A graph G is said to be Hamiltonian-decomposable if its edge set can be partitioned into edge-disjoint Hamiltonian cycles. Thus, some communication problems, such as many-to-many broadcasting and many-to-many scattering in all port model, can be nearly optimally solved [8]. In fact, many researchers have studied this issue of various topologies [9, 10]. To the best of our knowledge, there exists no article addressing this issue of WK-Recursive networks. In this paper, we show that WK-Recursive networks with amplitude 2n + 1 are Hamiltonian-decomposable. J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 1099–1104, 2006. c Springer-Verlag Berlin Heidelberg 2006
1100
Chien-Hung Huang, Jywe-Fei Fang, and Chin-Yang Yang
2 Notations and Background A complete graph with n nodes, denoted by Kn , is a graph in which every two distinct nodes are adjacent. A WK-Recursive network with amplitude W and level L, denoted by WK(W , L), can be recursively constructed as: WK(W , 1) is a KW in which each node has one free link and W − 1 links that are used for connecting to other nodes. Clearly, WK(W , 1) has W nodes and W free links. WK(W , C) consists of W copies of WK(W , C − 1) as supernodes and the W supernodes are connected as a KW , where 2 ≤ C ≤ L. By induction, it is easy to see that WK(W , L) has W L nodes and W free links. Consequently, for any specified number of degree W , WK-Recursive networks can be expanded to arbitrary level L without reconfiguring the links. In Fig. 1, the structures of WK(5, 1) and WK(5, 2) are shown.
0 4 1 3 2
(a) WK(5,1)
00 04
01 10
40 03
44
02
14 11
41 43 12
13
42
20
30 34
21
31 24
33 32
23
22
(b) WK(5,2)
Fig. 1. The structures of WK(5, 1) and WK(5, 2)
The following addressing scheme for WK(W , L) is described in [7]. After fixing an origin and an orientation (clockwise or counterclockwise), within each WK(W , 1) subnetwork, a node is labeled with an index digit d1 from 0 to W − 1. Similarly, within each WK(W , C) subnetwork, a WK(W , C − 1) subnetwork is labeled with an index dC from 0 to W − 1, where 2 ≤ C ≤ L. Hence, each node of WK(W , L) is labeled with an unique address dL dL−1 . . . d2 d1 as illustrated in Figure. 1. Likewise, A subnetwork of
Edge-Disjoint Hamiltonian Cycles of WK-Recursive Networks
1101
WK(W , L) can be represented by a string of L symbols over set {0,1,...,W − 1}{*}, where * is a "don’t care" symbol. That is, each WK(W , C) subnetwork of WK(W , L) can be denoted by dL dL−1 . . . dC+1 (∗)C , where (∗)( C) represents C consecutive *’s. For example, in WK(5, 3), 0** is the subnetwork {0d2 d1 |0 ≤ d2 ≤ 4 and 0 ≤ d1 ≤ 4}. For a subnetwork dL dL−1 . . . dC+1 (∗)C in WK(W , L), a node with address dL dL−1 . . . dC+1 (dC )C is called a corner node of dL dL−1 . . . dC+1 (∗)C . For example, in WK(5, 3), 000, 011, 022, 033 and 044 are corner nodes of 0**. Specifically, the node dL dL−1 . . . dC+1 (dC )C is called the dC − corner of dL dL−1 . . . dC+1 (∗)C . For example, in WK(5, 3), 022 is called 2-corner of 0**. In this paper, a link within a WK(W , 1) subnetwork is called an inner-cluster link. Definition 1. The inner-cluster links of node dL dL−1 . . . d2 d1 are defined as (dL dL−1 . . . d2 d1 , dL dL−1 . . . d2 h), where 0 ≤ h ≤ W − 1 and d1 = h. For example, in WK(5, 3), (002, 000), (002, 001), (002, 003) and (002, 004) are inner-cluster links of node 002. Clearly, each node has W − 1 inner-cluster links in WK(W , L). A link connecting two WK(W , C) subnetworks, where 1 ≤ C ≤ L − 1, is called an inter-cluster link and specifically a C-level link. Definition 2. the C level inter-cluster link of node dL dL−1 . . . dC+1 (dC )C , where dC+1 = dC , is defined as (dL dL−1 . . . dC+1 (dC )C , dL dL−1 . . . dC (dC+1 )C ). For example, in WK(5, 3), (022, 200) is a 2-level link and (012, 021) is a 1-level link. Note that each node except the corner nodes (dL )L , where 0 ≤ dL ≤ W − 1, has exactly one inter-cluster link in WK(W , L). Each corner node (dL )L of WK(W , L) has no inter-cluster link but a free link. In this paper, the outline graph of WK(W , L) is to take each WK(W , 1) subnetwork as a supernode. Since WK(W , L) can be constructed recursively, we have the following proposition. Proposition 1. The outline graph of WK(W , L) is WK(W , L − 1).
3 The Results In this section, we show how to partition the edge set of WK(2n+1, L) into n edge-disjoint Hamiltonian cycles. First, we show that the result is true for L = 1. According to the definition, WK(2n+1, 1) is K2n+1. It is well-known that K2n+1 can be decomposed into n Hamiltonian cycles[9]. Let these 2n + 1 nodes of K2n+1 be labeled as v0 , v1 , K, v2n . Hi , where 0 ≤ i ≤ n − 1, is defined as follows: v2n , vi , vi+1 , vi−1 , vi+2 , vi−2 , . . . , vn+i−1 , vn+i+1 , vn+i , v2n All subscripts except 2n are positive and expressed modulo 2n. Hi , where 0 ≤ i ≤ n − 1, is a Hamiltonian cycle of K2n+1 and these n Hamiltonian cycles are edge-disjoint [9]. Thus we can state the following lemma. Lemma 1. The edge set of K2n+1 can be partitioned into n edge-disjoint Hamiltonian cycles. HPi , where 0 ≤ i ≤ n − 1 , is defined as follows: vi , v2n , vn+i , vn+i+1 , vn+i−1 , . . . , vi−2 , vi+2 , vi−1 , vi+1 (for i is even) vn+i , v2n , vi , vi+1 , vi−1 , vi+2 , vi−2 , . . . , vn+i−1 , vn+i+1 (for i is odd) All subscripts except 2n are positive and expressed modulo 2n. Clearly, HPi , where 0 ≤ i ≤ n − 1, is a Hamiltonian path of K2n+1 and these n Hamiltonian paths are edge-
1102
Chien-Hung Huang, Jywe-Fei Fang, and Chin-Yang Yang
disjoint. Moreover, each source node and destination node of HPi , where 0 ≤ i ≤ n−1 , are disjoint each other. The source node and destination node of HPi are (v0 , v1 ), (v2 , v3 ), (v4 , v5 ),. . .,(v2m , v2m+1 ) for i= 2, 4, 6,. . ., 2m, where 2m ≤ n− 1. The source node and destination node of HPi are (vn+1 , vn+2 ), ( vn+3 , vn+4 ), (vn+5 , vn+6 ),. . .,(vn+2m+1 , vn+2m+2 ) for i = 1, 3, 5,. . ., 2m + 1, where 2m + 1 ≤ n − 1 and all subscripts except 2n are positive and expressed modulo 2n. Now we show that there exist n edge-disjoint Hamiltonian cycles in the WK-Recursive networks with amplitude 2n + 1. Theorem 2. WK(2n + 1, L) contains n edge-disjoint Hamiltonian cycles. Proof. We will prove the theorem by induction on L. For L =1, WK(2n + 1, 1) is K2n+1 . According to Lemma 1, the result is true for L = 1. Hypothesis: Assume that WK(2n + 1, k) contains n edge-disjoint Hamiltonian cycles. Induction Step: Because the outline graph of WK(2n + 1, k + 1) is WK(2n + 1, k). By hypothesis, WK(2n + 1, k) contains n edge-disjoint Hamiltonian cycles. Let HC(k, i), where 1 ≤ i ≤ n, be n edge-disjoint Hamiltonian cycles in the outline graph of WK(2n + 1, k + 1). Clearly, each WK(2n + 1,1) subnetwork Vj *, where 0 ≤ j ≤ (2n + 1)k − 1, is incident to exactly two edges in HC(k, i). That is, for each WK(2n + 1,1) subnetwork Vj *, there are exactly two nodes whose inter-cluster links are contained in HC(k, i). We label these two nodes as vj,i and vj,i+1 (vj,n+i and vj,n+i+1 ) for each i is even(odd), where 1 ≤ i ≤ n and the second subscripts except 2n are all positive and expressed modulo 2n. In each WK(2n + 1,1) subnetwork Vj *, the node which is not incident to an inter-cluster link in all HC(k, i) is labeled as vj,2n . For each WK(2n + 1,1) subnetwork Vj *, where 0 ≤ j ≤ (2n + 1)k − 1 and 1 ≤ i ≤ n, we define HPj,i as follows: vj,i , vj,2n , vj,n+i , vj,n+i+1 , vj,n+i−1 , . . .,vj,i−2 , vj,i+2 , vj,i−1 , vj,i+1 (for i is even) vj,n+i , vj,2n , vj,i , vj,i+1 , vj,i−1 , vj,i+2 , vj,i−2 , . . ., vj,n+i−1 , vj,n+i+1 (for i is odd) The second subscripts except 2n are all positive and expressed modulo 2n. Clearly, HPj,i is a Hamiltonian path of WK(2n + 1,1) subnetwork Vj * and these n Hamiltonian paths are edge-disjoint. For each 0 ≤ i ≤ n − 1, let HC(k + 1, i) = HC(k, i) HP0,i HP1,i K HPN −2,i HPN −1,i , where N = (2n+1)k . HC(k+1, i) is a Hamiltonian cycle of WK(2n + 1, k + 1) and these n Hamiltonian cycles are edgedisjoint. This extends the induction and completes the proof. 2 For example, edges of WK(5, 2) can be partitioned into Hamiltonian cycles as follows. First, according to Lemma 1, WK(5, 1) can be partitioned into Hamiltonian cycles as shown in Figure. 2(a). Note that the outline graph of WK(5, 2) is WK(5, 1). If each WK(5, 1) subnetwork is regarded as a supernode, the inter-cluster links of WK(5, 2) can be partitioned into two edge-disjoint Hamiltonian cycles. Clearly, K5 is node symmetric. Thus, we can label the nodes of each WK(5, 1) subnetwork according to proof of Theorem 2. Consequently, edges of WK(5, 2) can be partitioned into two edge-disjoint Hamiltonian cycles as illustrated in Figure. 2(b).
4 Conclusions In this paper, we have shown that WK-Recursive networks with amplitude 2n + 1 contains n edge-disjoint Hamiltonian cycles. By the aid of these edge-disjoint Hamiltonian cycles, nearly optimal all-to-all broadcasting communication on all-port WK-Recursive
Edge-Disjoint Hamiltonian Cycles of WK-Recursive Networks
1103
v0,4 v0,3
v0,1 v1,3
v4,1 v0,2
v0,0
v4,4
v1,2
v1,4
v4,0 v4,2
0 1
v2,0
v3,2 v3,0
3
v1,0
v1,1
v4,3
4
v2,2
v3,3 v2,1
2
(a) WK(5,1) v3,4
v3,1
v2,3
v2,4
(b) WK(5,2)
Fig. 2. Hamiltonian decomposition of WK(5, 2)
networks can be derived. However, the issue of one-port WK-Recursive networks is still open. This work was supported in part by the National Science Council of the Republic of China under the contract number: NSC92-2213-E-129-010.
References 1. G. D. Vecchia and C. Sanges, “A recursively scalable network VLSI implementation,“ Future Generat. Comput. Syst. vol. 4, pp. 235-243, 1988. 2. G. D. Vecchia and C. Sanges, “An optimal broadcasting technique for WK-Recursive topologies,“ Future Generat. Comput. Syst. vol. 5, pp. 353-357, 1989/1990. 3. F. T. Leighton, Introduction to Parallel Algorithms and Architectures: Arrays,Trees, Hypercubes, California: Mogran Kaufmann, 1992. 4. R. Fernandes, “Recursive interconnection networks for multicomputer networks,“ Proceed. Int. Conf. Parallel Process., vol. 1, pp. 76-79, 1992. 5. R. Fernandes and A. Kanevsky, “Substructure allocation in recursive interconnection networks,“ Proceed. Int. Conf. Parallel Process., vol. 1, pp. 315-318, 1993. 6. A. I. Mahdaly, H. T. Mouftah, N. N. Hanna, “Topological properties of WK-recursive networks," Proceed. Second IEEE Workshop on Future Trends of Distributed Computing Systems, pp. 374-380, 1990. 7. G. D. Vecchia and C. Sanges, “Recursively scalable network for message passing architecture," Proceed. Int. Conf. Parallel Processing and Applications, vol. 1, pp. 33-40, May 1987. 8. S. L. Johnsson and C. T. Ho, “Optimum broadcasting and personalized communication in hypercubes," IEEE Transaction on Computers, vol. 38, pp. 1249-1268,1989.
1104
Chien-Hung Huang, Jywe-Fei Fang, and Chin-Yang Yang
9. G. Chartrand and O. R. Oellermann, Applied and Algorithmic Graph Theory, McGrawHill,1993. 10. D. Barth and A. Raspaud, “Two edge-disjoint Hamiltonian cycles in the butterfly graph," Inform. Process. Lett., vol. 51, pp. 175-179,1994. 11. J. C. Bermond, O. Favaron and M. Maheo, “Hamiltonian decomposition of Cayley graphs of degree four," J. Combin. Theory, vol. 46, pp. 142-153, 1989. 12. J. F. Fang, G. J. Lai, Y. C. Liu and S. T. Fang, “A novel broadcasting scheme for WK-Recursive networks," Proceed. IEEE Pacific RIM Conference on Communications, Computers, and Signal Processing, pp. 1028-1031, 2003. 13. J. S. Fu, “Hamiltonian-connectedness of the WK-recursive network," Proceed. Int. Conf. Symposium on Parallel Architectures, Algorithms and Networks, pp.569-574, 2004. 14. F. Wu and C. C. Hsu, “A generalized processor allocation scheme for recursively decomposable interconnection networks," IEICE Transactions on Information and Systems, vol. E85-D, pp. 694-713, 2002. 15. A. d’Acierno, R. Del Balio, G. De Pietro and R. Vaccaro, “A parallel simulation of fully connected neural networks on a WK-recursive topology," Proceed. IEEE International Joint Conference on Neural Networks, pp. 1850-1854, 1991.
Simulation-Based Analysis of Parallel Runge-Kutta Solvers Matthias Korch and Thomas Rauber University of Bayreuth, Department of Mathematics, Physics, and Computer Science {matthias.korch,rauber}@uni-bayreuth.de
Abstract. We use simulation-based analysis to compare and investigate different shared-memory implementations of parallel and sequential embedded RungeKutta solvers for systems of ordinary differential equations. The results of the analysis help to provide a better understanding of the locality and scalability behavior of the implementations and can be used as a starting point for further optimizations.
1 Introduction The modeling of many scientific and engineering problems leads to systems of ordinary differential equations (ODEs). The numerical solution of such problems requires large amounts of computational resources, particularly if the ODE system is large or the evaluation of the right-hand-side function is expensive. We aim at the development of new, fast realizations of parallel ODE solvers that efficiently exploit the memory hierarchy of current microprocessors. We start with an analysis of existing parallel and sequential methods that assists in the detection of scalability bounds and further sources of performance degradation. The paper considers the simulation-based analysis of embedded Runge-Kutta (RK) methods. In particular, we study locality optimizations for general embedded RK methods based on program transformations similar to [11], and locality optimization and parallelization using pipelining to exploit the characteristic access structure of an important class of ODEs resulting from the spatial discretization of partial differential equations (PDEs) [6]. In contrast to [6], the parallel realizations are based on a multi-threaded realization using Pthreads. Investigations on real computer systems—as presented in [6,11]—show that there are several limitations when performing a software analysis on real systems. For example, only a limited number of hardware events is measurable as provided by the manufacturer of the system, the execution of the program to be analyzed may be disturbed by other user processes competing for shared resources, and the measurement itself may influence its outcome if additional code needs to be inserted into the program or if the program has to be interrupted in order to read the state of the machine. Additional experiments using simulators can help overcome these problems. Particularly promising are instructionset-level simulators which enable full-system simulation of real parallel architectures and provide a wide range of instrumentation facilities. We use Simics [9] for analyzing the locality behavior and scalability properties of several sequential and parallel multithreaded versions of embedded RK solvers. Simics has already been used successfully J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 1105–1114, 2006. c Springer-Verlag Berlin Heidelberg 2006
1106
Matthias Korch and Thomas Rauber
to investigate the cache memory behavior of PDE solvers [13]. We present experiments performed on a simulated SPARC V9 ISA, and compare our results with experiments performed on a real Sun Fire server. As examples we use two sparse ODE systems resulting from applying the method of lines to time-dependent PDEs. The rest of the paper is organized as follows: Section 2 describes the ODE solution methods considered. Section 3 shows the ODE test set used, and Section 4 gives an overview of the different implementations. Section 5 presents simulation experiments and provides a comparison with runtime tests on a real system. Section 6 discusses related work, and Section 7 concludes the paper.
2 Solution of Initial Value Problems Using Embedded Runge-Kutta Methods In this paper, we consider the solution of initial value problems (IVPs) y (t) = f (t, y(t)) ,
y(t0 ) = y 0 ,
y : IR → IRn ,
f : IR×IRn → IRn , (2.1)
of systems of ODEs. Such systems can be solved efficiently by explicit RK methods with stepsize control using embedded solutions. An embedded RK method with s stages uses the stage vectors v 1 , . . . , v s to compute two new approximations ηκ+1 and ηˆκ+1 from the two previous approximations ηκ and ηˆκ . This is represented by the computation scheme l−1 v l = f (xκ + cl hκ , ηκ + hκ ali v i ) , l = 1, . . . , s , i=1
ηκ+1 = ηκ + hκ
s l=1
bl v l ,
ηˆκ+1 = ηκ + hκ
s
(2.2) ˆbl v l .
l=1
The coefficients aij , ci , bi , and ˆbi are determined by the RK method used. A straightforward implementation of (2.2) leads to the program shown in Fig. 1. It contains a non-tightly nested loop structure consisting of a loop over the stage vector computation with the computation of the argument vector w as inner loop and the loop over vector components as innermost loop. Because the function evaluations of the right-hand-side function f may access all components of w, two barriers must be used in the dataparallel implementation to ensure that the computation of w has been finished before f is evaluated and that w is not modified before the function evaluation is complete.
3 ODE Test Set Our first test problem is the two-dimensional Brusselator equation (BRUSS2D) 2 ∂u ∂ u ∂2u 2 = 1 + u v − 4.4u + α + 2 , ∂t ∂x2 ∂y 2 2 ∂v ∂ v ∂ v = 3.4u − u2 v + α + 2 2 ∂t ∂x ∂y
(3.3)
Simulation-Based Analysis of Parallel Runge-Kutta Solvers
1107
// compute the stage vectors v1 , . . . , v s for (j = f irst; j ≤ last; j++) v1 [j] = fj (x + c1 h, η); for (i = 2; i ≤ s; i++) { barrier(); for (j = f irst; j ≤ last; j++) w[j] = ai1 v1 [j]; for (l = 2; l < i; l++) for (j = f irst; j ≤ last; j++) w[j] += ail vl [j]; for (j = f irst; j ≤ last; j++) w[j] = hw[j] + η[j]; barrier(); for (j = f irst; j ≤ last; j++) vi [j] = fj (x + ci h, w); } ˜=ˆ // compute u = ηκ+1 − ηκ and t = ηˆκ+1 − ηκ+1 (using b b − b) for (j = f irst; j ≤ last; j++) { u[j] = b1 v1 [j]; t[j] = ˜ b1 v1 [j]; } for (i = 2; i ≤ s; i++) ˜ for (j = f irst; j ≤ last; j++) { t[j] += bi vi [j]; u[j] += bi vi [j]; } use t to perform error control and stepsize selection use u to update ηκ+1 if step is accepted
Fig. 1. Program fragment of one time step of parallel implementation (A)
where the unknown functions u and v represent the concentrations of two substances in the domain 0 ≤ x ≤ 1, 0 ≤ y ≤ 1 [4]. A Neumann boundary condition ∂u/∂η = 0, ∂v/∂η = 0, and the initial conditions u(x, y, 0) = 0.5 + y, v(x, y, 0) = 1 + 5x are used. A standard five-point-star discretization of the spatial derivatives on a uniform N × N grid leads to a sparse ODE system of dimension n = 2N 2 for the discretized solution {Uij }i,j=1,...,N and {Vij }i,j=1,...,N dUij = 1 + Uij2 Vij − 4.4Uij + α(N −1)2 dt × (Ui+1,j + Ui−1,j + Ui,j+1 + Ui,j−1 − 4Ui,j ) , dVij = 3.4Uij − Uij2 Vij + α(N −1)2 dt × (Vi+1,j + Vi−1,j + Vi,j+1 + Vi,j−1 − 4Vi,j ) ,
(3.4)
where a constant number of components of the argument vector are accessed in each evaluation of the right-hand-side function. This is a non-stiff system for α = 2 · 10−4 and small values of N . The second problem considered is the Medical Akzo Nobel problem (MEDAKZO) [8]. This problem originates from the penetration of radio-labeled antibodies into a tissue that has been infected by a tumor. It consist of two PDEs ∂u ∂2u = − kuv , ∂t ∂x2
∂v = −kuv . ∂t
Semi-discretization leads to an ODE system of the form
(3.5)
1108
Matthias Korch and Thomas Rauber
y (t) = f (t, y) , y(0) = g , y ∈ IR2N , where y2j+1 − y2j−3 y2j−3 − 2y2j−1 + y2j+1 f2j−1 = αj + βj − k y2j−1 y2j , ∆ζ (∆ζ)2 f2j = −k y2j y2j−1 .
(3.6)
As the Brusselator equation, this ODE system is sparse, i.e., the function evaluations only access a small, constant number of components of the argument vector. A detailed derivation of (3.6) together with initial and boundary conditions can be found in [8]. For the investigation of the locality behavior, the system size and the number of stages of the RK method should be as large as possible such that not all memory locations accessed during one time step can be stored in the cache and a good locality behavior is profitable. Therefore, we use the grid size N = 250 (resulting in 125000 components) together with DOPRI5(4) (7 stages) for BRUSS2D, and N = 15000 (resulting in 30000 components) together with DOPRI8(7) (13 stages) in the experiments with MEDAKZO. Refining the mesh size to increase the system size, however, also increases the degree of stiffness of the problems. Therefore, RK methods of lower order would be more suitable for a practical numerical solution. However, a high number of stages is required for the creation of large working sets (cf. Section 5) since the only alternative is to increase the system size and thus to reduce the mesh size even further.
4 Overview of the Implementations To develop efficient parallel implementations of embedded RK solvers, we derive several different implementation variants by applying modifications to the loop structure and taking the special dependences of particular ODE systems into account. The modifications of the loop structure increase the locality and allow a better utilization of the memory hierarchy. On shared-memory machines a higher locality usually increases the scalability since the memory operations issued by the processors can be satisfied from one of the local caches more often, and the traffic on the system interconnect is reduced. Additionally, the loop transformations may enable more efficient synchronization. The implementations investigated include general implementations that can be used with arbitrary ODEs, and pipelining implementations which are specialized in sparse ODE systems with a limited access distance of the function evaluations. General Implementations. The general implementations have been derived starting from the straightforward implementation (A) (Fig. 1) by applying loop transformation techniques such as loop interchange and loop fusion. Further, the data structures have been adapted to the modified loop structure. As a result, these transformations lead to different working spaces of the loops and thus change the locality behavior of the implementations. All general implementations have in common that the function evaluations of the right-hand-side function may access all components of the argument vector. Therefore, the parallelization of all general implementations is performed similarly to implementation (A) by inserting appropriate barriers at every stage to ensure that all components of the argument vector w are available before the function evaluation starts (Fig. 1).
Simulation-Based Analysis of Parallel Runge-Kutta Solvers 9
1109
8
10
10
7
Total number of cache misses
Total number of cache misses
10 8
10
7
10
6
10 256
1 2 4 8 16
6
10
5
10
1 2 4 8 16
4
10
3
1K
4K
16K 64K Cache size (bytes)
256K
(a) Cache line size 32 bytes
1M
10
16K
64K
256K 1M 4M Cache size (bytes)
16M
64M
(b) Cache line size 1024 bytes
Fig. 2. Effect of different associativities on implementation (D) for BRUSS2D
Pipelining. The fact that for sparse ODE systems the function evaluations access only a small number of argument vector components within a bounded access distance enables us to perform a blockwise diagonal sweep across the stages rather than processing the stages successively [6]. Consequently, the working space of these implementations decreases to Θ(s2 b), where b is the blocksize (e.g., b = 2N for BRUSS2D, b ≥ 2 for MEDAKZO). In contrast, the working space of the general implementations consists of Θ(sn) vector components with n = 2N 2 for BRUSS2D and n = 2N for MEDAKZO. The parallelization of the pipelining implementations can be performed using mutex variables to synchronize neighboring processors when they are finalizing their pipelines. But, using a modified data distribution, it is possible to reduce the synchronization overhead such that only one barrier is required at each time step.
5 Experimental Results To investigate the influence of different cache parameters on the locality behavior, we simulate a one-level unified cache and vary its parameters, i.e., the number of cache lines, the line size, and the associativity. In these experiments, starting at t = 0, we execute two time steps to warm-up the cache, and then measure the number of cache misses which occur during the execution of three further time steps. The results of the simulation are verified by runtime experiments on a real Sun Fire 6800 machine, where we measure the execution times of the implementations and investigate the cache behavior using hardware performance counters. Influence of Associativity and Line Size. Figure 2 shows the number of cache misses of one general implementation as a function of cache size for different associativities for two different line sizes when solving the BRUSS2D problem. As expected, the number of cache misses of our implementations can be reduced by increasing the associativity of the cache. This particularly holds for caches with large cache lines. However, if the cache is sufficiently large (the exact size depends on the line size used) increasing
1110
Matthias Korch and Thomas Rauber
9
9
10
10
8
8
10 Total number of cache misses
Total number of cache misses
10
7
10
6
10
8 bytes/line 16 bytes/line 32 bytes/line 64 bytes/line 128 bytes/line 256 bytes/line 512 bytes/line 1024 bytes/line
5
10
4
10
3
10
256
1K
4K
7
10
6
10
8 bytes/line 16 bytes/line 32 bytes/line 64 bytes/line 128 bytes/line 256 bytes/line 512 bytes/line 1024 bytes/line
5
10
4
10
3
16K 64K 256K 1M Cache size (bytes)
4M
16M
10
64M
256
(a) Implementation (A)
16K 64K 256K 1M Cache size (bytes)
4M
16M
64M
4M
16M
64M
9
10
10
8
8
10 Total number of cache misses
10 Total number of cache misses
4K
(b) Implementation (D)
9
7
10
6
10
8 bytes/line 16 bytes/line 32 bytes/line 64 bytes/line 128 bytes/line 256 bytes/line 512 bytes/line 1024 bytes/line
5
10
4
10
3
10
1K
256
1K
4K
7
10
6
10
8 bytes/line 16 bytes/line 32 bytes/line 64 bytes/line 128 bytes/line 256 bytes/line 512 bytes/line 1024 bytes/line
5
10
4
10
3
16K 64K 256K 1M Cache size (bytes)
4M
(c) Implementation (Dblock)
16M
64M
10
256
1K
4K
16K 64K 256K 1M Cache size (bytes)
(d) Implementation (PipeD)
Fig. 3. Locality behavior for BRUSS2D for varying cache sizes and cache line sizes using one processor with a 16-way set-associative cache
the associativity to more than two ways improves the number of cache misses only insignificantly. The influence of the cache line size can be observed from diagrams as those shown in Fig. 3, where for the different implementations the number of cache misses as a function of cache size is displayed for varying line sizes. All of our implementations profit from large line sizes. Actually, we observed the lowest number of cache misses for the largest line size we used (1024 bytes). The reason is that the implementations spend the main part of their execution time in processing vectors, i.e., consecutive memory regions, and can therefore take greater advantage of spatial locality if the line size is increased. Working Sets. Figure 3 can also be used to identify the working sets [2] of the implementations, i.e., cache sizes where critical portions of an implementation’s data set start fitting into the cache. The working sets correspond to the points of inflection of the curve representing the number of cache misses as a function of cache size and are best characterized by simulating a fully associative cache using a line size of one word.
Simulation-Based Analysis of Parallel Runge-Kutta Solvers
1111
However, in our simulations we use a setup more close to practical systems, i.e., we use a 16-way set-associative cache, and line sizes between 8 bytes and 1024 bytes. The most significant working set of all implementations is found between 8 and 16 MB for BRUSS2D and between 2 and 4 MB for MEDAKZO. This working set corresponds to the data accessed in one time step and is therefore independent of the loop structure used to compute the time steps. In addition, depending on the implementation, one or two other working sets can be identified. The first starts at about 1 to 4 KB. Its exact size depends on the problem, the implementation, and the cache line size used. This working set is characterized by a very smooth change of the number of cache misses resulting from the fact that the bounds of one loop, which occurs in all implementations, depend on the index of another loop. The pipelining implementations such as (PipeD) show another significant working set. Again, the exact size of this working set depends on the line size. For BRUSS2D, this working set starts between 128 KB and 512 KB. It results from the data accessed in the outer loop executed at every time step. The iterations of this loop, which runs over the elements of the ODE system, correspond to pipelining steps of the blockwise diagonal sweep across the stages.
Comparison of the Implementations. Figure 4 shows a comparison of the number of cache misses of several embedded RK implementations. Comparing the two general implementations (A) and (D), we observe that (D) usually attains a lower number of cache misses than (A) if the cache size exceeds a certain size (e.g., 4–16 KB). The improved locality behavior of (D) is the result of several loop transformations (e.g., interchange and fusion) aiming at the frequent re-use of the results of the function evaluations [11]. These transformations replace the characteristic working set of (A) by a new working set which is slightly larger, but if it starts fitting into the cache, it decreases the number of cache misses further than the working set of the non-tightly nested loop structure of (A). Our motivation to implement (Dblock) was to use blocking (also called tiling) to combine the good properties of (A) and (D). Our simulations show that this approach can lead to a good locality behavior for large as well as small caches. For example, the simulation of the sequential solution of MEDAKZO (Fig. 4(c)) and the simulation of the parallel solution of BRUSS2D (Fig. 4(b)) show a cache miss number close to (A) for cache sizes up to 16 KB where (A) leads to fewer cache misses than (D). If the cache size is larger than 16 KB, the number of cache misses of (Dblock) is close to (D) and significantly lower than the number of cache misses of (A). But implementation (Dblock) is not always successful in reducing the cache misses. For example, it cannot reach the low number of cache misses of (D) in the simulation of the sequential solution of BRUSS2D (Fig. 4(a)). In the simulation of the parallel solution of MEDAKZO, (Dblock) even generates a higher number of cache misses than (A) (Fig. 4(d)). The pipelining implementation (PipeD) exploits the special access structure of the two problems by performing a blockwise diagonal sweep across the stages, i.e., the outer loop of the computational kernel of this implementation runs across the system dimension, and at every iteration of this loop (called a pipelining step) only a small part of the overall data set required to compute one time step is accessed. Therefore, this implementation shows a very low number of cache misses compared to the other
1112
Matthias Korch and Thomas Rauber
8
8
10
10 (A) (D) (Dblock) (PipeD)
7
7
10 Total number of cache misses
Total number of cache misses
10
6
10
5
10
4
6
10
5
10
4
10
10
3
10
3
4K
16K
64K 256K Cache size (bytes)
1M
4M
10
16M
(a) BRUSS2D on one processor
16K
64K 256K Cache size (bytes)
1M
4M
16M
8
10
10 (A) (D) (Dblock) (PipeD)
7
(A) (D) (Dblock) (PipeD)
7
10 Total number of cache misses
10 Total number of cache misses
4K
(b) BRUSS2D on eight processors
8
6
10
5
10
4
6
10
5
10
4
10
10
3
10
(A) (D) (Dblock) (PipeD)
3
4K
16K
64K 256K Cache size (bytes)
1M
(c) MEDAKZO on one processor
4M
10
4K
16K
64K 256K Cache size (bytes)
1M
4M
(d) MEDAKZO on eight processors
Fig. 4. Comparison of the locality behavior of different implementations on different numbers of processors (line size 256 bytes)
implementations in most of our experiments if the cache size is large enough to hold the working set of one pipelining step. Real System. To verify the results of our simulations, we have performed experiments on a real Sun Fire 6800 where we investigate the runtimes (Tab. 1) and the locality behavior (Tab. 2). The cache hierarchy of the real Sun Fire consists of a 64 KB 4-way set-associative data cache (DC) and a 32 KB 4-way set associative instruction cache (IC) on the first level, and an 8 MB 2-way set associative unified ‘enhanced cache’ (EC) on the second level. In these experiments, where we use the intergration intervals [0, 0.5] for BRUSS2D and [0, 10−4] for MEDAKZO, the runtime is often closely related to the number of cache misses. Thus the order of the implementations w.r.t the runtime is often the same as the order w.r.t. the number of cache misses. But it is also very important how the functional units of the processors can be exploited. Therefore, (A) can outperform the other implementations in the experiments using MEDAKZO on one processor even though it shows a worse locality.
Simulation-Based Analysis of Parallel Runge-Kutta Solvers
1113
Since the large L2 cache can store a large part of the data accessed during one time step, the order of the implementations w.r.t. the number of cache misses is sometimes different compared with our simulations. E.g., using BRUSS2D with N = 250, implementation (A) attains a lower number of cache misses than (D) and (Dblock). But if we use N = 750, thus increasing the data set of one time step, (D) and (Dblock) show less cache misses than (A). (Dblock) usually attains at least a similar number of cache misses as (D), but often the number of cache misses is significantly lower than that of (D). In all our experiments, (Dblock) obtains a lower execution time than (D). (PipeD) shows the best locality behavior in most of our experiments. Therefore, only (A) can obtain a better runtime than (PipeD) in some experiments, because the compiler can exploit the simpler source code structure of (A) to produce more efficient machine code. Table 1. Execution time (in seconds) on a real Sun Fire 6800 Problem
BRUSS2D N = 250 N = Processors 1 8 1 (A) 4.65 0.61 651.3 (D) 4.92 0.63 530.5 (Dblock) 4.79 0.62 517.0 (PipeD) 4.74 0.62 382.6
750 8 48.4 49.3 48.8 45.4
MEDAKZO N = 15000 1 8 134.0 25.9 143.2 24.5 137.0 23.8 136.5 21.3
Table 2. Number of cache misses (in thousands) on a real Sun Fire 6800 Problem
BRUSS2D N = 250 N = 750 Cache DC EC DC EC (A) 981 785 53562 32062 (D) 1303 1156 26477 25013 (Dblock) 1192 1244 26542 20389 (PipeD) 254 650 4138 6424
MEDAKZO N = 15000 DC EC 120 98 69 76 30 36 21 85
6 Related Work Loop transformations to improve the locality of scientific computations have been investigated by several authors. For example, [10] proposes computation regrouping to increase temporal locality, and [12] suggests the use of Cache Miss Equations (CMEs) to guide tiling and padding transformations. [7] gives an overview of different cache optimization techniques. There exist several alternative simulation environments suitable for investigating program locality, e.g., full-system simulators as SimpleScalar [1] and SimOS [5], and specialized cache simulators such as Dinero IV [3].
7 Conclusions Simulation-based analysis enables us to investigate many interesting features of ODE solvers and other applications. Our experiments show that locality has a large influence on the execution speed of embedded RK implementations, particularly when executed in parallel. But other factors also influence the execution speed, such as synchronization and the utilization of functional units. Which implementation performs best for a given problem depends on many factors, e.g., the system dimension, the access structure of the function evaluations, the particular memory hierarchy of the target system, and the optimizing capabilities of the compiler. If the access structure of the function evaluations enables the use of the pipelining computation scheme, it can be used to break down the working set of the outermost loop
1114
Matthias Korch and Thomas Rauber
to fit into the cache. Otherwise, if a general implementation is required, we recommend using implementation (D) since its most significant working set reduces the number of cache misses further than the most significant working set of (A), and most modern microprocessors are equipped with L1 data caches of 64 KB or larger which usually can store the working set of (D). (Dblock) can run faster than (D) in many cases, but requires tuning of the blocksize.
References 1. T. Austin, E. Larson, and D. Ernst. SimpleScalar: An infrastructure for computer system modeling. Computer, 35(2):59–67, February 2002. 2. P. J. Denning. The working set model for program behavior. Communications of the ACM, 11(5):323–333, May 1968. 3. J. Edler and M. D. Hill. Dinero IV: Trace-driven uniprocessor cache simulator. http://www.cs.wisc.edu/˜markhill/DineroIV/. 4. E. Hairer, S. P. Nørsett, and G. Wanner. Solving Ordinary Differential Equations I: Nonstiff Problems. Springer, Berlin, 2nd edition, 2000. 5. S. A. Herrod. Using Complete Machine Simulation to Understand Computer System Behavior. PhD thesis, Stanford University, February 1998. 6. M. Korch and T. Rauber. Scalable parallel RK solvers for ODEs derived by the method of lines. In Euro-Par 2003. Parallel Processing (LNCS 2790), pages 830–839. Springer, August 2003. 7. M. Kowarschik and C. Weiß. An overview of cache optimization techniques and cache-aware numerical algorithms. In Proceedings of the GI-Dagstuhl Forschungsseminar: Algorithms for Memory Hierarchies. Springer (LNCS 2625), 2003. 8. W. M. Lioen and J. J. B. de Swart. Test Set for Initial Value Problem Solvers, Release 2.1. CWI, Amsterdam, The Netherlands, September 1999. 9. P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. H˚allberg, J. H¨ogberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A full system simulation platform. Computer, 35(2):50–58, February 2002. 10. V. S. Pingali, S. A. McKee, W. C. Hsieh, and J. B. Carter. Restructuring computations for temporal data cache locality. International Journal of Parallel Programming, 31(4):306–338, August 2003. 11. T. Rauber and G. R¨unger. Optimizing locality for ODE solvers. In Proceedings of the 15th ACM International Conference on Supercomputing, pages 123–132. ACM Press, 2001. 12. X. Vera, N. Bermudo, J. Llosa, and A. Gonzalez. A fast and accurate framework to analyze and optimize cache memory behavior. ACM Transactions on Programming Languages and Systems (TOPLAS), 26(2):263–300, March 2004. 13. D. Wallin, H. Johansson, and S. Holmgren. Cache memory behavior of advanced PDE solvers. Technical Report 2003-044, Department of Information Technology, Uppsala University, August 2003.
A Novel Task Scheduling Algorithm for Distributed Heterogeneous Computing Systems Guan-Joe Lai Department of Computer and Information Science National Tai-Chung Teachers College, Taichung, Taiwan, R.O.C. [email protected]
Abstract. This paper proposes a novel task scheduling algorithm to exploit the potential of parallel processing, allowing for system heterogeneity in distributed heterogeneous computing environments. Its goal is to achieve maximizing parallelization and minimizing communication. Due to that the algorithm avoids from the max-min anomaly in the parallelization problem and exploits schedule holes, it could produce better schedules than those obtained by existing algorithms. Experimental results are presented to verify the preceding claims. Three comparative algorithms are applied to demonstrate the proposed algorithm’s effectiveness. As the system’s heterogeneity increases, the performance improvement of the proposed algorithm becomes more outstanding than that of others. Therefore, the proposed scheduling algorithm may be used in designing efficient parallel environments for those situations where the system heterogeneity is the system performance bottleneck.
1 Introduction This study proposes a novel task scheduling algorithm for distributed heterogeneous computing environments. A distributed heterogeneous computing system generally consists of a heterogeneous suite of machines, i.e., workstations or PCs. Although distributed heterogeneous computing systems offer significant advantages for highperformance computing, the effective use of heterogeneous resources still remains a major challenge. For effective utilization of diverse resources, an application could be partitioned into a set of tasks presented by an edge-weighted directed acyclic graph, such that each task could be scheduled to the best-suited machine to minimize the total completion time. Many heuristics have focused on solving the NP-complete problem [3] of efficiently scheduling tasks to heterogeneous computing systems, e.g., the generalized Dynamic Level Scheduling (DLS) algorithm [4], the Heterogeneous Earliest Finish Time (HEFT) [8] algorithm, the Critical Path on a Processor (CROP) [8] technique, the Iso-Level Heterogeneous Allocation (ILHA) algorithm [2], and the Partial Completion Time (PCT) algorithm [11]. However, most of these algorithms simply focus on computational aspects, so that the communication may become the system bottleneck as the computational power increases, particularly while executing applications with huge communication requirements. These strategies may perform poorly in distributed heterogeneous computing systems, owing to the heterogeneity of computational power and that of communication mechanisms. J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 1115–1122, 2006. c Springer-Verlag Berlin Heidelberg 2006
1116
Guan-Joe Lai
A list-scheduling based algorithm, called the Dominant Tasks Scheduling (DTS) algorithm, is proposed here to exploit the potential of parallel processing, allowing for system heterogeneity and network bandwidth. In general, the list-scheduling approach assigns priorities to tasks and then allocates these tasks to machines in the order of their priorities to minimize a predefined cost function [18]. Most of them examine a ready task for scheduling only after all parents of the task have been scheduled. However, the operation of scheduling some ready tasks should not affect the strict reduction of the earliest starting time of the tasks in the critical path; otherwise, it may extend the overall parallel scheduling length. This situation is the so-called max-min anomaly in the parallelization problem [1]. Exploiting schedule holes [13] is another important issue in the scheduling problem. The schedule holes arise primarily because a task is scheduled after some other tasks with higher scheduling priorities, but could be scheduled before these tasks without affecting their earliest starting times. The DTS algorithm avoids from the max-min anomaly by taking account of the partial global scheduling information [15,16], and exploits schedule holes at each scheduling step. Therefore, it could produce better schedules than those obtained by existing algorithms.
2 Dominant Tasks Scheduling Algorithm Before introducing the proposed algorithm, related terminology is presented. A parallel program is presented as a directed acyclic graph (DAG). The DAG is defined as G = (N , E, W , C), where N is the set of tasks, E is the set of communication edges, W is the set of task computation volumes, and C is the set of communication volumes. The value of wi ∈ W is the computation volume for ni ∈ N . The value of cij ∈ C is the communication volume occurring along eij ∈ E. Suppose that a distributed heterogeneous computing system is presented as M = (P , Q, A, B), where P is the set of heterogeneous processor elements, Q is the set of communication channels, A = {αi |i = 1, . . . , |P |} is the set of execution rates, and B = {βij |i = 1, . . . , |P |, j = 1, . . . , |P |} is the set of transfer rates for the communication channels. Let p(ni ) ∈ P be the processor element to which ni is allocated, α(p(ni )) ∈ A be the execution rate for p(ni ), and then wi × α(p(ni )) be the computation cost when the task ni is allocated to p(ni ). When ni is allocated to p(ni ) and nj is allocated to p(nj ), let β(p(ni ), (p(nj )) ∈ B be the transfer rate from p(ni ) to p(nj ), and then cij × β(p(ni ), (p(nj )) be the communication cost from ni to nj . Let est(ni ) be the earliest starting time when ni can start execution, and then the earliest completion time of ni is defined as ect(ni ) = est(ni ) + wi × α(p(ni )). The least-completion time of nj ∈ succ(ni ), lct(ni ) = max(cij × β(p(ni ), (p(nj )) + wj × α(p(nj )) + lct(nj )) is the least execution time from this node to the sink node, where succ(ni ) is the set of immediate successors of ni . Given a DAG and a system as described above, the scheduling problem is to obtain the minimal-length, non-preemptive schedule of the task graph in the distributed heterogeneous computing systems. To prevent the max-min anomaly, the DTS algorithm reduces monotonically the intermediate scheduled length at each scheduling step to achieve a shorter final schedule length. The proposed algorithm identifies the dominant task in scheduling progress. A dominant task belongs in the critical path which
A Novel Task Scheduling Algorithm
1117
dominates the partial scheduled length during the scheduling process. Therefore, the DTS algorithm could take advantage of the partial global information to monotonically reduce the scheduled length. In each scheduling step, the DTS algorithm finds out two ready candidates: the dominant task, ndt , and the task, nMaxP riority , with maximal scheduling priority. The dominant task is with the maximal value of est(ndt ) + w(ndt ) × α(p(ndt )) + lct(ndt ); and, the scheduling priority function is max(lct(ndt )−est(ndt )−w(ndt )×α(p(ndt ))). Then, the DTS algorithm schedules nMaxP riority to minimize est(nMaxP riority ) under the limitation of preventing est(ndt ) from being delayed. In order to exploit schedule holes, the DTS algorithm also finds out the task, nsh , which could be scheduled on p(nMaxP riority ) before nMaxP riority , when the following conditions are satisfied: ready(p(nMaxP riority )) ≤ est(nsh ), and ect(nsh ) ≤ est(nMaxP riority ). The DTS algorithm first finds the lct for each task, and then proceeds to deal with the set of ready tasks. When the set of ready tasks is not empty, the DTS finds out ndt and nMaxP riority . If ndt is nMaxP riority , then DTS sets ndt to be the candidate. If ndt is not nMaxP riority , then DTS checks whether ndt and nMaxP riority would be scheduled into the same processor element or not. If they are scheduled into the same processor element, then DTS tries to avoid from delaying the earliest starting time of ndt . If they are not scheduled into the same processor element, then DTS sets nMaxP riority to be the candidate. When the candidate is found, DTS tries to find out the schedule hole before the candidate in the same processor element. If the schedule hole exists, then DTS sets nsh to be the candidate. Finally, DTS schedules the candidate to its corresponding processor element, and updates the ready set. The DTS algorithm is listed as follows. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19.
Algorithm DTS Input: A system M=(P,Q,A,B); a DAG=(N,E,W,C). Output: A schedule with minimal parallel completion time. Finding lct for each node. Finding the ready nodes. While the set of ready nodes is not empty Finding ndt and nMaxP riority . If ndt = nMaxP riority then candidate = ndt else If p(ndt ) = p(nMaxP riority ) then If scheduling nMaxP riority would delay est(ndt ) then Exiting while loop, and finding the next node with maximal scheduling priority. else candidate = nMaxP riority . end if else candidate = nMaxP riority . end if
1118
Guan-Joe Lai
20. end if 21. For candidate, finding the task, nsh . 22. If nsh exists then candidate = nsh . 23. Scheduling candidate to its corresponding processor element. 24. Updating the set of ready nodes. 25. end while In order to provide constant measures of the scheduling performance, the performance bound of the DTS algorithm is analyzed to evaluate the accuracy of the heuristic solution. According to a previous theorem [6], the performance bound of the DTS scheduling algorithm ensures its performance within a factor of two times of the optimum for general directed acyclic task graphs, as shown in the following theorem. Theorem 1. For any DAG, G=(N , E, W , C) to be scheduled on a system M=(P , Q, A, B), the schedule length, ω, obtained by DTS always satisfies 1 ω ≤ 2− ωopt + c, (2.1) |P | where ωopt is the length of the optimal schedule, and c is a constant value. Proof. The time in (0, ω) could be partitioned into two sets S1 and S2 . S1 is defined as the set of all points of times for which all processor elements are executing some tasks, and S2 is defined as the set of all points of time for which at least one processor element is idle. If S2 is empty, all processor elements complete their last assignment at ω and no idle interval can be found within (0, ω). The schedule is indeed optimal and, thus, the theorem holds. Therefore, we assume that S2 is non-empty. Moreover, we also assume that S2 is the disjoint union of q open intervals as below: S2 = (Il1 , Ir1 ) ∪ (Il2 , Ir2 ) ∪ . . . ∪ (Ilq , Irq ), where Il1 < Ir1 < Il2 < Ir2 < . . . < Ilq < Irq . Without loss of generality, we claim that a chain of tasks (i.e., X : n1 → n2 → . . . → nx ) could be found after applying DTS algorithm, such that the chain of tasks satisfies the following three cases. (a) st(nx ) ≤ Il1 . (b) st(nx ) ∈ S2 , i.e., ∃ an integer h, h ≤ q, s.t. Ilh < st(nx ) < Irh . (c) st(nx ) ∈ S1 but st(nx ) > Il1 , i.e., ∃ an integer h, h ≤ q − 1, s.t. Irh ≤ st(nx ) ≤ Il,h+1 or Ir,q ≤ st(nx ), where nx denotes the task that finishes in the DTS schedule at time ω, and st(nx ) denotes the stating time scheduled by DTS for nx . If the task nx is the entry task, the first possibility occurs; then the task nx by itself constitutes a chain that satisfies our claim. If the task nx is not the entry task, the second and third possibilities may occur. In the scheduling process, there always is some task ng , such that there exists a resource contention relationship with nx ; otherwise, DTS could advance the earliest starting time of the candidate, as shown in the above-described algorithm steps.
A Novel Task Scheduling Algorithm
1119
Then there always exists a h which satisfies (b) or (c). The cycle can be repeated by considering the above-mentioned three possibilities until the starting time of the last added task satisfies (a). Finally, according to the theorem in [6], the performance bound of the DTS scheduling algorithm ensures its performance within a factor of two times of the optimum for general directed acyclic task graphs.
3 Experimental Results Experimental results are presented to verify the preceding theoretical claims. Three algorithms (i.e., DLS [4], HEFT [8] and CROP [8]) are applied to comparatively demonstrate the proposed algorithm’s effectiveness. The scheduling performance of these algorithms is compared in different communication/computation ratios (CCRs). In order to evaluate these algorithms, six practical parallel applications are applied, including the fork tree, the join tree, the fork-join, the FFT, the Gaussian elimination, and the LU decomposition. The experimental results show the superiority of the DTS algorithm.
Scheduled length
480000
DTS HEFT CPOP DLS
380000
280000
180000
80000 1
2
4
8
16
32
PE #
Fig. 1. Average scheduled lengths in six parallel applications
As the number of processor elements increases, the average scheduled lengths generated by the DTS, HEFT and CPOP decrease in the six parallel applications, as shown in Fig. 1. However, the average scheduled length generated by DLS does not decrease gradually, when the number of processor elements is larger than eight. It is due to that the DLS algorithm doesn’t take the strictly reduction problem into consideration. When the communication/computation ratio is set to 0.05, 0.1, or 1, the average scheduled lengths generated by the DTS, HEFT, CPOP and DLS in the six applications are respectively shown in Fig. 2. When CCR=0.05, the average scheduled lengths generated by the DTS and HEFT are similar, except that the DTS outperforms HEFT in terms of the schedule lengths in the FFT application, as shown in Fig. 2(a). When CCR=0.1, the average scheduled lengths generated by the DTS and HEFT are also similar; however, the HEFT outperforms the DTS in terms of the schedule lengths in the Fork-Join application, as shown in Fig. 2(b). As the communication/computation ratio increases to one, the DTS outperforms the HEFT in the bulk of applications, as shown in Fig. 2(c). The variance of β indicates the heterogeneity of communication mechanisms, and the variance of α indicates the heterogeneity of computation power. As the value of
Guan-Joe Lai
Scheduled length
1000000
500000
DTS HEFT CPOP
800000 600000
Scheduled length
1120
DLS
400000 200000 0
400000 300000
DTS HEFT CPOP DLS
200000 100000 0
FFT
G.E. LU Fork (a) CCR=0.05
Join
Scheduled length
60000
F.J.
FFT
G.E. LU Fork (b) CCR=0.1
Join
F.J.
Join
F.J.
DTS HEFT CPOP DLS
50000 40000 30000 20000 10000 0
FFT
G.E. LU Fork (c) CCR=1
Fig. 2. Average scheduled lengths in different CCRs DTS
HEFT
CPOP
DLS
Scheduled length
Scheduled length
DTS
120000
120000 90000 60000 30000
HEFT
CPOP
DLS
90000 60000 30000 0
0 2
4 8 16 32 (a) variance of â is 10
DTS
Scheduled length
120000
2
PE#
HEFT
CPOP
4 8 16 32 (b) variance of â is 100
PE#
DLS
90000 60000 30000 0 2
4 8 16 32 PE# (c) variance of â is 200
Fig. 3. Average scheduled lengths in different variances of β
the variance of β (or α) increases, the heterogeneity of communication mechanisms (or computation power) becomes obvious, and vice versa. In Fig. 3(a)-(c), the scales of the Y-axis are fixed from 0 to 120000. As the variance of β increases, the average scheduled lengths generated by four algorithms also increase, as shown in Fig. 3. Actually, we have shown the theoretical proof that the scheduling performance is affected by the heterogeneity of computational power and that of communication mechanisms in our previous work [5]; and Fig. 3 manifests it. [5] shows that the bounding margins of the scheduling performance are affected by the heterogeneity of computational power and
A Novel Task Scheduling Algorithm DTS
HEFT
CPOP
DLS
DTS
300000
25000
Scheduled length
Scheduled length
30000
20000 15000 10000 5000
HEFT
CPOP
8
16
1121
DLS
250000 200000 150000 100000 50000 0
0 2
4 8 16 32 (a) variance of á is 10 DTS
Scheduled length
500000
2
PE#
4
32 PE#
(b) variance of á is 100 HEFT
CPOP
DLS
400000 300000 200000 100000 0 2
4 8 16 32 (c) variance of á is 200
PE#
Fig. 4. Average scheduled lengths in different variances of α
that of communication mechanisms. As either heterogeneity increases, the spread of performance bounds also increases. Because the scheduled length generated by DLS is over the upperbound of the Y-axis as the variance of β is 200, we ignore the exceeded part. From Fig. 3, we could observe that as the system heterogeneity increases, the performance improvement of the DTS algorithm becomes more stable than that of others. The similar situations could also be observed in Fig. 4.
4 Conclusions In this paper, we present a novel task scheduling algorithm, DTS, for distributed heterogeneous computing systems. The DTS algorithm could take advantage of the partial global information to monotonically reduce the scheduled length, and to exploit schedule holes in each scheduling step. The performance of the DTS algorithm is demonstrated by evaluating six practical application benchmarks. Experimental results show the superiority of our proposed algorithm over those presented in previous literature, and that the scheduling performance is affected by the heterogeneity of computational power, the heterogeneity of communication mechanisms and the program structure of applications. As the system heterogeneity increases, the performance improvement of the DTS algorithm becomes more stable than that of others. Therefore, the proposed scheduling algorithm may be used in designing efficient parallel environments for those situations where the system heterogeneity is the system performance bottleneck.
Acknowledgements This work was sponsored in part by the National Science Council of the Republic of China under the contract number: NSC92-2213-E-018-002.
1122
Guan-Joe Lai
References 1. B. Kruatrachue and T. G. Lewis. Grain Size Determination for Parallel Processing. IEEE Software, 23–32, 1988. 2. B. Olivier, B. Vincent and R. Yves. The Iso-Level Scheduling Heuristic for Heterogeneous Processors. Proc. Of 10th Euromicro Workshop on Parallel, Distributed and Network-based Processing, 2002. 3. E. G. Coffman and P. J. Denning, Eds. Operating Systems Theory, Englewood Cliffs, NJ:Prentice-Hall, 1973. 4. G.C. Shih and E.A. Lee. A Compile-Time Scheduling Heuristic for InterconnectionConstrained Heterogeneous Processor Architectures. IEEE Trans. on Parallel and Distributed Systems, 4(2):175–187, 1993. 5. Guan-Joe Lai. Scheduling Communication-Aware Tasks on Distributed Heterogeneous Computing Systems. International Journal of Embedded Systems, accepted to be appeared, 2004. 6. Guan-Joe Lai. Performance Analysis of Communication-Aware Task Scheduling Algorithms for Heterogeneous Computing. IEEE Proceedings of the 2003 Pacific Rim Conference on Communications, Computers and Signal Processing, 788–791, August 2003. 7. H. El-Rewini and T. G. Lewis. Scheduling parallel program tasks onto arbitrary target machines. Journal of Parallel Distribution Computing, 9(2):138–153, June 1990. 8. H. Topcuoglu, S. Hariri, and M.Y. Wu. Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing. IEEE Trans. on Parallel and Distributed Systems, 13(3):260–274, March 2002. 9. I. Ahmad and Y. Kwok. On Parallelizing the Multiprocessor Scheduling. IEEE Trans. on Parallel and Distributed Systems, 10(4):414–432, Apr. 1999. 10. M.A. Palis, J.-C. Liou, and D.S.L. Wei. Task Clustering and Scheduling for Distributed Memory Parallel Architectures. IEEE Trans. on Parallel and Distributed Systems, 7(1):46–55, Jan. 1996. 11. M. Maheswaran, S. Ali, H. J. Siegel, D. Hensgen, and R. F. Freund. Dynamic mapping of a class of independent tasks onto heterogeneous computing systems. Journal of Parallel Distribution Computing, 59(2):107–121, Nov. 1999. 12. R. Andrei and J.C. Arjan. Low-Cost Task Scheduling for Distributed-Memory Machines. IEEE Trans. on Parallel and Distributed Systems, 13(6), June 2002. 13. S. Selvakumar and C. Siva Ram Murthy. Scheduling Precedence Constrained Task Graphs with Non-Negligible Intertask Communication onto Multiprocessors. IEEE Trans. on Parallel and Distributed Systems, 5(3):328–336, 1994. 14. T. Braun, H.J. Siegel, N. Beck, L.L. Boloni, M. Maheswaran, A.I. Reuther, J.P. Robertson, M.D. Theys, B. Yao, D. Hengsen, and R.F. Freund. A Comparison Study of Static Mapping Heuristics for a Classes of Meta-Tasks on Heterogeneous Computing Systems. 1999 Proc. Heterogeneous Computing Workshop, 15–29, 1999. 15. T. Yang and A. Gerasoulis. DSC: Scheduling Parallel Tasks on an Unbounded Number of Processors. IEEE Trans. on Parallel and Distributed Systems, 5(9):951–967, Sep. 1994. 16. Y. Kwok and I. Ahmad. Dynamic Critical-Path Scheduling: An Effective Technique for Allocating Task Graphs onto Multi-Processors. IEEE Trans. on Parallel and Distributed Systems, 7(5):506–521, May 1996. 17. Y. Kwok and I. Ahmad. Static Scheduling Algorithms for Allocating Directed Task Graphs to Multiprocessors. IEEE Trans. on Parallel and Distributed Systems, 31(4):406–471, Dec. 1999. 18. Y. Kwok and I. Ahmad. Static Scheduling Algorithms for Allocating Directed Task Graphs. ACM Computing Surveys, 31(4):406–471, Dec. 1999.
Study of Load Balancing Strategies for Finite Element Computations on Heterogeneous Clusters Kalyani Munasinghe1 and Richard Wait2 1
Dept. of Computer Science, University of Ruhuna, Sri Lanka [email protected] 2 Dept. of Scientific Computing, Uppsala University, Sweden [email protected]
Abstract. We study strategies for redistributing the load in an adaptive finite element computation performed on a cluster of workstations. The cluster is assumed to be a heterogeneous, multi-user computing environment. The performance of a particular processor depends on both static factors, such as the processor hardware and dynamic factors, such as the system load and the work of other users. On a network, it is assumed that all processors are connected, but the topology of the finite element sub-domains can be interpreted as a processor topology and hence for each processor, it is possible to define set of neighbours. In finite element analysis, the quantity of computation on a processor is proportional to the size of the sub-domain plus some contribution from the neighbours. We consider schemes that modify the sub-domains by, in general, moving data to adjacent processors. The numerical experiments show the efficiency of the approach.
1 Introduction The price-performance ratio of networks of workstations (NOWs) give better system architectures for parallel processing applications. Several projects have been initiated to deliver parallel supercomputing power by connecting large number of dedicated workstations. In addition, shared cluster networks of workstations also provide an attractive environment for parallel applications. Increasing availability of low cost computers and networks has generated increased interest in the use of distributed systems as a parallel computational resource with which to solve large applications. Even though the demand for computing power is large, many existing systems are underutilized and a large amount of computing capacity remains unused. At any instant, on a particular network only a few servers may be working at capacity, most of the systems will not. Therefore many computing cycles that could be utilised to fulfill the increasing computational demand are unused. The search for ways to do more work with available resources by handling dynamically changing workloads is the main aim of our work. Scheduling parallel applications on shared environments is different from scheduling in dedicated environments. The scheduling strategies of the dedicated environment depends on exclusively accessible identical processing elements. Scheduling in a shared environment is more challenging as the actual computing power available for parallel applications is dynamically changing. Some of the reasons are that the speed of machines J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 1123–1130, 2006. c Springer-Verlag Berlin Heidelberg 2006
1124
Kalyani Munasinghe and Richard Wait
are different, machines may fail and primary users generate workload which should be processed without interference by additional parallel computations. In cluster environment, it is possible to get both high performance and high throughput if resources are allocated and managed efficiently. One of the problems in achieving high performance in clusters is that resources cannot be controlled by the individual program. Usually, networks, computers and disks are shared by among competing applications. When parallel programs execute in this type of an environment, these programs compete for resources with other programs and has to face resource fluctuation during execution. It is necessary to adapt these types of applications to changing system environment, in order to achieve high performance. As the distributed resources cannot be controlled by a single global scheduler, it is necessary to schedule the applications by the application developer. It is also necessary to take into account both application specific and dynamic system information in developing a schedule. Several factors such as type of resources allocated, machine performance can be considered. In this work, we examine how to adapt parallel applications in CPU sharing on workstation networks and clusters to achieve goals such as minimizing execution time. For some irregular grid applications, the computational structure of the problem changes from one computational phase to another. For example, in an adaptive mesh, areas of the original graph are refined in order to model the problem accurately. This can lead to a highly localized load imbalance in subdomain sizes. Alternatively, load imbalance may arise due to variation of computational resources. For example in a shared network of workstations, computing power available for parallel applications is dynamically changing. The reasons may be that the speed of machines are different or there are other users on some part of the cluster, possibility with higher priority. The partitioning has to be altered to get a balanced load. We propose an algorithm which reduces the load imbalance by local adjustments of current loads to reduce the load spike as quickly as possible and to achieve a load balance. It is assumed that the connections for data transfers between the processors are determined by the data locality but data movement should be kept as low as possible. The load balance is adjusted in general by migrating data to adjacent processors with modifications to the connectivity where necessary. On a homogeneous cluster of dedicated processors (e.g. Beowulf [3]) with a fixed problem size, the partition may be uniform and static.
2 Background 2.1 Some Definitions Let p be the number of processors. The processor graph is represented by a graph (V, E) with |V | = p vertices and |E| edges. Two vertices i and j form an edge if processors i and j share a boundary of the partitioning. Hence the processor graph is defined by the topology of the data subdomains. As the edges of the processor graph are defined by the partitioning of the domain, when the partition changes the graph topology may change. Each vertex i is associated with a scalar li , which represents the load on the processor i.
Study of Load Balancing Strategies for Finite Element Computations
1125
The total load is L=
p
li
(2.1)
i=1
The average load per processor is 1 ¯ l= L p and we can define the vector, b, of load imbalances as bi = li − ¯l
(2.2)
(2.3)
This definition is based on the assumption that in order to achieve a perfect balanced computation, all the loads should be equal. If however the processor environments are heterogeneous and corresponding to each processor there is a load index αi which can be computed using current system and/or processor information, then the ideal load ˜li is defined as 1 ˜ li = αi L (2.4) j αj Load difference from the ideal load can be defined as di = li − ˜ li
(2.5)
A processor is therefore overloaded if di > 0. These simple definitions assume that the computation can be broken down into a large number of small tasks each of which can be performed on any processor for the same computational cost. This is not necessarily true as for example in a finite element computation, the cost of the computation might depend on the number of edges between subdomains in addition to the cost proportional to the size of the subdomains. So the total distributed computational cost is not necessarily equal after any redistribution of the data. In order to reduce any unnecessary data fragmentation, data will in general only be moved between contiguous subdomains. It is assumed that any processor is equally accessible from all other processors.
3 Existing Approaches Some work [7] assume that the processors involved are continuously lightly loaded, but commonly the load on a workstation varies in an unpredictable manner. Weissman [11] worked on the problem adapting data parallel applications in a shared dynamic environment of workstation clusters. They have developed an analytical frame work to compare different adaptation strategies. There are algorithms exist for scheduling parallel tasks. The Distributed Self Scheduling (DSS) [8] technique uses a combination of static and dynamic scheduling. During the initial static scheduling phase, p chunks of work are assigned to the p processors in the system. The first processor to finish executing its tasks from the static scheduling phase designates itself as the centralized processor and it stores the information about which tasks are yet to be executed, which processors are idle and dynamically distributes the tasks to the processors as they become idle.
1126
Kalyani Munasinghe and Richard Wait
Alessandro [4] introduced a method to obtain load balancing through data assignment on a heterogeneous cluster of workstations. This method is based on modified managerworkers model and achieves workload balancing by maximizing the useful CPU time for all the processes involved. Dynamic load balancing scheme for distributed systems [6] considers the heterogeneity of processors by generating a relative performance weight for each processor. When distributing the workload among processors, the load is balanced proportional to these weights. Many methods proposed in the literature to solve the load balancing problem are applicable to adaptive mesh computation. One of the earliest schemes was an iterative diffusion algorithm [5]. At each iteration, new load is calculated by combining the original load and the load of neighbouring processors. The advantage of this approach is, it requires local communication only, but the problem is its slow convergence. Horton [12] proposed a multilevel diffusion method by recursively bisecting a graph into two subgraphs and the balancing of the load of two subgraphs. This method assumes that the graph can be recursively bisected into two connected graphs. An alternative multilevel diffusion scheme [13] described. Walshaw [14] implemented a parallel partitioner and a direct diffusion partitioner that is based on the diffusion solver [15]. Several scratchremap [9] and diffusion based [10] adaptive partitioning techniques have also been proposed. These different approaches are suitable for different system environments and different computational environments. In our approach, we try to identify sharp increases of load and to reduce them quickly as possible without necessarily achieving a perfect load balance.
4 Repartitioning with Minimum Data Movement In this section, we describe our proposed approach. It operates on the processor graph which describes the interconnection of the subdomains of a mesh that has been partitioned and distributed among the processors. A processor is highly overloaded if the load difference is excessive di > cli (for some constant c < 1) and a partition is balanced if no processor is overloaded. We assume that an overloaded node initiates the load balancing operation whenever it detects that it is overloaded. One of the important features of our approach is to capture the need for the processor load to adapt very quickly to external factors. 4.1 Proposed Approach Define: W:={nodes with zero load}=∅ Define: H={overloaded nodes} H0 ={highly overloaded nodes}
Study of Load Balancing Strategies for Finite Element Computations
1127
– For each overloaded node, define its under loaded neighbours and a list of its under loaded neighbours of neighbours, so for each m ∈ H, Nm ={under loaded neighbours of m} The algorithm is explained in detail in the box below.
For each m ∈ H For each k ∈ Nm N N k ={under loaded neighbours of k}\N =∅ If N N k Wk ={nodes to be removed}⊂ {N N k ∪ {k}} =∅ If Wk Redistribute load from Wk among neighbours Modify Nm by removing nodes in Wk and adding neighbours as appropriate W := W ∪ Wk endif endif endfor For each m ∈ H li from m to its neighbour i by diffusion ∀ i ∈ Nm move li − ˜ endfor For each m ∈ H0 Assign set Wm ⊂ W of empty nodes to m Assign load ˜ li − li from node m to i ∈ Wm by repartitioning remaining part of lm into s + 1 “unequal” parts. endfor
5 Experimental Results The experiments were performed on the problem of using the finite element method on an unstructured grid. Here we assumed that the computation is element based so that the load to be redistributed can be considered as repartitioning of the elements into subdomains. A test grid created using FEMLAB [2] was used in the experiments. The figure 1 shows the test grid used in the tests. Our proposed algorithm was implemented on 8 Sun workstations connected by a 100 Mb/s Ethernet. All Suns share a common file server and all files are equally accessible from each host due to the implemented Network File System. MPI was the communication library and gcc was the C compiler used. In this work, we used a simple load sensor which used Unix commands to collect the system information. The load sensor calculated the work load of each machine. Here we used a combination of processor speed and the work load of each machine as a load index (i.e. speed/work load). The figure 2 explains the speed of the machines and the figure 3 represents the work load of each machine collected by the load sensor and the figure 4 describes the load index used in the experiments.
1128
Kalyani Munasinghe and Richard Wait
Fig. 1. Test Grid 1000 Speed 900
Processor Speed
800
700
600
500
400
300
200 1
2
3
4 5 Processor Number
6
7
8
Fig. 2. Speed of processors 0.09 Work Load 0.08
Processor Work Load
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0 1
2
3
4 5 Processor Number
6
7
8
Fig. 3. Work load of processors
We ran the test without applying the load balancing approach and 3 iterations with the load balancing approach. The table 1 describes the advantage of the load balancing
Study of Load Balancing Strategies for Finite Element Computations
1129
14500 Load Index 14000 13500 13000
Load Index
12500 12000 11500 11000 10500 10000 9500 1
2
3
4 5 Processor Number
6
7
8
Fig. 4. Load Index for each processor Reduction of Spike 800 750 700
Initial Load Iteration1 Iteration2 Iteration3
650
Load
600 550 500 450 400 350 300 1
2
3
4 5 Processor Number
6
7
8
Fig. 5. Load Distribution in each Iteration Table 1. Running Times in milliseconds Without Load Balance With Load Balance 16.7276 iteration 1 2.41231 iteration 2 2.21345 iteration 3 2.16542
approach and the figure 5 shows how the computational work load is distributed in each iteration. We also tested how the performance vary, when the load index varies. Table 2 gives a comparison of the timing.
1130
Kalyani Munasinghe and Richard Wait Table 2. Different Load Index Load Index
Run time in milliseconds
Processor speed x (1/work load) 2.41483 1/work load
5.6781
Available CPU
6.2142
Processor speed
11.2134
6 Conclusions In this paper, we have presented an approach to reduce sharp increases of load as quickly as possible. The experimental results show a performance improvement with the approach. According to the experimental results, we can see that the load index also plays an important role. Our future work will focus on including more heterogeneous machines and large real data sets into our experiments.
References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
11. 12. 13.
14. 15.
http://www.-cse.ucsd.edu/users/breman/apples.html/. http://www.comsol.com. http://www.beowulf.org/. Alessandro Bevilacqua, A dynamic load balancing method on a heterogeneous cluster of workstations, Informatica 23 (1999), no. 1, 49–56. G. Cybenko, Dynamic load balancing for distributed memory multiprocessors, Parallel and Distributed Computing 7 (1989), 279–301. Zhilling Lan and Valerie E. Taylor, Dynamic load balancing of SAMR applications on distributed systems, Scientific Programming 10 (2002), 319–328, no. 21. C. K. Lee and M. Hamdi, Parallel image processing application on a network of distributed workstations, Parallel Computing 26 (1995), 137–160. J. Lin and V. A. Saletore, Self scheduling on distributed memory machines, Supercomputing (1993), 814–823. L. Oliker and R. Biswas, Plum: Parallel load balancing for adaptive structured meshes, Parallel and Distributed Computing 52 (1998), no. 2, 150–177. Kirk Schloegel, George Karypis, and Vipin Kumar, Multilevel diffusion schemes for repartitioning of adaptive meshes, Journal of Parallel and Distributed Computing 47 (1997), no. 2, 109–124. Jon B. Weissman, Predicting the Cost and Benefit of Adapting Data Parallel Applications in Clusters, Parallel and Distributed Computing (2002), 62, 1248–1271. G. Horton, A multilevel diffusion method for dynamic load balancing, Parallel Computing (1993), 9, 209–218. Kirk Schloegel and George Karypis and Vipin Kumar, Multilevel Diffusion Schemes for Repartitioning of Adaptive Meshes, Journal of Parallel and Distributed Computing (1997), 47, 2, 109–124. C. Walshaw and M. Cross and M. G. Everett, Parallel Dynamic Graph Partitioning for Adaptive Unstructured Meshes, Journal of Parallel and Distributed Computing (1997), 47, 2, 102–108. Y. F. Hu and R. J. Blake, An improved diffusion algorithm for dynamic load balancing, Parallel Computing (1999), 25, 4, 417–444.
Parallel Algorithms for the Determination of Lyapunov Characteristics of Large Nonlinear Dynamical Systems G¨unter Radons1 , Gudula R¨unger2 , Michael Schwind2 , and Hong-liu Yang1 1
2
Institute of Physics, Technical University Chemnitz, 09111 Chemnitz, Germany {radons,hya}@physik.tu-chemnitz.de Department of Computer Science, Technical University Chemnitz, 09111 Chemnitz, Germany {ruenger,schwi}@informatik.tu-chemnitz.de
Abstract. Lyapunov vectors and exponents are of great importance for understanding the dynamics of many-particle systems. We present results of performance tests on different processor architectures of several parallel implementations for the calculation of all Lyapunov characteristics. For the most time consuming reorthogonalization steps, which have to be combined with molecular dynamics simulations, we tested different parallel versions of the Gram-Schmidt algorithm and of QR-decomposition. The latter gave the best results with respect to runtime and stability. For large systems the blockwise parallel Gram-Schmidt algorithm yields comparable runtime results.
1 Introduction The Lyapunov characteristics of many-particle systems consisting of Lyapunov exponents and vectors found much interest recently [1,2]. Especially the dynamics of Lyapunov vectors which may show collective wave-like behavior, provides a challenge for current research [3,4,5,6,7,8,9,10,11]. In this paper we therefore consider means for effectively calculating the full spectrum of Lyapunov exponents and associated Lyapunov vectors for high-dimensional, continuous time nonlinear dynamical systems. The standard method of [12] is employed to calculate these quantities for our many-particle (N = 100 − 1000) Lennard-Jones system in d dimensions. A system of 2dN × 2dN linear and 2dN nonlinear ordinary differential equations has to be integrated simultaneously in order to obtain the dynamics of 2dN offset vectors in tangent space and the reference trajectory in phase space, respectively. For the calculation of the Lyapunov exponents and vectors the offset vectors have to be reorthogonalized periodically using either Gram-Schmidt orthogonalization or QR decomposition. To obtain scientifically useful results one needs particle numbers in the above range and long integration times for the calculation of certain long time averages. This enforces the use of parallel implementations of the corresponding algorithms. It turns out that the repeated reorthogonalization is the most time consuming part of the algorithm. The algorithmic challenge for the parallelization lies on one hand on the parallel implementation of the reorthogonalization algorithm itself. On the other hand the results of the latter have to be fed back to the molecular dynamics integration routine, providing new initial conditions for the next reorthogonalization step, and so on. This alternation between reorthogonalization and molecular dynamics integration requires in J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 1131–1140, 2006. c Springer-Verlag Berlin Heidelberg 2006
1132
G¨unter Radons et al.
addition an optimal balancing of computation and communication on distributed memory machines or cluster of SMPs. As parallel reorthogonalization procedure we have realized and tested several parallel versions of Gram-Schmidt orthogonalization and of QR factorization based on blockwise Householder reflection. The parallel version of classical Gram-Schmidt (CGS) orthogonalization is enriched by a reorthogonalization test which avoids a loss of orthogonality by dynamically using iterated CGS. All parallel procedures are based on a 2-dimensional logical processor grid and a corresponding block-cyclic data distribution of the matrix of offset vectors. Row-cyclic and column-cyclic distributions are included due to parameterized block sizes, which can be chosen appropriately. Special care was also taken to offer a modular structure and the possibility for including efficient sequential basic operations, like from BLAS, in order to efficiently exploit the processor or node architecture. For comparison we consider the standard library routine for QR factorization from ScaLAPACK. The interface between the parallel reorthogonalization and the integration procedure guarantees an invariant data distribution, which may require an automatic redistribution, so that different parallel orthogonalizations can be used within the integration process to adapt the parallel efficiency to the specific hardware properties. Performance tests are performed on a Beowulf cluster, a cluster of dual Xeon nodes, and an IBM Regatta p690+. Our emphasis is on a flexible combination of parallel reorthogonalization procedures and molecular dynamics integration that allows an easy restructuring of the parallel program. The goal is to exploit the characteristics of processors or nodes and of the interconnections network of the parallel hardware effectively by choosing an efficient combination of basic routines and parallel orthogonalization algorithm, so that computation of Lyapunov spectra and Lyapunov vectors can be performed in the most efficient way in order to be able to simulate very large problems. The rest of the paper is organized as follows. Section 2 presents the physical problem of determining Lyapunov characteristics. Section 3describes the parallel implementation of orthogonalization algorithms. Section 4 presents corresponding runtime results on different parallel machines and Section 5 concludes.
2 Determining Lyapunov Characteristics The equations of motion for a many-body system may always be written as a set of first order differential equations Γ˙ (t) = F (Γ (t)) where Γ is a vector in the D-dimensional phase space. The tangent space dynamics describing infinitesimal perturbations around a reference trajectory Γ (t) is given by δ Γ˙ = M (Γ (t)) · δΓ
(2.1)
with the Jacobian M = dF dΓ . The time averaged expansion or contraction rates of δΓ (t) are given by the Lyapunov exponents. For a D−dimensional dynamical system there exist in total D Lyapunov exponents for D different directions in tangent space. The orientation vectors of these directions are the Lyapunov vectors e(α) (t), α = 1,· · · ,D. In practice the Lyapunov vectors (LVs) e(α) (t) and Lyapunov exponents λ(α) (LEs) are obtained by the so-called standard method consisting of repeated Gram-Schmidt
Parallel Algorithms for the Determination of Lyapunov Characteristics
1133
orthogonalization of an evolving set of D perturbation vectors δΓ (t) [12]. The time evolution of a matrix E(t) composed of the D perturbation vectors is written as E(ti+1 ) = L · E(ti ) where L is the propagator of δΓ . The computation of LEs and LVs can be presented as: Algorithm 1 Computation of all the LEs and LVs of a dynamical system Initialize E0 as a D × D identity matrix Initialize LE vector as a zero D-dimensional vector for i = 1 to m iterations Ei = Li · Ei−1 integration by MD simulation Ei = Qi · Ri compute the QR factorization of Ei LE vector = LE vector + log(diag(|Ri |)) for j = 1 to D LVj =j-th row of Qi end Ei = Qi end LE vector = LE vector/m iterations
3 Parallel Realization The entire parallel algorithm consists of integration steps which are periodically interrupted in order to apply a reorthogonalization of the offset matrix. Both parallel parts are based on the distribution of the matrix which differ according to efficiency needs. A natural data distribution in the integration parts is a column-block cyclic distribution. However in the orthogonalization part a different distribution may lead to the fastest algorithm so that a redistribution of data is needed when combining both parts. The redistribution of data is performed appropriately by an interface routine. The communication overhead of the redistribution can outbalance the advantage of a faster orthogonalization and a suboptimal orthogonalization can lead to the best parallel overall runtime. 3.1 Parallel Orthogonalization Algorithms As described, orthogonalization plays an important role in the interplay with the molecular dynamics simulation. To efficiently perform reorthogonalization in such a context we have implemented and compared several parallel algorithms for the Gram-Schmidtorthogonalization and for the QR-decomposition. The basis for the parallel algorithms is a two dimensional block cyclic data distribution and a logical two-dimensional processor grid. The parameters for the blocksize and the number of processors in each matrix dimension can be chosen according to the needs for the algorithms. For the use within the program for the determination of Lyapunov characteristics the following three parallel Gram-Schmidt versions and two QR decompositions have been implemented.
1134
G¨unter Radons et al.
– The Parallel-Classical-Gram-Schmidt-Algorithm (PCGS) is a parallel implementation of the Classical-Gram-Schmidt-Algorithm (CGS) [13]. Due to the structure of the Gram-Schmidt algorithm and the distribution of the matrix the R-Matrix is available in transposed form which is distributed on the processor grid. The PCGSAlgorithm may have the disadvantage that the calculated Q-Matrix is not always orthogonal for matrices with high condition numbers [14]. – The Iterated-Parallel-Classical-Gram-Schmidt-Algorithm (PICGS) has a similar structure as PCGS but includes an iterative reorthogonalization which avoids the disadvantage of PCGS. The need for a repeated orthogonalization is determined by a test described in [15]. – The Parallel-Blockwise-Classical-Gram-Schmidt-Algorithm (PBCGS) is a block version of the CGS-Algorithm (Algorithm 2.4 in [16]) which uses highly optimized cache efficient matrix-matrix operations (BLAS [17,18] Level 3 kernels). – The Parallel-QR-Algorithm (PQR) is a parallel implementation of the QR-Algorithm with Householder-Reflectors. The parallel implementation consists of two steps. First the R-Matrix and the Householder-Reflectors are calculated. In the second step the Q-Matrix is constructed. – The Parallel-Blockwise-QR-Algorithm (PBQR) is a parallel QR-Algorithm with Householder Reflectors [19,20,21] which computes column blocks instead of single columns. Like in the PQR-Algorithm the R-Matrix is calculated before the matrix Q is calculated in the second step. For the blocking the W Y -Representation for products of Householder-Vectors in [21] is used, which allows the use of cache efficient matrix-matrix operations. A description of the optimizations of the algorithm is given in Section 3.3. The parallel algorithms are mainly derived from the sequential algorithms by substituting the building blocks (vector-, matrix-vector-, or matrix-matrix operations) by parallel equivalents which work over the block cyclic-distribution on the logical processor grid. All parallel programs are designed in an SPMD programming style and use the Message-Passing-Interface (MPI) Standard for communication [23]. The Gram-Schmidt algorithms construct the R-Matrix column by column. The decision to apply a second orthogonalization step in the PICGS-Algorithm is done in one processor column of the processor grid and then it is broadcasted within the processor rows. The difference between our algorithm and the ScaLAPACK-Routine PDGEQRF lies in the fact that our algorithm caches some calculations in the first stage to use them in the second, which is not done in ScaLAPACK [24]. 3.2 Interface to Orthogonalization Routines To include the parallel orthogonalization modules into the entire program we have developed a generic interface routine which calls the different orthogonalization algorithms. The routine is able to redistribute the input matrix to a data distribution format needed by the integration part of the program. The data distribution format differs in the lengths of blocksizes and the number of processors in each matrix dimension. The distribution in the orthogonalization part is adapted to use the underlying hardware platform as efficiently as possible. As input data this routine requires the matrix to be orthogonalized
Parallel Algorithms for the Determination of Lyapunov Characteristics
1135
and a number of parameters which describe the distribution of the input matrix and the distribution needed for the calculation. This routine outputs the orthogonalized matrix and on request the R-Matrix or the diagonal elements of the R-Matrix. 3.3 Optimizations for the PBQR-Algorithm As mentioned above the PBQR-Algorithm uses the WY-Representation for products of Householder Matrices described in [21], which has the form I − V T V T . V is a trapezoidal matrix with 1 in its diagonal. T is an upper triangular matrix. The block representation is used both for computing the R- and Q-matrix. Thus in contrast to ScaLAPACK implementation we avoid a second computation of the T -matrix and store the T -Matrix for the computation of the Q-matrix. A straightforward parallelization of the Block-QR-Algorithm (like PBQR) with distributed BLAS-Operations (e.g. PBLAS) would result in a load imbalance because of idle times for some processors. When one column of the two dimensional processor grid computes the block of Householder-Reflectors with Level 2 Blas operations, the processors in the other processor-columns wait for the result of this computation before they can update the remaining matrix. As optimization the computation we have realized a more flexible computation scheme on the logical processor grid which makes an overlapping of communication and computation possible. A similar scheme has been presented in [25] for a ring of processors. The more flexible implementation requires a non-blocking broadcast operation, which is not available in the MPI-Standard. While the JUMP-architecture provides an MPE Ibcast, we have implemented non-blocking broadcast operations on the two other architectures by using nonblocking send-receives.
4 Performance Results The left column of Figure 1 shows the runtime plots for the entire molecular dynamics simulation on the three architectures listed in Table 1. The PBQR algorithm outperforms the Gram-Schmidt versions on the CLIC and XEON cluster and has a similar runtime on the JUMP architecture. The moderate increase in runtime of the PICGS indicates a good condition of the matrix of offset vectors. The PQR-algorithm gives the best performance on the CLIC-architecture for the column-block-cyclic distribution which is used by the integration-part. The speedup of the program is mainly limited by the orthogonalization routine which can be seen from the right column of Figure 1. Here the percentage of runtime used by the orthogonalization part increases with higher numbers of processors. Due to the fact that the molecular dynamics part scales fine with higher numbers of processors the total runtime decreases even if the runtime of the orthogonalization algorithm does not decrease or even increases slightly. Test runs with redistribution show that the overhead of redistributing the input- and output-matrices limits the advantage of the optimal processor grid for small matrices and small numbers of processor. Figure 2 shows a comparison of parallel runtimes for distributions on different processor grids: a column-block-cyclic distribution where all
1136
G¨unter Radons et al. Table 1. Systems used for runtime measurements Name Processor
Communication
BLAS
CLIC Pentium III @800MHz LAM-MPI over Ethernet libgoto for PIII [22] XEON dual Xeon @2GHz
Scali over SCI
JUMP 32x Power4+ @1.7GHz 1 Node Shmem MPI
libgoto for PIV [22] essl
processors form one row of the form (1 × P ), a row-block-cyclic distribution where all processors are in one column of the form (P × 1), and a block-cyclic distribution in both dimension of size P1 × P1 = P . The distributions 1 × P and P × 1 are not optimal and are outperformed by the square processor grid layout. An analysis of our algorithms in terms of block sizes for the block-cyclic distribution demonstrates that the PCGS-algorithm has a minimum runtime for a column blocksize of one. The blocked algorithms PBCGS and PBQR show a behavior similar to that of Figure 3 which shows the runtimes on the XEON-Cluster for a 1200 × 1200 matrix on a 4 × 4 processor-grid. This figure shows a high runtime increase for small column blocksizes. The runtime of these two algorithms is not very sensitive with respect to a change of the row blocksize. The optimal column-blocksize for the block-algorithms decreases with an increasing number of processor-grid-columns when the number of the processors in the row is held constant. Figure 4 compares the runtime of the different QR-Algorithms with HouseholderReflectors with the ScaLAPACK-Implementation consisting of one call to PDGEQRF (computing the R-Matrix) and PDORGQR (forming the Q-matrix). The measurement was done with 16 processors in the grid configurations (1 × 16), (2 × 8), (4 × 4), (8 × 2) and (16×1). In Figure 4 only the best runtime measured for a specific matrix dimension is plotted. The Figure 4 shows that our algorithms are comparable in runtime on the CLICArchitecture and outperform the ScaLAPACK implementation on the JUMP- and XEONClusters by a maximum of up to 35 and 46 percent. The best runtimes have been measured with the PBQR with the optimization described in Subsection 3.3 (PBQR2 in Figure 4).
5 Conclusion We have developed and tested several parallel implementations for the calculation of all Lyapunov characteristics, vectors and exponents, of large dynamical systems. The latter are of great importance for an understanding of the dynamics of many-particle systems. It turned out that in the interplay between molecular dynamics (MD) integration and repeated reorthogonalization the latter is the critical factor, which for large systems delimits overall performance. Among the five tested reorthogonalization algorithms the Parallel-Blockwise -QR Algorithm based on Householder reflectors (PBQR) gave the best results with respect to runtime and stability. They also use most effectively the column-cyclic data distribution generated naturally by the MD-algorithm. For large systems the blockwise parallel Gram-Schmidt algorithm yields comparable results. With such an implementation the investigation of Lyapunov instabilities in 2- and 3-dimensional systems with several hundred particles becomes possible.
Parallel Algorithms for the Determination of Lyapunov Characteristics CLIC 600x600 6
4
90 80
percent
runtime [s]
CLIC 600x600 100
PCGS PICGS PBCGS PQR PBQR
5
3 2
70 60 50
PCGS PICGS PBCGS PQR PBQR
40 1
30
0
20 0
2
4
6
8
10
12
14
16
0
2
4
number of processors
8
10
40
12
14
16
0
2
4
6
8
10
16
80
percent
70 60 50 PCGS PICGS PBCGS PQR PBQR
20 8
14
JUMP 600x600
30 6
12
90
40
4
10
number of processors
PCGS PICGS PBCGS PQR PBQR
2
16
PCGS PICGS PBCGS PQR PBQR
JUMP 600x600
0
14
50
number of processors
0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0
16
60
10 6
14
70
20 4
12
80
30
2
10
XEON 600x600
percent
runtime [s]
8
90
PCGS PICGS PBCGS PQR PBQR
0
runtime [s]
6
number of processors
XEON 600x600 2.2 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0
1137
12
number of processors
14
16
0
2
4
6
8
10
12
number of processors
Fig. 1. Parallel runtime per iteration step of the molecular dynamics simulation including reorthogonalization (left column) for the CLIC, XEON, and JUMP systems (from the top). The right column shows the corresponding percentages of runtime used for the orthogonalization part alone. The measurements are made on the architectures listed in Table 1. The experiments have been performed with a 600×600 matrix of offset vectors per run. The orthogonalization algorithm works without redistribution
Acknowledgment We thank the National Supercomputer Centre John von Neumann-Institute for Computing (NIC) in J¨ulich, Germany, for providing access to the Jump Cluster. We gratefully acknowledge financial support from the DFG within Sonderforschungsbereich 393 Parallele Numerische Simulation f¨ur Physik und Kontinuumsmechanik.
1138
G¨unter Radons et al. XEON PBQR 1200x1200 4x4
120 1.2
100
runtime [s]
runtime [s]
1
80
row-block-cyclic distribution column-block-cyclic distribution block-cyclic with redistribution block-cyclic-distribution
60
0.8 0.6 0.4
40 0.2 80
20
60
80 60
40
0 0
10
20
30
40
50
60
70
80
90
40
20
100
20 0
column blocksize
number of processors
0
row blocksize
Fig. 2. Runtime for a single orthogonalization for Fig. 3. Runtime on the XEON-Cluster for a 4× the PBCGS-algorithm on CLIC for a 3000×3000 4 processor grid with a total of 16 processors matrix for the PBQR-algorithm
CLIC
JUMP
100
100
runtime [s]
runtime [s]
10 10
1
0.1
PQR PBQR PBQR2 ScaLapack
0.1 0
500
1000
1500
2000
2500
1
PQR PBQR PBQR2 ScaLapack
0.01 3000
0
500
matrix dimension (m=n)
1000
1500
2000
2500
3000
matrix dimension (m=n)
A
B XEON 100
runtime [s]
10
1
0.1
PQR PBQR PBQR2 ScaLapack
0.01 0
500
1000
1500
2000
2500
3000
matrix dimension (m=n)
C Fig. 4. Parallel runtime of the QR-Algorithms with Householder-Reflectors in isolation without the simulation- and redistribution-part for the block-cyclic distribution shown in dependency of the matrix dimension. The measurement was done with random square-matrices on 16 processors. The ScaLAPACK algorithm consists of a call to PDGEQRF and PDORGQR
Parallel Algorithms for the Determination of Lyapunov Characteristics
1139
References 1. P. Gaspard, Chaos, Scattering, and Statistical Mechanics (Cambridge University Press, Cambridge, 1998). 2. J.P. Dorfman, An Introduction to Chaos in Nonequilibrium Statistical Mechanics (Cambridge University Press, Cambridge, 1999). 3. H.A. Posch and R. Hirschl, "Simulation of Billiards and of Hard-Body Fluids", in Hard Ball Systems and the Lorenz Gas, edited by D. Szasz, Encyclopedia of the mathematical sciences 101, p. 269, (Springer, Berlin, 2000). 4. C. Forster, R. Hirschl, H.A. Posch and Wm.G. Hoover, Physics D 187, 294 (2004). 5. H.A. Posch and Ch. Forster, Lyapunov Instability of Fluids, in Collective Dynamics of Nonlinear and Disordered Systems, p.301, Eds. G. Radons, W. Just, and P. H¨aussler (Springer, Berlin, 2004). 6. J.-P. Eckmann and O. Gat, J. Stat. Phys 98, 775 (2000). 7. S. McNamara and M. Mareschal, Phys. Rev. E 64, 051103 (2001). M. Mareschal and S. McNamara, Physics D 187, 311 (2004). 8. A. de Wijn and H. van Beijeren, Goldstone modes in Lyapunov spectra of hard sphere systems, nlin.CD/0312051. 9. T. Taniguchi and G.P. Morriss, Phys. Rev. E 65, 056202 (2002); ibid, 68, 026218 (2003). 10. H. L. Yang and G. Radons, Lyapunov instability of Lennard Jones fluids, nlin.CD/0404027. 11. G. Radons and H. L. Yang, Static and Dynamic Correlations in Many-Particle Lyapunov Vectors, nlin.CD/0404028. 12. G. Benettin, L. Galgani and J. M. Strelcyn, Phys. Rev. A 14, 2338 (1976). 13. J. Stoer and R. Bulirsch, Introduction to Numerical Analysis, Springer, 2002. 14. ˚A. Bj¨orck, Numerics of Gram-Schmidt Orthogonalization, Linear Algebra Appl., 197– 198:297–316, 1994. 15. Luc Girand, Julien Langou and Miroslav Rozlˇozn´ık, On the round-off error analysis of the Gram-Schmidt algorithm with reorthogonalization, www.cerfacs.fr/algor/reports/ 2002/TR PA 02 33.pdf. 16. D. Vanderstraeten, An accurate parallel block Gram-Schmidt algorithm without reorthogonalization, Numer. Linear Algebra Appl., 7(4):219–236, 2000. 17. J. Dongarra, J. Du Croz, I. Duff, S. Hammarling and Richard J. Hanson, An Extended Set of Fortran Basic Linear Algebra Subroutines, ACM Trans. Math. Soft., 14 (1):1-17, March 1988. 18. J. Dongarra, J. Du Croz, I. Duff and S. Hammarling, A set of Level 3 Basic Linear Algebra Subprograms, ACM Trans. Math. Soft., 16(1):1-17, March 1990. 19. Gene H. Golub and Charles F. Van Loan, Matrix Computations, Johns Hopkins University Press, 1996. 20. C. H. Bischof and C. Van Loan, The WY Representation for Products of Householder Matrices, SIAM Journal on Scientific and Statistical Computing, 8:s2–s13, 1987. 21. Schreiber and C. Van Loan, A Storage Efficient WY Representation for Products of Householder Transformations, SIAM Journal on Scientific and Statistical Computing, 10:53–57, 1989. 22. K. Goto and R. van de Geijn, On Reducing TLB Misses in Matrix Multiplication, FLAME Working Note #9, The University of Texas at Austin, Department of Computer Sciences, Technical Report TR-2002-55, Nov. 2002. 23. Message Passing Interface Forum: MPI: A Message-Passing Interface Standard, Technical Report UT-CS-94-230, 1994.
1140
G¨unter Radons et al.
24. L. S. Blackford, et al., ScaLAPACK Users’ Guide, Society for Industrial and Applied Mathematics, Philadelphia, PA, 1997. 25. K. Dackland, E. Elmroth, and B. K˚agstr¨om, A Ring-Oriented Approach for Block Matrix Factorizations on Shared and Distributed Memory Architectures, Proceedings of the Sixth SIAM Conference on Parallel Processing for Scientific Computing, 330–338, Norfolk, 1993. SIAM Publications.
Online Task Scheduling on Heterogeneous Clusters: An Experimental Study Einar M.R. Rosenvinge, Anne C. Elster, and Cyril Banino Norwegian University of Science and Technology ( NTNU ) Department of Computer and Information Science ( IDI ) [email protected], {elster,banino}@idi.ntnu.no
Abstract. This paper considers effcient task scheduling methods for applications on heterogeneous clusters. The Master/Worker paradigm is used, where the independent tasks are maintained by a master node which hands out batches of a variable amount of tasks to requesting worker nodes. The Monitor strategy is introduced and compared to other strategies suggested in the literature. Our online strategy is especially suitable for heterogeneous clusters with dynamic loads.
1 Introduction In today’s international high-performance computing arena, there is a clear trend from traditional supercomputers towards cluster and Grid computing solutions. This is mainly motivated by the fact that clusters typically can be constructed at a cost that is modest compared to the cost for traditional supercomputers that have equivalent computing power. The operation, use and performance characteristics of such clusters are, however, significantly different from those of traditional supercomputers. For instance, clusters typically have a much slower communication medium between nodes (e.g. a high-latency, low-bandwidth interface such as Ethernet or the faster Myrinet.) Clusters also give rise to some challenges not typically found on traditional supercomputers. A cluster may be a heterogeneous environment, meaning that its nodes may have different performance characteristics. Also, if the nodes composing a cluster are not dedicated, the cluster will be a dynamic environment, because its nodes may have a non-negligible background processing load. These challenges imply that one needs an adequate scheduling strategy to get good performance on a cluster. Our work aims at developing effective scheduling strategies for clusters for the class of applications that fall into the Master/Worker paradigm. These applications can be divided into a large number of independent work units, or tasks. There is no inter-task communication, so the tasks can be computed in any order. Finally, the tasks are atomic, i.e. their computation cannot be preempted. Many applications can be parallelized in such a way, including matrix multiplication, Gaussian elimination, image processing applications such as ray-tracing [1] and Monte Carlo simulations [2]. The rest of this paper is organized as follows: The test-case application and the testbed cluster are described in Section 2. In Section 3, previous scheduling strategies, as well as our scheduling strategy, are presented. Section 4exposes implementation-specific issues, and Section 5 discusses empirical results from our work. Finally, conclusions and suggestions for future work are provided in Section 6. J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 1141–1150, 2006. c Springer-Verlag Berlin Heidelberg 2006
1142
Einar M.R. Rosenvinge, Anne C. Elster, and Cyril Banino
2 Framework 2.1 Test-Case Application: Matched Filtering The test-case application used in this study is an image filtering application, known as matched filtering [3]. This application has been developed by O. C. Eidheim, a PhD student at IDI / NTNU, who was looking for speed improvements, giving us great access to a developer of the application. The matched filtering application is used in medical imaging in order to detect blood vessels in Computer Tomography (CT) images. A CT image is a cross-section of the human body. By using this application to detect blood vessels in multiple adjacent CT images, one is able to construct a 3 D representation of the blood vessels in the human body. The input to the application is a grayscale image, see Fig. 1 (a), which is filtered through an image correlation step. The correlation kernel that is used is a Gaussian hill, which is rotated in all directions and scaled to several sizes. For more detailed information about this filtering technique, see [3]. The noise in the input image, makes the blood vessel identification quite challenging. After filtering, see Fig. 1 (b), the noise has been removed and blood vessels are now identifiable.
(a)
(b)
Fig. 1. Image before filtering (a), and after filtering (b). CT image courtesy of Interventional Center, Rikshospitalet, Oslo, Norway
Since the input image can be divided into tasks corresponding to different parts (lines, columns, blocks), each node can process one or more tasks, and thus produce the corresponding parts of the output image, this application parallelizes easily in a homogeneous and static environment. However, in a heterogeneous and/or dynamic environment provided by most of today’s clusters, parallelizing this application efficiently is more complicated. 2.2 Test-Bed Platform: Athlon-Based Cluster ClustIS is a fairly homogeneous cluster composed of 38 nodes, with AMD Athlon XP / MP CPU s at clock frequencies of 1.4 to 1.66 GHz with 0.5 to 2 GB of RAM. A few of the nodes
Online Task Scheduling on Heterogeneous Clusters: An Experimental Study
1143
are dual-CPU nodes. The nodes are connected through 100Mbit switched Ethernet. The operating system is Linux, the MPI implementation is MPICH 1.2.5.2, and the queuing system is OpenPBS. On ClustIS, data storage is provided by one node, Story, which provides home directories through NFS. Consequently, all disk I / O from the nodes will go through this slow Ethernet interface. One solution could be to use local disk I / O instead. However, the scattering of input data and gathering of output data would add to the total application execution time, so regardless, input and output data would have to travel through the network. Nevertheless, we were able to demonstrate some I / O parallelism on this cluster. In fact, we got more or less linear speedup when reading data concurrently from up to 8 processes. This indicates that having the worker nodes read their part of the data themselves will be faster than having the master scatter and gather data to/from workers. Writing data in parallel also gave a significant speedup compared to centralized writing, but the speedup was not quite as linear. See [4] for details.
3 Scheduling Master/Worker Applications On a cluster, each processor might have very different performance characteristics (heterogeneity), as well as varying background workloads (dynamism). To a certain degree, heterogeneity can be handled through the job scheduler, by requesting processors with a certain CPU frequency. Such functionality, however, is not available with many job scheduling systems. Dynamism, however, cannot be handled through a job scheduler. The background processing load of the processors is unknown before the computation starts, and might vary throughout the computation. This must therefore be handled by the scheduling strategy used. 3.1 Previous Scheduling Strategies All popular scheduling strategies give out batches of tasks to workers, but since the workers might have different and possibly varying processing speeds, giving only one batch to each worker might lead to non-equal finish times for the workers. To compensate for this, some strategies give batches to workers in several rounds. In the following, N denotes the total number of tasks, p denotes the number of workers (processors), and R denotes the number of remaining unassigned tasks on the master at a given time. The Static Chunking (SC) strategy [1] assigns one batch of N/p tasks to each worker. At the other end of the spectrum is the Self Scheduling (SS) strategy [1], where tasks are handed out one by one. The Fixed-Size Chunking (FSC) strategy uses batches of tasks of one fixed size, and it is possible to approximate the optimal batch size [5]. The Guided Self Scheduling (GSS) strategy [6] gives each worker batches of size R/p. GSS thus uses exponentially decreasing batch sizes. The Trapezoid Self-Scheduling (TSS) strategy [7] also uses decreasing batch sizes, but the batch sizes decrease linearly from a first size f to a last size l. They advocate the use of f = N/(2p) and l = 1. The Factoring (Fac.) and Weighted Factoring (WF) strategies also use decreasing batch sizes. At each time
1144
Einar M.R. Rosenvinge, Anne C. Elster, and Cyril Banino
step, half of the remaining tasks are given out. The WF strategy works by assigning a weight to each processor corresponding to the computing speed of the processor before the computation starts, and allocates tasks based on these weights in every round [1,8]. 3.2 The Monitor Strategy The Monitor strategy is fairly similar to Weighted Factoring (WF), which assigns tasks to workers in a weighted fashion for each round, where each worker has a static weight. For WF, this weight has to be computed in advance, before the actual computations start, which is a disadvantage in a dynamic environment. The Monitor strategy, however, performs such benchmarking online throughout the computation and thus uses dynamic weights, which also allows for good performance in a truly dynamic environment. The strategy uses an initialization phase and several batch computation phases. During the initialization phase, workers request tasks from the master which are handed out one by one. Workers measure the time it takes to compute their respective task, and report these timings to the master when they request another task. When all workers have reported their task computation times, the initialization phase is done. Formally, let xi be the number of tasks that worker wi will be given in the current batch computation phase, and yi be the number of uncomputed tasks queued by worker wi . Let ti denote the time taken to process one task, and Ti denote the time taken for worker wi to finish the current phase. Recall that R denotes the number of unassigned tasks held by the master, and p denotes the number of workers. In a batch computation phase, the master starts by solving the following system of equations: ⎧ ⎪ (1) ∀i ∈ [0, p, Ti = (yi + xi ) × ti ⎪ ⎪ ⎨ (2) ∀i ∈ [1, p, Ti = Ti−1 p−1 ⎪ ⎪ ⎪ xi = R/2 ⎩ (3) i=0
In a given phase, worker wi receive xi tasks that are added to the yi uncomputed tasks already stored in its local task queue. It will hence finish its execution of the current phase at time Ti = (yi + xi ) × ti (equation 1). For the total execution time to be minimized, all workers must finish their computations simultaneously, hence ∀i ∈ [1, p, Ti = Ti−1 (equation 2). This condition has been proved in the context of Divisible Load Theory [9]. The sum of all xi is equal to R/2, meaning that R/2 tasks are allocated during the current phase (equation 3). It has been found experimentally [8] that handing out half of the remaining tasks in each round gives good performance. Throughout the computation, workers periodically (i.e. every r computed tasks) report their task computation times ti and the number of uncomputed tasks yi waiting in their local task queues to the master. Hence the master is continuously monitoring the worker states. Consequently, after the first batch computation phase, there is no need for another initialization phase, since the master has up-to-date knowledge of the performance of the workers. Note that the parameter r must be tuned for the application and cluster in use. As soon as a worker is done with its local tasks, a request is sent to the master, which then enters the next computation phase. A new system of equations is solved with the last up-to-date values of Ti and yi .
Online Task Scheduling on Heterogeneous Clusters: An Experimental Study
1145
Throughout the computation, yi has a great significance. Suppose that at phase k, worker wi has no external load, and can thus supply a large amount of computing power to our application. The master will then delegate a large number of tasks to wi . Suppose now that during phase k, wi receives a large external load, slowing down its task execution rate. At the end of phase k worker wi will still have a lot of uncomputed tasks. The master has up-to-date knowledge of this, and allocates only a few (or no) new tasks to wi in phase k + 1. Note that if some workers are slowed down drastically the above system of equations may yield negative xi values. Since the Monitor strategy does not consider withdrawing tasks from workers, the corresponding equations are removed, and the linear system is solved once more, distributing hence tasks among the remaining workers. This process is repeated until the solution yields no negative xi values. The task computation time ti reported by worker wi will typically be the mean value of its η last computation times. Having η = 1 might give a non-optimal allocation, since the timing can vary a lot in a dynamic environment. At the other end of the spectrum, a too high value for η conceals changes in processing speeds, which is also non-optimal. The parameter η needs to be adjusted for the individual application and/or environment. Fig. 2 shows the allocated batch sizes for the scheduling strategies described in Section 3.1 as well as the Monitor strategy, when the processors report the task computation times shown in Fig. 3. Note that for the Monitor strategy, we assume yi = 0 at the beginning of every phase, meaning that all the processors have computed all their assigned tasks from the previous phase. Strategy
Batch sizes
SC
128 128 128 128
SS
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...
GSS
128 96 72 54 40 30 23 17 13 9 7 5 4 3 2 2 1 1 1 1 1 1 1
TSS
64 60 56 52 48 44 40 36 32 28 24 20 8
Fac.
64 64 64 64 32 32 32 32 16 16 16 16 8 8 8 8 4 4 4 4 222211111111
WF
180 32 20 24 90 16 10 12 45 8 5 6 22 4 3 3 11 2 1 2 61013100110010001000
Monitor
1 1 1 1 1 1 1 1 177 32 20 23 73 27 12 14 11 23 13 16 4 7 14 6 6 3 3 4 4 1 1 1 0 2 1 1 1 2 1 1
Fig. 2. Batch sizes for various sched. strat. with N = 512 tasks and p = 4 workers
4 Implementation 4.1 Data Staging In order to improve I / O parallelism, the worker nodes read the necessary input data from the NFS disk themselves, much like the data staging technique presented in [10]. The
1146
Einar M.R. Rosenvinge, Anne C. Elster, and Cyril Banino Processor
Task computation times at time steps 1
2
3
4
5
6
7
8
9
1
0.10
0.15
1.01
0.90
0.28
0.29
0.99
0.90
0.89
2
0.56
0.40
0.50
0.48
0.52
0.53
0.47
0.49
0.50
3
0.89
0.90
0.89
0.24
0.67
0.88
0.60
0.66
0.63
4
0.75
0.76
0.74
0.50
0.45
0.70
0.69
0.63
0.62
Fig. 3. Examples of task computation times for 4 processors at 9 time steps throughout a computation. Note that for WF, the times at step 1 are used as weights
master receives short requests from workers, and answer these requests with a short message containing only a pointer to the input data to be fetched, thus circumventing the master bottleneck. However, because our test-case application has such a high computation to I / O ratio, our experiments showed that data staging did not have a great performance impact for this application [4]. Data staging is nevertheless a valuable technique for applications whose computation to I / O ratio is lower. 4.2 Multithreaded Processes In order to avoid processor idleness, we decided to implement a multithreaded approach. Every worker process is composed of three threads: a main thread for communicating with the master, a thread for reading input data from disk, and a thread for computing output data. The main thread requests tasks and buffers them in a task queue until the number of tasks buffered is above a user defined threshold φ. It then goes to sleep and wakes up when the number of tasks in the queue is below φ. The goal of the main thread is to keep φ tasks queued at all times. The input reader thread will fetch tasks from the task queue, and, if the queue is empty, sleep while waiting for a task. Once the task queue is non-empty, the input reader thread will read and store input data, then add pointers to the data locations in the input data queue. And this, until the number of input data units in the queue is above φ. It then goes to sleep, and wakes up when the number of input data units in the queue is below φ. The procedure is then repeated. The computer thread works in much the same way as the input reader thread. See [4] for details. The threshold φ regulates how soon the workers will request tasks from the master. Intuitively, φ = 1 might be non-optimal, since the task computer thread might become idle while the main thread is waiting for the master to allocate a task. Note that since each worker has two queues of size φ, it buffers 2φ tasks. The master process is also multi-threaded, with one thread continuously probing for, receiving and sending messages to/from workers using MPI, one thread executing the scheduling strategy in use, and one worker thread computing tasks on the master processor. The MPI thread terminates as soon as all tasks have been allocated, but until that point, it consumes quite a lot of CPU time that could have been used by the worker thread. This means that for the worker thread on the master, a high φ value is optimal, since the workers will request tasks quickly and the MPI thread will terminate early. This
Online Task Scheduling on Heterogeneous Clusters: An Experimental Study
1147
is a side effect of our non-optimal master architecture, since the MPI thread consumes unnecessary CPU power. One possible optimization would be to merge the thread for communicating through MPI and the thread for executing the scheduling strategy, but this would lead to non-modular code. Another possible optimization would be to use two threads calling MPI, one for receiving and one for sending, but this is impossible with the non-thread-safe MPICH library we had at our disposal. For more on this side effect, see [4].
5 Empirical Results and Analysis The implemented scheduling strategies were compared for different values of φ on our dedicated cluster which is a static environment [4]. Our goal was to find the best scheduling strategy combined with the optimal parameters φ, η and r for the test-case application running on the test-bed cluster. Note that for the Monitor strategy, we experimentally found r = 4 and η = 20 to be optimal values for our application and cluster [4], and these values have been used in the following experiments. The results from our experiments are shown in Fig. 4; for more experiments, see [4].
Fac. GSS Monitor SC SS TSS WF
total application running time (seconds)
3000 2800 2600 2400 2200 2000 1800 1600 1 3
8
16
32 φ
64
Fig. 4. Comparison of sched. strategies with increasing φ values, static environment
These experiments were conducted using 8 nodes, and an image of size 2048 × 2048 pixels decomposed into 1024 blocks, each block corresponding to a task. One interesting finding is that the Static Chunking strategy performs linearly better when using a larger φ value. When φ increases, the workers request new tasks earlier, hence causing the termination of the master MPI thread earlier. This frees resources for the worker thread
1148
Einar M.R. Rosenvinge, Anne C. Elster, and Cyril Banino
on the master, and thus makes it process tasks faster. One might argue that the MPI thread on the master should have been terminated early regardless of φ since SC only uses one round of allocation, but in order to be fair, we kept the same master implementation for all the scheduling strategies. Consequently, all scheduling strategies must take into account the worker thread on the master which is very slow compared to the dedicated workers. Therefore, SC is a bad choice in a heterogeneous environment, while it is good in a homogeneous environment, as shown when φ = 64. The SS, Fac., WF and Monitor strategies are all quite similar. The reason why we get better results with φ = 32 than with φ = 3 is the same as for the SC case. The master MPI thread is stopped earlier, and we have one faster worker for the rest of the computation. With φ > 32, the Monitor strategy performs very badly. One possible explanation for this is that during the initialization phase, the master computing thread is very slow, and will be given a small amount of tasks, less than φ. Consequently, the master computing thread will constantly request more tasks. As a result, the scheduling thread will solve a lot of unnecessary systems of equations further slowing down the computing thread. Nevertheless, it should be noted that using very high φ values prevents good load balancing, since in order to allocate tasks when the workers need them (or slightly before, to avoid idle time), φ must be kept relatively low. It is quite surprising that the Self-Scheduling strategy, which has the highest amount of communication of all strategies, is among the very fastest scheduling strategies. A possible explanation is that our multi-threaded implementation is able to hide the communication delays, and because our application has a high computation to I / O ratio. However, our environment is relatively homogeneous and static, and we expect the Monitor strategy to outperform SS in a strongly heterogeneous and dynamic environment. Fig. 5 shows speedup results. Note that using e.g. 2 processors means using the master with its separate worker thread and 1 dedicated worker. With a 512×512-pixel image, the speedup drops significantly when using more than 4 processors. This is due to using a suboptimal task size for this relatively small image size [4]. For the larger image sizes, the speedup increases when adding more processors. Intuitively, this comes from the fact that the relatively slow worker thread of the master processor plays a smaller role when adding more processors. With a sufficiently large image and 8 or more processors, we have a close to linear speedup.
6 Conclusions and Future Work A novel online scheduling strategy has been designed and implemented. The Monitor strategy was experimentally compared to implementations of six other popular scheduling strategies found in the literature. Experiments show that the Monitor strategy performs excellently compared to these strategies, and should be especially well suited for dynamic environments. The monitor strategy implementation involved multi-threading and data staging, two techniques that decrease processor idle time and increase master utilization. Our testcase application, the matched filtering algorithm, has a very high computation to I / O ratio, and consequently data staging is probably unnecessary for this application. Experimental tests on our cluster show, however, that data staging is a valuable technique for
Online Task Scheduling on Heterogeneous Clusters: An Experimental Study 16
1149
Ideal Image size 512x512 Image size 1024x1024 Image size 2048x2048
14 12
Speedup
10 8 6 4 2 0 2
4
8 Number of processors
16
Fig. 5. The speedup of the application
applications whose computation to I / O ratio is lower. The multi-threaded implementation of the worker processes, using task queuing mechanisms, is able to hide communication delays and keep the workers processing data continuously. The excellent performance of both the Self-Scheduling and the Monitor strategy substantiate this. This work could be extended in the following directions. First, running more experiments on other clusters which provide more heterogeneity and more dynamism would enable to measure the potential of the Monitor strategy. Second, the Monitor strategy has three user-specifiable parameters, r, η and φ. A way to determine the optimal values of these parameters online would be desirable. It might be that an optimal solution in heterogeneous environment necessitate different values of these parameters for different workers. Then, the optimal task size used in this study has been found experimentally, but it would be desirable to determine it online: Two procedures for doing this are discussed in [4]. Finally, a thread-safe MPI library would enable us to implement the master process differently [4], which would increase the performance.
References 1. Hummel, S.F., Schonberg, E., Flynn, L.E.: Factoring: A method for scheduling parallel loops. Comm. of the ACM 35 (1992) 90–101 2. Basney, J., Raman, R., Livny, M.: High Throughput Monte Carlo. In: Proceedings of the Ninth SIAM Conference on Parallel Processing for Scientific Computing. (1999) 3. Chaudhuri, S., Chatterjee, S., Katz, N., Nelson, M., Goldbaum, M.: Detection of blood vessels in retinal images using two-dimensional matched filters. IEEE Trans. on Med. Imaging 8 (1989) 263–269
1150
Einar M.R. Rosenvinge, Anne C. Elster, and Cyril Banino
4. Rosenvinge, E.M.R.: Online Task Scheduling On Heterogeneous Clusters: An Experimental Study. Master’s thesis, NTNU (2004) http://www.idi.ntnu.no/˜elster/ students/ms-theses/rosenvinge-msthesis.pdf. 5. Kruskal, C.P., Weiss, A.: Allocating independent subtasks on parallel processors. IEEE Trans. on Software Eng. 11 (1985) 1001–1016 6. Polychronopoulos, C.D., Kuck, D.J.: Guided self-scheduling: A practical scheduling scheme for parallel supercomputers. IEEE Trans. on Comp. 36 (1987) 1425–1439 7. Tzen, T.H., Ni, L.M.: Dynamic loop scheduling for shared-memory multiprocessors. In: Proc. of the 1991 Int’l Conference on Parallel Processing, IEEE Computer Society (1991) II247– II250 8. Hummel, S.F., Schmidt, J., Uma, R.N., Wein, J.: Load-sharing in heterogeneous systems via weighted factoring. In: Proceedings of the eighth annual ACM symposium on Parallel algorithms and architectures, ACM Press (1996) 318–328 9. Bharadwaj, V., Ghose, D., Mani, V., Robertazzi, T.G.: Scheduling Divisible Loads in Parallel and Distributed Systems. Computer Society (1996) 10. Elwasif, W., Plank, J.S., Wolski, R.: Data staging effects in wide area task farming applications. In: IEEE Int’l Symposium on Cluster Computing and the Grid, Brisbane, Australia (2001) 122–129
A Parallel Method for Large Sparse Generalized Eigenvalue Problems by OmniRPC in a Grid Environment Tetsuya Sakurai1,3 , Kentaro Hayakawa2 , Mitsuhisa Sato1 , and Daisuke Takahashi1
3
1 Department of Computer Science University of Tsukuba, Tsukuba 305-8573, Japan {sakurai,msato,daisuke}@cs.tsukuba.ac.jp 2 Doctoral Program of Systems and Information Engineering University of Tsukuba, Tsukuba 305-8573, Japan [email protected] Core Research for Evolutional Science and Technology (CREST)
Abstract. In this paper we present a parallel method for finding several eigenvalues and eigenvectors of a generalized eigenvalue problem Ax = λBx, where A and B are large sparse matrices. A moment-based method by which to find all of the eigenvalues that lie inside a given domain is used. In this method, a small matrix pencil that has only the desired eigenvalues is derived by solving large sparse systems of linear equations constructed from A and B. Since these equations can be solved independently, we solve them on remote hosts in parallel. This approach is suitable for master-worker programming models. We have implemented and tested the proposed method in a grid environment using a grid RPC (remote procedure call) system called OmniRPC. The performance of the method on PC clusters that were used over a wide-area network was evaluated.
1 Introduction The generalized eigenvalue problem Ax = λBx, where A and B are n×n real or complex matrices, arises in many scientific applications. In such applications the matrix is often large and sparse, and we may need to find a number of the eigenvalues that have some special property and their corresponding invariant subspaces. Several methods for such eigenvalue problems are building sequences of subspaces that contain the desired eigenvectors. Krylov subspace based techniques are powerful tools for large-scale eigenvalue problems [1,2,9,10]. The relations among Krylov subspace methods, moment-matching approach and Pad´e approximation are shown in [2]. In this paper, we present a parallel method for finding several eigenvalues and eigenvectors of a generalized eigenvalue problem. A moment-based method to find all of the eigenvalues that lie inside a given domain is used. In this method, a small matrix pencil that has only the desired eigenvalues is derived by solving large sparse linear equations constructed from A and B. Because these equations can be solved independently, we solve them on remote hosts in parallel. In this approach, we do not need to J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 1151–1158, 2006. c Springer-Verlag Berlin Heidelberg 2006
1152
Tetsuya Sakurai et al.
exchange data between remote hosts. Therefore, the presented method is suitable for master-worker programming models. Moreover, the method has a good load balancing property. We have implemented and tested our method in a grid environment using a grid RPC (remote procedure call) system called OmniRPC [8,13]. OmniRPC is a grid RPC system that allows seamless parallel programming in both cluster and grid environments and supports remote hosts with "ssh", as well as a grid environment with Globus. The performance of the presented method on PC clusters used over a wide-area network was evaluated. As a test problem, we used the matrices that arise in the calculation of the electronic state of molecular hydrogen. The results show that the method is efficient in a grid environment.
2 A Moment-Based Method Let A, B ∈ Cn×n , and let λ1 , . . . , λd (d ≤ n) be finite eigenvalues of the matrix pencil A − λB. The pencil A − λB is referred to as regular if det(A − λB) is not identically zero for λ ∈ C. For nonzero vectors u, v ∈ Cn , we define f (z) := uH (zB − A)−1 v. The function f (z) is analytic when zB − A is nonsingular. Suppose that m distinct eigenvalues λ1 , . . . , λm are located inside a positively oriented closed Jordan curve Γ . We consider the problem of finding all of the poles of f (z) inside Γ . Define the complex moments 1 µk := (z − γ)k f (z)dz, k = 0, 1, . . . , (2.1) 2πi Γ < where γ is located inside Γ . Let the m × m Hankel matrices Hm and Hm be Hm := m < m [µi+j−2 ]i,j=1 and Hm := [µi+j−1 ]i,j=1 . Then we have the following theorem ([12]).
Theorem 1. Let the pencil A − λB be regular. Then the eigenvalues of the pencil < Hm − λHm are given by λ1 − γ, . . . , λm − γ. Therefore, we can obtain the eigenvalues λ1 , . . . , λm by solving the generalized < eigenvalue problem Hm x = λHm x , where x ∈ Cm . If we find an appropriate Γ that includes a small number of eigenvalues, then the derived eigenproblem is small. This approach is based on the root finding method for an analytic function proposed in [6]. This method finds all of the zeros that lie in a circle using numerical integration. < Error analyses for the eigenvalues of the pencil Hm − λHm of which elements are obtained by numerical integration are presented in [7] and [11]. Let sk be 1 sk := (z − γ)k (zB − A)−1 vdz, k = 0, 1, . . . , (2.2) 2πi Γ and let q1 , . . . , qm be eigenvectors corresponding to the eigenvalues λ1 , . . . , λm . From the results of [7] and [12], we have the following result.
A Parallel Method for Large Sparse Generalized Eigenvalue Problems
1153
Theorem 2. Let Wm be an m × m matrix such that < Hm Wm = diag(ζ1 , . . . , ζm )Hm Wm ,
where ζj = λj − γ, j = 1, . . . , m. Then [q1 , . . . , qm ] = [s0 , . . . , sm−1 ]Wm .
(2.3)
Let ν1 , . . . , νm be residues of ζ1 , . . . , ζm . Then µk =
m
νj ζjk ,
k = 0, 1, . . . .
(2.4)
j=1
It follows from (2.4) that (µ0 , . . . , µm−1 ) = (ν1 , . . . , νm )Vm , where Vm is the m × m Vandermonde matrix Vm := [ζij−1 ]m i,j=1 . −1 Since the column vectors of Vm are given by the eigenvectors of the matrix pencil < Hm − λHm ([7]), the residues νj can be evaluated by (ν1 , . . . , νm ) = (µ0 , . . . , µm−1 )Vm−1 = (µ0 , . . . , µm−1 )Wm .
(2.5)
We next consider the case in which Γ is given by a circle and the integration is evaluated via a trapezoidal rule on the circle. Let γ and ρ be the center and the radius, respectively, of the given circle. Let N be a positive integer, and let ωj := γ + ρe
2π i N (j+1/2)
,
j = 0, 1, . . . , N − 1.
By approximating the integral of equation (2.1) via the N -point trapezoidal rule, we obtain the following approximations for µk : µk ≈ µ ˆ k :=
N −1 1 (ωj − γ)k+1 f (ωj ), N j=0
k = 0, 1, . . .
(2.6)
< ˆ m and H ˆm ˆ m := [ˆ ˆ< Let the m × m Hankel matrices H be H µi+j−2 ]m i,j=1 and Hm := m < ˆ m − λH ˆ m . We [ˆ µi+j−1 ]i,j=1 . Let ζˆ1 , . . . , ζˆm be the eigenvalues of the matrix pencil H ˆ ˆ regard λj = γ + ζj , 1 ≤ j ≤ m as the approximations for λ1 , . . . , λm . ˆ m be the matrix of which column vectors are eigenvectors of H ˆ < − λH ˆ m . The Let W m approximate vectors qˆ1 , . . . , qˆm for the eigenvectors q1 , . . . , qm are obtained by
ˆ m, [qˆ1 , . . . , qˆm ] = [ˆ s0 , . . . , sˆm−1 ]W where
yj = (ωj B − A)−1 v,
and sˆk :=
j = 0, 1, . . . , N − 1,
N −1 1 (ωj − γ)k+1 yj , N j=0
We then obtain the following algorithm:
(2.7)
k = 0, 1, . . . .
(2.8)
1154
Tetsuya Sakurai et al.
Algorithm: Input: u, v ∈ Cn , N , m, γ, ρ ˆ1, . . . , λ ˆ m , qˆ1 , . . . , qˆm Output: λ 1. Set ωj ← γ + ρ exp(2πi(j + 1/2)/N ), j = 0, . . . , N − 1 2. Solve (ωj B − A)yj = v for yj , j = 0, . . . , N − 1 3. Set f (ωj ) ← uH yj , j = 0, . . . , N − 1 4. Compute µ ˆk , k = 0, . . . , 2m − 1 by (2.6) < ˆm ˆm − λH 5. Compute the eigenvalues ζˆ1 , . . . , ζˆm of the pencil H 6. Compute qˆ1 , . . . , qˆm by (2.7) ˆ j ← γ + ζˆj , j = 1, . . . , m 7. Set λ In practical use, the exact number of eigenvalues in the circle is not given in advance. The criteria to find appropriate m were discussed in [3,6] for the Hankel matrices. If we take m larger than the exact number of eigenvalues, then the set of eigenvalues of the ˆ m includes approximations of the eigenvalues in Γ . ˆ < − λH matrix pencil H m
3 Parallel Implementation on OmniRPC In this section we describe a parallel implementation of the algorithm. To evaluate the value of f (z) at z = ωj , j = 0, . . . , N − 1, we solve the systems of linear equations (ωj B − A)yj = v,
j = 0, 1, . . . , N − 1.
(3.9)
When the matrices A and B are large and sparse, the computational costs to solve the linear systems (3.9) are dominant in the algorithm. Since these linear systems are independent, we solve them on remote hosts in parallel. We have implemented the algorithm in a grid RPC system called OmniRPC. OmniRPC efficiently supports typical master-worker parallel grid applications such as parametric execution programs. The user can define an initialization procedure in the remote executable to send and store data automatically in advance of actual remote procedure calls in OmniRPC_Module_Init. OmniRPC_Call is a simple client programming interface for calling remote functions. When OmniRPC_Call makes a remote procedure call, the call is allocated to an appropriate remote host. The programmer can use asynchronous remote procedure calls by OmniRP CC allA sync. The procedure to solve N systems in Step 2 of the presented algorithm is performed by OmniRPC_Call_Async in the source code. Since A, B and v are common in each system of linear equations, we only need to send these data at the first time to each host. To solve another equation on a remote host, a scalar parameter ωj is sent. If we do not need eigenvectors, the scalar value f (ωj ) is returned from a remote host after the procedure in Step 3 is performed. Otherwise, yj is returned in addition. We can easily extend the method for the case that several circular regions are given. Suppose that Nc circles are given. Then we solve N × Nc systems of linear equations (k)
(k)
(ωj B − A)yj
= v,
j = 0, . . . , N − 1, k = 1, . . . , Nc ,
A Parallel Method for Large Sparse Generalized Eigenvalue Problems
1155
(k)
where ωj , j = 0, . . . , N − 1 are points on the kth circle. Now we consider the efficiency loss caused by the communication of the matrix data to the remote host. Let α be the ratio of the time to send matrix data to a remote host and the total time to solve all systems of linear equations. We assume that matrix data are sent to each remote host one after another, and the computation starts after receiving matrix data. We also assume that load imbalance does not occur in the computation. Then from a simple observation, we have the following estimation of the speedup: α Sp ≈ np 1 − α + np (np + 1) , 2 where np is the number of remote hosts.
4 Numerical Examples In this section, we present numerical examples of the proposed method. The algorithm was implemented in OmniRPC on a Linux operating system. Computation was performed with double precision arithmetic in FORTRAN. The remote hosts were Intel Xeon computers with a clock speed of 2.4 GHz, and the local host was an Intel Celeron computer with a clock speed of 2.4 GHz and 100Base-T Ethernet. The local host and the remote hosts were connected over a wide-area network. To evaluate the speed of execution for the parallel algorithm, the elapsed wall-clock time (in seconds) was observed. The time for loading input matrices into a memory of the local host and the time required to start up all OmniRPC processes were not included. When A and B are real matrices, the relation f (z) = f (z) holds. Thus we evaluated N/2 function values f (ω0 ), . . . , f (ωN/2−1 ), and set the remaining function values using f (ωN −1−j ) = f (ωj ). The elements of u and v were distributed randomly on the interval [0, 1] by a random number generator, and they were calculated on the remote hosts. Example 1. The matrices were
−4 6 −4 1 1 −4 6 −4 1 . . . . . .. .. .. .. .. B= , 1 −4 6 −4 1 1 −4 6 −4 5 −4 1
A = In ,
1 −4 5
where In is the n × n identity matrix, and n = 2000000. The exact eigenvalues are given by λ∗j =
16 cos4
1
jπ 2(n+1)
,
j = 1, 2, . . . , n.
1156
Tetsuya Sakurai et al.
The linear systems were solved by a direct solver for a band matrix. The interval [2, 2.005] was covered by 100 circles with radius ρ = 2.5 × 10−5. The number of points on the circle was N = 32. The size of the Hankel matrices was m = 10, and we selected eigenvalues in each circle if the corresponding residue satisfies |νj | ≥ 10−8 . 368 eigenvalues in the interval were obtained. The resulting eigenvalues and their errors were as follows: j
λj
|λj − λ∗j |
1
2.000012164733324 6.22E-15
2 3 .. .
2.000025723848086 6.22E-15 2.000039283082696 1.15E-14 .. .. . .
368 2.004996418973247 8.88E-16 In Table 1, the wall-clock times in seconds, the speedup, and Sp are shown. In this example, we set α = 0.0052 from the results of the test program. The speedup was 7.66 with 10 remote processors, and we can see that Sp is a good estimation of the speedup.
Table 1. Wall-clock times in seconds, speedup and Sp in Example 1 Number of nodes Time (sec) Speedup
Sp
1
1149
1.00
1.00
2
583
1.97
1.98
4
304
3.78
3.82
6
214
5.37
5.43
8
174
6.60
6.76
10
150
7.66
7.80
Example 2. The test matrices were derived from the finite element method for a molecular electronic state described in [4]. Both A and B were real symmetric, and n = 12173. The number of nonzero elements was 509901. We computed 15 eigenvalues and corresponding eigenvectors in 8 circles. The circles were located at γ = −6.2, −5.8, −5.6, −3.4, −3.0, −2.8, −2.6, −2.0 with radius ρ = 0.1. The parameters were chosen as N = 32 and m = 10. Since the matrix ωj B−A with complex ωj is complex symmetric, the COCG method [14] with incomplete Cholesky decomposition was used to solve the linear equations. The stopping criterion for the relative residual was 10−12 . In Table 2, the wall-clock times in seconds, the speedup, and Sp with α = 0.0045 are shown. The speedup was similar to Example 1. While the computational time to solve the systems of linear equations were different for each ωj , Sp was still a good
A Parallel Method for Large Sparse Generalized Eigenvalue Problems
1157
Table 2. Wall-clock times in seconds, speedup and Sp in Example 2 Number of nodes Time (sec) Speedup
Sp
1
199
1.00
1.00
2
103
1.93
1.98
4
53
3.75
3.84
6
38
5.24
5.50
8
29
6.86
6.91
10
26
7.65
8.04
estimation of the speedup. The reason that the speedup was slightly smaller than Sp was load imbalance. If we give more circles or linear systems require more computational time then α becomes smaller. In such case, the speedup will increase.
5 Conclusions In this paper we presented a parallel algorithm to find eigenvalues and eigenvectors of generalized eigenvalue problems using a moment-based method. In this approach, we do not need to exchange data between remote hosts. Therefore, the presented method is suitable for master-worker programming models. We have implemented and tested the proposed method in a grid environment using a grid RPC system called OmniRPC. The remote procedures have large granularity, and the method has a good load balancing property. The computation with explicit moments is often numerically unstable. Thus it is important to find appropriate parameters about circular regions. In some cases, a rough estimation of distribution of eigenvalues is given in advance by physical properties or some approximate methods. For example, eigenvalues appeared in computation of molecular orbitals are approximated by fragmented molecular orbitals [5]. The estimation of suitable parameters will be the subject of our future study.
Acknowledgements This work was partially supported by CREST, Japan Science and Technology Agency (JST), and Grant-in-Aid for Scientific Research No. 15560049 from the Ministry of Education, Culture, Sports Science and Technology of Japan.
1158
Tetsuya Sakurai et al.
References 1. W. E. Arnoldi, The principle of minimized iteration in the solution of the matrix eigenproblem, Quarterly of Appl. Math., 9:17–29, 1951. 2. Z. Bai, Krylov subspace techniques for reduced-order modeling of large-scale dynamical systems, Appl. Numer. Math., 43:9–44, 2002. 3. G. H. Golub, P. Milanfar and J. Varah, A stable numerical method for inverting shape from moments, SIAM J. Sci. Comp., 21(4):1222–1243, 1999. 4. S. Hyodo, Meso-scale fusion: A method for molecular electronic state calculation in inhomogeneous materials, Proc. the 15th Toyota Conference, Mikkabi, 2001, in Special issue of J. Comput. Appl. Math., 149:101–118, 2002. 5. Yuichi Inadomi, Tatsuya Nakano, Kazuo Kitaura and Umpei Nagashima, Definition of molecular orbitals in fragment molecular orbital method, Chemical Physics Letters, 364:139–143, 2002. 6. P. Kravanja, T. Sakurai and M. Van Barel, On locating clusters of zeros of analytic functions, BIT, 39:646–682, 1999. 7. P. Kravanja, T. Sakurai, H. Sugiura and M. Van Barel, A perturbation result for generalized eigenvalue problems and its application to error estimation in a quadrature method for computing zeros of analytic functions, J. Comput. Appl. Math., 161:339–347, 2003. 8. OmniRPC: http://www.omni.hpcc.jp/OmniRPC. 9. A. Ruhe, Rational Krylov algorithms for nonsymmetric eigenvalue problems II: matrix pairs, Linear Algevr. Appl., 197:283–295, 1984. 10. Y. Saad, Iterative Methods for Large Eigenvalue Problems, Manchester University Press, Manchester, 1992. 11. T. Sakurai, P. Kravanja, H. Sugiura and M. Van Barel, An error analysis of two related quadrature methods for computing zeros of analytic functions, J. Comput. Appl. Math., 152:467–480, 2003. 12. T. Sakurai and H. Sugiura, A projection method for generalized eigenvalue problems, J. Comput. Appl. Math., 159:119–128, 2003. 13. M. Sato, T. Boku and D. Takahashi, OmniRPC: a Grid RPC System for parallel programming in cluster and grid environment, Proc. CCGrid 2003, 206–213, 2003. 14. H. A. van der Vorst, J. B. M. Melissen, A Petrov-Galerkin type method for solving Ax = b, where A is a symmetric complex matrix, IEEE Trans. on Magnetics, 26(2):706–708, 1990.
An Implementation of Parallel 3-D FFT Using Short Vector SIMD Instructions on Clusters of PCs Daisuke Takahashi, Taisuke Boku, and Mitsuhisa Sato Graduate School of Systems and Information Engineering, University of Tsukuba 1-1-1 Tennodai, Tsukuba, Ibaraki 305-8573, Japan {daisuke,taisuke,msato}@cs.tsukuba.ac.jp
Abstract. In this paper, we propose an implementation of a parallel three-dimensional fast Fourier transform (FFT) using short vector SIMD instructions on clusters of PCs. We vectorized FFT kernels using Intel’s Streaming SIMD Extensions 2 (SSE2) instructions. We show that a combination of the vectorization and block three-dimensional FFT algorithm improves performance effectively. Performance results of three-dimensional FFTs on a dual Xeon 2.8 GHz PC SMP cluster are reported. We successfully achieved performance of over 5 GFLOPS on a 16-node dual Xeon 2.8 GHz PC SMP cluster.
1 Introduction The fast Fourier transform (FFT) [1] is an algorithm widely used today in science and engineering. Parallel three-dimensional FFT algorithms on distributed-memory parallel computers have been well studied [2,3,4,5,6]. Today, many processors have short vector SIMD instructions, e.g., Intel’s SSE/SSE2/SSE3, AMD’s 3DNow!, and Motorola’s AltiVec. These instructions provide substantial speedup for digital signal processing applications. Efficient FFT implementations with short vector SIMD instructions have also been well studied [7,8,9,10,11,12]. Many FFT algorithms work well when the data sets fit into a cache. When a problem size exceeds the cache size, however, the performance of these FFT algorithms decreases dramatically. The key issue of the design for large FFTs is to minimize the number of cache misses. Thus, both vectorization and high cache utilization are particularly important for achieving high performance on processors that have short vector SIMD instructions. In this paper, we propose an implementation of a parallel three-dimensional FFT using short vector SIMD instructions on clusters of PCs. Our proposed parallel three-dimensional FFT algorithm is based on the multicolumn FFT algorithm [13,14]. Conventional three-dimensional FFT algorithms require three multicolumn FFTs and three data transpositions. The three transpose steps typically are the chief bottlenecks in cache-based processors. Some previous three-dimensional FFT algorithms [14,15] separate the multicolumn FFTs from the transpositions. Taking the opposite approach, we combine the multicolumn FFTs and transpositions to reduce the number of cache misses, and we modify the conventional three-dimensional J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 1159–1167, 2006. c Springer-Verlag Berlin Heidelberg 2006
1160
Daisuke Takahashi, Taisuke Boku, and Mitsuhisa Sato
FFT algorithm to reuse data in the cache memory. We use the block three-dimensional FFT algorithm to implement the parallel three-dimensional FFT algorithm. We have implemented the parallel block three-dimensional FFT algorithm on a 16node dual Xeon PC SMP cluster, and in this paper we report the performance results. Section 2 describes vectorization of FFT kernels. Section 3 describes a block threedimensional FFT algorithm used for problems that exceed the cache size. Section 4 gives a parallel block three-dimensional FFT algorithm. Section 5 describes the in-cache FFT algorithm used for problems that fit into a data cache. Section 6 gives performance results. In section 7, we provide some concluding remarks.
2 Vectorization of FFT Kernels Streaming SIMD Extensions 2 (SSE2) were introduced into the IA-32 architecture in the Pentium 4 and Intel Xeon processors [16]. These extensions were designed to enhance the performance of IA-32 processors. The most direct way to use the SSE2 instructions is to inline the assembly language instructions into source code. However, this can be time-consuming and tedious, and assembly language inline programming is not supported on all compilers. Instead, Intel provides easy implementation through the use of API extension sets referred to as intrinsics [17]. We used the SSE2 intrinsics to access SIMD hardware. An example of complex multiplication using the SSE2 intrinsics is shown in Fig. 1. The m128d data type in Fig. 1 is supported by the SSE2 intrinsics. The m128d data type holds two packed double-precision floating-point values. In the complex multiplication, the m128d data type is used as a double-precision complex data type. The inline function ZMUL in Fig. 1 can be used to multiply two double-precision complex values. To add two double-precision complex values, we can use the intrinsic function mm add pd in Fig. 1. To vectorize FFT kernels, the SSE2 intrinsics and the inline function ZMUL can be used. An example of a vectorized radix-2 FFT kernel using the SSE2 intrinsics is shown in Fig. 2.
3 A Block Three-Dimensional FFT Algorithm The three-dimensional discrete Fourier transform (DFT) is given by y(k1 , k2 , k3 ) =
n 1 −1 n 2 −1 n 3 −1
x(j1 , j2 , j3 )ωnj33k3 ωnj22k2 ωnj11k1 ,
(3.1)
j1 =0 j2 =0 j3 =0
√ where ωnr = e−2πi/nr (1 ≤ r ≤ 3) and i = −1. The three-dimensional FFT based on the multicolumn FFT algorithm is as follows: Step 1: Transpose x1 (j3 , j1 , j2 ) = x(j1 , j2 , j3 ).
An Implementation of Parallel 3-D FFT Using Short Vector SIMD Instructions
1161
#include <emmintrin.h> __m128d ZMUL(__m128d a, __m128d b); static __inline __m128d ZMUL(__m128d a, __m128d b) { __m128d ar, ai; ar = _mm_unpacklo_pd(a, a); ar = _mm_mul_pd(ar, b); ai = _mm_unpackhi_pd(a, a); ai = _mm_xor_pd(ai, _mm_set_sd(-0.0)); b = _mm_shuffle_pd(b, b, 1); ai = _mm_mul_pd(ai, b);
/* /* /* /* /* /*
ar = [a.r a.r] */ ar = [a.r*b.r a.r*b.i] */ ai = [a.i a.i] */ ai = [-a.i a.i] */ b = [b.i b.r] */ ai = [-a.i*b.i a.i*b.r] */
return _mm_add_pd(ar, ai);
/* [a.r*b.r-a.i*b.i a.r*b.i+a.i*b.r] */
}
Fig. 1. An example of double-precision complex multiplication using SSE2 intrinsics
Step 2: n1 n2 individual n3 -point multicolumn FFTs n 3 −1 x1 (j3 , j1 , j2 )ωnj33k3 . x2 (k3 , j1 , j2 ) = j3 =0
Step 3: Transpose x3 (j2 , j1 , k3 ) = x2 (k3 , j1 , j2 ). Step 4: n1 n3 individual n2 -point multicolumn FFTs n 2 −1 x3 (j2 , j1 , k3 )ωnj22k2 . x4 (k2 , j1 , k3 ) = j2 =0
Step 5: Transpose x5 (j1 , k2 , k3 ) = x4 (k2 , j1 , k3 ). Step 6: n2 n3 individual n1 -point multicolumn FFTs n 1 −1 x5 (j1 , k2 , k3 )ωnj11k1 . y(k1 , k2 , k3 ) = j1 =0
The distinctive features of the three-dimensional FFT can be summarized as: – Three multicolumn FFTs are performed in steps 2, 4 and 6. Each column FFT is small enough to fit into the data cache. – The three-dimensional FFT has the three transpose steps, which typically are the chief bottlenecks in cache-based processors. We combine the multicolumn FFTs and transpositions in order to reduce the number of cache misses, and we modify the conventional three-dimensional FFT algorithm to reuse data in the cache memory. As in the conventional three-dimensional FFT above, it is assumed in the following that n = n1 n2 n3 and that nb is the block size. We assume that each processor has a multilevel cache memory. A block three-dimensional FFT algorithm [6] can be stated as follows.
1162
Daisuke Takahashi, Taisuke Boku, and Mitsuhisa Sato
#include <emmintrin.h> __m128d ZMUL(__m128d a, __m128d b); void fft_vec(double *a, double *b, double *w, int m, int l) { int i, i0, i1, i2, i3, j; __m128d t0, t1, w0; for (j = 0; j < l; j++) { w0 = _mm_load_pd(&w[j << 1]); for (i = 0; i < m; i++) { i0 = (i << 1) + (j * m << 1); i1 = i0 + (m * l << 1); i2 = (i << 1) + (j * m << 2); i3 = i2 + (m << 1); t0 = _mm_load_pd(&a[i0]); t1 = _mm_load_pd(&a[i1]); _mm_store_pd(&b[i2], _mm_add_pd(t0, t1)); _mm_store_pd(&b[i3], ZMUL(w0, _mm_sub_pd(t0, t1))); } } }
Fig. 2. An example of a vectorized radix-2 FFT kernel using SSE2 intrinsics
1. Consider the data in main memory as an n1 × n2 × n3 complex matrix. Fetch and transpose the data nb rows at a time into an n3 × nb matrix. The n3 × nb array fits into the L2 cache. 2. For each nb column, perform nb individual n3 -point multicolumn FFTs on the n3 × nb array in the L2 cache. Each column FFT also fits into the L1 data cache. 3. Transpose each of the resulting n3 × nb matrices, and return the resulting nb rows to the same locations in the main memory from which they were fetched. 4. Fetch and transpose the data nb rows at a time into an n2 × nb matrix. 5. For each nb column, perform nb individual n2 -point multicolumn FFTs on the n2 × nb array in the L2 cache. 6. Transpose each of the resulting n2 × nb matrices, and return the resulting nb rows to the same locations in the main memory from which they were fetched. 7. Perform n2 n3 individual n1 -point multicolumn FFTs on the n1 × n2 × n3 array. We note that this algorithm is a three-pass algorithm. Fig. 3 gives the Fortran program for this block three-dimensional FFT algorithm. Here, the arrays YWORK and ZWORK are the work array. The parameters NB and NP are the blocking parameter and padding parameter, respectively.
4 Parallel Block Three-Dimensional FFT Algorithm Let N = N1 × N2 × N3 . On a distributed-memory parallel computer which has P processors, the three-dimensional array x(N1 , N2 , N3 ) is distributed along the first dimension, N1 . If N1 is divisible by P , each processor has distributed data of size N/P . We introduce the notation Nˆr ≡ Nr /P and we denote the corresponding index as Jˆr , which indicates that the data along Jr are distributed across all P processors. Here, we use the subscript r to indicate that this index belongs to dimension r. The distributed array is represented as x ˆ(Nˆ1 , N2 , N3 ). At processor m, the local index Jˆr (m) corresponds
An Implementation of Parallel 3-D FFT Using Short Vector SIMD Instructions COMPLEX*16 X(N1,N2,N3) COMPLEX*16 YWORK(N2+NP,NB),ZWORK(N3+NP,NB) DO J=1,N2 DO II=1,N1,NB DO KK=1,N3,NB DO I=II,MIN0(II+NB-1,N1) DO K=KK,MIN0(KK+NB-1,N3) ZWORK(K,I-II+1)=X(I,J,K) END DO END DO END DO DO I=II,MIN0(II+NB-1,N1) CALL IN_CACHE_FFT(ZWORK(1,I-II+1),N3) END DO DO K=1,N3 DO I=II,MIN0(II+NB-1,N1) X(I,J,K)=ZWORK(K,I-II+1) END DO END DO END DO END DO
1163
DO K=1,N3 DO II=1,N1,NB DO JJ=1,N2,NB DO I=II,MIN0(II+NB-1,N1) DO J=JJ,MIN0(JJ+NB-1,N2) YWORK(J,I-II+1)=X(I,J,K) END DO END DO END DO DO I=II,MIN0(II+NB-1,N1) CALL IN_CACHE_FFT(YWORK(1,I-II+1),N2) END DO DO J=1,N2 DO I=II,MIN0(II+NB-1,N1) X(I,J,K)=YWORK(J,I-II+1) END DO END DO END DO DO J=1,N2 CALL IN_CACHE_FFT(X(1,J,K),N1) END DO END DO
Fig. 3. A block three-dimensional FFT algorithm
to the global index as the cyclic distribution: Jr = Jˆr (m) × P + m,
0 ≤ m ≤ P − 1,
1 ≤ r ≤ 3.
(4.2)
To illustrate the all-to-all communication it is convenient to decompose Ni into two dimensions, N˜i and Pi , where N˜i ≡ Ni /Pi . Although Pi is the same as P , we are using the subscript i to indicate that this index belongs to dimension i. Starting with the initial data x ˆ(Nˆ1 , N2 , N3 ), the parallel three-dimensional FFT can be performed according to the following steps: Step 1: Transpose ˆ(Jˆ1 , J2 , J3 ). xˆ1 (J3 , Jˆ1 , J2 ) = x Step 2: (N1 /P ) · N2 individual N3 -point multicolumn FFTs xˆ2 (K3 , Jˆ1 , J2 ) =
N 3 −1
J3 K3 xˆ1 (J3 , Jˆ1 , J2 )ωN . 3
J3 =0
Step 3: Rearrangement xˆ3 (Jˆ1 , J2 , K˜3 , P3 ) = xˆ2 (P3 , ≡ xˆ2 (K3 ,
K˜3 , Jˆ1 , J2 ) Jˆ1 , J2 ).
Step 4: All-to-all communication xˆ4 (J˜1 , J2 , Kˆ3 , P1 ) = xˆ3 (Jˆ1 , J2 , K˜3 , P3 ). Step 5: Rearrangement xˆ5 (J2 , J˜1 , Kˆ3 , P1 ) = xˆ4 (J˜1 , J2 , Kˆ3 , P1 ). Step 6: N1 · (N3 /P ) individual N2 -point multicolumn FFTs xˆ6 (K2 , J˜1 , Kˆ3 , P1 ) =
N 2 −1 J2 =0
J 2 K2 xˆ5 (J2 , J˜1 , Kˆ3 , P1 )ωN . 2
1164
Daisuke Takahashi, Taisuke Boku, and Mitsuhisa Sato
Step 7: Rearrangement xˆ7 (J1 , K2 , Kˆ3 ) ≡ xˆ7 (P1 , J˜1 , K2 , Kˆ3 ) = xˆ6 (K2 , J˜1 , Kˆ3 , P1 ). Step 8: N2 · (N3 /P ) individual N1 -point multicolumn FFTs yˆ(K1 , K2 , Kˆ3 ) =
N 1 −1
J1 K1 xˆ7 (J1 , K2 , Kˆ3 )ωN . 1
J1 =0
The distinctive features of the parallel three-dimensional FFT algorithm can be summarized as: – The parallel three-dimensional FFT is accompanied with a local transpose (data rearrangement). – N 2/3 /P individual N 1/3 -point multicolumn FFTs are performed in steps 2, 6 and 8 for the case of N1 = N2 = N3 = N 1/3 . – The all-to-all communication occurs just once. If both of N1 and N3 are divisible by P , the workload on each processor is also uniform. Although the input data x ˆ(Nˆ1 , N2 , N3 ) is distributed along the first dimension N1 , the output data yˆ(N1 , N2 , Nˆ3 ) is distributed along the third dimension N3 . If we assume that the input data and output data are both the same distribution, an additional one allto-all communication step is needed.
5 In-cache FFT Algorithm We use the radix-2, 4 and 8 Stockham autosort algorithm [18] for in-cache FFT. Although the Stockham autosort algorithm requires a scratch array the same size as the input data array, it is unnecessary the digit-reverse permutation. If the Stockham autosort algorithm is used for the individual FFTs, the additional scratch requirement for performing the individual FFTs is O(N 1/3 ) (where N = N1 × N2 × N3 ) at most. The higher radices are more efficient in terms of both memory and floating-point operations. A high ratio of floating-point instructions to memory operations is particularly important in a cache-based processor. In view of the high ratio of floating-point instructions to memory operations, the radix-8 FFT is more advantageous than the radix-4 FFT. A power-of-two FFT (except for 2-point FFT) can be performed by a combination of radix-8 and radix-4 steps containing at most two radix-4 steps. That is, the power-of-two FFTs can be performed as a length n = 2p = 4q 8r (p ≥ 2, 0 ≤ q ≤ 2, r ≥ 0).
6 Performance Results To evaluate the implemented parallel three-dimensional FFT, named FFTE1 (version 3.2), we compared its performance against that of the FFTW library (version 2.1.5) [15], 1
http://www.ffte.jp
An Implementation of Parallel 3-D FFT Using Short Vector SIMD Instructions
1165
Table 1. Performance of parallel three-dimensional FFTs on dual Xeon PC SMP cluster P (Nodes × CPUs)
N1 ×N2 ×N3
FFTE 3.2 (SSE2) Time MFLOPS
FFTE 3.2 (x87) Time
MFLOPS
FFTW 2.1.5 Time
MFLOPS
1×1
28 ×28 ×28 7.68317
262.04
8.33619
241.51 12.86992
156.43
1×2
2 ×2 ×2
4.29060
469.23
4.38417
459.21 10.77560
186.84
2×1
28 ×28 ×29 6.31334
664.36
6.76740
619.78 16.21513
258.67
2×2
2 ×2 ×2
775.95
5.58136
751.48 11.89469
352.62
4×1
28 ×29 ×29 7.91637 1102.04
8.31000 1049.84 17.43453
500.40
4×2
2 ×2 ×2
6.75353 1291.79 13.59440
641.75
8×1
29 ×29 ×29 8.80932 2056.84
9.38208 1931.28 17.98921 1007.24
8×2
29 ×29 ×29 7.21630 2510.90
7.48793 2419.81 13.78780 1314.16
16×1
2 ×2 ×2
16×2
29 ×29 ×210 7.45187 5043.16
8
8
8
9
8
8
8
9
9
9
9
10
5.40536
6.61002 1319.84
9.35400 4017.64 10.28623 3653.52 18.42153 2040.06 8.02600 4682.40 14.56551 2580.13
which is known as one of the fastest FFT libraries for many processors. Although the latest FFTW (version 3.0.1) supports SSE/SSE2/3DNow!/AltiVec (new in version 3.0), MPI parallel transforms are still only available in 2.1.5. We averaged the elapsed times obtained from 10 executions of complex forward FFTs. The parallel FFTs were performed on double-precision complex data, and the table for twiddle factors was prepared in advance. We used transposed order output to reduce the all-to-all communication step for the parallel block three-dimensional FFT and the FFTW. A 16-node dual Xeon PC SMP cluster (Prestonia 2.8 GHz, 12 K uops L1 instruction cache, 8 KB L1 data cache, 512 KB L2 cache, 2 GB DDR-SDRAM main memory per node, Linux 2.4.20smp) was used. The nodes on the PC SMP cluster are interconnected through a Myrinet-2000 switch. MPICH-SCore [19] was used as a communication library. We used an intranode MPI library for the PC SMP cluster. The Intel C++ Compiler (icc, version 7.0) and the Intel Fortran Compiler (ifc, version 7.0) were used on the dual Intel Xeon PC cluster. For the FFTE (SSE2), the compiler options used were specified as “icc -O3 -xW -fno-alias” and “ifc -O3 -xW -fno-alias.” For the FFTE (x87), the compiler options used were specified as “ifc -O3 -fno-alias.” For the FFTW, the compiler options used were specified as “icc -O3 -xW” and “ifc -O3 -xW.” Table 1 compares the FFTE (SSE2 and x87) and the FFTW in terms of their run times and MFLOPS. The first column of the table indicates the number of processors. The second column gives the problem size. The next six columns contain the average elapsed time in seconds and the average execution performance in MFLOPS. The MFLOPS values are each based on 5N log2 N for a transform of size N = 2m . Table 2 shows the results of the all-to-all communication timings on the dual Xeon PC SMP cluster. The first column of the table indicates the number of processors. The
1166
Daisuke Takahashi, Taisuke Boku, and Mitsuhisa Sato Table 2. All-to-all communication performance on dual Xeon PC SMP cluster
P
N1 ×N2 ×N3 Time MB/sec
1×2
28 ×28 ×28 0.73685 91.08
2×1
28 ×28 ×29 1.44647 92.79
2×2
28 ×28 ×29 1.38669 72.59
4×1
28 ×29 ×29 2.36666 85.07
4×2
28 ×29 ×29 2.28441 51.41
8×1
29 ×29 ×29 3.05918 76.78
8×2
29 ×29 ×29 2.85692 44.04
16×1 29 ×29 ×210 3.37355 74.60 16×2 29 ×29 ×210 3.11048 41.80
second column gives the problem size. The next two columns contain the average elapsed time in seconds and the average bandwidth in MB/sec. In Tables 1 and 2, we can clearly see that all-to-all communication overhead contributes significantly to the execution time. For this reason, the difference in performance between the implemented parallel three-dimensional FFT and the FFTW decreases according to increasing the number of processors. For N = 29 ×29 ×210 and P = 16×2, FFTE (SSE2) runs about 1.95 times faster than the FFTW, as shown in Table 1. The performance of the implemented parallel three-dimensional FFT remains at a high level, even for the larger problem size, because of cache blocking. Moreover, our implementation of the parallel three-dimensional FFT exploits the SSE2 instructions. These are two reasons why the implemented parallel three-dimensional FFT is the most advantageous with the dual Xeon PC SMP cluster. These results clearly indicate that the implemented FFT is superior to the FFTW. We note that on a 16-node dual Xeon 2.8 GHz PC SMP cluster, over 5 GFLOPS was realized with size N = 29 × 29 × 210 in the FFTE, as shown in Table 1.
7 Conclusion In this paper, we proposed the implementation of the parallel three-dimensional FFT using short vector SIMD instructions on clusters of PCs. We vectorized FFT kernels using the SSE2 instructions, and parallelized the block three-dimensional FFT. The performance of the implemented parallel three-dimensional FFT remains at a high level, even for a larger problem size, because of cache blocking. These results demonstrate that the implemented FFT utilizes cache memory effectively. We succeeded in obtaining a performance of over 5 GFLOPS on a 16-node dual Xeon 2.8 GHz PC SMP cluster. These performance results demonstrate that the proposed parallel block three-dimensional FFT algorithm utilizes cache memory effectively.
An Implementation of Parallel 3-D FFT Using Short Vector SIMD Instructions
1167
References 1. Cooley, J.W., Tukey, J.W.: An algorithm for the machine calculation of complex Fourier series. Math. Comput. 19 (1965) 297–301 2. Brass, A., Pawley, G.S.: Two and three dimensional FFTs on highly parallel computers. Parallel Computing 3 (1986) 167–184 3. Agarwal, R.C., Gustavson, F.G., Zubair, M.: An efficient parallel algorithm for the 3-D FFT NAS parallel benchmark. In: Proceedings of the Scalable High-Performance Computing Conference. (1994) 129–133 4. Hegland, M.: Real and complex fast Fourier transforms on the Fujitsu VPP 500. Parallel Computing 22 (1996) 539–553 5. Calvin, C.: Implementation of parallel FFT algorithms on distributed memory machines with a minimum overhead of communication. Parallel Computing 22 (1996) 1255–1279 6. Takahashi, D.: Efficient implementation of parallel three-dimensional FFT on clusters of PCs. Computer Physics Communications 152 (2003) 144–150 7. Nadehara, K., Miyazaki, T., Kuroda, I.: Radix-4 FFT implementation using SIMD multimedia instructions. In: Proc. 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP ’99). Volume 4. (1999) 2131–2134 8. Franchetti, F., Karner, H., Kral, S., Ueberhuber, C.W.: Architecture independent short vector FFTs. In: Proc. 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2001). Volume 2. (2001) 1109–1112 9. Rodriguez V, P.: A radix-2 FFT algorithm for modern single instruction multiple data (SIMD) architectures. In: Proc. 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2002). Volume 3. (2002) 3220–3223 10. Kral, S., Franchetti, F., Lorenz, J., Ueberhuber, C.W.: SIMD vectorization of straight line FFT code. In: Proc. 9th International Euro-Par Conference (Euro-Par 2003). Volume 2790 of Lecture Notes in Computer Science., Springer-Verlag (2003) 251–260 11. Frigo, M., Johnson, S.G.: The design and implementation of FFTW3. In: Proc. IEEE. Volume 93. (2005) 216–231 12. Franchetti, F., Kral, S., Lorenz, J., Ueberhuber, C.W.: Efficient utilization of SIMD extensions. In: Proc. IEEE. Volume 93. (2005) 409–425 13. Bailey, D.H.: FFTs in external or hierarchical memory. The Journal of Supercomputing 4 (1990) 23–35 14. Van Loan, C.: Computational Frameworks for the Fast Fourier Transform. SIAM Press, Philadelphia, PA (1992) 15. Frigo, M., Johnson, S.G.: FFTW: An adaptive software architecture for the FFT. In: Proc. 1998 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 98). (1998) 1381–1384 16. Intel Corporation: IA-32 Intel Architecture Software Developer’s Manual Volume 1: Basic Architecture. (2004) 17. Intel Corporation: Intel C++ Compiler for Linux Systems User’s Guide. (2004) 18. Swarztrauber, P.N.: FFT algorithms for vector computers. Parallel Computing 1 (1984) 45–63 19. Sumimoto, S., Tezuka, H., Hori, A., Harada, H., Takahashi, T., Ishikawa, Y.: High performance communication using a commodity network for cluster systems. In: Proc. Ninth International Symposium on High Performance Distributed Computing (HPDC-9). (2000) 139–146
Other Para’04 Contributed Talks The contributed talks below can be found in the following report: • J. Dongarra, K. Madsen and J. Wa´sniewski (Eds.) Complementary proceedings of the Para’04 Workshop on State-of-the-Art in Scientific Computing, Lyngby, Denmark, June, 2004. IMM-Technical report-2005-09. Informatics and Mathematical Modelling, Technical University of Denmark, DK-2800 Lyngby, Denmark. URL: http://www2.imm.dtu.dk/pubdb/views/publication details.php?id=3927 • Masking Latency with Data Driven Program Variants by Scott B. Baden. • Dynamic Code Generation and Component Composition in C++ for Optimising Scientific Codes at Run-time by Olav Beckmann and Paul H J Kelly. • Communication Strategies for Parallel Cooperative Ant Colony Optimization on Clusters and Grids by Siegfried Benkner, Karl F. Doerner, Richard F. Hartl, Guenter Kiechle and Maria Lucka. • Fully Self Organized Public Key Management for Mobile Ad Hoc Network by Daeseon Choi, Seunghun Jin, and Hyunsoo Yoon. • Predicting Protein-Protein Interactions in Parallel by Yoojin Chung, Sang-Young Cho and Chul-Hwan Kim • On the Parallelization of the Lattice-Boltzmann Method by Salvatore Filippone, Nicola Rossi, Gino Bella and Stefano Ubertini. • Parallel and Distributed Techniques for Extracting Large Ontologies as a Resource in a Grid Environment by Andrew Flahive, Mehul Bhatt, Carlo Wouters, Wenny Rahayu, David Taniar, and Tharam Dillon. • Adaptive Fuzzy Active Queue Management by Mahdi Jalili-Kharaajoo, Mohammadreza Sadri and Farzad Habibipour Roudsari. • Inversion and Division Architecture in Elliptic Curve Cryptography over GF (2n ) by Jun-Cheol Jeon, Kyo-Min Ku, and Kee-Young Yoo. • Efficient On-the-fly Detection of First Races in Nested Parallel Programs by Keum-Sook Ha, Yong-Kee Jun, and Kee-Young Yoo. • New Architecture for Inversion and Division over GF(2m ) by Kyeoung Ju Ha, Kyo Min Ku, and Kee Young Yoo. • Solving Linear Systems on Cluster Computers with High Accuracy by Carlos Amaral H¨olbig, Paulo S´ergio Morandi Jr., Bernardo Frederes Kr¨amer Alcalde, Tiaraj´u Asmuz Diverio, and Dalcidio Moraes Claudio. • Finite Fields Multiplier based on Cellular by Automata by Hyun-Sung Kim, and Il-Soo Jeon. • Efficient Systolic Architecture for Modular Multiplication over GF(2m ) by Hyun-Sung Kim and Sung-Woon Lee. J. Dongarra, K. Madsen, and J. Wa´sniewski (Eds.): PARA 2004, LNCS 3732, pp. 1168–1169, 2006. c Springer-Verlag Berlin Heidelberg 2006
Other Para’04 Contributed Talks
1169
• Analyzing the Safety Problem in Securitay Systems using SPR Tool by Il-Gon Kim, Jin-Young Choi, Peter D. Zegzhda, Maxim O. Kalinin, Dmitry P. Zegzhda, In-Hye Kang, Pil-Yong Kang and Wan S. Yi. • A CBD-based SSL Component Model by Young-Gab Kim, Lee-Sub Lee, Dongwon Jeong, Young-Shil Kim, and Doo-Kwon Baik. • Exponentiation over GF(2m) For Public Key Crypto System using Cellular Automata by Kyo Min Ku, Kyeoung Ju Ha, and Kee Young Yoo. • Area Efficient Multiplier based on LFSR Architecture by Jin-Ho Lee and HyunSung Kim. • Efficient Authentication and Key Agreement for Client-Server Environment by Sung-Woon Lee, Hyun-Sung Kim, and Kee-Young Yoo. • New Efficient Digit-Serial Systolic AB 2 Multiplier & Divider in GF (2m ) by Won-Ho Lee and Kee-Young Yoo. • Parallel Co-Processor for Ultra-fast Line Drawing by Pere Mar`es, Antonio B. Mart´ınez and Joan Aranda. • The Design Patterns of Performance-Decision Factors in High-Speed NIDS by Jongwoon Park, Keewan Hong, Kiyoong Hong, Dongkyoo Kim, and Bongnam No. • Scalable Race Visualization for Debugging Message-Passing Programs by MiYoung Park, So-Hee Park, Su-Yun Bae, and Yong-Kee Jun. • Hierarchical Structures for Multi-Resolution Visualization of AMR Data by Sanghun Park. • The Development of Domain Specific Languages From Scientific Libraries by Daniel Quinlan. • A method to Derive the Cache Performance of Irregular Applications on machines with Direct Mapped Caches by Carsten Scholtes. • Interfacing C++ member functions with C libraries by Kurt Vanmechelen and Jan Broeckhove. • Compilation Techniques for a Chip-Multiprocessor with Two Execution Modes by Chao-Chin Wu. • Explicit Formulas and Library of Images of Electromagnetic Fields for Anisotropic Materials by Valery Yakhno, Tatyana Yakhno and Mustafa Kasap.
Author Index
Achenie, Luke E.K. 53, 57 Aliaga, Jos´e 444 Almeida, Francisco 444, 530 Alvarez, Jos´e A. 161 an Mey, Dieter 433 Andrade, Henrique 217 Arbenz, Peter 831 Argollo, Eduardo 691 Arzner, Kaspar 538 Ashcraft, Cleve 740 Aversa, Rocco 442 Baden, Scott B. 206 Bad´ıa, Jos´e M. 267, 444 Bai, Zhaojun 266, 276, 286, 323, 364 Baiardi, Fabrizio 1031 Baik, Doo-Kwon 974 Balaji, V. 563 Balle, Susanne M. 207 Bang, Young-whan 964 Banino, Cyril 1041, 1141 Baravykait˙e, Milda 305 Barnes, David J. 228 Barrachina, Sergio 444 Baryamureeba, Venansius 839 Beˇcka, Martin 831 Beleviˇcius, Rimantas 305 Bella, Gino 546 Benner, Peter 267 Bergen, Benjamin 883 Berz, Martin 65 Bientinesi, Paolo 376, 385 Bindel, David S. 286 Bischof, Christian 433 Bishop, John 207 Blaheta, Radim 847 Blanco, Vicente 444 Bl¨ omeling, Frank 296 Boja´ nczyk, Adam W. 374, 423 Boku, Taisuke 1159 Boville, Byron 563 Brazier, Frances M.T. 675 Brent, Richard P. 1 Bustany, Ismail 740 Buttari, Alfredo 593
Cabrera, Sergio 142 Cai, Xiao-Chuan 313 Cai, Xing 699 Cannataro, Mario 656 Casado, Leocadio G. 161 Casola, Valentina 454 Castillo, Mar´ıa 444 Ceberio, Martine 75 Chapman, Barbara M. 490 Chinchalkar, Shirish 395 Cho, Yookun 1025 Christadler, Iris 901 Chun, Robert 665 ˇ Ciegis, Raimondas 305 Coleman, Thomas F. 395 Collins, Nancy 563 Comito, Carmela 656 Congiusta, Antonio 656 Costa, Antˆ onio C.R. 83, 102, 179 Costa, F´ abia A. 83 Cottenceau, Bertrand 93 Craig, Anthony 563 Cruz, Carlos 563 Cunha, Jose C. 654 d’Almeida, Filomena 864 D’Ambra, Pasqua 593 da Silva, Arlindo 563 Dave, Jagrut 573 de Aguiar, Marilton S. 83 Delanoue, Nicolas 93 DeLuca, Cecelia 563 De Maio, Alessandro 546 Demmel, James W. 286, 740 Denewet, Nicolas 538 De Rose, C´esar A.F. 83 de Sande, Francisco 444, 530 di Flora, Cristiano 464 Di Martino, Beniamino 442 Dimuro, Gra¸caliz P. 83, 102, 179 di Serafino, Daniela 593 Dixon, Matthew F. 709 Dobrian, Florin 758 Dorta, U. 444 D´ ozsa, G´ abor 472
1172
Author Index
Drakenberg, N. Peter 237 Driscoll, Jonathan 573 Edalat, Abbas 112 Ekenb¨ ack, Andreas 554 Elmroth, Erik 404, 1051, 1061 Elster, Anne C. 1141 Faik, Jamal 911 Fang, Jywe-Fei 1071, 1099 Ficco, Massimo 464 Filippone, Salvatore 546, 593 Flaherty, Joseph E. 911 Folino, Gianluigi 656 Fujimoto, Richard 573 Gao, Weiguo 364 Garc´ıa, Inmaculada 161 Gardfj¨ all, Peter 1051 Gebremedhin, Assefaw H. 1079 Georgiev, Krassimir 828 Geus, Roman 831 Gimenez, Judit 665 Ginzburg, Lev 75 Gonz´ alez, Juan R. 481 Gould, Nicholas I.M. 818 Granat, Robert 719 Griewank, Andreas 1089 Grigori, Laura 768 Guerrero-Garc´ıa, Pablo 603 Gunnels, John A. 247, 256, 376 G¨ unther, Frank 874 Gupta, Anshul 778 Gustavson, Fred G. 11, 225, 247, 256, 376 Gwaltney, C. Ryan 122 Hallberg, Robert 563 Hartmann, Wolfgang M. 928, 931 Hayakawa, Kentaro 1151 Heggernes, Pinar 788 H´enon, Pascal 611 Henry, Greg M. 256, 376 Hernandez, Oscar R. 490 Heroux, Michael A. 620 Herrero, Jos´e R. 798 Hetmaniuk, Ulrich 831 Hill, Chris 563 Hippold, Judith 730 Holmgren, Sverker 893
Holmstr¨ om, Mats 554 Hopkins, Tim R. 228 Hoppe, Ronald H.W. 857 Hu, Yifan 818 Huang, Chien-Hung 1099 Huber, Wolfgang 939 H¨ ulsemann, Frank 872, 883 Husbands, Parry 364 Hwang, Eenjun 983 Hwang, Feng-Nan 313 Iacono, Mauro 499 Idriss, Ismail I. 132 Im, Eun-Jin 740 Iredell, Mark 563 Jacob, Robert 563 Jang, Ho-Jong 333 Jaulin, Luc 93 Jeong, Dongwon 974 Jin, Haoqiang 665 Joffrain, Thierry 413 Johansson, Henrik 893 Jost, Gabriele 665 Jung, Yuna 983 Kacsuk, P´eter 472 K˚ agstr¨ om, Bo 21, 719 Kang, Yeun-hee 964 Karimabadi, Homa 573 Kessler, Christoph 519 Kim, EunYoung 992 Kim, Tai-hoonn 957 Kim, Young-Gab 974 Kim, Young-Shil 974, 1000 Kluzek, Erik 563 Knaepen, Bernard 538 Kolos, Sergey 385 Korch, Matthias 1105 Korvink, Jan G. 349 Kosheleva, Olga 142 Kowarschik, Markus 872, 901 Kreinovich, Vladik 53, 75, 83, 189 Krymsky, Victor G. 151 Kwak, Jin 1009 Kwon, Ho-yeol 957 Labarta, Jesus 665 LaFrance-Linden, David Lai, Guan-Joe 1115
207
Author Index Larson, Jay 563 Lee, CheolHo 992 Lee, Gang-soo 964 Lee, HyungHyo 1017 Lee, JinSeok 992 Lee, Kyusoon 423 Lee, Lee-Sub 974 Lee, Lie-Quan 364 Lee, Seung 1000 Lee, YoungLok 1017 Le´ on, Coromoto 481 Li, Ren-Cang 266, 323 Li, Xiaoye S. 364, 768 Liao, Ben-Shan 276 Liao, Chunhua 490 Longpr´e, Luc 189 Luque, Emilio 691 Ma, Sangback 333 Madsen, Kaj 53 Makino, Kyoko 65 Mancini, Emilio 509 Manne, Fredrik 1079 Mansfield, Peter 395 Mansmann, Ulrich 939 Margenov, Svetozar 847 Marrone, Stefano 499 Mart´ınez, Jos´e A. 161 Mastroianni, Carlo 656 Mattsson, H˚ akan 519 Mayo, Rafael 267, 444 Mazzeo, Antonino 454 Mazzocca, Nicola 454, 499 Mediavilla, Evencio 530 Meer, Klaus 169 Mehl, Miriam 874 Messmer, Peter 527, 583 Michalakes, John 563 Mori, Paolo 1031 Moscato, Francesco 499 Mulmo, Olle 1051 Munasinghe, Kalyani 1123 Myers, Margaret 376 Navarro, Juan J. 798 Neckels, David 563 N´emeth, Csaba 472 Neytcheva, Maya 847 Ng, Esmond G. 364 Ngubiri, John 839
1173
Noh, BongNam 1017 Norris, Boyana 629 Oh, HyungGeun 992 Oh, Soohyun 1009 Omelchenko, Yuri 573 Omidi, Nick 573 Oscoz, Alex 530 Ostrovsky, Gennadi M. 57 Overeinder, Benno J. 675 Park, Sang Yun 1000 Park, Soo-Hyun 974 Park, Yongsu 1025 Pattinson, Dirk 112 Pellegrini, Fran¸cois 611 Perumalla, Kalyan 573 Petrova, Svetozara I. 857 Plassmann, Paul E. 644 P¨ ogl, Markus 874 Pothen, Alex 758 Pugliese, Andrea 656 Quintana, Enrique S. 444 Quintana, Gregorio 444 Quintana-Ort´ı, Enrique S. 267, 376, 413 Radons, G¨ unter 1131 Raghavan, Padma 590, 637 Rak, Massimiliano 509 Ramet, Pierre 611 Rana, Omer F. 654 Rauber, Thomas 1105 Reid, John K. 33 Reiser, Renata H.S. 179 Rexachs, Dolores 691 Ricci, Laura 1031 Rifkin, Howard 207 Rixen, Daniel J. 342 Rodr´ıguez, Casiano 444, 481 R¨ ollin, Stefan 808 Roman, Jean 611 Rosenvinge, Einar M.R. 1141 R¨ ude, Ulrich 901 Rudnyi, Evgenii B. 349 Ruhe, Axel 357 R¨ unger, Gudula 730, 1131 Ruschhaupt, Markus 939 Russo, Stefano 464 Saad, Yousef
611
1174
Author Index
Sakurai, Tetsuya 1151 Sala, Marzio 620 Sandholm, Thomas 1051 ´ Santos-Palomo, Angel 603 Sato, Mitsuhisa 1151, 1159 Sawyer, William 563 Schenk, Olaf 808 Schreiber, Andreas 680 Schwab, Earl 563 Schwind, Michael 1131 Scott, Jennifer A. 818 Silva, Rafael K.S. 83 Skelander, Rikard 404 Smithline, Shepard 563 Somerville, Paul N. 949 Sosonkina, Masha 690, 747 Spezzano, Giandomenico 656 Spiegel, Alexander 433 Stadtherr, Mark A. 122 Starks, Scott A. 189 Suarez, Max 563 Sussman, Alan 217 Takahashi, Daisuke 1151, 1159 Talia, Domenico 656 Teranishi, Keita 637 Teresco, James D. 911 Testa, Mario 546 Tinetti, Fernando G. 691 Toledo, Sivan 756 Tordsson, Johan 1061 Torella, Roberto 509 Trayanov, Atanas 563 Trinitis, Carsten 921 Usevitch, Brian
142
van de Geijn, Robert A. 256, 376, 385, 413 Vasconcelos, Paulo B. 864 Veljkovic, Ivana 644 Veltri, Pierangelo 656 Venticinque, Salvatore 454 Vidal, Edward Jr. 142 Villanger, Yngve 788 Villano, Umberto 509 Vittorini, Valeria 499 Vlahos, Loukas 538 Voss, Heinrich 296 Wait, Richard 1123 Wallin, Dan 893 Walther, Andrea 1089 Wang, Suk-Hee 1000 Wa´sniewski, Jerzy 225 Weidendorfer, Josef 921 Wolfe, Jon 563 Won, Dongho 1009 Woods, Tom 1079 Xiang, Gang
189
Yang, Chao 364 Yang, Chin-Yang 1099 Yang, Hong-liu 1131 Yang, Laurence Tianruo 442 Yang, Weiyu 563 Yelick, Katherine A. 740 Zaslavsky, Leonid 563 Zenger, Christoph 874 ˇ Zilinskas, Antanas 197 ˇ Zilinskas, Julius 197 Zlatev, Zahari 43, 828