Performance-Oriented Application Development for Distributed Architectures: Perspectives for Commercial and Scientific Environments

PERFORMANCE-ORIENTED APPLICATION DEVELOPMENT FOR DISTRIBUTED ARCHITECTURES This page intentionally left blank Perfo...

Author: Germany) Pdda 200 (2001 Munich

28 downloads 197 Views 12MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

PERFORMANCE-ORIENTED APPLICATION DEVELOPMENT FOR DISTRIBUTED ARCHITECTURES

This page intentionally left blank

Performance-Oriented Application Development for Distributed Architectures Perspectives for Commercial and Scientific Environments Edited by

M. Gerndt Department of Informatics, Technical University Munich, Munich, Germany

/OS

Press Ohmsha

© 2002, IOS Press All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, without the prior written permission from the publisher. ISBN 1 58603 267 4 (IOS Press) ISBN 4 274 90514 4 C3055 (Ohmsha) Library of Congress Control Number: 2002106946 This is the book edition of the journal Scientific Programming, Volume 10, No. 1 (2002). ISSN 1058-9244.

Publisher IOS Press Nieuwe Hemweg 6B 1013 BG Amsterdam The Netherlands fax:+3120 620 3419 e-mail: [email protected]

Distributor in the UK and Ireland IOS Press/Lavis Marketing 73 Lime Walk Headington Oxford OX3 7AD England fax:+44 1865750079

Distributor in the USA and Canada IOS Press, Inc. 5795-G Burke Centre Parkway Burke. VA 22015 USA fax: +1 703 323 3668 e-mail: [email protected]

Distributor in Germany, Austria and Switzerland IOS Press/LSL.de Gerichtsweg 28 D-04103 Leipzig Germany fax: +49 341 995 4255

Distributor in Japan Ohmsha, Ltd. 3-1 Kanda Nishiki-cho Chiyoda-ku. Tokyo 101-8460 Japan fax: +81 332332426

LEGAL NOTICE The publisher is not responsible for the use which might be made of the following information. PRINTED IN THE NETHERLANDS

Contents Guest-editorial: PADDA 2001, M Gerndt

1

Performance engineering, PSEs and the GRID, T. Hey and J. Papay

3

NINJA: Java for high performance numerical computing, J.E. Moreira, S.P. Midkiff, M. Gupta, P. Wu, G. Almasi and P. Artigas

19

Dynamic performance tuning supported by program specification, E. Cesar, A. Morajko, T. Margalef, J. Sorribes, A. Espinosa and E. Luque

35

Memory access behavior analysis of NUMA-based shared memory programs, J. Tao, W. Karl and M. Schulz

45

SKaMPI: a comprehensive benchmark for public benchmarking of MPI, R. Reussner, P. Sanders and J.L. Traff

55

Efficiently building on-line tools for distributed heterogeneous environments, G, Rackl, T. Ludwig, M. Lindermeier and A. Stamatakis

67

Architecting Web sites for high performance, A. Iyengar and D. Rosu

75

Mobile objects in Java, L. Moreau and D. Ribbens

91

Author Index Almasi, G. Artigas, P. Cesar, E. Espinosa, A. Gerndt, M. Gupta, M. Hey, T. lyengar, A. Karl, W. Lindermeier, M. Ludwig, T. Luque, E. Margalef, T. Midkiff, S.P. Morajko, A.

19 19 35 35 1 19 3 75 45 67 67 35 35 19 35

Moreau, L. Moreira, Papay, J. Rackl, G. Reussner, R. Ribbens, D. Rosu, D. Sanders, P. Schulz, M. Sorribes, J. Stamatakis, A. Tao, J. Traff, J.L. Wu, P.

J.E.

91 19 3 67 55 91 75 55 45 35 67 45 55 19

Scientific Programming 10 (2002) 1-2 IOS Press

Guest-Editorial

PADDA 2001

This special issue is devoted to programming models, languages, and tools for performance-oriented program development in commercial and scientific environments. The included papers have been written based on presentations given at the workshop PADDA 2001 (Performance-Oriented Application Development for Distributed Architectures - Perspectives for Commercial and Scientific Environments) held at Technische Universitat Munchen on April, 19th-20th, 2001. The workshop was funded by KONWIHR (Competence Network for Scientific High Performance Computing in Bavaria, konwihr.in.tum.de). The goal of the workshop was to identify common interests and techniques for performance-oriented program development in scientific and commercial environments. Distributed architectures currently dominate the field of highly parallel computing. Highly parallel machines can range from clustered shared-memory systems like the Hitachi SR8000 and SGI Origin 2000, to PC-based clusters connected via Myrinet/Beowulf, SCI or Fast Ethernet. In addition, meta-computing applications are executed in parallel on high-performance systems and special purpose systems, which are connected via a wide-area network. Distributed architectures, based on Internet and mobile computing technologies, are important target architectures in the domain of commercial computing too. Program development is facilitated by distributed object standards such as CORBA, DCOM, and Java RMI. Those standards provide support for handling remote objects in a client-server fashion, but must also ensure certain guarantees on the quality of service. A large number of applications, such as data mining, video ondemand, and e-commerce, require powerful server systems, which are quite similar to the parallel systems used in high performance computing applications. In both areas performance tuning plays a very important role. For high performance computing both, individual applications as well as co-operating applications in meta-computing environments, have to be tuned to ISSN 1058-9244/02/$8.00 © 2002 - IOS Press. All rights reserved

improve speedup and enable new complex simulations to be undertaken. In commercial data processing performance is crucial to enable processing of huge data set and customer requests. For example, e-commerce applications must support rapid response times to customers, which may become a deciding factor in whether customers remain with a given service provider. Similarly, such electronic businesses may need to sustain a large number of transactions and manage large data sets. A similar requirement is found in emerging applications in scientific computing which make use of distributed visualization and data management. On one side, there is an uptake of programming standards from the commercial arena into high performance computing since these standards facilitate metacomputing, as well as programmers from universities being trained in these programming styles. On the other hand, a huge experience in performance analysis exists in the area of high performance computing that might be applicable to the commercial arena. The papers included in the special issue come from both areas, scientific computing and commercial computing. The first five papers are devoted to scientific computing while the other three papers are presenting aspects and techniques for commercial computing. The first paper written by Tony Hey et al. on Performance Engineering, PSEs and the Grid gives an introduction into requirements and standard techniques for performance analysis in scientific environments. It also presents new requirements and techniques in the areas of problem solving environments and grid computing. The second paper written by Jose Moreira et al. on NINJA: Java for High Performance Numerical Computing discusses the challenges in using Java for scientific programs. It presents the motivation for using Java for numerical applications as well as the current major performance problems in doing so. It presents also recent techniques for improving the performance of numerical codes written in Java.

Editorial

The third paper written by Eduardo Cesar et al. on Dynamic Performance Tuning Supported by Program Specification presents an environment for automatic performance analysis and program tuning for message passing codes. Kappa-Pi analyzes traces from message passing programs, detects performance bottlenecks automatically, and gives recommendations for improving the code. The environment is currently being extended for online-tuning of applications based on abstract program structures, such as SMPD or divide and conquer. The fourth paper written by jie Tao et al. on Memory Access Behavior Analysis of NUMA-based Shared Memory Programs presents a new monitoring infrastructure for SCI-based NUMA machines. It is based on a low-level hardware monitoring facility in SCIinterfaces and includes software layers and tools that allow the user to investigate the remote access performance in the context of the program's control flow and data structures. The fifth paper written by Ralf Reussner et al. on SKaMPI: A Comprehensive Benchmark for Public Benchmarking ofMPI describes a benchmark for MPI programs which incorporates well-accepted mechanisms for ensuring accuracy and reliability. The benchmarking results are maintained in a public performance database with different hardware platforms and MPI implementations. It thus assists users in developing performance portable MPI applications. The second group consists of three papers presenting aspect of performance-oriented programming in commercial environments. The first paper written by Gunther Rackl et al. on Efficiently Building Online Tools for Distributed Heterogeneous Environments presents a methodology for developing on-line tools based on MIMO, a novel middleware monitor. The MIMO environment implements an event-based online approach in the context of CORBA applications frequently used in commercial environments. The second paper written by Arun lyengar et. al. on

Architecting Web Sites for High Performance gives an overview of recent solutions concerning the architectures and the software infrastructures used in building Web site applications. The challenge for such Web sites is the specific combination of high variability in workload characteristics and of high performance demands regarding the service level, scalability, availability, and costs. The third paper in this group written by Luc Moreau et al. on Mobile Objects in Java discusses a novel algorithm for routing method calls. Mobile objects have emerged as a powerful paradigm for distributed applications. Important for the success of that paradigm is the performance of transparent method invocation and the availability of distributed garbage collection. This article compares different implementations based on synthetic benchmarks. The broad spectrum of topics covered by the papers in this special issue will, hopefully, enable readers to benefit from techniques developed in both areas. New research and development projects should take advantage of similarities of requirements for performance in both areas and thus broaden the opportunities for exploitation of the results. Certainly, the workshop was very successful in steering discussions and triggering new ideas for joint research. I would like thank the members of the PADDA 2001 program committee, Marios Dikaiakos (University of Cyprus), Wolfgang Gentzsch (SUN Microsystems), John Gurd (University of Manchester), Gabriele Kotsis (Universitat Wien), Frank Muller (North Carolina State University), Jim Laden (SUN Microsystems), and Omer F. Rana (University of Wales-Cardiff), for setting up an exiting program and for selecting these excellent papers. Michael Gerndt Technische Universitat Munchen Germany

Scientific Programming 10 (2002) 3-17 IOS Press

Performance engineering, PSEs and the GRID Tony Heya and Juri Papayb a

Director UK e-Science Programme EPSRC, Polaris House, North Star Avenue Swindon SN2 1ET, UK E-mail: [email protected] b Department of Electronics and Computer Science, University of Southampton, Southampton SO171BJ, UK E-mail: [email protected]

Abstract: Performance Engineering is concerned with the reliable prediction and estimation of the performance of scientific and engineering applications on a variety of parallel and distributed hardware. This paper reviews the present state of the art in 'Performance Engineering' for both parallel computing and meta-computing environments and attempts to look forward to the application of these techniques in the wider context of Problem Solving Environments and the Grid. The paper compares various techniques such as benchmarking, performance measurements, analytical modelling and simulation, and highlights the lessons learned in the related projects. The paper concludes with a discussion of the challenges of extending such methodologies to computational Grid environments.

1. Introduction Performance has been a central issue in computing since the earliest days [3]: 'As soon as the Analytical Engine exists, it will necessarily guide the future course of science. Whenever any result is sought by its aid, the question will then arise - by what course of calculation can these results be arrived at by machine in the shortest time ?' Performance engineering may be defined as a systematic approach in which components of both the application and computer system are modelled and validated. Although performance is probably one of the most frequently used words in the vocabulary of computing, paradoxically it is evident that there is a substantial "knowledge gap" between the software development process and actual performance estimation and optimisation. It is still often the case that programmers and software system designers have insufficient knowledge of the performance implications of their design choices. Indeed, it is clear that systematic performance engineering is not yet an integral part of the software development process and that performance issues often arise very late in the process. As a result it is not surprising that performance problems are a frequent cause of failure of large software development projects. ISSN 1058-9244/02/$8.00 © 2002 - IOS Press. Allrightsreserved

One possible reason why performance issues do not feature explicitly in current software methodologies is Moore's Law. Up to now, the exponential growth in microprocessor performance has usually enabled users to avoid hitting any serious 'performance-wall'. However, in the relatively near future, it is likely that growth in processor performance will slow down and begin to deviate from Moore's Law. Software developers will then be forced to pay more attention to the efficient use of the available silicon real-estate. Furthermore, if the Grid becomes a reality and computer resource 'marketplaces' begin to emerge, software performance on different hardware platforms will be directly related to real costs. In such a 'computational economy', it is clear that performance engineering and reliable performance estimation will play a pivotal role in the establishment of realistic 'performance contracts'. A performance contract is the product of the negotiating process between the suppliers and customers of computing resources. It contains information about the resource demand of applications and available computing capacity. At present this feature is not available, this is mainly due to the lack of reliable performance estimation techniques. The Grid is assumed to be 'an infrastructure that enables flexible, secure, coordinated resource sharing among dynamic collections of individuals, institutions and resources' [13]. The resources accessible

T. Hey and J. Papay / Performance engineering, PSEs and the GRID

via the Grid include computational systems, data storage and specialized facilities and are thus a richer set of 'informational utilities' than the Web. In this context it is helpful to consider the Grid as providing the global middleware infrastructure that will enable the establishment of transient "virtual organizations" on a transparent and routine basis. There will be many different types of applications for the Grid and in many cases it is likely that Problem Solving Environments (PSEs) generalized to the Grid will play an important role. A PSE is an applicationspecific environment that provides the user with support for all stages of the problem solving process from program design and development to compilation and performance optimisation. Such an environment also provides access to libraries and integrated toolsets, as well as support for visualization and collaboration. It may also implement some form of automated workflow management. Performance engineering - including estimation, monitoring and measurement - will be an integral component of any Grid PSE since reliable models of performance prediction will be required for any realistic Grid scheduling and accounting packages. The paper is organized as follows. Section 2 reviews several different approaches that have been attempted for performance engineering and gives a short account of some performance benchmarking, monitoring and simulation techniques. Section 3 takes a brief look at Problem Solving Environments in the context of performance and presents a short account of two recent UK-based PSE projects. The next section outlines some of the challenges represented by the Grid for the performance evaluation community and reviews some EU experiments on Europe-wide meta-computing. Finally we offer some conclusions and challenges for the performance engineering community.

2. Performance engineering approaches 2.1. Benchmarks The goal of benchmarking is to understand and predict the key parameters that determine the performance of computing platforms and full scale applications. There have been numerous benchmarking efforts undertaken in the past but no general agreement on how to conduct the measurements and how to interpret the results. Examples of benchmarking efforts include the Livermore Loops [38], the NAS Kernels [4] and the Parkbench initiative [16]. It is worthwhile for us to

summarize the objectives and achievements of Parkbench. The main objectives of the ParkBench initiative were: 1) To establish a comprehensive set of parallel benchmarks that is generally accepted by both users and vendors of parallel systems. 2) To provide a focus for parallel benchmark activities and avoid unnecessary duplication of effort and proliferation of benchmarks. 3) To set standards for benchmarking methodology and result-reporting with establish a database/ repository for both benchmarks and the results. 4) To make parallel and sequential versions of the benchmarks and results freely available in the public domain. As a result of this effort a benchmark suite was developed which contains sequential and message passing versions of the following codes: -

5 Low Level Communication codes 5 Low Level Sequential codes 5 Parallel Linear Algebra Kernels 2 NAS Parallel Benchmark Kernels 3 NAS Compact Application codes ORNL Shallow Water Model Application code

Open-MP versions of some of these codes are now available [18]. By providing three tiers of benchmark complexity - low-level, kernel and compact application - it was hoped that the performance of real applications could be understood. The low-level codes provide basic machine parameters, the kernels provide information about compute intensive algorithms and the compact applications add the complexities of start-up, I/O and so on. In the event, Parkbench was only partially successful: lack of dedicated funding for such a benchmark evaluation programme prevented its full exploration and realization. Nonetheless, there were significant achievements - a serious programming methodology was defined, parallel versions of the NAS Parallel Benchmarks were made available in the public domain and a repository for results with a graphical interface established. In addition, an electronic journal for the rapid publication of performance results was established [19] and this is now established as a special section of the journal Concurrency and Computation: Practice and Experience. Two examples will illustrate the type of data that was made available by these benchmarks. Figure 1 shows the performance of the parallel LU kernel benchmark

T. Hey and J. Papay / Performance engineering, PSEs and the GRID C-90Aug.92 T3DOct.94 SP-lFeb.94 - X - - SP-2 Aug.94 - — — Paragon-OSF1.2 Jul.94

10000

§ 1000

100

1

10

100

1000

10000

Number of processors Fig. 1. Performance of vector vs. distributed memory machines on LU kernel.

on distributed memory (DM) machines. The data is from 1994 and shows for perhaps the first time the performance of a parallel DM system outperforming the largest vector supercomputer of the time. Figure 2 shows results of the low-level communication benchmark COMMS1 for the Intel Delta and the Intel iPSC/860 systems. The main concern of performance engineering is to develop techniques which enable to predict the performance and resource requirements of applications, in this sense benchmarking has little to offer. The results of benchmarks are usually expressed by a single number which is sufficient for comparing and rating various computers, however this number provides little information about the key parameters governing the resource requirements of real applications. 2.2. Performance measurements Performance measurements are based on event profiling and tracing. These techniques assume an event model of program execution. During program execution, profiling tools accumulate summary data for significant events such as function calls, cache misses, communications etc. Typically, this approach has a low overhead since such profiling is usually implemented by simple event counters. Using this method users can obtain statistical information on the percentage of time spent by their application program in performing vari-

ous functions and can use this information to identify potential problem areas. Trace tools on the other hand, provide the user with much more detailed information on the sequence of events as they happen during program execution. Trace files record time ordered events and can constitute a large volume of data. The information recorded in a trace can represent various levels of abstraction. In the case of parallel platforms, the recorded traces for the different nodes need to be collected, sorted according to time stamps and merged into a global event trace. Although trace tools provide more detailed information about the parallel program execution than the profiling tools, there can be a significant overhead for trace generation. Furthermore, the large volumes of trace data generated can be overwhelming. There are numerous commercial and academic tracing and profiling tools available. Examples include Apprentice [8], gprof, Vampir [41] and Paradyne [5]. Apprentice is a product of Cray Research which uses source code instrumentation through compiler switches to provide statistics on the level of functions and basic blocks. The advantage of source code instrumentation is that the results of monitoring can be easily interpreted in terms of programming language statements and give very direct feedback to the programmer. A problem with this approach is that the libraries cannot generally be monitored at the same level of detail since they are usually only available in binary format. Unlike the

T. Hey and J. Papay / Performance engineering. PSEs and the GRID 100

10

2

1

10

100 1000 Message length, Byte

10000

100000

Fig. 2. Benchmarking of communication latency.

Cray product, Vampir is a commercial graphical event trace browser from Pallas that is available on many different platforms. Vampir also provides visualization and statistical analysis of trace files. Measurement and profiling tools have achieved a high level of maturity, however the usage of these tools for reliable performance estimation is still an open issue. These tools can generate a large volume of detailed data, but in order to gain some understanding of the application's runtime behaviour this data needs to processed and interpreted. The interpretation is not a simple and straightforward process, it requires user intervention and considerable knowledge of the problem domain. An important aspect of these tools is the level of intrusion which affects the accuracy of measurements and can even alter the behaviour of the system, this is often not mentioned or not quantified. The amount of data that is collected during the measurement is related to the level of intrusion, therefore it is vital to find the right balance between the volume of data and the acceptable level of intrusion. 2.3. Analytical models Historically, Hockney's n1/2 and rx 'pipeline' model provided a useful abstraction of Vector Supercomputer architecture [15]. This pipeline model has

been extended to characterize communication performance in parallel DM message-passing systems. In this case the pipeline parameters captured the communication latency and asymptotic communication bandwidth. Typically, the nodes of such systems are scalar processors but it is also possible to use Hockney-style pipeline parameters to provide a simple characterization of the memory hierarchy of the node. These are basic hardware parameters. It is also possible to characterize the 'computational intensity' of an application in terms of the ratio of the number of arithmetic operations performed per off-chip memory access. The parallel program is represented as an alternation of nonoverlapping computation and communication stages. The application program is described in terms of the number of scalar floating point operations, the amount of data transferred and the number of messages. The output of the model is assumed to be the sum of processing and communication times. The model assumes perfect load balance and the timing formula derived for a single processor is used for the performance characterisation of the whole parallel program. A weakness of this model is that it is only valid for the performance analysis of parallel algorithms with regular structures and good load balancing. In recent years, several other cost models have been developed. These include the BSP (Bulk Synchronous

T. Hey and J. Papay / Performance engineering, PSEs and the GRID

Parallel) model [46,36,37] and the LogP model [9]. Both these approaches attempt to get beyond the usual (unrealistic) assumptions made in 'classical' PRAM complexity analysis. In particular an attempt is made to take into account the limitations of real systems such as the network bandwidth, and communication and synchronisation overheads. These models aim to provide a machine independent framework for parallel algorithm design and performance prediction. The LogP model characterises the parallel system by the following parameters: - the upper bound on communication latency from source to target (L), - the overhead of send and receive (o), - the minimum time interval between consecutive transmissions or receptions (g), - the number of processor/memory modules (P). The model assumes asynchronous execution mechanism, finite network capacity and specifies the work (W) between communications. The BSP model, on the hand, has an equally simplistic and unrealistic cost model. In fact, the Oxford 'BSP model' bears almost no resemblance to Valiant's actual BSP complexity analysis. In order to prove any useful results, in his original BSP model Valiant requires parallel slackness at the nodes to hide communication delays, two-phase random routing of messages to avoid possible network congestion, and data hashed randomly across the processors with a sufficiently random class of hash functions to avoid memory 'hot spots'. All that is left of Valiant's BSP analysis in the Oxford BSP model is the programming methodology of the bulk synchronous programming style! Both the BSP and LogP models are similar in a sense that they attempt to provide an abstraction of the performance of the communication network and processing nodes using a minimal number of 'average' performance parameters. Details of the network topology and memory hierarchy are ignored. Such models represent, at best, a "back-of-the envelope" approach to performance prediction. Nevertheless, it must be said that in some cases, as the experiments on CM-5 showed, the LogP model was able to provide a close match to actual performance measurements [11]. As we will see below, a similar 'average' speed analysis of the Livermore loops ignoring any effects of the memory hierarchy gives performance results that can be over 100% under- or over-estimated. Other interesting approaches to performance modelling include Carter's Parallel Memory Hierarchy Model [2] and the Manchester group's Overhead Analysis approach [40,44].

Analytical modeling is in fact complexity analysis, which involves an abstract model of program execution and cost models of computation and communication operations. This technique attempts to combine the parameters of the application and the computer in order to produce a mathematical expression for performance estimation. This approach involves numerous assumptions which approximate the system and application's behaviour. Although these approximations make the performance evaluation analytically tractable, they significantly reduce the accuracy of predictions and limit the applicability of analytical models to certain class of applications or computer architectures. 2.4. Simulation Simulation can provide very detailed information about both the computer system and application program. This information can be at various levels, in terms of hardware architecture ranging from simulation of only the main components of the architecture right down to simulations at the gate level, and on the application side, from programming language statements down to machine code. The simulation model characterizes the system by a number of state variables that are updated as the simulation progresses. Simulation techniques can be classified according to four basic types: instruction driven, trace driven, execution driven and event driven. Instruction driven simulation is based on interpreting instructions of the target machine. This technique gives high accuracy, achieved by step-bystep simulation of each instruction, but requires long simulation times. Full instruction level simulation is too time consuming for practical use on real applications and complex machine architectures. Trace driven simulation uses records of measurements obtained from the real system or synthetic traces generated by the trace generator. The trace is a sequence of user defined events generated by an instrumented program. This technique can require large amounts of memory and processing time to produce reliable results. The DIMEMAS tool is an example of a trace-driven performance prediction tool for messagepassing parallel programs [34]. This tool uses the Vampir Trace File and scales the CPU time spent in each block of code and the parameters of communication events according to the target machine parameters. Event driven simulation maintains a global queue of events. The operation cycle consists of event fetching, the simulation step and update of the data structure representing the simulated system [42]. The inputs of

T. Hey and J. Papay / Performance engineering, PSEs and the GRID

Fig. 3. Prediction error of static statement analysis.

event driven simulation models are probability distributions of response times, request arrivals and delays. The main disadvantages of probabilistic workload models are that they do not directly represent the parameters of specific application programs and the results of the simulation require in-depth statistical processing in order to determine the accuracy of the model. Finally, execution-driven simulation models interleave the execution of an application with the simulation of the target system. The main advantages of execution driven simulation are the speed and the use of actual programs for the simulation of parallel architectures rather than using distribution or trace driven workloads. That such an execution driven approach is a feasible solution for the simulation of parallel systems has been demonstrated by systems such as the Rice Parallel Processing Testbed (RPPT) [7] and the Wisconsin Wind Tunnel [43]. The disadvantage of this technique is that it is more difficult to implement than the trace driven approach due to the complex interactions between the application program and the simulator. A key lesson learned from the previous work is that any simulation method must take account of memory hierarchy for realistic performance estimation. Simple static statement analysis of the source code is well known to be unpredictably unreliable as it is presented by Fig. 3, which shows the difference between predicted and actual execution times of Livermore Fortran kernels on SPARC 1 and SPARC 5 workstations.

The PERFORM system developed by Dunlop at Southampton is an execution-driven simulation tool that uses a novel 'Fast Simulation Method' that attempted to improve the accuracy of prediction and overcome some of the problems with a full simulation of the memory hierarchy [10,14]. The model uses the "program slicing" technique to isolate the control variables and array indices of the source code, retaining sufficient information to simulate data movement within the memory hierarchy. The sliced program is then augmented with calls to the PERFORM simulator which models the effects of memory hierarchy cache memory, computation and message passing. Fast simulation is achieved by providing feedback between the simulator and source and curtailing loop execution when the cache behaviour of iterations are reliably estimated. The main stages of the Fast Simulation Method used in PERFORM tool are illustrated in Fig. 4. An example of the accuracy of predictions that can be achieved by the PERFORM tool is shown in Fig. 5 that compares actual and predicted performance on a SPARC system. As can be seen, the predicted lower bound provided by PERFORM captures the detailed cache effects very accurately. Simulation is a useful technique for the performance evaluation of systems at the design stage of development. Concerns with such simulation-based approaches are the level of detail of the simulation model, accuracy and the simulation time. Simulation models are usually large size programs, their development is expensive and in the case of simulations of the architecture at the instruction level for example, require a long run-time in order to provide meaningful results.

3. Problem Solving Environments A Problem Solving Environment (PSE) is an integrated computing environment which incorporates all stages of the problem solving process, such as problem specification, computation, analysis and optimization. The key issues associated with the architecture design of PSEs are interoperability, modularity and reusability of components. The problem of interoperability of the different software packages stems from the different (and often proprietary) file formats produced by the various components. This is often the case, for example in mechanical engineering where we frequently need to implement data exchange between various CAD packages and CAE tools. Several recent papers [20,21] highlight the need to adopt a universal file format based

T. Hey and J. Papay / Performance engineering, PSEs and the GRID

Fig. 4. Key stages of fast simulation method.

on XML that will simplify data exchange and structure specification. Many of these pleas are from users with real industrial applications but vendors of component software packages see little commercial incentive to make their software easily interoperable with packages from other vendors. At present there is no generally accepted methodology for the specification, analysis and design of modular reusable systems. In recent years there has been a significant progress in the development of middleware technologies that provide support for system integration based on objects. Examples of these technologies include CORBA, DCOM and, in the context of Web Services, the recently proposed SOAP protocol. Although these technologies share many common features there is no universally accepted definition of objects and consequently their claims to provide full interoperability within the same system are somewhat questionable. At present, CORBA is the dominant middleware technology in the PSE world. Key advantages of CORBA are the existence of a single specification document and the participation of more than 800 companies in the consortium. Nevertheless, despite the existence of the IIOP inter-ORB protocol, there is still a problem for applications that attempt to mix two or more of the large number of vendor specific implementations of CORBA. An unwelcome result of this diversity is the problem of interoperability between competing middleware prod-

20 18 16 14 -

V

12 10 6 6• Optimised benchmark results

4-

0

• Predicted lower bound execution time

1

2

3

4

Data size, MByte

Fig. 5. Predictive power of PERFORM tool.

ucts. It should also be emphasised that programming using any of the CORBA implementations is not trivial and the C++ syntax is rather complex. By contrast, DCOM is Microsoft's answer to CORBA and this has the definite advantage that there is one specification, implementation and one vendor. A limitation is that DCOM runs only on Windows.

10

T. Hey and J. Papay / Performance engineering, PSEs and the GRID

Fig. 6. A PSE architecture built from Cardiff/Southampton PSE components.

Arguably, two of the most successful PSEs are the commercial products Matlab and Mathematica. Both these mature software environments have successfully combined good usability with functionality. Perhaps, strictly speaking, these products do not qualify as genuine PSEs since they both provide rather generic environments for a broad range of application areas rather than an environment targeted at a single application. Nevertheless, these products have a wide user base and demonstrate what is possible in principle though neither have seen the development of a version for parallel systems as a high commercial priority. Apart from these examples, there are many 'research' PSEs either in existence or under development in many universities and research institutes around the world. Examples include GasTurbnLab from Purdue [22], BIOPSE from Utah [23] and Autobench from Stuttgart [24]. Unfortunately, it is not clear whether any of these PSE systems are much used by real users. As two examples of PSE projects, we shall briefly describe two ongoing UK projects: one a project with considerable direct industrial involvement and the other a purely 'academic' project. These are the Swansea/BAE Systems Project [49] and the Cardiff/Southampton PSE project [47]. The Swansea/BAE Systems project represents a complete industrial environment for multi-disciplinary computational fluid dynamics, electro-magnetics and structural mechanics simulations. This PSE includes geometry builder, mesh repair, unstructured grid generation, grid quality analysis, post-processing and data analysis.

execution on remote/parallel platforms, help facilities and application integration. The system is based on CORBA and uses a parallel architecture (VIPar) for image processing. This PSE has been further developed in the CAESAR and JULIUS EU projects. A key problem for implementation of their 'Computational Science Pipeline' is the data transfer between the different components. The aim of Cardiff/Southampton PSE Project is to leverage modern software technologies such as CORBA, Java and XML and to develop the key modules which can be used for the rapid prototyping of application specific PSE environments [47]. The main components developed in this project are: a Visual Component Composition Environment, an Intelligent Resource Manager, based on the Southampton Intrepid Scheduler [1] and a Software Component Repository. The two applications targeted by this project are Molecular Dynamics and Photonic Crystal Structures simulations. A PSE architecture incorporating the developed components is presented in Fig. 6. The system is based on the object-web concept where the services are represented as network objects. The problem is formulated as an XML request by the user and the response, also in XML, is produced by the Web server. The user interface is embedded in the browser environment and enables visual programming by allowing "drag-and-drop" of objects in a task-graph design area. The interaction with the user is implemented as a sequence of Web pages. As a commercial middleware ORB. the ORBACUS implementation of CORBA

T. Hey and J. Papay / Performance engineering, PSEs and the GRID

was selected which is a mature product and provides numerous services for naming, trading and interface repository. The main task of the Monitor is to collect and store information about machines and tasks running in the system. The information about machines includes data about the available resources such as memory, processors, disk and load. The task information is a representation of the resources used by the given task such as size of occupied memory, communication and I/O traffic, and disk volume used. The information collected by the Monitor is stored in a database and used by the Scheduler for task allocation and load balancing. The interactions of Monitor with the other components of the system are represented in Fig. 7. On each computer there is an Object Server deployed which instantiates the Reporter object. The Reporter registers with the Name Server, which maintains a list of remote object addresses. The Monitor at regular intervals queries the Reporter objects and updates the Machine and Task tables in the database. The Scheduler provides task allocation to resources, run-time forecast and dynamic load-balancing. The scheduling is based on machine independent application load models for CPU, memory size, I/O traffic and disk volume. These algebraic expressions are included in the description of each task and represent the task resource requirements. The scheduling algorithm performs the following steps for each task in the task-graph: -

check the availability of licenses, memory, disk generate list of candidate machines compute time components select minimum execution time include task-machine binding in the schedule

The three year project is nearing completion and a full evaluation with performance measurements will be available soon. Preliminary indications are that such a component based approach can bring real advantages in terms of software development and deployment. In the final analysis, however, it will be the reaction of users to such an environment that will provide the real measures of success or failure! On the whole, the lessons learnt from many of these PSE experiments are not very encouraging. There are major problems in automating the data flow between CAD and CAE tools. Furthermore, there is little incentive for vendors of legacy codes to make their product interoperable with tools from other vendors. In addition, present PSEs focus almost entirely on the design

11

and simulation part of the engineering process. There is a real need to incorporate the experimental validation and testing part of the process. Thus a complete PSE would offer support for the recording and analysis of experimental as well as simulation data. Incorporation of databases of experimental measurements and simulation results will allow the development of data mining and knowledge discovery components of the PSE. PSEs have a long way to go to prove their worth in real engineering environments! PSEs often represent large scale meta-applications which require massive computing resources. In this case the role of performance engineering is to predict the resource requirements of the application, to ensure that there is sufficient computing capacity available and the individual tasks are assigned to the most appropriate computer.

4. Grids 4.1. The Grid as a new paradigm The significant investments currently being made in Grid research shows that the governments around the world are taking the development of such infrastructure middleware very seriously. With the recent announcement of IBM's support for the Grid, it is not unreasonable to expect that the Grid will eventually become the key middleware not only for science and engineering but also for industry and commerce. In the US several agencies are funding major Grid initiatives. Examples include: -

NASA Information Power Grid [25] NSF Science Grid [26] NSF GriPhyN Project [27] DOE PPGrid [28] NSF NVO [29] NSF NEESGrid [30]

Most of these Grid Infrastructure and Application Projects make use of the Globus Toolkit [12] as the basic platform on which to provide Grid services. In addition, the NSF funded GrADS project [6] identifies many important research issues for Grid computing. Europe has been also active in Grid R&D. In addition to two initial EU Grid projects, DataGrid [17] and EuroGrid [31], several new EU Grid-centred projects are currently under negotiation. National governments in the EU have also recognised the potential strategic importance of the Grid. For example, under its new 'e-

12

T. Hey and J. Papay / Performance engineering, PSEs and the GRID

Fig. 8. Grid vision.

Science Programme', the Office of Science and Technology (OST) in the UK have allocated £120M for the deployment of e-Science Grids spanning a wide range of application areas and the development, with industry, of the associated Grid middleware. The Computation Grid is perhaps best envisaged as an infrastructure that integrates computing resources, data, knowledge, instruments and people. The construction of such an environment will enable sharing of computational resources, data repositories and facilities in a routine way as the Web now allows us to

share information. In cartoon form, this Grid vision is depicted in Fig. 8. There are many genuine Computer Science research challenges to be overcome before we can realize this vision. In the context of this paper, an obvious issue is the need for realistic performance estimation. Together with mechanisms for monitoring and accounting, reliable performance estimation will allow the creation of global marketplaces for Grid resources. As a starting point for a discussion of Grid performance estimation, it is worthwhile to review results from some recent

13

T. Hey and J. Papay / Performance engineering, PSEs and the GRID Table 1 WAN experiment statistics

Southampton (PAC) Southampton University Barcelona (UPC) Stuttgart (RUS) Madrid (CASA) Bilbao (CEIT) Torino (ItalDesign) Torino (Blue Enginnering) Grand Totals

CPUs Nproc 15 10 16 12 15 12 11 11 102

Total cpus installed Total cpus defined in PVC Total cpus available Total cpus used in WAN

102 102 95 79

Partner

In PVC 15 10 16 12 15 12 11 11 102

Availability Access 15 9 16 11 15 12 6 11 95

Used 14 6 16 9 14 8 5 7 79

Failed 1 0 1 0 15 0 2 6 25

Shot statistics Successful 150 40 275 104 184 98 63 61 975

Total 151 40 276 104 199 98 65 67 1000

Elapsed Execution Time: 4:39: 16 Approx Single CPU Time: 250 hrs

meta-computing projects. 4.2. Meta-computing experiments There have been numerous meta-computing projects involving performance engineering. Here we shall restrict our discussion to several EU-funded metacomputing projects - Promenvir [35], Toolshed [32] and HPC-VAO [33]. The main application focus of these projects is engineering design optimisation by simulations. In these simulations, the parameter space of key design parameters is explored to find a set of optimal values: simulations must be performed for every new set of parameter values. The applications were drawn from a variety of engineering domains including satellite alignment analysis, surface accuracy analysis, reflector deployment, crash analysis, vibro-acoustic optimization and CFD computation. Optimisation by simulation is computationally expensive and the user needs these simulations to execute in the shortest possible time or within a set period or resource cost. There is a clear economic incentive to achieve efficient utilisation of the available resources with as little intervention as possible. Several of these meta-computing experiments utilized Europewide computing resources. An illustration of the results of the PROMENVIR project performed by connecting up the resources of project partners across Europe is given in Table 1. Table 1 contains statistics of the resource usage obtained by running a large scale Monte Carlo simulation of satellite deployment. The program and associated data were small enough that each simulation could run on a single workstation or node: nontrivial parallelism of the application code was not required. In this simulation a thousand 'shot' (parameter set) computational experiment has been performed.

Initially all machines listed above were specified as comprising the Parallel Virtual Computer (PVC). During the actual run, however, some of them were either not available or not used due to the pre-existing high load on them. As can be seen, the experiment was very successful and utilized nearly 100 processors and resulted in a very significant improvement in exploration of the design space. The key module of a distributed computing environment is the scheduler. This must perform task allocation to resources, run-time prediction and dynamic load-balancing. Resource management decisions must be made using platform independent application load or resource demand models for CPU, memory size, I/O traffic and disk volume. In the above mentioned metacomputing projects, such models were developed by benchmarking large industrial size codes such as NASTRAN and Sysnoise. The process of obtaining load models and using them by the scheduler is presented in Fig. 9. The accuracy of performance predictions obtained by this technique is illustrated on the case study of a static analysis with NASTRAN [39]. It is important to stress that, as is commonly the case for commercial codes, the source code was not available for instrumentation. A series of 2D and 3D test problems were used for benchmarking on two different architectures a Distributed Memory IBM SP2 and a Shared Memory SGI Power Challenge. During benchmarking, the runtime, memory, disk traffic, disk space parameters were measured. These measurements were then used for the development of analytical performance models. At the first stage a machine independent model of the application is derived. Figure 10 illustrates that the derived CPU-load model (number of floating point operations) for SP2 and Power Challenge show a close match so

14

T. Hey and J. Papay / Performance engineering, PSEa and the GRID

Fig. 9. Development and utilisation of platform independent load models. 10000.0

100000.0 2DChall 2DSP2

10000.0 -

1000.0 -

• 3D Chall D 3D SP2

.

1 1000.0 1

a

100.0 D 100.0 - .0 .

. 10.0 -

10.0 -

1.0

1.0 -

100 100

1000

10000

100000

N

100 100

1000

10000

100000

N

Fig. 10. Derived CPU-loads of static analysis in NASTRAN for 2D-3D problems for PowerChallenge and SP2.

that, as might be expected, the number of FPU operations in both cases is approximately the same. The derived load models are used for the development of an analytic expression that incorporates the key application parameters such as degrees of freedom, front size, number of extracted eigenvalues, etc. The accuracy of the analytical model of CPU-load for the NASTRAN static analysis code is illustrated in Fig. 11.

Similar models to the one presented in Fig. 11 have been developed for I/O load, disk volume and memory size. The advantage of this approach is that it provides simple mathematical expressions that include the key parameters governing the performance of the application. The main drawback is that the development of these models requires substantial benchmarking effort and also some knowledge of the algorithm and applica-

T. Hey and J. Papay / Performance engineering, PSEs and the GRID 100000.0 • • 10000.0

2DChall 2D Model 3DChall 3D Model

1000.0

100.0 --

10.0

100

1000

N

10000

100000

Fig. 11. CPU-load model for static analysis in NASTRAN.

tion. Nevertheless, these experiments demonstrate the level of accuracy that can be obtained in the industrially relevant environment in which the source code of the application package is unavailable. 4.3. Performance and the Grid Performance estimation and forecasting will be vital ingredients of the future Grid environment as has been emphasised in several US projects such as GrADS [6], AppLeS [45] and the Network Weather Service [48]. The GrADS Project envisages a "performance contract" framework as the basis for a dynamic negotiation process between resource providers and consumers. The Network Weather Service monitors the available performance capacity of distributed resources, forecasts future performance levels using statistical forecasting models, reports monitoring data to client schedulers, applications and visual interfaces. Such a service is important for the Grid environment but needs to be scalable, portable, and secure. There are many open issues that need further investigation such as the balance between intrusiveness of sensors and accuracy of measurements, fault diagnosis and adaptive sensors. 5. Concluding remarks In this paper, various techniques used for performance engineering on parallel and distributed systems have been reviewed. We conclude with two remarks:

15

(1) The national and international levels of investment in Grid computing make it clear that performance estimation, modelling and measurement on the Grid will assume an increasingly important role in any future computational Grid economy. Over the last decade, we have seen a shift in the software industrie towards an objectoriented, component-based software methodology. At present, although the programming interfaces and functionality of these components are exposed, there is no methodology for expressing performance trade-offs in the software development process. We therefore suggest that, in addition to specifying interfaces and functions, software methodologies need to incorporate some form of "performance metadata". Such metadata would contain information about the performance and resource requirements of software constructs and components. Only with the availability of such performance metadata will the construction of truly intelligent schedulers become possible. An internationally coordinated effort to define a common format for performance metadata seems long overdue. (2) As we have seen, performance models range from simple algebraic models that attempt to identify a few key parameters, to complex simulation models with many parameters and involving powerful mathematical techniques such as queuing theory. However, the key to realistic performance prediction lies in understanding the interaction between the application and the computer architecture. It is also important to note that in a typical industrial application users will not have access to the source code of a software package or library routine. These requirements highlight the need for performance model abstractions that are relatively simple and easy to use yet are sufficiently accurate in their predictions to be useful as input to a scheduler or intelligent agent. Reliable performance estimation becomes even more relevant when we consider payment for services in a computational Grid economy. Users will require answers to questions such as best value for money as well as guarantees for specified turn-around times. Finally, we have seen that there are many existing tools for performance monitoring, some of which have a non-negligible user community. When it comes to performance estimation, there are few tools and few users. Although the computer science community has

16

T. Hey and J. Papay /Performance engineering, PSEs and The GRID

been researching performance for a long time, we believe that such research needs to become more systematic and scientific. A common approach to performance metadata together with a methodology that allows independent verification and validation of performance results would be a good start.

References [1 ]

[2]

[3] [4]

[5]

[6]

[7]

[8] [9]

[10] (11)

[121

[131 [141 [151

[16]

N.K. Allsopp, T.P. Cooper and P. Ftakas, Porting Legacy Engineering Applications onto Distributed NT Systems. Proceedings of the 3rd USENIX Windows NT Symposium. Seattle. Washington, USA, July 12-15, 1999. B. Alpern, L. Carter, E. Feig and T. Selker, The Uniform Memory Hierarchy Model of Computation, Algorhhmica 12(2/3) (1994), 72-109. Babbage, Charles, Passages of the Life of a Philosopher. Longman, et alia, London. 1864. D. Bailey, J. Barton, T. Lasinski and H. Simon, (eds). The NAS parallelbenchmarks. Technical Report RNR-91-02, NASA Ames Research Center, Moffett Field. CA 94035. January 1991. B.P. Miller, M.D. Callaghan, J.M. Cargille, J.K. Hollingsworth, R. Bruce Irvin, K.L. Karavanic, K. Kunchithapadam and T. Newhall, The Paradyn Parallel Performance Measurement Tools, IEEE Computer 28(11) (November 1995), 37-46. F. Berman, A. Chien, K. Cooper, J. Dongarra, I. Foster. D. Gannon, L. Johnsson, K. Kennedy, C. Kesselman, D. Reed, L. Torczon and R. Wolski, The GrADS Project: Software Support for High-Level Grid Application Development. http://www.hipersoft.rice.edu/grads/publications/tr/grads. project.pdf, February 15, 2000. R.G. Covington, S. Dwarkadas, J.R. Jump, J.B. Sinclair and S. Madala, Efficient Simulation of Parallel Computer Systems, International Journal in Computer Simulation 1 (1991 ), 3158. CRAY Research, Introducing the MPP Apprentice Tool, CRAY Manual IN-2511, 1994. D. Culler, R. Karp and D. Patterson. LogP: Towards a realistic Model for Parallel Computation. ACM SIGPLAN Notices 28(7)(1993), 1-12. A. Dunlop, Southampton Ph.D. thesis, 1997. A.C. Dusseau, D.E. Culler, K.E. Schauser and R. Martin, Fast Parallel Sorting under LogP: Experience with the CM5. IEEE Transactions on Parallel and Distributed Systems (August 1996). I Foster and C. Kesselman. Globus: A Metacomputing Infrastructure Toolkit, International Journal of Supercomputer Applications 11(2) (1997), 115-128. L Foster and C. Kesselman, (eds). The Grid: Blueprint for a New Computing Infrastructure, Morgan Kaufmann, 1999. T. Hey, A. Dunlop and E. Hernandez. Realistic Parallel Performance Estimation, Parallel Computing 23 (1997). 5-21. R.W. Hockney, Performance parameters and benchmarking of supercomputers, in: Computer Benchmarks, J.J. Dongarra and W. Gentzsch, eds, Elsevier Science Publishers. Holland, 1993, pp. 41-63. R.W. Hockney and M. Berry, PARKBENCH Report: Public international benchmarks for parallel computing. Scientific Programming 3(2) (1994). 101-146

[ 17]

W. Hoschek. J. Jaen-Martinez, A. Samar. H. Stockinger and K Stockinger. Data Management in an International Data Grid Project, in: Proc. 1st IEEE/ACM International Workshop on Grid Computing. Springer Verlag Press. 2000. [18] http://www.netlib.org/parkbench/. [19] http://hpc-journals.ecs.soton.ac.uk/PEMCS/. [20] http://www-106.ibm.com/developerworks/xml/. (21 ] http://www.xml.com/. [22] http://www.cs.purdue.edu/research/cse/gasturbn/. [23] http://ampano.cs.utah.edu/software/. (24) http://wwwvis.informatik.unistuttgart.de/eng/research/proj/autobench/. [25] http://www.ipg.nasa.gov/. [26] http://www.ncsa.uiuc.edu/About/PACI/. [27] http://www.griphyn.org/. [28] http://www.ppdg.net/. [29] http://www.hoise.com/primeur/01/articles/monthly/AE-PR04-01-15.html. [30] http://www.neesgrid.org/. [31] http://www.eurogrid.org/. [32] http://www.cse.clrc.ac.uk/ActivityResources/16. [33] http://www.beasy.com/projects/hipsid/pac.html. [34] J Labarta. S. Girona and T. Cortes, Analyzing scheduling policies using Dimemas. Parallel Computing 23( 1-2) (1997). 23-34. [35] J. Marczyk. Principles of Simulation-Based Computer-Aided Engineering FIM Publications, Barcelona, 1999, pp. 174. [36] W.F. McColl. EPS Programming, in Proc. DIMACS Workshop on Specification of Parallel Algorithms. Princeton. 9-11 May 1994 [37] W.F. McColl. Truly, madly, deeply parallel. New Scientist. February 1996. pp. 36-40. [38] F.H. McMahon. The Livenmore Fortran kernels test of the numerical performance range. Performance Evaluation of Supercomputers (1988). 143-186. [39] MSC/NASTRAN Quick Reference Guide. Version 70. The MacNeal-Schwendler Corporation, 1997. [40] N. Mukherjee, G. Riley and J. Gurd, FINESSE: A Prototype Feedba-guided Performance Enhancement System, Proceedings of 8th Euromicro Workshop on Parallel and Distributed Processing, Rhodes. Greece. IEEE Computer Society Press. January 19-21. 2000, pp. 101-109. [41] W.E. Nagel. A. Arnold, M. Weber, H.-C. Hoppe and K Solchenbach. VAMPIR: Visualization and Analysis of MPI Resources. Supercomputer 63 12(1) (19%), 69-80. [42] D. Pease et al.. PAWS: A Performance Evaluation Tool for Parallel Computing Systems, IEEE Computer (January 1991). 18-29. [43] S K. Reinhardt. M.D. Hill, J.R. Larus, A.R. Lebeck, J.C. Lewis and DA. Wood. The Wisconsin Wind Tunnel: Virtual Prototyping of Parallel Computers, in Proceedings of the 1993 ACM SIGMETRICS Conference on Measurements & Modeling of Computer Systems, May 1993. pp. 48-60. [44] G. Riley. M. Bull and J. Gurd, Performance Improvement through Overhead Analysis: A Case Study in Molecular Dynamics. Proceedings of the 1997 International Conference on Supercomputing. ACM Press. 1997. [45] N. Spring and R Wolski. Application Level Scheduling of Gene Sequence Comparison on Metacomputers, Proceedings of the 12th ACM International Conference on Supercomputing, Melbourne, Australia, July 1998. [46] L.G. Valiant, A bridging model for parallel computation. Communications of the ACM 33(8) (1990). 103-111.

T. Hey and J. Papay / Performance engineering, PSEs and the GRID [47]

D.W. Walker, M. Li, O.F. Rana, M.S. Shields and Y. Huang, The Software Architecture of a Distributed Problem-Solving Environment Concurrency: Practice and Experience 12(15) (2001), 1455-1480. [48] R. Wolski, Forecasting Network Performance to Support Dynamic Scheduling Using the Network Weather Service. In Proc. 6th IEEE Symp. on High Performance Distributed Computing, Portland, Oregon, 1997.

[49]

17

Y. Zheng, N.P. Weatherill, E.A. Turner-Smith, M.I. Sotirakos, M.J. Marchant and O. Hassan, Visual Steering of Grid Generation in a Parallel Simulation User Environment, Chapter 27 in Enabling Technologies for Computational Science: Frameworks, Middleware and Environments, The Kluwer International Series in Engineering and Computer Science, (vol. 548), Kluwer Academic Publishers, Boston, 2000, pp. 339-349.

This page intentionally left blank

19

Scientific Programming 10 (2002) 19-33 IOS Press

NINJA: Java for high performance numerical computing Jose E. Moreiraa, Samuel P. Midkiffa, Manish Guptaa, Peng Wua, George Almasia and Pedro Artigasb a

IBM Thomas J. Watson Research Center, Yorktown Heights, NY 10598-0218, USA Tel: +1 914 945 3018; Fax: +1 914 945 4270; E-mail: {jmoreira,smidkiff,mgupta,peng\vu,gheorghe}@ us.ibm.com b School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213-3891, USA E-mail: [email protected]

Abstract: When Java was first introduced, there was a perception that its many benefits came at a significant performance cost. In the particularly performance-sensitive field of numerical computing, initial measurements indicated a hundred-fold performance disadvantage between Java and more established languages such as Fortran and C. Although much progress has been made, and Java now can be competitive with C/C++ in many important situations, significant performance challenges remain. Existing Java virtual machines are not yet capable of performing the advanced loop transformations and automatic parallelization that are now common in state-of-the-art Fortran compilers. Java also has difficulties in implementing complex arithmetic efficiently. These performance deficiencies can be attacked with a combination of class libraries (packages, in Java) that implement truly multidimensional arrays and complex numbers, and new compiler techniques that exploit the properties of these class libraries to enable other, more conventional, optimizations. Two compiler techniques, versioning and semantic expansion, can be leveraged to allow fully automatic optimization and parallelization of Java code. Our measurements with the NINJA prototype Java environment show that Java can be competitive in performance with highly optimized and tuned Fortran code.

1. Introduction When Java(TM) was first introduced, there was a perception (properly founded at the time) that its many benefits, including portability, safety and ease of development, came at a significant performance cost. In few areas were the performance deficiencies of Java so blatant as in numerical computing. Our own measurements, with second-generation Java virtual machines, showed differences in performance of up to one hundred-fold relative to C or Fortran. The initial experiences with such poor performance caused many developers of high performance numerical applications to reject Java out-of-hand as a platform for their applications. The JavaGrande forum [11] was organized to facilitate cooperation and the dissemination of information among those researchers and applications writers wanting to improve the usefulness of Java on these environments. ISSN 1058-9244/02/$8.00 © 2002 - IOS Press. All rights reserved

Much has changed since those early days. More attention to optimization techniques in the just-in-time (JIT) compilers of modern virtual machines has resulted in performance that can be competitive with popular C/C++ compilers [4]. Figure l(a), with data from a study described in [4], shows the performance of a particular hardware platform (a 333 MHz Sun Sparc10) for different versions of the Java Virtual Machine (JVM). The results reported are the aggregate performance for the SciMark [16] benchmark. We note that performance has improved from 2 Mflops (with JVM version 1.1.6) to better than 30 Mflops (with JVM version 1.3). However, as Fig. l(b) with data from the same study shows, the performance of Java is highly dependent on the platform. Often, the better hardware platform does not have a virtual machine implementing the more advanced optimizations. Despite the rapid progress that has been made in the past few years, the performance of commercially available Java platforms is not yet on par with state-of-the-

20

J.E. Moreira et al. /NINJA: Java for high performance numerical computing

Fig. 1. Although Java performance on numerical computing has improved significantly in the past few years (a), that performance is inconsistent across platforms (b) and still not up to par with state-of-the-art C and Fortran compilers. (Data courtesy of Ron Boisvert and Roldan Pozo, of the National Institute of Standards and Technology.)

art Fortran and C compilers. Programs using complex arithmetic exhibit particularly bad performance [21]. Furthermore, current Java platforms are incapable of automatically applying important optimizations for numerical code, such as loop transformations and automatic parallelization [20]. Nevertheless, our thesis is that there are no technical barriers to high performance computing in Java. To prove this thesis, we have developed a prototype Java environment, called Numerically INtensive JAva (NINJA), which has demonstrated that Fortran-like performance can be obtained by Java on a variety of problems. We have successfully addressed issues such as dense and irregular matrix computations, calculations with complex numbers, automatic loop transformations, and automatic parallelization. Moreover, our techniques are straightforward to implement, and allow reuse of existing optimization components already deployed by software vendors for other languages [17], lowering the economic barriers to Java's acceptance. The primary goal of this paper is to convince virtual machine and application developers alike that Java can deliver both on the software engineering and performance fronts. The technology is available to make Java perform as well for numerical computing as highly tuned Fortran or C code. Once it is accepted that Java performance is only an artifact of particular implementations of Java, and that there are no technical barriers to Java achieving excellent numerical performance, our techniques will allow vendors and researchers to

quickly deliver high performance Java platforms to program developers. The rest of this paper is organized as follows. Section 2 describes the main sources of difficulties in optimizing Java performance for numerical computing. Section 3 covers the solutions that we have developed to overcome those difficulties. Section 4 discusses how those solutions were implemented in our prototype Java environment and provides various results that validate our approach to deliver high performance in numerical computing with Java. Finally. Section 5 presents our conclusions. Two appendices provide further detail on technologies of importance to numerical computing in Java: Appendix A gives the flavor of a multidimensional array package and Appendix B discusses a library for numerical linear algebra. A note about the examples in this paper. The Java compilation model involves a Java source code to Java bytecode translation step, with the resulting bytecode typically compiled into native, or machine code using a dynamic (i.e. just-in-time) compiler. The NINJA compiler performs its optimizations during this bytecode to machine code compilation step, but we present our examples using source code for readability.

2. Java performance difficulties Among the many difficulties associated with optimizing numerical code in Java, we identify three characteristics of the language that are. in a way. unique: (i)

J.E. Moreira et al. /NINJA: Java for high performance numerical computing

exception checks for null-pointer and out-of-bounds array accesses, combined with a precise exception model, (ii) the lack of regular-shaped arrays, and (iii) weak support of complex numbers and other arithmetic systems. We discuss each of these in more detail. 2.1. The Java exception model Java requires all array accesses to be checked for dereferencing via null-pointer and out-of-bounds indices. An exception must be thrown if either violation happens. Furthermore, the precise exception model of Java states that when the execution of a piece of code throws an exception, all the effects of those instructions prior to the exception must be visible, and no effect of instructions after the exception should be visible [8]. This has a negative impact on performance in two ways: (i) checking the validity of array references contributes to runtime overhead, and (ii) code reordering in general, and loop iteration reordering in particular, is prohibited, thus preventing almost all optimizations for numerical codes. The first of these problems can be alleviated by aggressive hardware support that masks the direct cost of the tests. The second problem is more serious and requires compiler support. 2.2. Arrays in Java Unlike Fortran and C, Java has no direct support for truly rectangular multidimensional arrays. Java allows some simulation of multidimensional arrays through arrays of arrays, but that is not an ideal solution. Arrays of arrays have two major problems. First, arrays of arrays are not necessarily rectangular. Determining the shape of an array of arrays is, in general, an expensive runtime operation. Even worse, the shape of an array of arrays can change during computation. Figure 2(a) shows an array of arrays being used to simulate a rectangular two-dimensional array. In this case, all rows have the same length. However, arrays of arrays can be used to construct far more complicated structures, as shown in Fig. 2(b). We note that such structures, even if unusual for numerical codes, may be natural for other kinds of applications. When a compiler is processing a Java program, it must assume the most general case for an array of arrays unless it can prove that a simpler structure exists. Determining rectangularity of an array of arrays is a difficult compiler analysis problem, bound to fail in many cases. One could advocate the use of pragmas to help identify rectangular arrays. However, to maintain the overall safety

21

of Java, a virtual machine must not rely on pragmas that it cannot independently verify, and we are back to the compiler analysis problem. It would be much simpler to have data structures that make this property explicit, such as the rectangular two-dimensional arrays of Fig. 2(c). Knowing the shape of a multidimensional array is necessary to enable some key optimizations that we discuss below. As can be seen in Fig. 2(b), the only way to determine the minimum length of a row is to examine all rows. In contrast, determining the size of a true rectangular array, as shown in Fig. 2(c), only requires looking at a small number of parameters. Second, arrays of arrays may have complicated aliasing patterns, with both intra- and inter-array aliasing. Again, alias disambiguation - that is, determining when storage locations are not aliased - is a key enabler of various optimization techniques, such as loop transformations and loop parallelization, which are so important for numerical codes. The aliasing problem is illustrated in Fig. 2. For the arrays of arrays shown in Fig. 2(b), two different arrays can share rows, leading to inter-array aliasing. In particular, row 4 of array X and row 3 of array Y refer to the same storage, but with two different names. Furthermore, intra-array aliasing is possible, as demonstrated by rows 0 and 1 of array X. For the true multidimensional arrays shown in Fig. 2(c) (Z and T), alias analysis is easier. There can be no intra-array aliasing for true multidimensional arrays, and inter-array aliasing can be determined with simpler tests [20]. 2.3. Complex numbers in Java From a numerical perspective, Java only has direct support for real numbers. Fortran has direct support for complex numbers also. For even more versatility, both Fortran and C++ provide the means for efficiently supporting other arithmetic systems. Efficient support for complex numbers and other arithmetic systems in Fortran and C++ comes from the ability to represent low-cost data structures that can be efficiently allocated on the stack or in registers. Java, in contrast, represents any non-primitive data type as a full fledged object. Complex numbers are typically implemented as objects of a class Complex, and every time an arithmetic operation generates a new complex value, a new Complex object has to be allocated. That is true even if the value is just a temporary, intermediate result. We note that an array of n complex numbers requires the creation of n objects of type Complex, further complicating alias analysis and putting more pres-

22

J.E. Moreira el al. / NINJA: Java for high performance numerical computing

Fig. 2. Examples of (a) array of arrays simulating a two-dimensional array, (b) array of arrays in a more irregular structure, and (c) rectangular two-dimensional array.

sure on the memory allocation and garbage collection system. We have observed the largest differences in performance between Java and Fortran when executing code that manipulates arrays of complex numbers. Because Complex objects are created to hold the result of each arithmetic operation, almost all of the execution time of an application with complex numbers is spent creating and garbage collecting Complex objects used to hold intermediate values. In that case, even modern virtual machines may perform a hundred times slower than equivalent Fortran code. The three difficulties described above are at the core of the performance deficiencies of Java. They prevent the application of mature compiler optimization technology to Java and, thus, prevent it from being truly competitive with more established languages such as Fortran and C. We next describe our approach to eliminating these difficulties, and we will show that, with the proper technology, the performance of Java numerical code can be as good as with any other language.

3. Java performance solutions Our research showed that the performance difficulties of Java could be solved by a careful combination of language and compiler techniques. We developed new class libraries that "enrich" the language with some important constructs for numerical computing. Our compiler techniques take advantage of these new constructs to perform automatic optimizations. Above all, we were able to overcome the Java performance problems mentioned earlier while maintaining full portability of Java across all virtual machines. The performance results on a particular virtual machine, however, depends on the extent to which that virtual machine (more precisely, its Java bytecode to machine code compiler) implements the automatic optimizations we describe below.

3.1. The A rray package and semantic expansion To attack the absence of truly multidimensional arrays in Java, we have defined an Array package with multidimensional arrays (denoted in this text as Arrays, with a capital A) of various types and ranks (e.g., doubleArray2D, ComplexArray3D, ObjectArraylD). This Array package introduces true multidimensional arrays in Java through a class library. See Appendix A, The Array package for Java, for further discussion. Element accessor methods (get and set methods for individual array elements), sectioning operations, gather and scatter operations, and basic linear algebra subroutines (BLAS) are some of the operations defined for the Array data types. By construction, the Arrays have an immutable rectangular and dense shape, which simplifies testing for aliases and facilitates the optimization of runtime checks. The Array classes are written in fully compliant Java code, and can be run on any JVM. This ensures that programs written using the Array package are portable. When Array elements are accessed via the get and set element operations, each element access will be encumbered by the overhead of a method invocation, which is unacceptable for high performance computing. This problem is avoided by a compiler technique known as semantic expansion. In semantic expansion, the compiler looks for specific method calls, and substitutes efficient code for the call. This allows programs using the Array package to have high performance when executed on JVM that recognize the Array package methods. As an example, consider the operation of computing Ci,j = Aij + Bij, for all elements of n x n Arrays A, B, and C. The code for that operation would look something like: doubleArray2D A. B.C. for (i = 0; i < n:i++) { for (j =0:j < n:j++) {

.A.getd.(i,j)+ B.gpt(j. i))::

J.E. Moreira et al. /NINJA: Java for high performance numerical computing

which requires three method calls (two gets and one set) in every loop iteration. If the compiler knows that A, B, and C are multidimensional arrays, it can generate code that directly accesses the elements of the Arrays, much like a Fortran compiler generates code for the source fragment do i = 1, n do j = 1, n C ( i , j ) = A(i,j) + B(j,i) end do end do

Note that this is different from the important, but more conventional, optimization of inlining. The compiler does not replace the invocation of get and set by their library code. Instead, the compiler knows about them: it knows the semantics of the classes and of the methods. Semantic expansion is an escape mechanism for efficiently extending a programming language through standard class libraries. 3.2. The complex class and semantic expansion A complex number class is also defined as part of the Array package, along with methods implementing arithmetic operations on complex numbers. (See Fig. 3.) Again, semantic expansion is used to convert calls to these methods into code that uses a value-object version of Complex objects (containing only the primitive values, not the full Java object representation). Figure 3 illustrates the differences between valueobjects and regular objects. A value-object version of Complex contains only fields for the real and imaginary parts of the complex number represented, as shown in Fig. 3(b). It is akin to a C struct, and can be easily allocated on the stack and even on registers. For Complex to behave as a true Java object, a different representation is necessary, shown in Fig. 3(c). In particular, every Java object requires an object header, which can represent a significant fraction of the object size. (For example, a Complex object of doubleprecision real and imaginary parts occupies 32 bytes in modern virtual machines, even though only 16 bytes are dedicated to the numerical fields.) Even worse is the overhead of creating and destroying objects, which typically are allocated on the heap. Any computation involving the arithmetic methods can be semantically expanded to use complex values. Conversion to Complex objects is done in a lazy manner upon encountering a method or primitive operation that truly requires object-oriented functionality. Thus, the programmer continues to treat complex numbers as

23

objects (maintaining the clean semantics of the original language), while our compiler transparently transforms them into value-objects for efficiency. We illustrate those concepts with an example. Consider the computation of yi = axi for all n elements of arrays x and y of complex numbers. This operation would typically be coded as ComplexArraylD x, y; Complex a;

for (i = 0; i < n; i++) { y.set(i, a.times(x.get(i))); }

A straightforward execution of this code would require the creation of 2n temporary objects. For every iteration, an object has to be created to represent Xi. A second object is created to hold the result of axi. The cost of creating and destroying these objects completely dominates execution. If the compiler knows the semantics of Complex and ComplexArrays, it can replace the method calls by code that simply manipulates values. Only the values of the real and imaginary parts of Xi are generated by x.get(i). Only the values of the real and imaginary parts of axi are computed by a.times(x.get(i)). Finally, those values are used to update yi. As a result, the object code generated would not be significantly different from that produced by a Fortran compiler for the source fragment complex* 16 x(n), y(n) complex*16 a doi = l,n y(i) = a, * x(i) end do

3.3. Versioning for safe and alias-free regions For Java programs written with the Array package, the compiler can perform simple transformations that eliminate the performance problems caused by Java's precise exception model. The idea is to create regions of code that are guaranteed to be free of exceptions. Once these exception-free (also called safe) regions have been created, the compiler can apply traditional core-reordering optimizations, constrained only by data and control dependences [20]. The safe regions are created by versioning of loop nests. For each optimized loop nest, the compiler creates two versions - safe and unsafe - guarded by a runtime test. This runtime test establishes whether all Arrays in the loop nest are valid (not null), and whether all the indexing operations

24

J.E. Moreira et al. /N1NJA: Java for high performance numerical computing public final class Complex { private double re, im;

public Complex minus(Complex z) ( return new Complex ire- z . re , im- z . im ;

public Complex times(Complex z) { return new Complex(re*z.re-im*z.im,im*z.re+re*z.im)

(a) partial code for Complex class

re 0.0

im 00

(b) Complex value-object representation descriptor

re 0.0

im 0.0

(c) Complex object representation Fig. 3. A Java class for complex numbers.

inside the loop will generate in-bound accesses. If the tests pass, the safe version of the loop is executed. If not, the unsafe version is executed. Since the safe version cannot throw an exception, explicit runtime checks can be omitted from the code. We take the versioning approach a step further. Application of automatic loop transformation (and parallelization) techniques by a compiler requires, in general, alias disambiguation among the various arrays referenced in a loop nest. We rely on a key property of Java that two object references (the only kind of pointers allowed in Java) must either point to identical or completely non-overlapping objects. Use of the Array package facilitates checking for aliasing by representing a multidimensional array as a single object. Therefore, we can further specialize the safe version of a loop nest into two variants: (i) one in which all multidimensional arrays are guaranteed to be distinct (no aliasing), and (ii) one in which there may be aliasing between arrays. The safe and alias-free version is the perfect target for compiler optimizations. The mature loop optimization techniques, including loop parallelization, that have been developed for Fortran and C programs can be easily applied to the safe and alias-free region.

We note that the "no aliasing" property between two Arrays is invariant to garbage collection activity. Garbage collection may remove aliasing, but it will never introduce it. Therefore, it is enough to verify once that two Arrays are not aliased to each other. We have to make sure, however, that there are no assignments to Array references (e.g., A — B) in a safe and aliasfree region, as that can introduce new aliasing. Assignments to the elements of an Array (e.g.. A(i) — B [ j ) never introduce aliasing. An example of the versioning transformation to create safe and alias-free regions is shown in Fig. 4. Figure 4(a) illustrates the original code for computing A, = F(B i =1) for n-element arrays ,4 and B. Figure 4(b) explicitly shows all null pointer and array bounds runtime checks that are performed when the code is executed by a Java virtual machine. The check chknull(-4) verifies that Array reference ,4 is not a null-pointer, whereas check chkbounds( i) verifies that the index i is valid for that corresponding Array. Figure 4(c) illustrates the versioned code. A simple test for the values of the A and B pointers and a comparison between loop bounds and array extents can determine if the loop will be free of exceptions or not. If

J.E. Moreira et al. /NINJA: Java for high performance numerical computing

25

for (i = 0; i < n; i++) { [i + 1])

(a) original code

for (i = 0; i < n; i++) { /* code for A[i] = F(B[i + 1]) with explicit checks */ chknull(A)[chkbounds(i)] = .F(chknull(B)[chkbounds(i + 1)])

(a) original code with explicit runtime checks if ( (A null) A (B / null) A (n - 1 < A.length) A (n < /* This region is free of exceptions */

length)) {

/* This region is free of aliases */ for (i = 0; i < n; i++) { A'[i] = F(B'[i + 1]) } } else { /* This region may have aliases */ for (i = 0; i < n; i++) { A[i] = F(B[i + 1]) } } } else { /* This region may have exceptions and aliases */ for (i = 0; i < n; i++) { chknull(A)[chkbounds(i)] = F(chknull(B)[chkbounds(?; + 1)])

(c) code after safe and alias-free region creation Fig. 4. Creation of safe and alias-free regions.

the test passes, then the safe region is executed. Note that the array references in the safe region do not need any explicit checks. The array references in the unsafe region, executed if the test fails, still need all the runtime checks. One more comparison is used to disambiguate between the storage areas for arrays A and B. A successful disambiguation will cause execution of the alias-free version. Otherwise, the version with potential aliases must be executed. At first, there seems to be no difference between the alias-free version and the version with potential aliases. However, the compiler internally annotates the symbols in the alias-free region as not being aliased with each other. We denote these new, alias-free symbols, by A' and B'. This information is later used to enable the various loop transformations. We note that the representation shown in Fig. 4(c) only exists as a compiler internal intermediate representation, after the versioning is automatically performed and before object code is generated. Neither the Java language, nor the Java bytecode, can directly represent that information. The concepts illustrated by the example of Fig. 4 can be extended to loop nests of arbitrary depth operating on multidimensional arrays. The tests for safety and

aliasing are much simpler (and cheaper) if the arrays are known to be truly multidimensional (rectangular), as in Fig. 2(c). The Arrays from the Array package have this property. 3.4. Libraries for numerical computing Optimized libraries are an important vehicle for achieving high-performance in numerical applications. In particular, libraries provide the means for delivering parallelism transparently to the application programmer. There are two main trends in the development of high-performance numerical libraries for Java. In one approach, existing native libraries are made available to Java programmers through the Java Native Interface (JNI) [5]. In the other approach, new libraries are developed entirely in Java [3]. Both approaches have their merits, with the right choice depending on the specific goals and constraints of an application. Using existing native libraries through JNI is very appealing. First, it provides access to a large body of existing code. Second, that code has already been debugged and its performance tuned by previous pro-

26

J.E. Moreira et al. /NINJA: Java for high performance numerical computing

grammars. Third, in many cases (e.g., BLAS, MPI, LAPACK, . . . ) the same native library is available for a variety of platforms, properly tuned by the vendor of each platform. However, using libraries that are themselves written in Java also has its advantages. First, those libraries are truly portable, and one does not have to worry about ideosyncrasies that typically occur in versions of a native library for different platforms, such as maintaining Java floating point semantics. Second, Java libraries typically fit better with Java applications. One does not have to worry about parameter translation and data representations that can cause performance problems and/or unexpected behavior. Third, and perhaps most importantly, by writing the libraries in Java the more advanced optimization and programming techniques that are being developed, and will be developed, for Java will be exploited in the future without the additional work of performing another port. The discussion of Appendix B describes one technique which is easier to implement with Java, that can lead to improved performance. The Array package itself is a library for numerical computing. In addition to focusing on properties that enable compiler optimizations, we also designed the Array package so that most operations could be performed in parallel. We have implemented a version of the Array package which uses multiple Java threads to exploit multiprocessor parallelism inside some key methods. This is a convenient approach for the application developer. The application code itself can be kept sequential, and parallelism is exploited transparently inside the methods of the Array package. We report results with this approach in the next section. For further information on additional library support for numerical computing in Java, see Appendix B. Numerical linear algebra in Java. 3.5. A comment on our optimization approaches We want to close this section by emphasizing that the class libraries and compiler optimizations that we presented are strictly Java compliant. They do not require any changes to the base language or the virtual machines, and they do not change existing semantics. The Array and complex classes are just tools for developing numerical applications in a style that is familiar to scientific and technical programmers. The compiler optimizations (versioning and semantic expansion) are exactly that: optimizations that can improve performance of code significantly (by orders of magnitude as we will see in the next section) without changing the observed behavior.

4. Implementation and results We have implemented our ideas in the NINJA prototype Java environment, based on the IBM XL family of compilers. Figure 5 shows the high-level organization of these compilers. The front-ends for different languages transform programs to a common intermediate representation called W-Code. The Toronto Portable Optimizer (TPO) is a W-Code to W-Code transformer which performs classical optimizations, like constant propagation and dead code elimination, and also high level loop transformations based on aggressive dataflow analysis. TPO can also perform both directive-assisted and automatic parallelization of loops and other constructs. Finally, the transformed W-Code is converted into optimized machine code by an architecture-specific back-end. The particular compilation path for Java programs is illustrated in the top half of Fig. 5. Java source code is compiled by a conventional Java compiler (e.g.Javac) into bytecode for the Java Virtual Machine. We then use the IBM High Performance Compiler for Java [ 19] (HPCJ) to statically translate bytecode into W-Code. In other words, HPCJ plays the role of front-end for bytecode. Once W-Code for Java is generated, it follows the same path through TPO and back-ends as W-Code generated from other source languages. Semantic expansion of the Array package methods [2] is implemented within HPCJ, as it is Java specific. Safe region creation and alias versioning have been implemented in TPO and those techniques can be applied to W-Code from any other language. We note that the use of a static compiler - HPCJ represents a particular implementation choice. In principle, nothing prevents the techniques described in this article from being used in a dynamic compiler. Moreover, by using the quasi-static dynamic compilation model [18], the more expensive optimization and analysis techniques employed by TPO can be done off-line, sharply reducing the impact of compilation overhead. We should also mention that our particular implementation is based on IBM products for the RS/6000 family of machines and the AIX operating system. However, the organization of our implementation is representative of typical high-performance compilers [15] and it is adopted by other vendors. Obviously, a reimplementation effort is necessary for each different platform, but the approach we followed serves as a template for delivering high-performance solutions for Java. We used a suite of eight real and five complex arithmetic benchmarks to evaluate the performance impact

27

J.E. Moreira etal. /NINJA: Java for high performance numerical computing Front End

Java Source

Bytecode

Portable Optimizations

Back End

TOBEY

HPCJ

POWER/PowerPC Code

TPO

Other Backend

W-Code

Machine Code

W-Code

Fig. 5. Architecture of the IBM XL compilers.

of our techniques. We also applied our techniques to a production data mining application. These benchmarks and the data mining application are described further in [2,13,14]. The effectiveness of our techniques was assessed by comparing the performance produced by the NINJA compiler with that of the IBM Development Kit for Java version 1.1.6 and the IBM XLF Fortran compiler on a variety of platforms. 4.1. Sequential execution results The eight real arithmetic benchmarks are matmul (matrix multiply), microdc (electrostatic potential computation), lu (LU factorization), cholesky (Cholesky factorization), shallow (shallow water simulation), bsom (neural network training), tomcatv (mesh generation and solver), and fft (FFT with explicit real arithmetic). Results for these benchmarks, when running in strictly sequential (single-threaded) mode, are summarized in Fig. 6(a). Measurements were made on an RS/6000 model 260 machine, with a 200 MHz POWERS processor. The height of each bar is proportional to the best Fortran performance achieved in the corresponding benchmark. The numbers at the top of the bars indicate actual Mflops. For the Java 1.1.6 version, arrays are implemented as double [] []. The NINJA version uses doubleAr ray2 D Arrays from the Array package and semantic expansion. For six of the benchmarks (matmul, microdc, lu, cholesky, bsom, and shallow) the performance of the Java version (with the Array package and our compiler) is 80% or more of the performance of the Fortran version. This high performance is due to well-known loop trans-

formations, enabled by our techniques, which enhance data locality. The Java version of tomcatv performs poorly because one of the outer loops in the program is not covered by a safe region. Therefore, no further loop transformations can be applied to this particular loop. The performance of fft is significantly lower than its Fortran counterpart because our Java implementation does not use interprocedural analysis, which has a big impact in the optimization of the Fortran code. 4.2. Results for complex arithmetic benchmarks The five complex benchmarks are matmul (matrix multiply), microac (electrodynamic potential computation), lu (LU factorization), fft (FFT with complex arithmetic), and cfd (two-dimensional convolution). Results for these benchmarks are summarized in Fig. 6(b). Measurements were made on an RS/6000 model 590 machine, with a 67 MHz POWER2 processor. Again, the height of each bar is proportional to the best Fortran performance achieved in the corresponding benchmark, and the numbers at the top of the bars indicate actual Mflops. For the Java 1.1.6 version, complex arrays are represented using a Complex [] [] array of Complex objects. No semantic expansion was applied. The NINJA version uses ComplexArray2D Arrays from the Array package and semantic expansion. In all cases we observe significant performance improvements between the Java 1.1.6 and NINJA versions. Improvements range from a factor of 35 (1.7 to 60.5 Mflops for cfd) to a factor of 75 (1.2 to 89.5 Mflops for matmul). We achieve Java performance that ranges from 55% (microac) to 85% (fft and cfd) of fully optimized Fortran code.

28

J.E. Moreira et al. /NINJA: Java for high performance numerical computing

Fig. 6. Performance results of applying our Java optimization techniques to various cases.

4.3. Parallel execution results Loop parallelization is another important transformation enabled by safe region creation and alias versioning. We report speedup results from applying loop parallelization to our eight real arithmetic Java benchmarks. All experiments were conducted using the Array package version of the benchmarks, compiled with our prototype compiler with automatic parallelization enabled. Speedup results, relative to the single processor performance of the parallel code optimized with NINJA, are shown in Fig. 6(c). Measurements were made in a machine with four 200 MHz POWERS processors. The compiler was able to parallelize some loops in each of the eight benchmarks. Significant

speedups were obtained (better than 50% efficiency on 4 processors) in six of those benchmarks (matmul, microdc, lu, shallow, bsom, and fft). 4.4. Results for parallel libraries We further demonstrate the effectiveness of our solutions by applying NINJA to a production data mining code [14]. In this case, we use a parallel version of the Array package which uses multithreading to exploit parallelism within the Array operations. We note that the user application is a strictly sequential code, and that all parallelism is exploited transparently to the application programmer. Results are shown in Fig. 6(d). Measurements were made with a RS/6000 model F50

J.E. Moreira et al. /NINJA: Java for high performance numerical computing

machine, with four 332 MHz PowerPC 604e processors. The conventional (Java arrays) version of the application achieves only 26 Mflops, compared to 120 Mflops for the Fortran version. The single-processor Java version with the Array package (bar Array x 1) achieves 109 Mflops. Furthermore, when run on a multiprocessor, the performance of the Array package version scales with the number of processors (bars Array x 2, Array x 3, and Array x 4 for execution on 2,3, and 4 processors, respectively), achieving almost 300 Mflops on 4 processors.

5. Conclusions Our results show that there are no serious technical impediments to the adoption of Java as a major language for numerically intensive computing. The techniques we have presented are simple to implement and allow existing compiler optimizers to be exploited. The Java-specific optimizations are relatively simple and most of the benefits accrue from leveraging well understood language-independent optimizations that are already implemented in current compilers. Moreover, Java has many features like simpler pointers and flexibility in choosing object layouts, which facilitate application of the optimization techniques we have developed. The impediments to high-performance computing in Java are instead economic and social - an unwillingness on the part of vendors of Java compilers to commit the resources to develop product-quality compilers for technical computing; the reluctance of application developers to make the transition to new languages for developing new codes; and finally, the widespread belief that Java is simply not suited for technical computing. The consequences of this situation are severe: a large pool of programmers is being underutilized, and millions of lines of code are being developed using programming languages that are inherently more difficult and less safe to use than Java. The maintenance of these programs will be a burden on scientists and application developers for decades. We have already engaged with companies that are interested in doing numerical computing in Java, which represents a first step towards wider adoption of Java in that field. Java already has a strong user base in commercial computing. For example, IBM's Websphere suite is centered around Java and is widely used in the industry. However, the characteristics of the commercial computing market are significantly different,

29

in both size and requirements, from the technical computing market. It is our hope that the concepts and results presented in this paper will help overcome the difficulties of establishing Java as a viable platform for numerical computing and accelerate the acceptance of Java, positively impacting the technical computing community in the same way that Java has impacted the commercial computing community. Appendix A. The Array package for Java The Array package for Java (provisionally named com. ibm. math. array) provides the functionality and performance associated with true multidimensional arrays. The difference between arrays of arrays, directly supported by the Java Programming Language and Java Virtual Machine, and true multidimensional arrays is illustrated in Fig. 2. Multidimensional arrays (Arrays) are rectangular collections of elements characterized by three immutable properties: type, rank, and shape. The type of an Array is the type of its elements (e.g., int, double, or Complex). The rank (or dimensionality) of an Array is its number of axes. For example, the Arrays in Fig. 2 are two-dimensional. The shape of an Array is determined by the extent of its axes. The dense and rectangular shape of Arrays facilitate the application of automatic compiler optimizations. Figure 7 illustrates the class hierarchy for the Array package. The root of the hierarchy is an Array abstract class (not to be confused with the Array package). From the Array class we derive type-specific abstract classes. The leaves of the hierarchy correspond to final concrete classes, each implementing an Array of specific type and rank. For example, doub1eAr r ay 2 D is a two-dimensional Array of double precision floatingpoint numbers. The shape of an Array is defined at object creation time. For example, intArray3D A = new intArray3D(m,n,p); creates an m x n x p three-dimensional Array of integer numbers. Defining a specific concrete final class for each Array type and rank effectively binds the semantics to the syntax of a program, enabling the use of mature compiler technology that has been developed for languages like Fortran and C. Arrays can be manipulated element-wise or as aggregates. For instance, if one wants to compute a twodimensional Array C of shape m x n in which each element is the sum of the corresponding elements of Arrays A and B, also of shape m x n, then one can write either

30

J.E. Moreira et al. /NINJA: Java for high performance numerical computing

other Array types

Fig. 7. Simplified partial class hierarchy chart for the Array package. ESSL and Java BLAS performance for SGEMM on RS/6000 260 800 700 600 500 400 300 200-

100

0

100

200

300

400

500 600 Problem size

700

800

900

1000

Fig. 8. Performance results for ESSL and Java BLAS for SGEMM operation.

the two forms can be aggressively optimized as with for (int i=0; i<m; state-of-the-art Fortran compilers. for (int j=0; j
J.E. Moreira et al /NINJA: Java for high performance numerical computing

31

Fig. 9. Illustration of the block recursive layout.

Fig. 10. Performance results for Java DGEMM with two array layouts.

Community Process [12]. The standardization is an important step in making Java practical for numerical computing. We note that the current naming conventions for the Array package do not follow recommended Java practice (e.g., some classes start with lower case letters). We expect this will change with the standardization process. It is also likely that the class hierarchy of the standardized package will be somewhat different. Nevertheless, the key properties of truly rectangular multidimensional arrays, important for enabling compiler optimizations, will be preserved. Appendix B. Numerical linear algebra in Java Numerical linear algebra operations are important building blocks for scientific and engineering applications. Many problems in those domains can be expressed as a system of linear equations. Much work has been done, by industry, academia, and government, to develop libraries of routines that manipulate and solve

these diverse systems of equations using numerical linear algebra. The Basic Linear Algebra Subprograms (BLAS) and the Linear Algebra Package (LAPACK) are two popular examples of such libraries available to Fortran and C programmers [7]. Part of our work in optimizing Java performance for numerically intensive computing involved the development of a linear algebra library for Java. This library is part of the Array package for Java. We call it Java BLAS. We chose to develop this library entirely in Java, with no native code components. We took advantage of Java's object oriented features to arrive at a design that is easy to maintain, portable, and achieves high performance [1]. The implementation of our linear algebra library in Java also allowed us to pursue new optimization techniques. Linear algebra algorithms (e.g., solving for vector x in the equation Ax = b) are expressed in terms of vector and matrix operations. For that reason, we defined two interfaces, BlasVector and BlasMatrix that define the behavior of vectors and matrices, respectively.

32

J.E. Moreira et al. /NINJA: Java for high performance numerical computing

For example, any implementation of the BlasMatrix interface must provide methods gemm (for matrix multiplication), trsm (for solution of triangular systems), and syrk (for update of symmetric matrices). Linear algebra algorithms are then expressed strictly in terms of the methods defined by the BlasVector and BlasMatrix interfaces. This approach is particularly appropriate for the implementation of linear algebra algorithms in recursive form [9]. The one- and two-dimensional floating-point Arrays in the Array package (namely float Array ID, floatArray2D, doubleArraylD, doubleArray2D, ComplexArraylD, ComplexArray2D) implement the BlasVector and BlasMatrix interfaces, respectively. Therefore, a single instance of a linear algebra algorithm works for single precision, double precision, and complex floating-point numbers. This results in our linear algebra library being much smaller than equivalent implementations in C and Fortran. We have been able to achieve very respectable performance with our all-Java implementation. Figure 8 compares the performance of our Java BLAS library and the highly tuned ESSL product [10] when performing the SGEMM BLAS operation (i.e., computing C = 3C + a A x B for single precision floatingpoint matrices .4, B, and C). In those measurements, all three matrices are of size n x n, where n is the problem size. We observe that the Java BLAS version achieves 80% of ESSL performance and 75% of the machine peak performance (800 Mflops). The area where Java allowed us to pursue new optimization techniques is in the exploitation of memory hierarchies, the multilevel cache structure of most current machines. It has been known for a while that neither the column major layout of Fortran nor the row major layout of C for storing multidimensional arrays is optimal for linear algebra algorithms. Java in general, and the Array package in particular, hide the specific memory layout of an array. Therefore, we are free to organize arrays in any form that we find convenient, totally transparent to the application programmer. In particular, we have experimented with a block recursive storage layout [6]. The idea behind block recursive layouts is illustrated in Fig. 9. We start by dividing the array into two blocks and laying each block contiguous in memory. We repeat the partitioning for each block until we arrive at some convenient block size (e.g., that fits into level-1 data cache). Our experiments with a block recursive storage layout have shown significant performance improvements above and beyond what is achieved by already highly

optimized code. The performance impact of the recursive blocked layout can be observed in Fig. 10. The bottom (lighter) plot in that figure shows the performance of the BLAS DGEMM operation (i.e., the doubleprecision version of SGEMM), as a function of problem size, for an optimized code operating on an array with row major layout. The top (darker) plot shows the performance for the same code operating on an array with block recursive layout. For large problem sizes, the Mflops rate for the block recursive layout can be up to 30% higher. Furthermore, we observe that the performance of the block recursive layout to be more stable with the problem size.

References [ 1 ] G. Almasi, F.G. Gustavson and J.E. Moreira, Design and Evaluation of a Linear Algebra Package for Java, in: Proceedings of the ACM 2000 Conference on Java Grande. ACM. June 3-4,2000, pp. 150-159. [2] P.V. Artigas, M. Gupta, S.P. Midkiff and J.E. Moreira, High performance numerical computing in Java: Language and compiler issues, in: 12th International Workshop on Languages and Compilers for Parallel Computing. J. Ferrante et al., eds. Vol. 1863 of Lecture Notes in Computer Science, IBM Research Report RC21482 Springer Verlag, San Diego, CA. August 1-17, 1999. [3] R.F. Boisvert, J.J. Dongarra, R. Pozo, K.A. Remington and G.W. Stewart. Developing numerical libraries in Java, Concurrency. Pract. Exp. (UK) 10(11-13) (SeptemberNovember 1998), 1117-1129. ACM 1998 Workshop on Java for High-Performance Network Computing. URL: http:// www.cs.ucsb.edu/conferences/java98. [4] R.F. Boisvert, J.E. Moreira, M. Philippsen and R. Pozo , Java and Numerical Computing, Computing in Science and Engineering 3(2) (March/April 2001), 18-24. [5] H. Casanova, J. Dongarra and D.M. Doolin, Java Access to Numerical Libraries, Concurrency. Pract. Exp. (UK) 9(11) (November 1997), 1279-1291. Java for Computational Science and Engineering - Simulation and Modeling II Las Vegas. NV, USA, 21 June 1997. [6] S. Chatterjee, V.V. Jain, A.R. Lebeck, S. Mundhra and M. Thottethodi, Nonlinear array layouts for hierarchical memory systems, in: Proceedings of the 1999 International Conference on Supercomputing, Rhodes, Greece, 1999, pp. 444-453. [7] J.J. Dongarra, I.S. Duff, D.C. Sorensen and H.A. van der Vorst. Solving Linear Systems on Vector and Shared Memory Computers. Society for Industrial and Applied Mathematics, 1991. [8] J. Gosling, B. Joy and G. Steele, The Java(™) Language Specification. Addison-Wesley, 19%. [9] F.G. Gustavson, Recursion Leads to Automatic Variable Blocking For Dense Linear Algebra Algorithms. IBM Journal of Research and Development 41(6) (November 1997). 737-755. [10] International Business Machines Corporation, IBM Parallel Engineering and Scientific Subroutine Library for AIX - Guide and Reference. December 1997. [11 ] Java Grande Charter, http://www.javagrande.org/public.htm.

J.E. Moreira et al. /NINJA: Java for high performance numerical computing [12] J.E. Moreira et al., JSR-083, Java™ Multiarray Package, URL: http://java.sun.com/aboutJava/communityprocess/ jsr/jsr_083_multiarray.html. [13] J.E. Moreira, S.P. Midkiff, M. Gupta, P.V. Artigas, M. Snir and R.D. Lawrence, Java Programming for High Performance Numerical Computing, IBM Systems Journal 39(1) (2000) 21-56, IBM Research Report RC21481. [14] J.E. Moreira, S.P. Midkiff, M. Gupta and R.D. Lawrence, Parallel Data Mining in Java, in: Proceedings ofSC '99, Also available as IBM Research Report 21326, Nov. 1999. [15] S.S. Muchnick, Advanced Compiler Design and Implementation, Morgan Kaufmann, San Francisco, California, 1997. [16] R. Pozo and B. Miller, SciMark: A Numerical Benchmark for Java and C/C++, National Institute of Standards and Technology, Gaithersburg, MD, http://math.nist.gov/SciMark. [17] V. Sarkar, Automatic selection of high-order transformations

[18]

[19] [20] [21]

33

in the IBM XL Fortran compilers, IBM Journal of Research and Development 41(3) (May 1997), 233-264. M.J. Serrano, R. Bordawekar, S.P. Midkiff and M. Gupta, Quicksilver: a quasi-static compiler for Java, in: Proceedings of the Conference on Object-Oriented Programming Systems, Languages, and Applications (OOPSLA'OO), Minneapolis, MN, USA, Oct. 2000, pp. 66-82. V. Seshadri, IBM High Performance Compiler for Java, AlXpert Magazine, September 1997, URL: http://www. developer.ibm.com/library/aixpert. M.J. Wolfe, High Performance Compilers for Parallel Computing, Addison-Wesley, 2000. P. Wu, S.P. Midkiff, J.E. Moreira and M. Gupta, Efficient Support for Complex Numbers in Java, in: Proceedings of the 1999 ACM Java Grande Conference, IBM Research Report RC21393, 1999, pp. 109-118.

This page intentionally left blank

35

Scientific Programming 10 (2002) 35-44 IOS Press

Dynamic performance tuning supported by program specification Eduardo Cesara, Anna Morajkoa, Tomas Margalefa, Joan Sorribesa, Antonio Espinosab and Emilio Luquea a Computer Science Department, Universitat Autonoma de Barcelona, 08193 Bellaterra, Barcelona, Spain Tel: +34 93 5812888; Fax: +34935812478; E-mail: [email protected], {eduardo.cesar, tomas.margalef, joan.sorribes, emilio.luque}@uab.es b Isoco (Intelligent Software Components) E-mail: Tonie @ isoco. com

Abstract: Performance analysis and tuning of parallel/ distributed applications are very difficult tasks for non-expert programmers. It is necessary to provide tools that automatically carry out these tasks. These can be static tools that carry out the analysis on a post-mortem phase or can tune the application on the fly. Both kind of tools have their target applications. Static automatic analysis tools are suitable for stable application while dynamic tuning tools are more appropriate to applications with dynamic behaviour. In this paper, we describe KappaPi as an example of a static automatic performance analysis tool, and also a general environment based on parallel patterns for developing and dynamically tuning parallel/distributed applications.

1. Introduction The main goal of parallel and distributed computing is to obtain the highest performance in a due environment. Designers of parallel applications are responsible for providing the best possible behaviour on the target system. To reach this goal it is necessary to carry out a tuning process of the application through a performance analysis and the modification of critical application/system parameters. This tuning process implies the monitoring of application execution in order to collect the relevant related information, then the analysis of this information to find the performance bottlenecks and determination of the actions to be taken to eliminate these bottlenecks. The classical way of carrying out this process has been to use a monitoring tool that collects the information generated during the execution and use a visualisation tool to present users with the information in a more comprehensive way that tries to help in the performance analysis [1-3]. These tools help users in the collection of information and the presentation, but obliges them to carry out the performance analysis on their own. ThereISSN 1058-9244/02/$8.00 © 2002 - IOS Press. All rights reserved

fore, this process requires a high degree of expertise detecting the performance bottlenecks and, moreover, in relating them to the source code of the application or to the system components. To complete the tuning cycle, it is necessary to modify the application code or the system parameters in order to improve application performance. Consequently, the participation of users in the whole process is very significant. Many tools have been designed and developed to support this approach. However, the requirements asked of users with respect to the degree of expertise and the time consumed in this process, have not facilitated widespread use of such tools in real applications. To overcome these difficulties, it is very important to offer users a new generation of tools that guide them in the tuning process, avoiding the degree of expertise required by the visualisation tools. This new generation of tools must introduce certain automatic features that help users and guide them in the tuning process or even carry out certain steps automatically in such a way that user participation can be reduced or even avoided. In this sense, two approaches can be distinguished: the static and the dynamic.

36

E. Cesar et al. / Dynamic performance tuning supported by program specification

In the static approach, the objective is to analyse application performance and then modify the source code, recompile it and re-run the application. Usually, this approach is based on a post-mortem analysis performed on a trace file obtained during the execution of the application. On the other hand, the dynamic approach tries to tune the application during the execution without stopping, recompiling or even re-running the application. To accomplish this objective, it is necessary to use dynamic instrumentation techniques that allow the modification of the application code on the fly. These two approaches might appear to be opposed, but can actually be considered as complementary, since they cover different application ranges and there are several techniques and methodologies that are common to them. Both have their advantages and disadvantages, depending on the features of the application. The static approach has the advantage that, when the applications have a regular and stable behaviour, they can be tuned and once the tuning process has been completed, the application can be executed as many times as necessary without introducing any [monitoring] intrusion during the application execution. However, there are many applications that do not have such a stable behaviour and change from run to run according to the input data, or even change their behaviour during one single run due to the data evolution. In this situation, the dynamic approach allows for following the application behaviour on the fly. This requires a continuous intrusion into the program that is not necessary when application behaviour is stable. Moreover, if the analysis is carried out on the fly during the execution of the application, the information available and time spent on analysis is considerably restricted, due to the need to modify the application in this particular run. In the following sections of this paper, new tools covering both approaches are presented. In Section 2, we describe an automatic performance analysis tool based on a static approach. Section 3 introduces the principles of a dynamic tuning tool supported by pattern-based design environment. Section 4 describes the patternbased design environment; section 5 introduces the dynamic instrumentation techniques required to carry out the tuning on the fly, and finally, section 6 presents certain conclusion to this work. 2. A static automatic performance analysis tool: KappaPi KappaPi (Knowledge based Automatic Parallel Program Analyser for Performance Improvement) [4] is a

static automatic performance analysis tool that helps users in the performance improvement process by detecting the main performance bottlenecks, analysing the causes of those problems, relating the causes to the source code of the application and providing certain suggestions about the bottlenecks detected and the way of avoiding them. KappaPi was designed and developed at Computer Architecture and Operating System Group of the Universitat Autonoma de Barcelona. This tool is based on a trace file post-mortem analysis and a knowledge base that includes the main bottlenecks found in message passing applications. The goal of KappaPi is to provide users with some certain hints that allow them to modify the application in order to improve performance. 2.1. KappaPi operation cycle The first step is to execute the application with a monitoring tool in order to get the trace file that will be analysed by the KappaPi analyser. The trace file includes all the events occurred during the execution of the application, related to the communication actions undertaken by the different processes of the application. There are several tools that provide this kind of trace file, but for our purpose it is necessary that each event includes additional information that will be useful during the analysis phase. Besides the kind of event that has occurred (the source process, the destination process in a communication action and so on) it is very important that the monitoring tool inserts the time stamp of each event and the source code line that is responsible for that particular event during the execution of the application. TapePVM is a monitoring tool for PVM applications that includes these features, and VampirTrace for MPI applications also includes the required information. Once the trace has been generated, KappaPi tool can be invoked. As a first step, KappaPi makes a general overview of application performance by measuring the efficiency of the different processors of the system. KappaPi considers as performance inefficiencies those intervals where processors are not doing any useful work; they are simply blocked, waiting for a message. So, the efficiency of a processor is considered as the percentage of time where it is doing useful work. When there are idle time intervals, these time intervals should be avoided in order to improve application performance. The best situation would be to have all the processors completely busy doing useful work during the execution of the application. In this first step, users

E. Cesar et al. /Dynamic performance tuning supported by program specification

get some information about the overall behaviour of the application, but have no idea about the bottlenecks and their causes. After this initial classification, KappaPi starts the deep analysis by looking for performance bottlenecks. KappaPi takes chunks from the trace file and classifies the performance inefficiencies detected in that chunk. It must be pointed out that several inefficiencies can correspond to the same performance bottleneck, because, in many cases, the inefficiencies are repeated throughout the execution of the application. The detected bottlenecks are classified in a table according to the inefficiency time incurred. After analysing the first chunk, the second chunk is analysed and a new table is built and joint to the initial one in such a way that the new inefficiency time of the same bottleneck is added to the first one. The process is repeated for all the chunks and finally KappaPi provides a sorted table indicating the worst performance bottlenecks. The next stage in the KappaPi analysis is the classification of the most important inefficiencies. For this purpose, it relates these inefficiencies with certain existing categories of behaviour using a rule-based knowledge system. From this point, inefficiencies are transformed into specific performance problems that must be studied in order to build up certain hints to for users. To carry out this classification, KappaPi tool takes the trace file events as input and applies the set of rules deducing a list of facts. The deduced facts are kept in a list so that, in the next iteration of the algorithm, higher order rules apply to them. The process terminates when no more facts are deduced. The query process finishes after the performance problem has been identified (when fitting in one of the categories of the rule-based system). The next step in the analysis is to take advantage of the problem-type information to carry out a deeper analysis that determines the causes of the performance bottleneck with the objective of building an explanation of this problem for users. 2.2. KappaPi knowledge base Three main types of problems are differentiated [5]: communication related, synchronisation and program structure problems. This classification does not claim to be a complete taxonomy of the performance problems in message passing programs. It only reflects different types of scenarios that commonly appear when analysing the performance of message-passing applications. These reflect very different situations that require

37

special care when trying to improve the performance of an application. Given an execution interval of low efficiency, the description of a problem allows the possibility of engaging a searching process that looks for the existence of the problem in the interval of interest. Consequently, the ultimate objective of performance-problem descriptions is to automate the search process for problems in the performance data. As with any search process, it can be viewed as a two-step process of query formulation and the execution of this query in the performance data space. The queries define the high level constructs of the application programming model. In this way, the system recognises a programming structure that is close to users with the subsequent objective of finding its performance limitations and suggesting possible improvements to users. Therefore, we must build a language to express these queries and a system with which execute them. Additionally, the need for automation also requires the creation of the queries that will implement the search process. For this purpose, we have built a simple rulebased system that carries out a process of deduction using the trace file events and the deduced facts of the system. Rules are divided into different levels. The deduction process applies all rules in the first level to the trace events until no further facts are deduced. Then, these recently deduced facts serve as input to the next level of rules and the deduction process applies again. This process will continue until the last level of rules is finished. In this way, higher order facts can be deduced from lower level events. For example, a couple of a send and a receive event can deduce a communication between two processes Building facts on others previously deduced allows the system to detect higher order execution situations. In principle, this system can allow the detection of any high level construction that decomposes afterwards in small, lower level operations. Rules encapsulate special program execution configurations that commonly represent performance problems. These configurations range from the detailed low-level situations such as the behaviour of the communication receives of certain processes to certain global collaboration schemes of the application such as the master/worker. The actual classification used contains the following situations: Communication problems - Blocked Sender

38

E. Cesar et al. /Dynamic performance tuning supported by program specification

Communication problem caused by two blocked linked receives. In this case, one process is blocked waiting for a message from a process, which is also blocked, waiting for a message from a third process. Rules defined for this problem are:

- SPMD unbalance problems Data partition between a group of processes derives in a loss of performance. Rules needed refer to the detection of a task link graph where all task are connected to each other (rel is equal to relationship):

(blocked sender, process p1, process p2, process p3) is deduced when finding: (receive at process p3 from process p2) & (receive at process p2 from process p1) & (send from process p1 to process p2)

(complete subgraph, p1, p2, p3,..., pn) = (rel p1, p2) & (rel p1, p3) & ... & (rel p1, pn) & (complete subgraph, p2. p3 pn) (relpl,p2) = (communication, p1, p2) & (communication, p2, p1)

- Multiple output Communication problem caused by a serialisation of the output messages of a process. The rule needed to detect this problem is: (multiple output, from process p1. to process p2, process p 3 , . . . ) = (receive at process p2 from process p 1) & (receive at process p3 from process p2) & (send from process p1 to process p2) & (send from process p1 to process p3) Synchronisation problems — Barrier synchronisation Barrier waiting times create a delay in the execution of the application. The rule needed for detection is: (barrier problem, process p1 blocked time xx, process p2 blocked time y y , . . . ) = (barrier call, process p1, blocked time) (barrier call, process p2, blocked time) Program structure problems - Master/worker Master/worker collaboration scheme generates idle intervals. The rules defined for this problem are: (master/worker, p1 and p2) = (dependence, p1, p2)& (relationship, p1, p2) (dependence, p1, p2) = (communication, p1, p2) & (blocked, p2, from p 1) (relationship, p1, p2) = (communication, p 1, p2) & (communication, p2, p1) (communication, p1, p2) = (send, from p1. to p2) & (receive, at p2. from p1)

2.3. Building recommendations for users The process of building a recommendation for users starts with the simpler objective of building an expression of the highest-level deduced fact that includes the situation found, the importance of such a problem and the program elements involved in the problem. In some cases, it is possible to evaluate the impact of a certain change in the program that created the problem. Only then will it be possible to calculate the impact of a different solution and build a suggestion for users. The creation of this description strongly depends on the nature of the problem found, but in the majority of cases, there is a need to collect more specific information to complete the analysis. In these cases, it is necessary to access the source code of the application and to look for specific primitive sequence or data reference. Therefore, some specialised pieces of code (or "quick parsers"), which look for specific source information, must be called to complete the performance analysis description. This last stage of the performance analysis can be thought of as an information gathering process. Its objective is to use the identification of the performance problems found in the analysis to build a description of these problems for users. This description represents the feedback that the tool is giving to users. Therefore, the given information includes a description of the performance problems found; the importance of the problem related to the global execution, and the program elements that are involved in the problem. Sometimes, this gathering can create a new, deeper analysis of the problem to describe the causes of its generation. In such cases, it is useful to look for specific details in the application or in the trace file under analysis.

E. Cesar et al. /Dynamic performance tuning supported by program specification

3. Dynamic performance tuning supported by program specification As was mentioned in the introduction, a different approach from static automatic performance analysis is that of dynamic performance tuning. This fits a set of applications that can behave in a different way for different executions. Such an approach would require neither developer intervention nor even access to the source code of the application. The running parallel application would be automatically monitored, analysed and tuned without the need to re-compile, re-link and restart. Dynamic performance tuning of parallel applications is a task that must be carried out during application execution and, therefore, there are certain points to be considered: - It is necessary to minimise the intrusion of the tool. Besides the classical monitoring intrusion, in dynamic performance tuning there are certain additional overheads due to monitor communication, performance analysis and program modifications. - The analysis must be quite simple, because decisions must be taken in a short time to be effective in the execution of the program. - The modifications must not involve a high degree of complexity, because it is not realistic to assume that any modification can be done on the fly. For all these reasons, the analysis and modifications cannot be very complex. Since monitoring, evaluation and modification must be done in execution time, it is very difficult to carry this out without previous knowledge of the structure and functionality of the application. The programmer can develop any kind of program; hence, the generated bottlenecks can be extremely complicated. In such a situation, the analysis and the modifications might be extremely difficult. If knowledge about the application is not available, the applicability and effectiveness of our approach is significantly reduced. Therefore, an effective solution is to extract as much information from the application development framework as possible. We therefore propose an environment that covers all of the aspects mentioned above. The environment consists of two main parts: an application development framework and a dynamic performance tuning tool. The first part provides the programmers with the design and development of their application. Users are constrained to use a set of programming patterns, but by using them, they skip the details related to the low

39

level parallel programming. The main goal of the second part - the dynamic performance tuning tool - is to improve performance by modifying the program during its execution without recompiling and rerunning it. This task is achieved by monitoring the application, analysing the performance behaviour and finally tuning selected parts of the running program. The whole environment (the application design tool together with the dynamic performance-tuning tool) allows the programmer to concentrate on the application design without taking into account low level details and without having to worry about program performance. When developers builds the application in our environment, they use the patterns provided by the framework. Hence, the kind of structures and paradigms that a developer has chosen is a known entity. On the other hand, patterns provided by the framework are wellknown structures that may present certain well-known performance bottlenecks. We can therefore define information about the application that is capable of being used by the tuning environment. This information allows the tuning tool to know what must be monitored (measure points), what the performance model is and what can be changed to obtain better performance (tuning points). Using this knowledge, the dynamic performance tuning tool is simplified, because the set of performance bottlenecks to be analysed and tuned are only those related to the programming patterns offered to users. 3.1. Environment modules Our pattern-based programming and dynamic performance tuning environment consists of several modules: 1. Application framework - this tool is based on common parallel patterns and offers support to users in developing their parallel application. The framework provides specific information to the dynamic performance tuning environment that can simplify the tuning of the pattern-based application on the fly. 2. Monitor - this tool collects events produced during the execution of the parallel application. To collect them, the monitor dynamically inserts instrumentation into the original program execution, taking into account all the running processes of the application. Generally, the instrumentation is specified by the framework (measure points), but it can also be specified interactively by a user

40

E. Cesar et al. / Dynamic performance tuning supported by program specification

before program execution. If the analyser requires more or less information, it can notify the monitor to change the instrumentation dynamically during run-time. 3. Performance Analyser - this module is responsible for the automatic performance analysis of a parallel application "on the fly". During the execution, the analysis tool receives selected events that occur in the application's processes. Using received events and the knowledge given by the framework (performance model), the analyser detects the performance bottlenecks, determines the causes and decides what should be tuned in the application to improve performance. Detected problems and recommended solutions are also reported to users. 4. Tuner - this module automatically modifies a parallel application. It utilises solutions given by the analyser as well as information provided by the framework (tuning points). Tuner manipulates the running process and improves the program performance by inserting appropriate modifications. It has no need to access a source code or program restart. Figure 1 shows how the described modules dynamically interact among themselves and with the application (in the run-time phase), also indicating the information that they obtain from the framework used in building the application (development phase). On the one hand, our approach requires an application framework, which includes knowledge about the patterns and the behaviour of their implementation in the parallel application. On the other hand, to accomplish the goals of our dynamic tuning approach, we need to use a dynamic instrumentation technique. Only this technique allows the inclusion of certain new code in a running program without accessing the source code. The following sections describe two main parts of our environment in further detail. These are: the application framework and dynamic performance tuning supported by dynamic instrumentation.

4. Application framework The solution to most concurrent problems could be obtained from the application of a finite set of design patterns [6,7]. Moreover it is possible to offer a finite set of pattern implementations (frameworks), which depend on the design pattern used and also on the imple-

mentation paradigm (message passing, shared memory). We have focused our work on the frameworks devoted to message passing systems, with two main objectives in mind: - Allow programmers to concentrate on codifying application-related issues, concealing low-level details of the communication library from them. - Facilitate the dynamic performance tuning of the application, defining a performance model for each framework. To obtain the first objective, we provide a library to offer users the possibility of developing an application based on parallel programming paradigms such as Master-Worker, Pipeline. SPMD. and Divide & Conquer. Object-oriented programming techniques are the natural way for implementing patterns, not only due to their capacity for encapsulating behaviour, but also offering a well-defined interface to users. For these reasons, we have designed a class hierarchy in C++, which encapsulates the pattern behaviour, and also a class to encapsulate the communication library. In this sense, using our API the programmer simply has to fill in those methods related to the particular application being implemented, indicating the computation that each process has to perform, and the data that must be communicated with other processes. A configuration tool complements this library, where users indicate the general structure of the application and the data structures that will be communicated by each process. The tool uses this configuration information to generate the adequate object structure and the communication classes. To fulfil the second objective, it is necessary to develop a performance model for each framework that allows knowing what the ideal behaviour of the pattern should be, how far is the real behaviour from this ideal and how this ideal behaviour could be reached. Consequently, we have to analyse possible performance bottlenecks for each framework, which measures have to be taken to detect these bottlenecks, and what actions might be taken to overcome them. The objective is to detect from outside the application that there is some (possibly undetermined) performance problem using very little information, and then try to isolate the specific problem by gathering new data from the application. The measures that have to be obtained to detect and isolate the performance problems are defined by the performance model of the

E. Cesar et al /Dynamic performance tuning supported by program specification

41

Fig. 1. Design of the dynamic tuning environment supported by program specification.

used framework, which in turn is used by the tuning tool to decide where, and when, to introduce corrective actions. We have adopted a methodology that allows a unified approach to the definition, analysis and implementation of each framework, but which also defines a way to define new frameworks in the future (flexibility). The methodology includes: - A general description of the framework. - Establishing the elements to specify user-application in terms of the selected framework (interface). This includes initialisation and ending, functional description, and communication management. - Characterising the associated framework bottlenecks. - Determining the parameters needed to detect these bottlenecks (measure points). - Determining the parameters that could be changed to overcome these bottlenecks and the actions that could be taken on them (tuning points). The frameworks that have been included up to this point are the following: - Master-Worker: * Description: this framework consists of a master process that generates requests to other processes called workers. These workers make a computation on the requests and then send the results back to the master. * Interface: how the master generates tasks, the actual computation that must be undertaken by each worker and the processing that the master must carry out on the received results.

* Bottlenecks: performance differences among workers, too few workers, too many workers, computational differences in requests processing (due, for example, to the task granularity). * Measure points: communications times, workers computation times. * Tuning points: task distribution (which includes message size, i.e., the number of tasks sent at one time to a worker), and worker numbers. - Pipeline: * Description: this pattern represents those algorithms that can be divided in an ordered chain of processes, where the output of a process is forwarded to the input of the next process in the chain. * Interface: work that must be carried out at each stage of the pipe, input and the output data and connection among stages. * Bottlenecks: significant performance differences among stages, bad communication/ computation ratio. * Measure points: computing the time and data load for each stage, stage waiting time. * Tuning points: the number of consecutive stages per node, the number of parallel instances of a stage. - SPMD (Single Program Multiple Data): * Description: this represents those algorithms where the same processing is applied on different data portions, with some communication patterns among processing elements. * Interface: this specifies the task that must be carried out for all the processes, including the

42

E. Cesar et al. /Dynamic performance tuning supported by program specification

data communication (send-receive) pattern and protocol (all-to-all, 2D mesh, 2D torus, 3Dcube. and so on). * Bottlenecks: performance differences among processes. * Measure points: computing time and data load of each process, waiting time for other processes. * Tuning points: number of intercommunicating processes per node, number of instances of a process, and, in certain cases, data distribution. - Divide and Conquer: * Description: each node receives some data and decides to process it or to create certain new processes with the same code and distribute the received data among them. The results generated at each level are gathered to the upper level. Each process receives partial results, carries out some computation based on them and passes the result to the upper level. * Interface: processing of each node, the amount of data to be distributed and the initial configuration of nodes. * Measure points: completion time for each branch, computation time for each process, branch depth. * Tuning points: number of branches generated in each division, data distribution among branches, and branch depth.

5. Dynamic performance tuning by dynamic instrumentation The main goal of our work is to provide an environment that automatically improves the performance of parallel programs during run-time. To be able to achieve this objective, we make use of a special dynamic instrumentation technique. The implementation of the technique is provided by a library called Dynlnst, which is presented in the first subsection. The second subsection describes further details on our dynamic tuning tool that is based on Dynlnst and the application framework. 5.1. Dynamic instrumentation: Dynlnst The principle of dynamic instrumentation is to defer program instrumentation until it is in execution and insert, alter and delete this instrumentation dynamically

during program execution. This approach was first used in the Paradyn tool developed at the University of Wisconsin and University of Maryland. In order to build an efficient automatic analysis tool, the Paradyn group developed a special API that supports dynamic instrumentation. The result of their work was called Dynlnst API [8]. Dynlnst is an API for runtime code patching. It provides a C++ class library for machine independent program instrumentation during application execution. Dynlnst API allows attaching to an already-running process or starting a new process, creating a new piece of code and finally inserting created code into the running process. The next time the instrumented program executes the block of code that has been modified, the new code is executed. Moreover, the program being modified is able to continue its execution and does not need to be re-compiled, re-linked, or restarted. Dynlnst manipulates the address-space image of the running program and, thus, this library only needs to access a running program, not its source code. However, Dynlnst requires an instrumented program to contain debug information. The process to be instrumented is simply called application or mutatee. A separate process that modifies an application process via Dynlnst is called mutator. The Dynlnst API is based on the following abstractions: - point - a location in a program where new code can be inserted, i.e. function entry, function exit. - snippet - a representation of a piece of executable code to be inserted into a program at a given point; a snippet must be built as an AST (Abstract Syntax Tree). It can include conditionals, function calls, loops, etc. - thread-a thread of execution (this means process or a lightweight-thread). - image - refers to the static representation of a program on disk. Each thread is associated with exactly one image. Taking into account the possibilities offered by the Dynlnst library, it is possible to insert code into the running application. Our dynamic tuning tool uses this library for two main objectives: - Insert code for monitoring purposes to collect information on the behaviour of the application. The module supporting this function will be called the "monitor".

E. Cesar et al. /Dynamic performance tuning supported by program specification

43

Fig. 2. Dynamic tuning system design.

- Insert code for performance tuning. The main goal of dynamic tuning is to improve the performance on the fly. Therefore, it is necessary to change the code of the application. The module supporting this function will be called the "tuner". 5.2. Dynamic performance tuning architecture The current version of our dynamic tuning tool is implemented in C++ language and is dedicated to PVMbased applications. However, the system architecture is open and could be easily extended to support applications that use other message passing communication libraries. In fact, the communication library details are hidden in our approach, since the low-level code is generated automatically by the application framework. In general, the parallel application environment usually collects several computers. A parallel application consists of several intercommunicating processes that solve a common problem. Processes are mapped on a set of computers and hence each process may be physically executed on a different machine. This situation means that it is not enough to improve processes separately without considering the global application view. To improve the performance of the entire application, we need to access global information about all processes on all machines. And to obtain this objective, we need to distribute the modules of our dynamic tuning tool (monitors and tuners) to all machines where application processes are running. In Fig. 2, we present a scheme for the dynamic tuning tool indicating the distribution of the modules. The figure also illustrates the interactions between all the modules of the dynamic tuning system that we describe in the following paragraphs in more detail. To collect events that happen during the execution of each process, the monitor makes use of the Dynlnst library and dynamically inserts the instrumentation

into the original process execution. Using information given by the framework, this module can instrument each process at points that are highly specific for the monitored application (it knows the points where a bottleneck can occur) thus minimising intrusiveness. From the implementation point of view, we insert a piece of monitoring code (snippet) into the running program at all points that are needed to discover performance problems. Such a snippet logs events that happen during program execution. The logging snippet may be inserted at arbitrary points defined by the used patterns, for instance at the entry and/or exit of pvm_send and pvm_recv functions if communication is a potential bottleneck. When the function is executed, the snippet code logs timestamp, all function parameters and execution time, and sends them as events to the analyser. The monitor distribution indicates that events from different tasks are collected on different machines. However, our approach to dynamic analysis requires the global and ordered set of events, and thus we have to send all events to a central location. The analyser module can reside on a dedicated machine collecting events from all distributed monitors. For faster and easier analysis, this module also uses information given by the framework - especially the performance model that is the consequence of the pattern or patterns chosen to develop an application. Although this module has such knowledge from the framework, the analysis is still assumed to be time-consuming. It can significantly increase application execution time if both - the analyser and the application - are running on the same machine. In order to reduce intrusion, the analysis should be executed on the dedicated and distinct machine (the performance "optimiser" machine). The analysis must be carried out globally by taking the behaviour of the entire application into consideration. The collected events are used to detect potential problems. Obviously, during the analyser computation.

44

E. Cesar et al. /Dynamic performance tuning supported by program specification

monitor modules can still trace the application. In certain situations, the analyser may need more information about program execution to detect a problem or determine the action to be taken. Therefore, it can request the monitor to change the instrumentation dynamically in order to provide more detailed information about specific program behaviour. Consequently, the monitor must be able to modify program instrumentation add more or remove whatever is redundant - depending on the needs of the performance analysis. To detect problems, find their causes and provide a solution, we take advantage of the knowledge and experience gained from work undertaken on the KAPPA-PI tool. The last module - tuner - receives the decision from analyser and automatically modifies the application during run-time using Dynlnst library. It is based on knowledge of mapping problem solutions to code changes (tuning points). Therefore, when a problem has been detected and the solution has been given, tuner must find appropriate modifications and apply them dynamically into the running process. Here our framework is also very useful, because it provides information about parameters that can be changed and actions that can be taken to overcome the bottleneck. Therefore, tuner knows what it must modify in order to improve performance. The inclusion of some new code into a process must be done during run time without recompiling and re-running it. Applying modifications requires access to the appropriate process; hence tuner must be distributed on different machines. For example, when an application is based on master/worker pattern, the parameter, which is important for a good performance, is the number of workers. Therefore, if there are insufficient workers doing the work, an application might need far more time to finish. The analyser discovers this problem and recommends increasing the number of workers. The tuner receives this information, and by using the knowledge provided by the application framework, finds out which parameter in the application represents this number. To change the code, it finds the variable in running process via Dynlnst, and modifies the value. The rest of the work is carried out by the framework runtime. The framework detects the change of the variable and adjusts the number of workers accordingly. In our example, the next time that the application distributes the data, there will be more workers to do the work.

static automatic performance analysis tool that analyses trace file looking for bottlenecks and provides certain hints to users. Users can take advantage of these hints in order to modify the application to improve performance. The second approach to parallel/distributed performance analysis and tuning includes a patternbased application design tool and a dynamic performance tuning tool. The sets of patterns included in the pattern-based application design tool have been selected to cover a wide range of applications. They offer well-defined behaviour, and the bottlenecks that can occur are also very well determined. In this sense, the both analysis of the application and performance tuning on the fly can be carried out successfully. Using this environment, the programmers can design its application in a fairly simple way, and then have no need to concern themselves about any performance analysis or tuning, as dynamic performance tuning automatically takes care of these tasks. Acknowledgement This work was supported by the Comision Interministerial de Ciencia y Tecnologia (CICYT) under contract number TIC 98-0433. References [1]

[2]

[3]

[4]

[5]

[6]

[7]

6. Conclusions We have presented two kinds of tools for automatic performance analysis. KappaPi is a knowledge-based

[8]

D.A. Reed, P.C. Roth, R.A. Aydt, K.A. Shields, L.F. Tavera, R.J. Noe and B.W. Schwartz, Scalable Performance Analysis: The Pablo Performance Analysis Environment, Proceeding of Scalable Parallel Libraries Conference. IEEE Computer Society, 1993. pp. 104-113. W. Nagel, A. Arnold, M. Weber and H. Hoppe: VAMPIR: Visualization and Analysis of MPI Resources. Supercomputer 1(1996). 69-80. Y.C. Yan and S.R. Sarukhai, Analyzing parallel program performance using normalized performance indices and trace transformation techniques. Parallel Computing 22 (1996). 12151237. A. Espinosa. T. Margalef and E. Luque, Integrating Automatic Techniques in a Performance Analysis Session, Lecture Notes in Computer Science. (Vol. 1900). (EuroPar 2000), SpringerVerlag, 2000, pp. 173-177. A. Espinosa, Automatic performance analysis of parallel programs, PhD thesis. Universitat Autonoma de Barcelona. September 2000. J. Schaffer, D. Szafron, G. Lobe and I. Parsons. The Interprise model for developing distributed applications, IEEE Parallel and Distributed Technology 1(3) (1993), 85-%. J.C. Browne, S. Hyder, J. Dongarra, K. Moore and P. Newton. Visual Programming and Debugging for parallel computing, IEEE Parallel and Distributed Technology 3( 1) (1995), 75-83. J.K. Hollingsworth and B. Buck, Paradyn Parallel Performance Tools, DynlnstAPI Programmer's Guide, Release 2.0, University of Maryland. Computer Science Department. April 2000.

45

Scientific Programming 10 (2002) 45-53 IOS Press

Memory access behavior analysis of NUMA-based shared memory programs Jie Tao*, Wolfgang Karl and Martin Schulz LRR-TUM, Institut fur Informatik, Technische Universitdt Munchen, 80290 Munchen, Germany Tel: +49-89-289-{28397,28278,28399}; E-mail: {tao,karlw,schulzm}@ in.tum.de

Abstract: Shared memory applications running transparently on top of NUMA architectures often face severe performance problems due to bad data locality and excessive remote memory accesses. Optimizations with respect to data locality are therefore necessary, but require a fundamental understanding of an application's memory access behavior. The information necessary for this cannot be obtained using simple code instrumentation due to the implicit nature of the communication handled by the NUMA hardware, the large amount of traffic produced at runtime, and the fine access granularity in shared memory codes. In this paper an approach to overcome these problems and thereby to enable an easy and efficient optimization process is presented. Based on a low-level hardware monitoring facility in coordination with a comprehensive visualization tool, it enables the generation of memory access histograms capable of showing all memory accesses across the complete address space of an application's working set. This information can be used to identify access hot spots, to understand the dynamic behavior of shared memory applications, and to optimize applications using an application specific data layout resulting in significant performance improvements.

1. Motivation The development of parallel programs which run efficiently on parallel machines is a difficult task and takes much more effort than the development of sequential codes. A programmer has to consider communication and synchronization requirements, the complexity of data accesses, and the problem of partitioning work and data. Even after a program has been validated and produces correct results, a considerable amount of work has to be done in order to tune the parallel program to efficiently exploit the resources of the parallel machine. This task becomes even more complicated on parallel machines with NUMA characteristics (Non Uniform Memory Access). Shared memory programs running on top of such machines often face severe performance problems due to bad data locality and excessive remote memory accesses. In this case, optimizations with respect to data locality are necessary for a better performance. The information required for data local*Jie Tao is a staff member of Jilin University, China and is currently pursuing her Ph.D at the Technische Universitat Munchen, Germany.

ISSN 1058-9244/02/$8.00 © 2002 - IOS Press. All rights reserved

ity optimizations cannot be acquired easily as communication events are potentially very frequent, relatively short, fine-grained, and implicit. In this paper, a comprehensive approach for an easy and efficient data locality optimization process is presented. This approach is based on a hardware monitoring concept which allows the acquisition of very fine-grained communication events. The information is being delivered to a visualization tool which enables the generation of memory access histograms capable of showing all memory accesses across the complete virtual address space of an application's working set. Using this graphical representation, the programmer can identify access hot spots, understand the dynamic behavior of shared memory applications, and optimize the program with an application specific data layout resulting in significant performance improvements. The approach has been developed and evaluated on PC clusters with an SCI interconnection technology (Scalable Coherent Interface [3,5]). SCI supports memory-oriented transactions over a ringlet-based network topology, effectively supporting a distributed shared memory in hardware. In order to support shared memory programming on top of such a NUMA ar-

46

J. Tao et al. /Memory access behavior analysis ofNUMA-based shared memory programs

chitecture, a software framework has been developed within the SMiLE project (Shared Memory in a LAN like Environment) which closes the semantic gap between the global view of the distributed physical memories in NUMA architectures and the global virtual memory abstraction required by shared memory programming models [8,17]. This framework supports, in principle, almost arbitrary shared memory programming models on top of the PC cluster [13] and thereby creates a flexible target platform for the presented monitoring approach. The paper is organized as follows. The next section presents the SMiLE approach supporting shared memory programming on SCI-based PC-clusters. The SMiLE monitoring approach is being covered in Section 3. Section 4 describes the tool environment supporting the data locality optimization process. The paper concludes with a brief overview of related work in Section 5 and some final remarks in Section 6.

2. Shared memory in NUMA clusters Cluster systems in combination with NUMA (NonUniform Memory Access) networks work on the principle of a global physical address space and enable each node to transparently access the memories on all other nodes within the system. They thereby form a favorable architectural tradeoff by combining the scalability and cost-effectiveness of standard clusters with a shared memory support close to CC-NUMA and SMP systems. However, in order to exploit this hardware support for shared memory environments, a software framework is required which closes the semantic gap between the distributed physical memories of NUMA machines and the global virtual memory abstraction required by shared memory programming models. Such a shared memory framework based on SCI (Scalable Coherent Interface [3]), an IEEE-standardized [5] cluster interconnect with NUMA capabilities, has been developed within the SMiLE [7] (Shared Memory in a LAN-like Environment) project, which broadly investigates in SCI-based cluster computing. In addition, this framework, called HAMSTER [13] (Hybrid-dsm based Adaptive and Modular Shared memory archiTEctuRe), is capable of supporting in principle almost arbitrary shared memory programming models on top of a common hybrid-DSM core, the SCI Virtual Memory or SCI-VM [8,17]. This new type of DSM system establishes the global virtual address space required for applications by combining the present NUMA HW-

DSM with lean software management. On top of this fundament, HAMSTER provides a large range of shared memory and cluster resource services, including an efficient synchronization module called SyncMod [18]. These services are designed in way that allows the implementation of programming models in a lowcomplex fashion, enabling the creation of as many different models as required or necessary for the intended target users and/or applications. In summary, HAMSTER enables the efficient exploitation of loosely coupled NUMA clusters for shared memory programming without binding users to a new, custom API. Despite the efficient and direct utilization of the underlying HW-DSM and the lean implementation of the corresponding software components, it can be observed that applications often suffer from significant performance deficiencies. The reason for this can be found in excessive remote memory accesses which, despite SCFs extremely low latency, still can be up to an order of magnitude slower than local ones. It is therefore of great importance to study and optimize the locality behavior of shared memory applications in such NUMA scenarios since this will have a significant impact in the overall execution time and parallel efficiency of these codes.

3. Observing shared memory accesses In order to optimize the runtime behavior of shared memory applications on top of such loosely coupled NUMA architectures, it is necessary to enable users to understand the memory access pattern of their applications and their impact with respect to the actual memory distributions. Depending on the application and its complexity, this can be a quite difficult and tedious task. Therefore it is of importance to support this process with appropriate tools. The basis for any of these, however, is the ability to monitor the memory access behavior on-line during the runtime of the application. 3.1. Challenges The most severe problems connected with such a monitoring component stem from the fact that shared memory traffic by default is of implicit nature and performed at runtime through transparently issued loads and stores to remote data locations. Unlike in message passing systems, where explicite communication points are known and hence can be instrumented, in a

J. Tao et al. /Memory access behavior analysis ofNUMA-based shared memory programs

shared memory environment other ways to track remote communication have to be found. In addition, shared memory communication is very fine-grained (normally at word level). This also renders code instrumentation recording each global memory operation infeasible since it would slow down the execution significantly and thereby distort the final monitoring to a point where it is unusable for an accurate performance analysis. In addition, such an approach would require a complete cache and consistency model emulation since these two components play a large role in filtering the actual memory references which need to be served from main and/or remote memory. Therefore, the only viable alternative is to deploy a hardware monitoring device observing the actual link traffic of the NUMA interconnection network. Only this guarantees fine-grained information about the actual communication after any access filtration by caches without influencing the actual execution behavior. Such a device, the SMiLE hardware monitor, has been developed within the SMiLE project for the observation of SCI network traffic [6]. 3.2. The SMiLE monitoring approach This SMiLE hardware monitor is designed to be attached to an internal link on current SCI adapters, the so-called B-Link. This link connects the SCI link chip to the PCI side of the adapter card and is hence traversed by all SCI transactions intended for or originating from the local node. Due to the bus-like implementation of the B-Link, these transactions can be snooped without influencing or even changing the target system and can then be transparently recorded by the SMiLE hardware monitor. In order to prevent the necessity to actually store the complete transaction information for later processing and to enable an efficient on-line analysis of the observed data, the hardware monitor enables a sophisticated real-time analysis of the acquired data. The result are so-called memory access histograms which show the number of memory accesses across the complete virtual address space of an application's working set separated with respect to target node IDs. These histograms give the user a first and direct overview of the real memory access behavior of the target application. In order to save valuable hardware resources, the SMiLE hardware monitor implements a swapping mechanism for its counter components. Whenever all counters are filled or one counter is about to overflow, a counter is evicted from the hardware monitor

47

and stored in a ring buffer in main memory. The free counter is then reclaimed by the monitoring hardware for the further monitoring process. The resulting monitoring information in the main memory ring buffer is then collected by a corresponding driver software and combined to the complete access histogram. As the information, by the time it is evicted from the monitor, is relatively coarse grained, this combination step has rather low computational demands and therefore only a minimal impact on the application execution behavior [6]. In addition to this histogram mode, also referred to as dynamic mode due to its adaptability to the memory access behavior of the target application, the SMiLE monitor also contains a static monitoring component which allows to watch predefined events or accesses to special, user definable memory regions. In contrast to the former method, which is intended for a first performance overview of the complete application, this static mode is very useful for the detailed analysis of specific bottlenecks. Together, the two modes enable the complete analysis of the memory access behavior of shared memory applications on top of SCI. Currently, the SMiLE hardware monitor is still under development with a first experimental prototype expected within the next six months. In order to still be able to start with software development and to prove the feasibility and usability of the presented approach, a realistic simulator for NUMA architectures has been developed [20] and is also being used within this work. It is designed in a way which allows a clean and easy migration from the simulation platform to the real hardware guaranteeing the validity of the presented approach. 3.3. The SMiLE tool infrastructure In order to make the information gathered by the hardware monitor available to the user, it has to be first transformed to a higher level of abstraction. This is necessary since all acquired data is only based on individual memory accesses observed from the SCI network adapter and hence by nature based purely on physical addresses. This needs to be transformed into the virtual address space and then be related to source code information. For this task, a comprehensive SMiLE software infrastructure is under development [9] (see also Fig. 1). It is based on the information acquired from the underlying monitoring component and transforms and enhances it using additional data collected from the var-

48

J. Tao et al. /Memory access behavior analysis of NUMA-hased shared memory programs

ious components of the overall system, including the SCI Virtual Memory and its synchronization module (see also Section 2) as well as various interfaces in programming models. This information is then aggregated and made globally accessible through the OMIS interface [1], an established on-line monitoring specification, and OCM (the OMIS Compliant Monitor) [23], the corresponding reference implementation. As a result, this monitoring infrastructure enables a standardized and highly structured way for tools to access the distributed information. Currently the main focus lies on a sophisticated visualization tool, called DLV [21], which will be discussed below in more detail. In the future further shared memory tools will complete the monitoring infrastructure; especially an integrated and automatic data migration and load balancing component is envisioned.

4. Access behavior analysis As mentioned in Section 2, the latency of remote memory accesses is one of the most important performance issues on NUMA systems. Optimizing programs with respect to data locality can minimize the accesses to remote memory modules and improve memory access performance. It requires, however, an understanding of a program's memory access behavior at run-time. The SMiLE tool infrastructure described in Section 3 provides a monitoring facility for observing the interconnection traffic and enables a comprehensive analysis of the runtime memory access behavior of shared memory applications based on the observed data. 4.1. The visualization tool As already mentioned, the current focus of the SMiLE tool infrastructure lies on the Data Layout Visualizer (DLV) [21], a comprehensive visualization tool for shared memory traffic on NUMA architectures. It is capable of transforming the fine-grained data acquired by snooping hardware monitors like the SMiLE hardware monitor into a human-readable and easy-touse form, enabling an exact understanding of an application's memory access behavior and the detection of communication hot spots. It provides a set of display windows showing the memory access histograms in various views and projecting the memory addresses back to data structures within the source code (see Fig. 2). This allows the programmer to analyze an ap-

plication's access pattern and thereby forms the basis for any optimization of the physical data layout and distribution. For this purpose, the DLV includes several different views on the acquired data, each presented by a specific window within the GUI. The most basic one among them is the "Run-time transfer" window (shown top left in Fig. 2). It is designed to illustrate a global overview of the actual data transfer performed on the interconnection fabric and visualizes the number of network packets between all nodes in the system. In addition, communication bottlenecks are highlighted based on user-defined thresholds (relative to a system wide average). Going further into detail and looking at accesses related to their destination within the whole shared virtual space, the "Histogram table" (shown top right in Fig. 2) shows exact numbers of accesses at page granularity. As above, access hot spots are highlighted using user-defined thresholds and are also extracted into a further table shown in the "Most frequently accessed" window. The same information, however with less detail, can also be shown graphically in the "Access diagram" (shown at bottom of Fig. 2). It presents the memory access histogram at page granularity using colored columns to show the relative frequency of accesses performed by all nodes. In this diagram, inappropriate data allocation can be easily detected via the different heights of the columns. In addition, the corresponding data structure of a selected page can be shown in a small combined window using a mouse button within the graphic area. These diagrams therefore can direct users to place data on the correct node that mostly requires it. The "Histogram table" and the "Access diagram" described above serve as the base for a correct allocation of pages accessed dominantly by one node. For pages accessed by multiple processors, however, it is more difficult to determine their location. For this purpose, the DLV provides a "Page analysis" window to illustrate the access behavior within a page, directing programmers to partition such a page and distribute it among the nodes in the system. An example for this is given in the next section. Besides the direct information based on virtual addresses discussed so far, the DLV is also capable of relating the presented data to user data structures within the source code. This is done with the help of the "Data structure and location" window, which can be activated as a main window beyond other windows or as a subwindow within them, showing the mappings between

J. Tao et al. /Memory access behavior analysis ofNUMA-based shared memory programs

49

Fig. 2. GUI of the Data Layout Visualizer (DLV) with a few sample windows.

the virtual address currently under investigation and the corresponding data structure within the source code. It therefore enables the user to relate the information acquired and visualized within the DLV to the data structures exhibiting the observed behavior and thereby to exactly modify the physical layout and distribution of the structures causing a problematic execution behavior.

4.2. Analyzing a sample code The various DLV windows can efficiently be used to characterize the memory access behavior of shared memory codes. This will be demonstrated in the following using a Successive Over Relaxation (SOR) as a basic and easy-to-follow example. SOR is a numerical kernel that is used to iteratively solve partial differential

50

J. Tao et al. / Memory access behavior analysis ofNUMA-baxed shared memory programs

Fig. 3. Memory accesses of some pages on node 1 (SOR code).

equations. Its main working set is a large dense matrix array. During each iteration a four point stencil is applied to all points of this matrix. Due to the fixed and uniform work distribution across all matrix points and due to the numerical stability of the approach, which allows a reordering of the stencil updates, the matrix is split into blocks of equal size during the parallelization process. Each block is assigned to one processor of the cluster and each processor in the following only updates the part of the matrix assigned to it. On the other side, elements of the matrix array are stored transparently over the whole cluster corresponding to the default allocation scheme of the SCI-VM, a round-robin distribution at page granularity. Processors therefore communicate on most matrix/memory accesses to access the data included in the individual blocks leading to poor speedup. In order to understand its memory access behavior, the execution of the SOR code, running on a 4-nodes cluster using a 512 by 512 grid (about 1 MB memory footprint), is simulated for 10 iterations and the monitored data is visualized. Based on this information, the access pattern of the SOR code can be exactly analyzed and in the next step optimized. First we use the "Run-time transfer" window to get a first overview over the complete application and to detect simple communication bottlenecks. In this case, however, this display only shows that every node is accessed frequently by others without the existence of a single dominating bottleneck. This indicates that the overall working load of the SOR code is not allocated correctly. In order to find the hot spots, the next step

is to analyze the memory access histogram using the "Access diagram" window. Figure 3 shows three different sections of the complete node diagram from the view of node 1 as the local node (incoming memory transactions). The memory access behavior of pages on other nodes is quite similar due to the symmetric work distribution and parallelization concept within SOR. The figure first shows that page 0 behaves differently than the rest, as it is accessed by all nodes. By examining the "Data structure and location" information, it can be extracted that this is caused by accesses to global variables located at the beginning of this page before the actual matrix part. This includes information about the matrix size as well as IDs for nodes, locks, and barriers, which are required by all parallel threads. Past this initial information, it can be seen that all pages up to 64 are only accessed by node 1 (the local node), pages between 68 to 124 by node 2, and so on. The data structure information provided by the DLV (shown in the small top window in Fig. 3) additionally shows that these access blocks correspond to matrix blocks of the main SOR matrix. While the "Access diagram" offers the general understanding of a program's access behavior, a more detailed analysis can be enabled using the "Page analysis" window of the visualizer. This window provides facilities to exhibit the memory accesses within a page and can be used for border pages with accesses from more than one node. Figure 4 shows the information about one such page: the "Section" subwindow, demonstrates the total references at finest possible granularitv (mostly L2 cache line size) and the "Read/write"

J. Tao et al. /Memory access behavior analysis ofNUMA-based shared memory programs

subwindow presents the concrete numbers of reads and writes. These windows thereby give exact information about the memory accesses within a page and can be used to clearly identify sharing properties. In the concrete example, Fig. 4 shows that the first few sections are only accessed by node 1 followed by sections with overlapping accesses from both. This indicates that such a page can be partitioned and distributed with the first section located together with the whole block on pages 4 to 64 on the corresponding node. For the following section with true sharing, however, a correct distribution is more difficult. In this case, the Read/write curve can be used to determine the optimal placement: since blocking read operations are more expensive than writes, which are non-blocking, the node with a highest read frequency has the priority for owning them. In summary, the information acquired from the DLV windows clearly shows the blocked memory access and matrix distribution strategy presented in the SOR code. This observation is also likely to hold for all other working set sizes. Hence, it is possible to deduce an application's memory access behavior based on the analysis of a single working set size and thereby to optimize the application in general. 4.3. Using the information for easy optimization Based on the analysis described above, programs can be optimized by specifying a data layout fitting to their memory access pattern. This is done by placing data, which is indicated by the DLV as incorrectly allocated, on the correct nodes using annotations available in the programming models. For the SOR code this can be done by specifying a blocked memory distribution corresponding to the matrix subdivision into blocks and their assignment to processors. In order to verify the efficiency of this optimization, the modified version of the SOR code has been executed on a real NUMA cluster, along with two further codes, a Gaussian elimination (GAUSS) and a particle simulation (WATERNSQUARED from Splash-2 [24]), which have been optimized in a similar way. Both, the optimized and the transparent version of each code, have been executed on a 4 node Xeon 450 MHz cluster interconnected using the D320 SCI adapter cards by Dolphin ICS and the software setup described in Section 2. The results of these experiments are shown in Table 1 in terms of speed-up in comparison to a sequential execution. They show a rather poor performance when executed on a fully transparent mem-

51

ory layout with a significant slow-down. This picture changes drastically after the modification of the memory layout. Now a significant speedup can be observed, especially the SOR code with a speedup of over 3.1 on 4 processors. Also all other codes benefit significantly proving both the importance and the feasibility of the presented approach. 5. Related work In recent years, monitoring approaches are increasingly investigated and some hardware monitors [10,12, 14] have been developed for tracing of interconnection traffic. None of them, however, is applied to improve the data locality of running programs. An example is the trace instrument at Trinity College Dublin [14]. It has been built for monitoring the SCI packets transfered across the network, as our hardware monitor does, but is intended only for the analysis of the interconnect traffic with the goal to improve the modeling accuracy for network simulation systems. Data locality, on the other hand, has been addressed intensively since it has a severe influence on performance of NUMA systems. Among these efforts [2, 4,11,15,16,19,22], which are primarily based on compiler analysis and page migration. One is especially closely related to the approach presented here. This is the Dprof profiling tool [2] developed by SGI. Dprof samples a program during its execution and records the program's memory access information as a histogram file. This histogram can be manually plotted using gnuplot for analyzing which regions of virtual memory each thread of a program accesses and further directs the explicit placement of specific ranges of virtual memory on a particular node. In comparison with the approach presented in this paper, the Dprof report is based on statistical sampling and does therefore not record all references; in addition, numbers of memory accesses are shown at page granularity, allowing no detailed understanding of accesses within a single page. This restricts the accuracy of memory behavior analysis and prohibits a proper specification of an optimal data layout. 6. Conclusions The successful deployment of NUMA architectures using the shared memory paradigm depends greatly on the efficient use of memory locality. Otherwise appli-

52

J. Tao et al. /Memory access behavior analysis ofNUMA-based shared memory programs

Fig. 4. The detailed access character of Page 65 on node 2 (SOR code). Table 1 Execution on a real cluster with 4 nodes with and without optimizations

SOR Sequential execution Transparent parallel Optimized parallel

Time 1.36s 78.16s 0.43s

Speedup 1 0.0174 3.16

cations will be penalized by excessive remote memory accesses and their significantly higher latencies. The tuning of applications towards this goal, however, is a difficult and complex task, since all communication is executed implicitly by read and write accesses and cannot directly be observed using simple software instrumentation without incurring a high probe overhead. Therefore, a low-level hardware monitoring facility in coordination with a comprehensive toolset has to be provided enabling users to perform the required optimizations. Such an environment has been presented in this work. It consists of a low-level hardware monitor capable of observing the complete inter-node memory access traffic across the interconnection network and a tool infrastructure transforming the gathered information about the runtime behavior of the application into a humanreadable way and enhancing it by additional information acquired through the various layers of the runtime environment. The current focus of this tool environment is a comprehensive visualization tool, the Data Layout Visualizer (DLV), which enables the presentation of the acquired information in a graphical and easyto-use way. This includes the creation of memory access histograms which give a complete overview of the application execution across the whole address space without requiring any previous knowledge about the application. The information can then be used to ana-

GAUSS Time Speedup 4.83s 1 44.95s 0.1075 2.11 s 2.289

WATER-N. Speedup Time 1 15.83 s 0.14 116.12s 3.04 5.20s

lyze the memory access behavior of the target application and to optimize its memory distribution leading to a, in most cases significant, performance improvement. This has been proven by a set of numerical kernels, for which the optimization has enabled good speedup values on a 4-node SCI clusters. It is expected that a similar benefit can also be achieved for larger codes and systems as well as with more complex access patterns. Even though this work was based on a single specific NUMA architecture, PC-based clusters interconnected with SCI, the approach is principally applicable to any other NUMA system, which allows to snoop memory traffic on any node. It therefore presents a general approach for the optimization of shared memory applications in such systems and can potentially be used far beyond the context of the SMiLE project.

References [1]

f2]

M. Bubak, W. Funika, R. Gembarowski and R. Wismuller, OMIS-compliant monitoring system for MPI applications. Proc. 3rd International Conference on Parallel Processing and Applied Mathematics - PPAM'99, Kazimierz Dolny, Poland. Sept. 1999, pp. 378-386, D. Cortesi, Origin2000 and Onyx.2 Performance Tuning and Optimization Guide, chapter 4, Silicon Graphics Inc., 1998, Available from: http://techpubs.sgi.com:80/library/manuals/ 3000/007-3430-002/pdf/007-3430-002.pdf.

J. Tao et al. /Memory access behavior analysis ofNUMA-based shared memory programs [3] H. Hellwagner and A. Reinefeld, SCI: Scalable Coherent Interface: Architecture and Software for High-Performance Computer Clusters, Volume 1734 of Lecture Notes in Computer Science, Springer Verlag, 1999. [4] G. Howard and D. Lowenthal, An Integrated Compiler/RunTime System for Global Data Distribution in Distributed Shared Memory Systems, Proceedings of the Second Workshop on software Distributed Shared Memory Systems, 2000. [5] IEEE Computer Society, IEEE Std 1596-1992: IEEE Standard for Scalable Coherent Interface, The Institute of Electrical and Electronics Engineers, Inc., 345 East 47th Street, New York, NY 10017, USA, August, 1993. [6] W. Karl, M. Leberecht and M. Schulz, Optimizing Data Locality for SCI-based PC-Clusters with the SMiLE Monitoring Approach, Proceedings of International Conference on Parallel Architectures and Compilation Techniques (PACT), Oct. 1999, pp. 169-176. [7] W. Karl, M. Leberecht and M. Schulz, Supporting Shared Memory and Message Passing on Clusters of PCs with a SMiLE, in: Proceedings of Workshop on Communication and Architectural Support for Network based Parallel Computing (CANPC) (held in conjunction with HPCA), volume 1602 of LNCS, A. Sivasubramaniam and M. Lauria, eds, Springer Verlag, Berlin, 1999, pp. 196-210. [8] W. Karl and M. Schulz, Hybrid-DSM: An Efficient Alternative to Pure Software DSM Systems on NUMA Architectures, Proceedings of the 2nd International Workshop on Software DSM (held together with ICS 2000), May 2000. [9] W. Karl, M. Schulz and J. Trinitis, Multilayer OnlineMonitoring for Hybrid DSM systems on top of PC clusters with a SMiLE, Proceedings of llth Int. Conference on Modelling Techniques and Tools for Computer Performance Evaluation, volume 1786, of LNCS, Springer Verlag, Berlin, Mar. 2000, pp. 294-308. [10] S. Karlin, D. Clark and M. Martonosi, SurfBoard-A Hardware Performance Monitor for SHRIMP, Technical Report TR-59699, Princeton University, Mar. 1999. [11] A. Krishnamurthy and K. Yelick, Analyses and Optimization for Shared Space Programs, Journal of Parallel and Distributed Computation 38(2) (1996), 130-144. [12] D. Lenoski, J. Laudon, T. Joe, D. Nakahira, L. Stevens, A. Gupta and J. Hennessy, The DASH Prototype: Logic Overhead and Performance, IEEE Transactions on Parallel and Distributed Systems 4(1) (Jan. 1993), 41-61. [13] M. Schulz, Efficient deployment of shared memory models on clusters of PCs using the SMiLEing HAMSTER approach, in: Proceedings of the 4th International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP), A. Goscinski, H. Ip, W. Jia and W. Zhou, eds, World Scientific Publishing, Dec. 2000, pp. 2-14.

53

[14] M. Manzke and B. Coghlan, Non-intrusive deep tracing of SCI interconnect traffic, Conference Proceedings of SCI Europe'99, a conference stream of Euro-Par'99, Toulouse, France, Sept. 1999, pp. 53-58. [15] A. Navarro and E. Zapata, An Automatic Iteration/Data Distribution Method based on Access Descriptors for DSMM, Proceedings of the 12th International workshop on Languages and Compilers for Parallel Computing (LCPC'99), San Diego, LaJolla, CA,USA, 1999. [16] D. Nikolopoulos, T. Papatheodorou, et al., User-Level Dynamic Page Migration for Multiprogrammed Shared-Memory Multiprocessors, Proceedings of the 29th International Conference on Parallel Processing, Toronto, Canada, Aug. 2000, pp. 95-103. [17] M. Schulz, True shared memory programming on SCI-based clusters, chapter 17, volume 1734 of Hellwagner and Reinefeld [3], Oct. 1999, pp. 291-311. [18] M. Schulz, Efficient Coherency and Synchronization Management in SCI based DSM systems, in: Proceedings of SCI-Europe 2000, The 3rd international conference on SCI— based technology and research, G. Horn and W. Karl, eds, ISBN: 82-595-9964-3, Also available at http://wwwbode. in.tum.de/events/, Aug. 2000, pp. 31-36. [19] S. Tandri and T. Abdelrahman, Automatic Partitioning of Data and Computations on Scalable Shared Memory Multiprocessors, Proceedings of the 1997 International Conference on Parallel Processing (ICPP '97), Washington - Brussels Tokyo, Aug. 1997, pp. 64-73. [20] J. Tao, W. Karl and M. Schulz, Using Simulation to Understand the Data Layout of Programs, Proceedings of the IASTED International Conference on Applied Simulation and Modelling (ASM 2001), page to appear, Marbella, Spain, Sept. 2001. [21] J. Tao and W. Karl and M. Schulz, Visualizing the memory access behavior of shared memory applications on NUMA architectures, Proceedings of the 2001 International Conference on Computational Science (/CCS), volume 2074 of LNCS, San Francisco, CA, USA, May 2001, pp. 861-870. [22] B. Verghese, S. Devine, A. Gupta, M. Rosenblum, OS support for improving data locality on CC-NUMA compute servers, Technical Report CSL-TR-96-688, Computer System Laboratory, Stanford University, Feb. 1996. [23] R. Wismuller, Interoperability Support in the Distributed Monitoring System OCM, Proc. 3rd International Conference on Parallel Processing and Applied Mathematics - PPAM'99, Kazimierz Dolny, Poland, Sept. 1999, pp. 77-91. [24] S. Woo, M. Ohara, E. Torrie, J. Singh and A. Gupta, The SPLASH-2 Programs: Characterization and Methodological Considerations, Proceedings of the 22nd International Symposium on Computer Architecture (ISCA), June 1995, pp. 2436.

This page intentionally left blank

55

Scientific Programming 10 (2002) 55-65 IOS Press

SKaMPI: a comprehensive benchmark for public benchmarking of MPI Ralf Reussnera,*, Peter Sandersb and Jesper Larsson Traff c a Distributed Systems Technology Center (DSTC), Monash University Caulfield Campus, 900 Dandenong Road, Caulfield East, VIC 3145, Australia E-mail: [email protected] b Max-Planck-InstitutfurInformatik, Stuhlsatzenhausweg 85, D-66123 Saarbrucken, Germany E-mail: [email protected] C C&C Research Laboratories, NEC Europe, Rathausallee 10, D-53757 Sankt Augustin, Germany E-mail: traff@ ccrl-nece. de

Abstract: The main objective of the MPI communication library is to enable portable parallel programming with high performance within the message-passing paradigm. Since the MPI standard has no associated performance model, and makes no performance guarantees, comprehensive, detailed and accurate performance figures for different hardware platforms and MPI implementations are important for the application programmer, both for understanding and possibly improving the behavior of a given program on a given platform, as well as for assuring a degree of predictable behavior when switching to another hardware platform and/or MPI implementation. We term this latter goal performance portability, and address the problem of attaining performance portability by benchmarking. We describe the SKaMPI benchmark which covers a large fraction of MPI, and incorporates well-accepted mechanisms for ensuring accuracy and reliability. SKaMPI is distinguished among other MPI benchmarks by an effort to maintain a public performance database with performance data from different hardware platforms and MPI implementations.

1. Introduction The Message-Passing Interface (MPI) [6,15,29] is arguably the most widespread communication interface for writing dedicated parallel applications on (primarily) distributed memory machines.1 The programming model of MPI is a distributed memory model with explicit message-passing communication among processes, coupled with powerful collective operations over sets of processes. MPI ensures portability of application programs to the same extent that the supported application programming languages C, C++, and Fortran are portable, and is carefully designed to be effi*The work described in this paper was done while this author was at Lehrstuhl Informatik fur Ingenieure und Naturwissenschaftler, Universitat Karlsruhe, Germany 1 Throughout this paper MPI denotes the message-passing core of the interface, MPI-1 [29]. We expressly say so when addressing MPI-2 extensions.

ISSN 1058-9244/02/$8.00 © 2002 - IOS Press. All rights reserved

ciently implementable on a wide variety of hardware platforms. Indeed, many high-quality vendor implementations achieve MPI communication performance close to that of their native communication subsystem. Apart from basic semantic properties (liveness etc.) there are no performance model or performance guarantees associated with MPI, and the MPI standard stipulates no performance requirements for a valid MPI implementation. Without empirical performance data it is therefore not possible to predict/analyze the performance of a parallel application using MPI, much less to predict and obtain good performance when moving to another platform and/or MPI implementation. Reliable figures for performance characteristics of MPI implementations for as many different platforms as possible are indispensable to guide the design of efficient and performance portable parallel applications with MPI. Performance characteristics include the "raw" performance of MPI communication primitives, both for message-passing and collective communication for

56

R. Reussner et al. /SKaMPI: a comprehensive benchmark for public benchmarking of MPI

varying parameters (message lengths, number of processes), performance under "load" (e.g. bisection bandwidth) or with typical communication patterns (e.g. master-slave, ring), as well as comparative measurements of different realizations of collective operations. Such information allows the application programmer to tune his application for a specific platform by choosing the appropriate communication primitives, and to rune for good performance across different platforms. There are several benchmarks for MPI which partly address these issues. In this paper we describe the Special Karlsruher MPI benchmark, SKaMPI, which in particular addresses the issue of cross platform performance portability by maintaining a public performance database of performance measurements for different platforms. Some main features of SKaMPI are: - Coverage of (almost) all of the MPI standard, including collective operations and user-defined datatypes. - Assessment of performance under different communication patterns, e.g. ping-pong and masterslave. - Automatic parameter refinement for accuracy, reliability and speed of benchmarking. - Operation controlled by configurationfiles,which allow for detailed and flexible planning of experiments; the benchmark comes with a default set of measurement suites. - A report generator, which allows for automatic preparation of measurements into a readable form. - Last but not least a public performance database available on the WWW, which allows for interactive comparison of MPI performance characteristics across different implementations and platforms. The SKaMPI project was initiated by Ralf Reussner, Peter Sanders, Lutz Prechelt and Matthias Muller at the University of Karlsruhe in 1996-97 [23,25,28], and has since then developed with new features and broader MPI coverage [26]. The interactive WWW-database was implemented by Gunnar Hunzelmann [11,24]. URL of the SKaMPI-project: http://liinwww.ira.uka.de/skampi/ 1.1. Related work Benchmarking has always played an important role in high-performance computing. For MPI, several benchmarks exist which differ in philosophy, goals, and

level of ambition. In this section we briefly review some other well-known MPI benchmarks in relation to SKaMPI; the discussion is not meant to be exhaustive. A general discussion of problems and pitfalls in (MPI) benchmarking can be found in [7] and [10]; SKaMPI adheres to the sound advice of these papers. Benchmarking of application kernels [1,2,18] is traditionally used to get an idea of the overall performance of a given machine, but such benchmarks measure communication in a specific, complex context and can only indirectly be used to guide the development of efficient programs. A widely used MPI benchmark is the mpptest shipped with the MPICH implementation of MPI [8, 16]; it measures nearly all MPI operations, but is less flexible than SKaMPI and has limited coverage of user-defined datatypes. The low-level part of the PARKBENCH benchmarks [18] measures communication performance and provides a result database, but does not give much information about the performance of individual MPI operations. The MPI part of P.J. Mucci's Low-Level Characterization Benchmarks (LLCbench) [13], mpbench, pursues similar goals to SKaMPI, but it covers only a part of MPI and makes rather rough measurements assuming a "dead" machine. The Pallas MPI Benchmark (PMB) [ 17] is easy to use and has a simple, well-defined measurement procedure, but covers relatively few functions, and offers no graphical evaluation. PMB is one of the few MPI benchmarks that covers MPI-2 functionality (one-sided communications, some MPI-I/O). Rolf Rabenseifner's effective bandwidth benchmark [4] attempts to give a realistic picture of the achievable communication bandwidth. Bandwidth is measured by a ring pattern over all processes, which is implemented using both simple send and receive operations, as well as by a collective MPI_Alltoallv operation. Results from a number of high-performance platforms are publicly available. The effective bandwidth benchmark has recently been complemented by a similar I/O benchmark [20,21 ]. A comparison of SKaMPI, PMB. mpptest and mpbench benchmarks is given in [14] for benchmarking MPI on an SGI Origin2000. The benchmarks give roughly similar results, but differ in finer details due to different assumptions on use of cache, and placement and size of communication buffers. The mpbench is confirmed to be sensitive to other activities on the machine. Many studies measure selected functions in more detail [5,19,22] but the codes are often not publicly available, not user configurable, and not designed for ease of use. portability, and robust measurements.

R. Reussner et al. ISKaMPI: a comprehensive benchmark for public benchmarking of MPI

2. Performance considerations MPI is an extensive interface and communication can be expressed in many different ways. MPI offers two basic types of communication: point-to-point messagepassing where information is passed explicitly between a sending and a receiving process, and collective communication where a set of processes jointly performs a communication operation, possibly involving computation as in MPI_Reduce. For the applications programmer this raises a number of questions to be answered in order to get the best possible performance on a given platform/MPI implementation, as well as for being able to obtain and/or predict performance when moving to a different platform/MPI implementation. Selection of communication mode: MPI differentiates between blocking and non-blocking point-to-point communication, which can be further adapted by different communication modes: standard, synchronous, ready and buffered. There also exist specialized compound operations like MPI_Sendrecv for simultaneous sending and receiving of data. It is possible to receive non-deterministically by using wildcards like MPI_ANY_TAG and/or MPI_ANY_SOURCE. The performance of point-to-point communication thus depends on the application context, the MPI implementation, and hardware capabilities that may allow especially efficient implementations of some of these primitives in special contexts. Use of collective operations: A number of collective operations are available, but are not always used, either because they are not sufficiently known, or because their implementation is distrusted by some users. The question is whether an available MPI library provides good implementations. Do the implementations compare favorably to simple(r), ad hoc point-to-point based implementations? Use of compound collectives: The MPI standard offers certain compound collective operations (MPI_ All reduce, MPI_Allgather, and others) that can easily be expressed in terms of more primitive collectives (e.g. MPI_Reduce, MPI_Gather, and MPI_Bcast). These compound operations are included in MPI since better algorithms than simple concatenation of more primitive collectives exist. Is this exploited in a given MPI implementation?

57

MPI user-defined datatypes: MPI has a powerful mechanism for working with user-structured, possibly non-consecutive data, but not all MPI implementations support user-defined datatypes equally well [9,30,26]. Is the best performance achieved by maintaining nonconsecutive data "manually" or by relying on the MPI mechanism?

3. The SKaMPI benchmark The SKaMPI benchmark package consists of three parts: the skampi . c benchmarking program itself, an optional post-processing program (also a C program), and a report generation tool (a Perl script). It is complemented by an interactive public database of benchmark results, accessible through the WWW. A run of SKaMPI is controlled by a configuration file, . skampi, which can be modified for more selective or detailed benchmarking. A default configuration file defines a standard run of the benchmark, and the results from such a run can be reported to the SKaMPI database. The configuration file starts with a preamble identifying the benchmarker, the MPI implementation, and the machine and network used. It defines output and logfiles, and sets various default values. For the standard configuration file only the ©MACHINE, ©NODE, ©NETWORK, ©MPIVERSION and ©USER fields have to be modified. The ©MEMORY field controls the total size of communication buffers per processor (in KBytes). Figure 1 shows a sample configuration file. We refer to this example in the following. The benchmark program produces an ASCII text file skampi.out (selected by ©OUTFILE) in a documented format [27]; it can be further processed for various purposes. The post-processing program is only needed when the benchmark is run several times (see Section 4.3). Post-processing can also be done by SKaMPI itself by setting ©POSTPROC to yes. The report generator reads the output file and generates a postscript file containing a graphical representation of the results. This includes comparisons of selected measurements. The report generator can also be customized via a parameter file. Reports (actually: output files) are collected in the SKaMPI result database in Karlsruhe, which can be queried for both textual and graphical presentation of results (including downloadable encapsulated postscript figures). The ©MEASUREMENT keyword starts the description of the actual experiments to be performed. This is

58

R. Reussner et al. / SKaMPI: a comprehensive benchmark for public benchmarking of MPI

@MACHINE IBM SP @NODE thin @NETWORK hpf-switch3 @MPIVERSION aix-mpi library @USER R. Reussner @MEMORY 8192 @OUTFILE skampi.out @LOGFILE skampi.log

Pattern Pingpong

Type MPI operations Seno - MP: Recv _Send MP: _Recv(MPI:_ANY_TA~l MPI: _Send -MPI_:recv MP: _Send MP: :probe - MP:_Recv MP: _Ssend - MPI_Recv MF: :send - MPI_Recv MP: Bsend - MP: Re"'." MP: Sendrecv MP: Sendrecv_replace MPI Issend

©POSTPROC no @CACHEWARMUP 5 @BASETYPE1 MPI_INT @MEASUREMENTS MPI_Send-MPI_Recv-dynamicVectorl Type = 1; Basetype_Number = 1; Send_Datatype_Number = 50; Receive_Datatype_Number = 50; Variation = Length; Scale = Dynamic_log; Max_Repetition = Default_Value; Min_Repetition = Default_Value; Multiple_of = Default_Value; Time_Measurement = Invalid_Value; Time_Suite = Invalid_Value; Node_Times = yes; Cut_Quantile = Default_Value; Default_Chunks = 0; Default_Message_length = 256; Start_Argument = 0 ,End_Argument = Max_Value; Stepwidth = 1.414213562; Max_Steps = Default_Value; Min_Distance = 2; Max_Distance = 512; Standard error = Default Value;

18 19 20

; i

33 35 36 37 38 39 40 41

42 43 44

45

MPI er MP: cat t er , MP: A llr educe MP: edu ce - MPI: Beast MPI e d u c e _ s c a t t e r MPI l lgather MPI s c a t t e r y MPI athery MP: A 113 a t h e r y ! MPI llt cally MPI edu c e !

MP Gathen by

MPI

end - MPI Recv Gathen by MP: :send MPI recv-MPI W a i t a i l 10 MP a i t some 11 MPI Wait any 12 MPI R ecv (MP: ANY SOURCE ) 13 MPI Send 14 MPI: s sen d

46 MasterWorker

Simple

Fig. 1. A . skampi configuration file with a one-measurement suite.

a list of named measurement suites, each of which controls a set of measurements to be performed. The configuration file in Fig. 1 lists only the single suite named MPI_Send-MPI_Recv-dynamicVectorl. Each suite has a Type which identifies the pattern used to control and time the execution of the MPI operations to be benchmarked. More precisely a type is an instance of one of the four SKaMPI patterns to a specific combination of MPI functions. Individual measurements with given parameters are repeated a number of times determined by default value settings (Max_Repetition, Min_Repetition) and by SKaMPI's adaptive parameter refinement mechanism. The set of measurements to be performed by a suite is furthermore determined by the selected dimension to be varied along, which can be either message length, number of processes, or number of chunks (for the master-worker

Fig. 2. The instances (column Type) of the SKaMPI patterns.

pattern). For variation along message length, interval and stepwidth must be given (Start_Argument, End_Argument, and Stepwidth). The overall time which should be spend measuring this suite can be set in Time_Suite. Table 2 lists the currently existing types of suites and the patterns to which they belong. A SKaMPI run is the results for the whole list of suites. 3.1. The patterns The way execution times are measured and reported, and the way a set of measurements is coordinated is determined by a so-called pattern, of which SKaMPI currently has four. These predefined measurement strate-

R. Reussner et al. /SKaMPI: a comprehensive benchmark for public benchmarking ofMPI

gies make it easy to extend SKaMPI with new pattern instances to cover new MPI functions and/or MPI functions in new contexts. Only a small core function with the proper MPI calls has to be written; the measurement infrastructure of the pattern is automatically reused. All current pattern instances are listed in Table 2. The ping-pong pattern coordinates point-to-point communication between a pair of processors. The pingpong exchange is between two processors with maximum ping-pong latency; this in order to avoid misleading results on clusters of SMP machines. SKaMPI automatically selects such a pair based on measurements. Time is measured for one of the processes. Parameter variation is on message Length. There are currently 9 instances of the ping-pong pattern, corresponding to (some of) the possible combinations of blocking and non-blocking communication calls under different modalities. The collective pattern measures operations that are collective in the sense that all processors play a symmetric role. Processors are synchronized with MPI_Barrier. Execution time is measured on process 0 (the root), and the running time of the barrier synchronization is subtracted. Parameter variation can be either on message Length or number of Nodes. There are instances of the collective pattern for all collective MPI communication operations, for collective bookkeeping operations like MPI_Comm_split, and for some collectives implemented with point-to-point communication (e.g. gather-functions). There is also an instance of the collective pattern for measuring the performance of the memcpy function. This can be used to compare memory bandwidth with communication performance. Master-worker pattern: Certain performance-relevant aspects such as the contention arising when one processor simultaneously communicates with several other processors cannot be captured by the ping-pong pattern. To compensate for this a master-worker-pattern is introduced. A master process partitions a problem into smaller chunks and dispatches them to several worker processes. These workers send their results back to the master which assembles them into a complete solution. Time is measured at the master process. Variation on message Length, number of Chunks, and number of Nodes is possible. Currently 7 instances of this pattern are implemented.

59

The simple pattern measures MPI-operations with local completion semantics such as MPI_Wtime, MPI_Comm_rank, and unsuccessful MPI_Iprobe. No parameter variation is possible. 3.2. User-defined datatypes The MPI user-defined datatypes is a mechanism for describing the layout of user data in memory to the MPI implementation. All MPI communication operations can operate with complex data described by a user-defined datatype. Communication performance is usually dependent on the datatype, and the extent to which different MPI implementations work well with user-defined datatypes is known to vary. To be able to assess the quality of the datatype handling, SKaMPI incorporates a set of datatype patterns that is orthogonal to the communication pattern instances. In SKaMPI the data used in a measurement suite are structured according to either a base type or a datatype pattern over a base type. Base types are defined in the preamble (©BASETYPE1, ...), and can be either a built-in MPI type (e.g. MPI_INT, MPI_CHAR, MPI_DOUBLE), or a simple structure given by a sequence of triples (ci, Oi,ti) each consisting of repetition count, offset, and an built-in MPI type. As the unit of communication either a base type or a type pattern over a base type is selected (Basetype_Number, Send_Datatype_Number,and Recv_Datatype _Number). SKaMPI contains a number of fixed type patterns, including instances of all MPI type constructors, as well as various nested types. Type patterns can be further customized in the preamble. All type patterns are constructed to have the same size, i.e. encompass the same amount of data. Therefore send and receive type can be chosen independently, and can be different type patterns. This gives rich possibilities to gauge the handling of user-defined datatypes of a given MPI implementation. The datatype patterns are described in more detail in [26].

4. Measurement mechanisms We now describe SKaMPI's approach to efficiently measure execution times to a given relative accuracy e. The standard error is set in each suite by the Standard_error parameter.

60

R. Reussner et al. / SKaMPl: a comprehensive benchmark for public benchmarking of MPI

4.1. A single parameter setting Each SKaMPl result is eventually derived from multiple measurements of single calls to particular MPI functions as determined by the pattern instances, e.g. in the corresponding ping-pong pattern instance measurements of an MPI_Send followed by an MPI_Recv call for given message lengths. For each measurement, the number n of repetitions needed to achieve the required accuracy with the minimum effort is determined individually. We need to control both the systematic and the statistical error. A systematic error occurs due to the measurement overhead including the call of MPI_Wtime. It is usually small and can be corrected by subtracting the time for an empty measurement. Additionally, the user can choose to "warm up" the cache by setting the number of dummy ©CACHEWARMUP calls to the MPI functions before actual measuring is started. Individual measurements are repeated in order to control three sources of statistical error, finite clock resolution, execution time fluctuations from various sources, and outliers. The total time for all repetitions must be at least MPI.WTick/e in order to adapt to the finite resolution of the clock. Execution time fluctuations are controlled by monitoring the standard error s := sÖn where n is the number of measurements, s— Ösni=1(x' - x)2 /n is the measured standard deviation, and x —sni=xini=1/=X i / n is the average execution time. The repetition is stopped as soon as sx/x < e. Additionally, upper and lower bounds on the number of repetitions are imposed, Min_Repetition and Max_Repetition. Under some operating conditions one will observe huge outliers due to external delays such as operating system interrupts or other jobs. These can make x highly inaccurate. Therefore, we ignore the slowest and fastest run times before computing the average, as determined by Cut_Quant i1e. Note that we cannot just use the median of the measured values, because its accuracy is limited by the resolution of the clock. 4.2. Adaptive parameter refinement In general we would like to know the behavior of a communication routine over a range of possible values for the message length m and the number P of processors involved. SKaMPl varies only one of these parameters at a time. Two-dimensional measurements must be written as an explicit sequence of one-dimensional measurements.

(ma,ta)

Fig. 3. Deciding on the refinement of a segment (mb,.tb) — (mc. tc).

Let us focus on the case were we want to find the execution time tp(m) for a fixed P and message lengths in [m min .m max ]. First, we measure at m max and at m min gkfor all k such that m m i n g k < mmax,with g. On a logarithmic scale these values are equidistant. Now the idea is to adaptively subdivide those segments where a linear interpolation would be most inaccurate. Since nonlinear behavior of t p(m) between two measurements can be overlooked, the initial stepwidth 7 should not be too large (7 = Ö2 or 7 = 2 are typical values). Figure 3 shows a line segment between measured points (mb,. tb) and ( m c . t c ) and its two surrounding segments. Either of the surrounding segments can be extrapolated to "predict" the opposite point of the middle segment. Let I and 2 denote the prediction errors. We use m i n ( | i /t b ,, /2/ /tc, (mc-mb)/mb) as an estimate for the error incurred by not subdividing the middle segment. The reason for the last term in the minimum is to avoid superfluous measurements near sharp jumps in running times which often occur where an MPI implementation switches to a different communication protocol. We keep all segments in a priority queue. If mb and mc are the abscissae of the segment with largest error, we subdivide it at Ömbmc We stop when the maximum error drops below e or the upper bound on the number of measurements is exceeded. In the latter case, the priority queue will ensure that the maximum error is minimized given the available computational resources. 4.3. Multiple runs If a measurement run crashed, the user can simply start the benchmark again. SKaMPl will identify the measurement which caused the crash, try all suites not measured yet, and will finally retry the suite which led to the crash. This process can be repeated.

R. Reussner et al. /SKaMPI: a comprehensive benchmark for public benchmarking ofMPl

61

Fig. 4. The WWW interface for querying the result database.

If no crash occurred, all measurements are repeated yielding another output file. Multiple output files can be fed to a post-processor which generates an output file containing the medians of the individual measurements. In this way the remaining outliers can be filtered out which may have been caused by jobs competing for resources or system interrupts taking exceptionally long. 4.4. Cache behavior Whether data is sent from cache or not can make a large difference in communication performance. A benchmark that is used for program design should therefore state its assumptions regarding the cache content during communication. In SKaMPI the assumption is that code, MPI internal data, and user data will

as far as possible reside in cache. SKaMPI therefore provides the simple means of "warming up" the cache outlined above. The main reason for "wanning up" is that this avoids the effect that the first single measurement takes much longer than later repetitions so that the variance of the measured times is increased. Communication of data outside the cache can be measured by choosing message lengths that are larger than the size of the cache. A different caching assumption is made by mpptest, which makes an effort to ensure that the user data sent is not in the cache [7]. 5. The result database The SKaMPI result database has two WWW userinterfaces. One is for downloading detailed reports of the various runs on all machines. The other interface

62

R. Reussner et al. / SKaMPI: a comprehensive benchmark for public benchmarking ofMPI

Fig. 5. Performance of compound send-receive for the Hitachi SR 8000, Cray T3E and NEC SX-5.

enables interactive comparison of measurements for different pattern instances/machines. A snapshot of this interface is shown in Fig. 4. Querying the database takes three steps: 1. Choosing a run. The user selects one or more machine(s) of interest. On some machines several runs may have been performed, for example with different number of MPI processes, and it is possible to choose among these. Selection of runs is done in the upper left part of the browser window. 2. Choosing the suites. After selecting the runs the database is queried for the suites of these runs. The available suites for each selected run are presented at the upper right part of the browserwindow. For each run the suites of interest are selected; results from different runs can be combined. 3. Finally the database is queried for the selected measurements, and a single plot for all selected suites is created. The plot is shown in the lower half of the browser window, and can also be downloaded as an encapsulated postscript file. It is also possible to zoom into the plot. The detailed design of the database is described in [11].

6. Examples The public database currently has results for the default set of suites for Fujitsu VPP 300, Hitachi SR 8000, IBM RS 6000, NEC SX-5, SGI Origin 2000, Cray T3E. All results are supplied by users, who have run SKaMPI on their machine. We give three examples of the use of the benchmark database. In Fig. 5 the performance of the MPI_Sendrecv primitive as measured with ping-pong pattern instance 8 on the Hitachi SR8000, the Cray T3E, and the NEC SX-5 is given. The suite varies on message length, and shows many cases of adaptive parameter refinement (high concentration of measurements). The shapes of the curves are sufficiently similar that Sendrecv will probably not pose problems when porting applications among these machines. In Fig. 6 we compared four different ping-pong pattern instances on the Hitachi SR 8000, namely blocking send-recv, non-blocking either send or receive, and compound send-receive. It is worth noticing that the compound send-receive is clearly better than the other alternatives, about a factor 2 for short messages up to 1 KBytes. For this machine it thus seems advisable to use MPI_Sendrecv wherever possible. Similar, or even more complicated pictures appear for the other machines; in particular, the advice to always use MPI Sendrecv is not universally true. The user may

R. Reussner et al. / SKaMPI: a comprehensive benchmark for public benchmarking ofMPI

63

Fig. 6. Performance of different combinations of blocking and non-blocking send and receive operations on the Hitachi SR 8000.

Fig. 7. Two implementations of a gather operation on the IBM RS 6000 SP. Variation over number of processors, message length 256 bytes. Measurements performed on the machine installed in Karlsruhe, February 2000.

gain performance on a particular machine by looking more closely in this direction. Our last example investigates the implementation of the MPI_Gather collective on the IBM RS 6000 SP [28,24]. SKaMPI has a corresponding collective

pattern instance, as well as two pattern instances which measure "hand-written" implementations of the gather functionality using point-to-point communication (see Table 2). Figure 7 compares the vendor implemented MPI_Gather-operation to an implementation with

64

R. Reussner et al. /SKaMPl: a comprehensive benchmark for public benchmarking of MPI

blocking send and receive operations for fixed, short messages of 256 bytes, varied over number of processors. For short messages the hand-written, naive implementation performs significantly better than the MPI_Gather implementation of the vendor MPI library. Findings like this may discourage users from relying on the MPI collectives, but should rather incite vendors to improve their library.

7. Future extensions Although many aspects of MPI are covered by SKaMPl, there is still room for improvements in various directions. Of immediate concern are more collective pattern instances, e.g. for ring-communication (as in the effective bandwidth benchmark [4]), for "bisection bandwidth" where half the processes simultaneously communicate with the other half, and for more alternative implementations of collective operations, either in terms of point-to-point communication or in terms of other MPI collectives. The benchmarking of the "irregular" (or vector) variants of MPI collectives like MPI_Alltoallv is rather rudimentary, in particular there is no room for individually varying the amount of data communicated between each pair of processes. In future versions of SKaMPl it should be made possible to select among different distributions and vary more flexibly over message lengths. Also more accurate measurement mechanisms for collective operations should be considered, see [3]. Another natural extension is towards MPI-2 functionality [6]. Particularly relevant, but also easy to incorporate in the existing patterns is the one-sided communications, but perhaps also I/O should be covered by the benchmark; for some thoughts on MPI-IO benchmarking, see [12,17,20,21]. More flexible control over cache utilization may be required for realistic I/O benchmarking.

8. Summary In the absence of an analytical performance model for MPI, accurate, reliable, and realistic benchmark data are necessary to guide the development and tuning of application programs. We described the SKaMPl benchmark which performs such detailed benchmarking of a large fraction of MPI, both in isolation and in more complex (master-worker) patterns. Not covered

are the construction of datatypes and the construction and use of user-defined topologies. Complemented by application kernels benchmark, SKaMPl can give a realistic picture of the performance of a given MPI implementation on a given machine. The SKaMPl project is distinguished from other MPI benchmarks by a directed effort to collect results from a standard run into a public performance database. This public database can be a powerful aid to users who want to port their application to a machine to which they do not yet have access. The most obvious problem with maintaining a public result database is that results get obsolete; this is especially so since benchmark results are supplied voluntarily by users. To keep up with the technical progress in MPI implementations and changes of machine details, support by vendors would be welcome. Acknowledgments We would like to thank Lutz Prechelt for contributions to the basic design and to the original SKaMPIconference paper [25], Matthias Muller for the design of the web pages, Gunnar Hunzelmann for implementing the result database and enhancing the reportgenerator significantly, Thomas Worsch and Werner Augustin for their current redesign of collective measurements, and Roland Vollmar for his ongoing administrative and financial support; all of the University of Karlsruhe, Germany. Furthermore, we would like to thank the following people for their extensive comments, detailed suggestions and submitted measurements: Ed Benson (DEC, USA), Julien Bourgeois (Univ. de Besancon, France), Bronis R. de Supinski (Lawrence Livermore National Laboratory, USA). Dieter Kranzlmuller (Univ. of Linz, Austria), John May (Lawrence Livermore National Laboratory, USA). Hermann Mierendorff (FhG-GMD, Germany), JeanPhilippe Proux (CNRS/IDRIS, France), Rolf Rabenseifner (HLRS Stuttgart, Germany). Umesh Kumar V. Rajasekaran (Univ. of Cincinnaiti, USA). Christian Schaubschlager (Univ. of Linz, Austria), and Scott Taylor (Lawrence Livermore National Laboratory, USA). References [1]

D. Bailey, E. Barszcz, J. Barton, D. Browning, R. Carter. L. Dagum, R. Fatoohi, S. Fineberg, P. Frederickson, T. Lasinski, R. Schreiber, H. Simon, V. Venkatakrishnan and S. Weeratunga, The NAS parallel benchmarks. Report RNR-94-007. Department of Mathematics and Computer Science, Emory University. 1994.

R. Reussner et al. / SKaMPl: a comprehensive benchmark for public benchmarking ofMPI [2] D. Bailey, T. Harris, W. Saphir, R. van der Wijngaart, A. Woo and M. Yarrow, The NAS parallel benchmarks 2.0, Report NAS-95-020, Numerical Aerodynamic Simulation Facility, NASA Ames Research Center, 1995. [3] B .R. de Supinski and N.T. Karonis, Accurately measuring MPI broadcasts in a computational grid, Proc. 8th IEEE Symp. on High Performance Distributed Computing (HPDC-8), 2000. [4] Effective Bandwidth Benchmark, http://www.hlrs.de/organization/par/services/models/mpi/b.eff/. [5] V. Getov, E. Hernandez and T. Hey, Message-passing performance of parallel computers, Euro-Par'97 Parallel Processing, volume 1300 of Lecture Notes in Computer Science, 1997, pp. 1009-1016. [6] W. Gropp, S. Huss-Lederman, A. Lumsdaine, E. Lusk, B. Nitzberg, W. Saphir and M. Snir, MPI - The Complete Reference, volume 2, The MPI Extensions, MIT Press, 1998. [7] W. Gropp and E. Lusk, Reproducible measurements of MPI performance characteristics, Recent Advances in Parallel Virtual Machine and Message Passing Interface. 6th European PVM/MPI Users' Group Meeting, volume 1697 of Lecture Notes in Computer Science, 1999, pp. 11-18. [8] W. Gropp, E. Lusk, N. Doss and A. Skjellum, A highperformance, portable imlementation of the MPI message passing interface standard, Parallel Computing 22(6) (1996), 789-828. [9] W.D. Gropp, E. Lusk and D. Swider, Improving the performance of MPI derived datatypes, Third MPI Developer's and User's Conference (MPIDC'99), 1999, pp. 25-30. [10] R. Hempel, Basic message passing benchmarks, methodology and pitfalls, SPEC Workshop on Benchmarking Parallel and High-Performance Computing Systems, Wuppertal, Germany, http://www.hks.de/mpi/b-eff/hempel.wuppertal.ppt, 1999. [11] G. Hunzelmann, Entwurf und Realisierung einer Datenbank zur Speicherung von Leistungsdaten paralleler Rechner, Studienarbeit, Department of Informatics, University of Karlsruhe, Am Fasanengarten 5, D-76128 Karlsruhe, Germany, 1999. [12] D. Lancaster, C. Addison and T. Oliver, A parallel I/O test suite, Recent Advances in Parallel Virtual Machine and Message Passing Interface. 5th European PVM/MPI Users' Group Meeting, volume 1497 of Lecture Notes in Computer Science, 1998, pp. 36-43. [13] LLCBench Home Page, http://icl.cs.utk.edu/projects//cbench/ [14] H. Mierendorff, K. Cassirer and H. Schwamborn, Working with MPI benchmarking suites on ccNUMA architectures, Recent Advances in Parallel Virtual Machine and Message Passing Interface. 7th European PVM/MPI Users' Group Meeting, volume 1908 of Lecture Notes in Computer Science, 2000, pp. 18-26. [15] The MPI Forum, http://www.mpi-forum.org/ [16] MPICH - A Portable Implementation of MPI, http://www. mcs.anl.gov/Projects/mpi/mpich/ [17] The Pallas MPI Benchmark, http://www.pallas.de/pages/ pmbd.htm. [18] PARKBENCH Committee and R. Hockney (chair), Public international benchmarks for parallel computers, Scientific Programming 3(2) (1994), iii-126. [19] J. Piernas, A. Flores and J.M. Garcia, Analyzing the performance of MPI in a cluster of workstations based on Fast

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

65

Ethernet, Recent Advances in Parallel Virtual Machine and Message Passing Interface. 4th European PVM/MPI Users' Group Meeting, volume 1332 of Lecture Notes in Computer Science, 1997, pp. 17-24. R. Rabenseifner and A.E. Koniges, Effective file-I/O bandwidth benchmark, Euro-Par 2000 Parallel Processing, volume 1900 of Lecture Notes in Computer Science, 2000, pp. 12731283. R. Rabenseifner and A. E. Koniges, Effective communication and file-I/O bandwidth benchmarks, Recent Advances in Parallel Virtual Machine and Message Passing Interface. 8th European PVM/MPI Users' Group Meeting, volume 2131 of Lecture Notes in Computer Science, 2001, pp. 24—35. M. Resch, H. Berger and T. Boenisch, A comparison of MPI performance on different MPPs, Recent Advances in Parallel Virtual Machine and Message Passing Interface. 4th European PVM/MPI Users' Group Meeting, volume 1332 of Lecture Notes in Computer Science, 1997, pp. 25-32. R. Reussner, Portable Leistungsmessung des Message Passing Interfaces, Diplomarbeit, Department of Informatics, University of Karlsruhe, Am Fasanengarten 5, D-76128 Karlsruhe, Germany, 1997. R. Reussner and G. Hunzelmann, Achieving performance portability with SKaMPl for high-performance MPI programs, Computational Science - ICCS 2001. Proc. of ICCS 2001, International Conference on Computational Science, Part II, Special Session on Tools and Environments for Parallel and Distributed Programming., volume 2074 of Lecture Notes in Computer Science, 2001, pp. 841-850. R. Reussner, P. Sanders, L. Prechelt and M. Muller, SKaMPl: A detailed, accurate MPI benchmark, Recent Advances in Parallel Virtual Machine and Message Passing Interface. 5th European PVM/MPI Users' Group Meeting, volume 1497 of Lecture Notes in Computer Science, 1998, pp. 52-59. R. Reussner, J.L. Traff and G. Hunzelmann, A benchmark for MPI derived datatypes, Recent Advances in Parallel Virtual Machine and Message Passing Interface. 7th European PVM/MPI Users' Group Meeting, volume 1908 of Lecture Notes in Computer Science, 2000, pp.10-17. R.H. Reussner, SKaMPl: The Special Karlsruher MPIbenchmark - user manual, Technical Report 02/99, Department of Informatics, University of Karlsruhe, Am Fasanengarten 5, D-76128 Karlsruhe, Germany, 1999. R.H. Reussner, Recent Advances in SKaMPl, in: High Performance Computing in Science and Engineering 2000, E. Krause and W. Jager, eds, Transactions of the High Performance Computing Center Stuttgart (HLRS), Springer-Verlag, 2001, pp. 520-530. M. Snir, S. Otto, S. Huss-Lederman, D. Walker and J. Dongarra, MPI - The Complete Reference, volume 1, The MPI Core, MIT Press, second edition, 1998. J.L. Traff, R. Hempel, H. Ritzdorf and F. Zimmermann, Flattening on the fly: efficient handling of MPI derived datatypes, Recent Advances in Parallel Virtual Machine and Message Passing Interface. 6th European PVM/MPI Users' Group Meeting, volume 1697 of Lecture Notes in Computer Science, pages 109-116, 1999.

This page intentionally left blank

67

Scientific Programming 10 (2002) 67-74 IOS Press

Efficiently building on-line tools for distributed heterogeneous environments Gunther Rackl*, Thomas Ludwig, Markus Lindermeier and Alexandras Stamatakis Lehrstuhlfur Rechnertechnik und Rechnerorganisation (LRR), Institut fur Informatik, Technische Universitat Munchen (TUM), 80290 Munchen, Germany E-mail: { rackl, ludwig, lnderme, stamatak} @ in. turn, de

Abstract: Software development is getting more and more complex, especially within distributed middleware-based environments. A major drawback during the overall software development process is the lack of on-line tools, i.e. tools applied as soon as there is a running prototype of an application. The MIMO Middleware MOnitor provides a solution to this problem by implementing a framework for an efficient development of on-line tools. This paper presents a methodology for developing on-line tools with MIMO. As an example scenario, we choose a distributed medical image reconstruction application, which represents a test case with high performance requirements. Our distributed, CORBA-based application is instrumented for being observed with MIMO and related tools. Additionally, load balancing mechanisms are integrated for further performance improvements. As a result, we obtain an integrated tool environment for observing and steering the image reconstruction application. By using our rapid tool development process, the integration of on-line tools shows to be very convenient and enables an efficient tool deployment.

1. Introduction Supporting the software development process within complex distributed environments is still a problem due to the lacking on-line tool infrastructures. Especially, application-oriented approaches for developing integrated and tool-supported applications are rare. In this paper, we present the MIMO approach, which proposes an infrastructure and methodology for developing on-line tools for distributed heterogeneous environments. As the MIMO approach is applicationoriented, we illustrate it by means of a real-world application scenario: We show how to integrate on-line tool functionality into a medical image-processing application. The characteristics of this application are the high performance requirements that make it necessary to build a parallel and distributed version in order to limit processing times. Furthermore, an automatic load * Present address: Gunther Rackl, HolzstraBe 20, 80469 Munchen, Germany. E-mail: [email protected].

ISSN 1058-9244/02/$8.00 © 2002 -IOS Press. All rights reserved

balancer is applied to the application for performance reasons, such that the application can be executed in parallel within a cluster of distributed workstations. As the on-line tool support should not be limited to the pure development of applications, we provide a tool environment that supports both the development and the subsequent deployment of the application. During development, the observation of the distributed application can be used for debugging and performance tuning purposes, while during deployment the steering facilities can be used for maintenance tasks. For example, for management purposes, all computation objects might have to be migrated away from a specific node, what can easily be carried out using our tool environment, without interfering the running computation. In the following, we first give an introduction into the components participating in our environment, which are the medical image-processing application, the loadbalancer, and the MIMO system. Subsequently, we describe the composition of these components into an integrated, tool-supported real-world application. The

G. Rackl et al. /Efficiently

building on-line tools for distributed heterogeneous environments

evaluation with a genuine test case proves the applicability of our approach. Altogether, this paper therefore contributes to an enhanced tool development and usage process for complex distributed applications. The MIMO system provides the basis for this approach, while the tool development methodology shows to be an appropriate procedure for its efficient deployment.

2. The medical image-processing application We explore our concepts by means of the parallel medical image-processing application described in [ 1, 2]. In this application, a realignment process forms part of the Statistical Parametric Mapping (SPM) application developed by the Wellcome Department of Cognitive Neurology [3]. SPM is used for processing and analysing tomograph image sequences, as obtained for example by functional Magnetic Resonance Imaging (fMRI) or Positron Emission Tomography (PET). Such image sequences are used in the field of neuro-science, for the analysis of activities in different regions of the human brain during cognitive and motoric exercises. Realignment is a cost intensive computation performed during the preparation of raw image data for the forthcoming statistical evaluation. It computes a 4 x 4 transformation matrix for each image of the sequence, for compensating the effect of small movements of the patient, caused e.g. by his breath. The images are realigned relatively to the first image of the sequence. The realignment algorithm for image sequences as obtained by fMRI will briefly be presented. One has to distinguish two cases: 1. Realignment of one sequence of images: The reference data set and the first matrix is obtained by performing a number of preparatory computations using the image data of the first image. The matrices for all remaining images are calculated using the reference data set. 2. Realignment of multiple sequences of images: The reference data set and the first matrix of the first sequence are calculated. Thereafter, the first images of all remaining sequences are realigned relatively to the first image of the first sequence and its reference data set. Finally, the realignment algorithm as described in the first case is applied to all sequences independently. At this point the only precondition for the calculation of the transformation matrix is the availability of

the reference data set, which is calculated only once for each sequence. Once the reference data set(s) is(are) available, the matrices of the sequence(s) can be computed independently. As the realignment process has high performance requirements, it is parallelised using Java and CORBA as a programming language and communication platform. Computational parts written in C++ are integrated by using the Java Native Interface (JNI). The following sections describe the load management system used for tuning the performance of the realignment application, and the monitoring approach used to observe and steer the running application.

3. The load management system In order to improve the performance of the parallel realignment application, the load management approach described in [4] is applied. In general, load management systems can be split into three components: The load monitoring, the load distribution, and the load evaluation component. They fulfil different tasks at different abstraction levels. This eases the design and the implementation of the overall system. Figure 1 shows the components of a load management system and a runtime environment containing application objects. The load monitoring component provides both information on available computing resources and their utilisation, and information on application objects and their resource usage. This information has to be provided dynamically, i.e. at runtime, in order to obtain knowledge about the current state of the runtime environment. Load distribution provides the functionality for distributing workload. Load distribution mechanisms for system level load management are initial placement, migration, and replication. Initial placement stands for the creation of an object on a host that has enough computing resources in order to efficiently execute an object. Migration means the movement of an existing object to another host that promises a more efficient execution. As migration is applied to existing objects, the object state has to be considered. The object's communication has to be stopped and its state has to be transferred to the new object, and all communication has to be redirected to the new object.

G. Rackl et al. /Efficiently

building on-line tools for distributed heterogeneous environments

69

Fig. 1. The components of a load management system.

Replication is similar to migration but the original object is not removed, so that identical objects called replicas are created. Further requests to the object are divided among its replicas in order to distribute workload (requests) among them. Replication is restricted to replication safe objects. This means that an object can be replicated without applying a consistency protocol to the replicas. Finally, the load evaluation component makes decisions about load distribution based on the information provided by load monitoring. The decisions can be reached by a variety of strategies. The aim of the diverse strategies is to improve the overall performance of the distributed application by compensating load imbalance. There are two main reasons for load imbalance in distributed systems. First, background load can substantially decrease the performance of a distributed application. Second, request overload that is caused by too many simultaneously requesting clients increases the request processing time and thus, decreases the performance of the overall application. Both sources of load imbalance have to be considered by a load manager. Distributed object oriented environments like CORBA [5] or DCOM [6] are based on some kind of object model. CORBA objects are connected to the middleware by the POA (Portable Object Adapter). The object adapter provides functionality for creating and destroying objects, and for assigning requests to them. The POA is configured by the developer via policies. The ORB (Object Request Broker) provides functionality for creating object adapters and for request handling. A request to an object arrives at the ORB that transmits it to the appropriate POA. Subsequently, the object adapter starts the processing of the request by an implementation of the object (Servant). The load management functionality, especially load monitoring and load distribution, have to be integrated into the ORB and the POA because we decided to make a system level implementation. Therefore, we added

some policies and interfaces to the POA in order to enable state transfer and the creation of replicas. The migration and replication of objects is realised using new policies that determine the creation of objects by means of factories, and a persistence policy that allows to migrate the state of an object. Request redirection is performed by the CORBA Location Forward mechanism [7]. It enables to hand over object references to clients by raising an ForwardRequest exception. The client runtime transparently reconnects to the forwarded reference. This guarantees migration and replication transparency.

4. MIMO This section introduces the MIMO Middleware MOnitoring system, an infrastructure for monitoring and managing distributed, heterogeneous middleware [8]. To handle heterogeneity, MIMO is based on a multilayer-monitoring approach which classifies collected information using several abstraction levels and therefore serves as a foundation for integrating diverse middleware. In order to enable a rapid and flexible tool design and construction, MIMO's structure relies on a three-tier model separating tools, the monitoring system, and the instrumented application through generic interfaces. These interfaces make it possible to use the core MIMO infrastructure for building tools and application instrumentation in an appropriate fashion for the monitored middleware. Finally, in order to design GUI tools advantageously, the component-based MIVIS tool framework can be used to integrate new tool functionality easily by means of Java beans. 4.1. Multi-layer-monitoring Figure 2 shows an illustration of a typical distributed middleware environment that we consider. The ob-

70

G. Rackl et al. /Efficiently

building on-line tools for distributed heterogeneous environments

Fig. 3. 3-Tier model of the monitoring architecture.

Fig. 2. Layer model of the distributed environment.

served system consists of six abstraction layers from which the monitor collects information and provides it to tools. The highest abstraction level within the system is the application level. Here, only complete applications are of interest for the monitoring system. Within an application, the whole functionality exported by the components is described by interfaces. These interfaces are defined in an abstract way in the interface layer. The implementation of the behaviour described by these interfaces is done by objects within the distributed object layer. These objects may still be considered as abstract entities residing in a global object space. In order to enable communication between the distributed objects, some type of middleware is required, and especially, a mechanism to define and uniquely identify objects within the object space is needed. As objects on the distributed object level are still abstract entities, they need to be implemented in a concrete programming language. This implementation of the objects is considered in the subsequent implementation layer. Finally, the implementation objects are executed within a run-time environment which can be an operating system or a virtual machine on top of an operating system that is being executed by the underlying hardware nodes. For various middleware platforms or applications, this abstract model can be mapped to concrete entity types related to the respective environments. We will show the mapping of the realignment application to the MLM model in the following section. 4.2. MIMO design and architecture The MIMO Middleware MOnitor provides a framework for online monitoring and management tools

which is compliant to the multi-layer-monitoring approach. The fundamental architecture relies on the separation of the tools from the monitoring system and the observed applications [9]. Figure 3 illustrates the resulting 3-tier model, which shows tools making use of MIMO by means of a tool-monitor-interface, while MIMO collects information from the monitored applications by means of intruders or adapters which communicate with MIMO through a intrudermonitor-interface. The difference between intruders and adapters is that intruders are transparently integrated into the application, while adapters might be built by inserting code into the application. Tools interact with the monitor system by means of a standardised tool-monitor interface, while instrumented components make use of a generic intrudermonitor interface that allows to exchange any kind of event-based information. 4.3. MIVIS tool framework MIVIS (MImo VISualizer) represents a general purpose framework for GUI-based tools. It contains basic visualisation functionality needed for sophisticated observation of any middleware, and allows for easy extension to include other advanced tool functionality. The generic MIVIS framework cares for the interaction with MIMO and presents data it receives in an advantageous way to the user. To fulfil the requirement of uncomplicated extensibility of the visualisation tool, it is split into a main program and several JavaBeans software components. The main program takes care of the communication with MIMO and the processing of the data, and the JavaBeans do the graphical display. Hereby, all JavaBeans are discovered by MIVIS at startup time, and get dynamically integrated into the GUI. If a different type of display is needed, a user can program that display type using Java and turn it into a JavaBean. This component is placed into a specific directory so that MIVIS

G. Rackl et al /Efficiently

building on-line tools for distributed heterogeneous environments

can find and use it. The main program does not have to be changed at all, the only requirement is that the JavaBean implements a minimal interface that enables the main program to communicate with the bean. The bean-specific properties can be set by the user. MIVIS knows about these properties by means of the introspection mechanism and provides editors to change the settings of these properties. Additional editors for properties of a special data type can be placed inside the JavaBean and used instead of the standard editors. Hence, this approach offers a very dynamic and flexible way to configure the behaviour of various display types. More details about MIVIS can be found in [10]. In the following, we make use of the MIMO capabilities to monitor and steer our load-balanced realignment application.

5. Integrating the load-balanced realignment environment In this section, we now describe how to monitor and steer the load-balanced realignment application. First, we begin with a general overview of the tool development process proposed by MIMO, before we subsequently show the integration of the realignment application. As a result, we obtain a visualisation tool showing the activities of the realignment application, and a steering possibility that allows to migrate or replicate objects using interactive drag-and-drop mechanisms. 5.1. Tool development with MIMO To enable monitoring of a new middleware with MIMO, a general methodology consisting of three major steps exists: - Define relevant middleware entities and map them to the MLM model: The first step is to define entities within the middleware which are relevant for being monitored with MIMO. This can either comprise application-specific entities like businessobjects, or middleware-specific entities like e.g. CORBA objects. The choice of these entities depends on the focus of interest and strongly influences the further activities. After defining the relevant entities, they need to be mapped to MIMO's multi-layer-monitoring model described before. Here, a certain degree of freedom exists and be exploited for the respective goals. The result is a middleware-specific layermodel with a mapping to the general MEMO MLM.

71

- Define relevant events: After defining the entities, relevant events that may occur within the middleware or application have to be determined. For example, these events may include generation or deletion of entities, or interactions between entities. As before, the definition of relevant events highly depends on the focus of the monitoring goals and can be completely user-defined. In any case, the result is a list of event names and their corresponding parameters that have to be passed with them. Also, it is possible to pass events from the application to the tools, as well as passing back commands from tools to the intruders or adapters residing in the application in order to manipulate the running application. - Implement instrumentation code and tool: The last step in our methodology is the actual implementation of intruders/adapters and the tools based on the previous entity and event definitions. Here, MIMO serves as a common monitoring framework, and the MIMO core can be used as an intelligent communication infrastructure between tools and intruders/adapters. Moreover, the MIVIS tool framework can be used for easy development of GUI tools. Hence, with this procedure new platforms can easily be integrated to the MIMO system by following a fixed set of rules. This general approach therefore enables a rapid and easy tool development which is highly application- and middleware-oriented, such that developers can concentrate on tool development without worrying about general monitoring issues. The steps of the tool development methodology are summarised in Fig. 4. 5.2. Integrating the realignment application Based on the general tool development process, we now show the integration of the realignment application into the tool environment. Figure 5 depicts the structure of the realignment application. The service offered by the server object is the compute() service, which calculates the transformation matrix for an image. The state of a server object consists of a reference data queue (cache). Therefore it is replication safe since it can be replicated without applying a consistency protocol to its replicas, i.e. the required cache data can easily be reestablished. A getRef erenceDataQ service is offered by each client and provides the specific reference data to the server if it is not already cached.

G. Rackl et al. /Efficiently

72

building on-line tools for distributed heterogeneous environments

Fig. 4. MIMO tool development methodology. Table 1 Mapping of the realignment application Realignment application Realignment application Compute-interface Compute-object IOR Java realignment class Realignment process ID Node

MIMO MLM entity Application Interface Distributed object Implementation Runtime Hardware

For integrating these components within MIMO, the relevant entities have to be determined at first. The resulting mapping of realignment entities to the MIMO MLM model is illustrated in Table 1. The next step is to define appropriate events of interest. These include both events for observing the progress of the realignment process, as well as steering events allowing to manipulate the running application. The main events being tracked by MIMO are as follows: - Creation and deletion events for new client and servant objects - Replication of objects - Migration of objects - Information events reporting the load of nodes and objects For steering the realignment process, two events are used: - Migrate object - Replicate object The parameters for those events include the target object and the respective destination node. More exact details about the events and their parameters can be found in [1]. 5.3. Example scenario The final step of our integration is the development of a visualisation bean making use of the MIVIS frame-

work. We developed a new display that is used for the visualisation of the processes described before. Figure 6 presents the basic layout of the graphical online tool. Client and server objects are located within the respective rectangles representing the client and server hosts. In addition, server object load (numerical representation) and server host load values are depicted (numerical and graphical representation). The CORBA method compute() is represented as black arrow with a counter and getRef erenceData() as offset turquoise arrow. Replications and Migrations are represented as white and red arrows respectively. Replication and Migration actions can be initiated manually with a drag-and-drop functionality; the user can therefore easily migrate objects to other nodes, or replicate them, if adequate. Consequently, the combination of MIMO and MIVIS provides a flexible and extensible infrastructure for the development and the maintenance of large scale distributed applications. 6. Evaluation In order to prove the efficiency of the presented load management concept and its implementation, we present a test case for our integrated realignment application [11]. The hardware consists of three machines with equal configuration. There is no background load on the machines. The examined CORBA application is the medical image-processing application described in Section 2 with two simultaneously requesting clients. The application is replication safe as already mentioned in Section 3. Thus, migration and replication can be applied to this application. Figure 7 shows the processing time per image against the number of the processed image for both clients. At the beginning, one server object is created and placed on a machine (initial placement) and the clients start

G. Rackl et al. /Efficiently

building on-line tools for distributed heterogeneous environments

73

Fig. 6. Visualization of a replication and of object interactions.

requesting the server. The image processing time is equivalent for both clients now because the server alternately processes their requests. After a while the load management system recognises that the server is overloaded because both clients permanently request the server. Accordingly, replication is performed, i.e. a second server object (replica) is created and each client gets a replica on its own. In consequence of the replication, the image processing time of each client decreases about 50%. Some time later background processor load is generated on the machine that is used by the second client's replica. Hence, the image processing time of the second client substantially increases. Again, the load management system recognizes the processor overload and migrates the affected replica to the third machine which was not used so far. The consequence is that the image processing time returns to its normal level. The test case shows how the load management system is able to deal with different kinds of overload. Re-

quest overload is compensated by replication, whereas background load is compensated by migrating an object to a less loaded host. Consequently, the load management systems improves the performance and the scalability of the medical image-processing application. Additionally, the integrated tool environment makes it possible to visualise and manually steer the loadbalanced application, what is helpful for performance tuning as well as for maintenance tasks. 7. Conclusion and future work In this paper, we have presented an approach for developing on-line tools for distributed middlewarebased applications. By means of our load-balanced medical realignment application, we have shown how to integrate monitoring functions and to implement a visualisation and steering tool, which can be used for both development and deployment tasks.

G, Rackl et al. /Efficiently

74

building on-line tools for distributed heterogeneous environments

Processing Time / Image (Sec.) 20

Replication

Client 1 Clinet 2

18

14

Migration

10 8 -

35

40 Image Nr.

Fig. 7. The load managed medical image-processing application.

The MIMO infrastructure provides an infrastructure for implementing integrated tool environments. The tool development methodology cares for a systematic and concerted development process for tools and instrumentation components. Finally, we have presented a real-world application scenario that proves the applicability of our approach. Future work includes the elaboration of further tool interoperability concepts, which enable a coordinated and simultaneous application of several tools to a single application. The integration of our load-balancer and the visualisation and steering tool is a first step towards this direction. The global goal of our endeavours is to contribute to an enhanced overall software development and deployment process by improving the tool support for the "on-line phases" of the software lifecycle. We aim at an increased acceptance of on-line tools by showing an approach for their rapid development and efficient usage.

[2]

[3] f4]

[5]

[6] [7] [8]

[9]

[10]

References (1)

A. Stamatakis, Interoperability of Tools for Distributed Object-Oriented Environments, Diploma thesis, Technische Universitat Munchen, 2001, (in German).

[11]

M. May, Vergleich von PVM und CORBA bei der verteilten Berechnung medizinischer Bilddaten. Master's thesis. Technische Universitat Munchen, 2000. K. Friston, SPM, Technical report. The Wellcome Department of Cognitive Neurology, University College London, 1999. M. Lindermeier, Load Management for Distributed ObjectOriented Environments, in: International Symposium on Distributed Objects and Applications (DOA'2000). IEEE Press. Antwerp, Belgium. 2000, OMG (Object Management Group), The Common Object Request Broker: Architecture and Specification - Revision 2.3.1. Technical report, http://www.omg.org. 1999. G. Eddon and H. Eddon. Inside Distributed COM, Microsoft Press. 1998. M. Henning, Binding. Migration, and Scalability in CORBA. Communications of the ACM. 1998. G. Rackl, Monitoring and Managing Heterogeneous Middleware, Dissertation. Technische Universitat Munchen, February 2001. http://tumbl.biblio.tu-muenchen.de/publ/diss/in/ 2001/rackl.html. T. Ludwig. R. Wismuller, V. Sunderam and A. Bode. OMIS — On-Line Monitoring Interface Specification (Version 2.0). Vol. 9 of Research Report Series, LehrstuhlfurRechnertechnik und Rechnerorganisation (LRR-TUM), Technische Universitat Munchen. Shaker, Aachen. 1997. M. Rudorfer, Visualisierung des dynamischen Verhaltens verteilter objektorientierter Anwendungen. Master's thesis. Technische Universitat Munchen. 1999. T. Ludwig, M. Lindermeier, A. Stamatakis and G. Rackl, Tool Environments in CORBA-based Medical High Performance Computing, in: Proc. of the PACT 2001 Conference. Novosibirsk. Russia. September 2001. To appear.

75

Scientific Programming 10 (2002) 75-89 1OS Press

Architecting Web sites for high performance Arun lyengar* and Daniela Rosu IBM Research, T.J. Watson Research Center, P.O. Box 704, Yorktown Heights, NY 10598, USA

Abstract: Web site applications are some of the most challenging high-performance applications currently being developed and deployed. The challenges emerge from the specific combination of high variability in workload characteristics and of high performance demands regarding the service level, scalability, availability, and costs. In recent years, a large body of research has addressed the Web site application domain, and a host of innovative software and hardware solutions have been proposed and deployed. This paper is an overview of recent solutions concerning the architectures and the software infrastructures used in building Web site applications. The presentation emphasizes three of the main functions in a complex Web site: the processing of client requests, the control of service levels, and the interaction with remote network caches.

1. Introduction Web site applications are some of the most challenging high-performance applications currently being developed and deployed. This class of applications spans a wide range of activities including commercial sites, like on-line retailing and auctions, financial services, like on-line banking and security trading, information sites, like news and sport events, and educational sites, like digital libraries. The challenges that should be addressed by highperformance Web site applications emerge from the characteristics of workloads and the complexity of performance constraints that these applications have to support. In general, a Web site application provides one or more types of services developed on top of an HTTP-based infrastructure. These services may span a large range of functionalities, from delivery of static content, to execution of site-specific computations and database queries, and to streaming content. Consequently, the services offered by a site may differ with respect to request rates, response time constraints, and computation and bandwidth needs. Numerous studies demonstrate that Web traffic is increasing at a high rate. Moreover, the throughput that Web sites need to sustain continues to increase dramat* Corresponding author: Arun lyengar, IBM T.J. Watson Research Center, 19 Skyline Drive, Hawthorne, NY 10532, USA. Tel.: +1 914 784 6468; Fax: +1 914 784 7455; E-mail: [email protected].

ISSN 1058-9244/02/58.00 © 2002 - IDS Press. All rights reserved

ically. Figure 1 illustrates this trend by presenting the increase of (a) daily Web traffic and of (b) peak hits per minute at major sporting and event Web sites hosted by IBM from February 1998 through September 2000. This trend raises important capacity planning problems, given that having sufficient capacity to handle traffic at a given point in time might not be sufficient several months in the future. Another challenging characteristic of Web traffic is its burstiness, given that request rates during peak intervals may be several times larger than their average rates [371. Cost-effective solutions to this challenge multiplex several independent Web applications over the same computing and networking infrastructure in order to appropriately balance performance and cost parameters [5]. However, this raises the need for complex mechanisms for service level control. Practically anyone who has used the Web is aware of the fact that requests can incur long response delays. On the path of a client's interaction with a Web site application, there are many components of the Web infrastructure where performance problems can occur. The Web site itself can be a major bottleneck. Particular attention is required for Web sites generating significant dynamic or encrypted content. Previous research has shown that generating dynamic content can consume significant CPU cycles [35,36], while serving encrypted content via SSL can be more than an order of magnitude more expensive than serving unencrypted content [4].

76

A. lyengar and D. Rosu / Architecting Web sites for high performance I 800

1000 700 £ 600

800

•

600

-

£ 500 400

-

400

300

-

200

.

200 100

0

(a)

1Q 98

2Q98

2Q 99

3Q99

2Q 00

IQ 98

3Q 00

Quarter

2Q 98

2Q 99

(b)

3Q 99

2Q OO

3Q 00

Quarter

Fig. 1. Maximum number of hits during (a) a single day (in millions) and (b) a single minute (in thousands), at major spotting and event Web sites hosted by IBM from February 1998 through September 2000. Each bar represents a different quarter of a year.

While predictable response time is not the most critical performance demand a Web site is designed for, failing to address it may have significant business consequences. Long waits can deter users who are not willing to spend significant amounts of time accessing a Web site in order to retrieve content of interest. Moreover, unpredictable response times combined with availability gaps can make a Web site unsuitable for missioncritical applications. These problems might be due to failures of systems along the path between clients and Web sites, but often their cause lies in the Web site itself. The site's architecture and software infrastructure are likely to not scale well with high peak request rates or highly variable per-request resource needs. Overall, the high growth rate of client population and the increased complexity of interactions and content models result in demands for high throughput, high availability, low response times, and reduced costs. To address these demands, innovative software and hardware solutions are required. Recently, a large body of research has addressed Web site-related problems. This paper is an overview of recent solutions concerning the architectures, software infrastructure services, and operating system services used in building high performance Web site applications. The remainder of this paper is organized as follows. Section 2 presents the architecture of a high performance Web site, identifying the main components and briefly discussing implementation challenges and recent solutions. Sections 3-5 review the challenges and solutions proposed for several Web site functions which have a significant impact on performance - the processing of client requests (Section 3), the control of service levels (Section 4), and the interactions with remote network caches (Section 5). Finally, Section 6

concludes the paper highlighting several problems for future research.

2. Web site architecture In this section, we introduce the main elements of a high-volume Web site architecture. Focusing on their role in processing client requests, we categorize the components of a Web site's architecture in three layers: the request distribution layer, the content delivery layer, and the content generation layer. The request distribution layer performs the routing of requests to the Web site's nodes or subsystems that can appropriately process them. The goal of request distribution is to ensure the targeted levels of throughput, response time, and availability. This task is particularly challenging when the site is offering several services, each with specific resource needs, request patterns, throughput goals, and business values. The request distribution layer includes components such as Domain Name System (DNS) servers, access switches and routers, and infrastructure for performance monitoring and request (re)routing. The content delivery layer replies to client requests by transmitting content available in disk and memorybased stores, or acquired by interaction with content generation components. A goal of the content delivery layer is to maximize the ratio of requests that can be serviced from local stores, thus minimizing response times and server loads. This layer includes components such as caching proxies, HTTP servers, application servers, and content transformation engines. Finally, the content generation layer handles content materialization triggered by client requests and

A. lyengar and D. Rosu / Architecting Web sites for high performance

database updates. This layer includes database servers, caches of query results, engines for tracking updates and triggering off-line re-computation, and infrastructure for Web site management. The three layers of a Web site application may be mapped to a variety of physical configurations ranging from a single node to a worldwide distributed infrastructure. Figure 2 illustrates the architecture of a complex Web site which would be similar to those at sites such as IBM's sporting and event Web sites [34]. The content generation layer is at the central point of the site configuration and typically spans a relatively small geographical area. Content delivery and request routing layers may be widely distributed. Multiple Web servers would typically be needed to handle high request rates. In Fig. 2, multiple Web servers satisfy requests dispatched by the connection router according to some load balancing policy. Web sites may be mirrored in geographically distributed locations. A client can then obtain content with less latency by accessing the Web site replica that is closest to it. For some mirrored Web sites, clients must pick the closest site themselves, while for other sites, requests are routed by the network to the closest site [16]. Mirrored Web sites also provide higher availability. If one replica site is down, others are likely to be available. In the remainder of this section, we briefly discuss the role of the main components in a Web site architecture including DNS servers, caching proxies, connection routers, HTTP servers, query engines, and dynamic content management frameworks. The presentation order follows the flow of a client's interaction with the Web site application. DNS servers DNS servers provide client hosts with the IP address of one of the site's content delivery nodes. When a request is made to a Web site such as http://www.research. ibm.com/hvws, "www.research.ibm.com" must be translated to an IP address. DNS servers perform this translation. Besides enabling clients to reach the targeted Web sites, DNS servers provide a means for scaling Web sites to handle high request rates. Namely, a name associated with a Web site might map to several IP addresses, each associated with a different set of server nodes. DNS servers select one of these addresses based on a site-specific request distribution policy. This policy may be as simple as Round Robin, but typically it falls into the 'Least Loaded' paradigm, attempting to minimize maximum node utilization [15,19]. In the

77

latter case, load estimations are based on observations local to the DNS server (e.g., number of forwarded clients), and on monitoring information provided by each of the nodes (e.g., number of serviced requests, CPU utilization). In either case, the load estimators are likely to have low accuracy, and this is due to several factors. First, not all requests trigger identical server loads, with load variability likely to increase as the web site complexity increases. Second, DNS servers are not on the path of all of the requests that reach the site, as the results of DNS lookups can be cached. Cached nameto-IP address mappings have lifetimes, called "TimeTo-Live" (TTL) attributes, which are provided by the DNS servers and which indicate when these mappings are no longer valid. If a mapping is cached, a subsequent request to the same Web site can obtain the IP address from a cache, obviating the need to contact the DNS server. Thus, the caching of name-to-IP address mappings allows requests to reach the site's nodes without being load balanced by the DNS server; this increases the risk of load imbalance and limits the effectiveness of DNS-based load balancing [22]. Overcoming these drawbacks by increasing the rate of monitoring reports may be prohibitively expensive. A typical solution is to limit the number of requests that reach the site without the control of the DNS server. This is achieved by setting very short, possibly zero, TTL's, causing most of the requests to trigger DNS lookups. However, recent research provides evidence that this approach has negative effects on client performance as it may result in significant increase of perpage response times [51]. Consequently, for a scalable Web site, DNS-based request distribution should be coupled with local solutions for load redistributions [8]. Connection routers Clusters dedicated to content delivery typically have a connection-routing front end, a component of the request distribution layer which routes requests to multiple back-end servers. The connection router hides the IP addresses of the back-end servers. This contrasts with typical DNS-based routing in which clients can obtain the addresses of individual back-end servers. However, for sites with multiple connection routers, DNS lookups provide the address of one of these routers. A connection router includes a router or a switch that may be stand-alone [22,31,46,61], may work with dedicated resource management nodes [54,62], or may be backed by request re-routing systems distributed across the back-end nodes [7,8,63]. The request distribu-

78

A. Iyengar and D. Rosu / Architecting Web sites for high performance

Fig. 2. Components of a Web site architecture.

tion policies implemented by the connection router are based on system load parameters and rules for mapping content and service types to resources. The performance enabled by a connection router is better than that achieved with DNS-based routing [22]. This is mainly because, in comparison to a DNS server, the connection router handles all requests addressed to the site, and thus requests do not reach back-end nodes without first going through the router. In addition, a connection router can have more accurate system load information (e.g., number of active connections, transferred payload), can monitor resource utilization with a much finer granularity, and can determine and exploit the type of the requested service and content. Besides enabling effective load distribution, connection routers simplify system management. Namely, by hiding the identities of back-end servers, a connection router enables transparent additions or removals of servers. In addition, when a server fails, a connection router can quickly identify the event and stop directing new requests to the node. Reverse proxies A Web site's reverse proxies, also called "Web server accelerators", are components in the content delivery layer that attempt to service client requests from a highperformance cache. When this is not possible, requests are forwarded to the Web site. A key difference between a reverse proxy and a proxy cache that serves an institution or a region is the mix of cached content. An institution or region cache, also called a forward proxy, includes content from a large set of Web sites

and therefore has a large working-set size, requiring an extensive amount of storage to obtain reasonable hit rates. In contrast, a reverse proxy cache includes content from a single site. Therefore, its working set is relatively small. Less storage is thus required for a reverse proxy to achieve good hit rates. A reverse proxy may be located both within and outside the physical confines of the Web site. Its physical configuration varies from a cache running on a highperformance connection router [41], to a stand-alone node [32.39). to a cluster [26,30,52,53]. In clusterbased reverse proxies, nodes operate independently or as cooperative caches [30,52]. In typical implementations, reverse proxies do not have disk stores, relying only on main memory caches. Reverse proxies can include the functionality of a content-based router [32], a particular type of connection router in which the target node is selected based on the type of the requested content. By offloading many of the requests addressed to the regular Web servers at a site, reverse proxies contribute to an increase in the site's overall throughput. Moreover, their restricted functionality is amenable to efficient implementations, which can boost throughput and reduce response times. In order to achieve consistency and prevent stale data from being served, a reverse proxy cache should have mechanisms that allow the server to invalidate and possibly push data. The API used in [41] allows dynamic data to be cached in addition to static data. More recently, a significant body of work has addressed the problem of consistency protocols for proxy caches.

A. lyengar and D. Rosu /Architecting Web sites for high performance

Much of this work is relevant to reverse proxies. New proposals aim to reduce Web site overhead and clientobserved response times by extending traditional pullbased protocols with push-based methods [21,35,57]. Namely, the Web site keeps track of the proxies caching its content and appropriately pushes invalidation messages when new versions are created. HTTP Servers. HTTP servers process requests for static and precomputed dynamic content [48] and provide the underlying framework for execution of application servers, such as IBM's WebSphere [33], in response to client requests. The main characteristics that distinguish HTTP servers are their software architectures and execution modes. The software architectures of HTTP servers vary from event-based with a single thread handling all client connections (e.g., Zeus [60], NWSA [32], kHTTPd [55]), to a combination of event-based processing and thread-based handling of disk I/O operations (TUX [44]), to pure thread-based processing with connection-to-thread mappings that last for the entire connection lifetime (e.g., Apache [3]). With respect to execution mode, HTTP servers may execute in user mode (e.g., Zeus, Apache) or in kernel mode (e.g., kHTTPd, TUX, NWSA). Both software architecture and execution mode influence the achievable performance levels; significant performance advantages result from event-based kernel-mode servers. Recent research has focused on optimizing the operating system functions on the critical path of request processing. For instance, [11] proposes scalable solutions to the UNIX methods for select and allocation of file descriptors, which are critical for user-mode, event-based servers. [9] proposes a network subsystem architecture that allows for incoming network traffic to be processed at the priority of the receiving process, thus achieving stability under overload and integrating the network subsystem into the node's resource management system. [47] proposes a unified I/O buffering and caching system that eliminates all data copies involved in serving client requests and eliminates multiple buffering of I/O data, thus benefiting servers with numerous WAN connections, for which TCP retransmission buffers have to be maintained for long time intervals. Caching query results In the process of serving requests for dynamic content, application servers often query databases. These queries can have significant overhead. Recent research indicates that significant benefits can result from

79

caching query results. Cached results can be used as a whole [20,48] or as a base for sub-queries [42]. Proposed solutions allow query caching to be controlled by the application [20,48] or by the database itself [25]. Application-controlled caching may benefit from exploiting application semantics but may be less effective for complex applications with many relations and a variety of queries. In contrast, query engine-controlled caching can exploit the engine's detailed understanding of the complete site schema. For instance, a query engine can automatically implement query reformulation in order to produce content that is likely to have higher cache utility. Management of dynamic content The framework for dynamic content management is a Web site component that tracks information updates and controls the pro-active recomputation of the related dynamic content. Recent research has proved the significant benefits that can result from pro-active content recomputation (and caching) versus the traditional per-request content generation [17,64]. For instance, the approach proposed in [17] is to monitor underlying data which affect Web pages, such as database tables or files. When changes are detected, new Web pages are generated to reflect these changes. The underlying mechanism, called "trigger monitor", uses a graph to maintain relationships between the underlying data and the Web pages affected by the data. The trigger monitor uses graph traversal algorithms to determine which Web pages need to be updated as a result of changes to underlying data. To summarize, this section has reviewed the major components of a Web site's architecture, highlighting related challenges and proposed solutions. In the remainder of this paper, we focus on several functions of a Web site application that have strong implications on the overall performance, namely the processing of client requests, the control of service levels, and the interaction with network caches. 3. Processing client requests Highly accessed Web sites may need to handle peak request rates of over a million hits per minute. Web serving lends itself well to concurrency because transactions from different clients can be handled in parallel. A single Web server can achieve parallelism by multithreading or multitasking among different requests. Additional parallelism and higher throughputs can be

80

A. lyengar and D. Rosu /Architecting Web sites for high performance

achieved by using multiple servers and load balancing requests among the servers. Sophisticated restructuring by the programmer or compiler to achieve high degrees of parallelism is not necessary. Web servers satisfy two types of requests, static and dynamic. Static requests are for files that exist at the time a request is made. Dynamic requests are for content that has to be generated by a server program executed at request time. A key difference between satisfying static versus dynamic requests is the processing overhead. The overhead of serving static pages is relatively low. A Web server running on a uniprocessor can typically serve several hundred static requests per second. This number is highly dependent on the data being served; for large files, the throughput is lower. The overhead for satisfying a dynamic request may be orders of magnitude more than the overhead for satisfying a static request. Dynamic requests often involve extensive back-end processing. Many Web sites make use of databases, and a dynamic request may invoke several database accesses; these database accesses can consume significant CPU cycles. The back-end software for creating dynamic pages may be complex. While the functionality performed by such software may not appear to be compute-intensive, such middleware systems are often not designed efficiently. Many commercial products for generating dynamic data are highly inefficient. One source of overhead in accessing databases is connecting to the database. In order to perform a transaction on many databases, a client must first establish a connection with a database in which it typically provides authentication information. Establishing a connection is often quite expensive. A naive implementation of a Web site would establish a new connection for each database access. This approach could overload the database with relatively low traffic levels. A significantly more efficient approach is to maintain one or more long-running processes with open connections to the database. Accesses to the database are then made with one of these long-running processes. That way, multiple accesses to the database can be made over a single connection. Another source of overhead is the interface for invoking a server program in order to generate dynamic data. The traditional method for invoking server programs for Web requests is via the Common Gateway Interface (CGI). CGI works by forking off a new process to handle each dynamic request; this incurs significant overhead. There are a number of faster interfaces available for invoking server programs [34].

These faster interfaces use one of two approaches. The first approach is for the Web server to provide an interface to allow a program for generating dynamic data to be invoked as part of the Web server process itself. IBM's GO Web server API (GWAPI) is an example of such an interface. The second approach is to establish long-running processes to which a Web server passes requests. While this approach incurs some interprocess communication overhead, the overhead is considerably less than that incurred by CGI. FastCGI is an example of the second approach [45]. In order to reduce the overhead for generating dynamic data, it is often feasible to generate the data corresponding to a dynamic page once, store the page in a cache, and to serve subsequent requests to the page from cache instead of invoking the server program again [35,48,64]. Using this approach, dynamic data can be served at the same rate as static data. However, there are types of dynamic data that cannot be pre-computed and serviced from the cache. For instance, dynamic requests that cause a side effect at the server such as a database update cannot be satisfied merely by returning a cached page. For example, consider a Web site that allows clients to purchase items using credit cards. At the point at which a client commits to buying something, that information has to be recorded at the Web site; the request cannot be solely serviced from the cache. Personalized Web pages can also present problems for caches. A personalized Web page would contain content specific to a client such as the client's name. Such a Web page could not be used for another client. Therefore, caching the page is of limited utility since only a single client can use it. Each client would need a different version of a page. One method which can reduce the overhead for generating dynamic pages and enable caching of some parts of personalized pages is to define these pages as a collection of fragments [18]. In this approach, a complex Web page is constructed from several simpler fragments that may be recursively embedded. This is efficient because the overhead for composing an object from simpler fragments is usually minor compared to the overhead for constructing the object from scratch, which can be quite high. The fragment-based approach also makes it easier to design Web sites. Common information that needs to be included on multiple Web pages can be created as a fragment. In order to change the information on all pages, only the fragment needs to be changed. In order to use fragments to allow partial caching of personalized pages, the personalized information on a

A. lyengar and D. Rosu /Architecting Web sites for high performance

Web page is encapsulated by one or more fragments that are not cacheable, but the other fragments in the page are. When serving a request, a cache composes pages from its constituent fragments, many of which can be locally available. Only personalized fragments have to be created by the server. As personalized fragments typically constitute a small fraction of the entire page, generating them would require lower overhead than generating all of the fragments in the page. Generating Web pages from fragments provides other caching benefits as well. Fragments can be constructed to represent entities that have similar lifetimes. When a particular fragment changes but the rest of the Web page stays the same, only the fragment needs to be invalidated or updated in the cache, not the entire page. Fragments can also reduce the amount of cache space taken by a collection of pages. Suppose that a particular fragment f1 is contained in 2000 popular Web pages which should be cached. Using the conventional approach, the cache would contain a separate version of f1 for each page resulting in as many as 2000 copies. By contrast, if the fragment-based method of page composition is used, only a single copy of f1 needs to be maintained. A key problem with caching dynamic content is maintaining consistent caches. The cache requires a mechanism, such as an API, allowing the server to explicitly invalidate or update cached objects that have become obsolete. Web objects may be assigned expiration times that indicate when they should be considered obsolete. Such expiration times are generally not sufficient for allowing dynamic data to be cached properly because it is often not possible to predict accurately when a dynamic page will change. This is why a mechanism is needed to allow the server to explicitly keep the cache updated. Server performance can also be adversely affected by encryption. Many Web sites need to provide secure data to their clients via encryption. The Secure Sockets Layer protocol (SSL) [27] is the most common method for passing information securely over the Web. SSL causes serious performance degradation [4]. In addition, the overhead of providing secured interactions may be unnecessarily increased if embedded images that do not include private content are specified by the Web content designer as requiring secure transmission. While objects encrypted via SSL generally cannot be cached within the network, they can be cached within browsers if they have expiration times. Web sites can thus considerably reduce their encryption workloads by properly setting expiration times for objects that need to be encrypted.

81

To summarize, this section has addressed the elements involved in the actual processing of HTTP requests that have significant overhead. Among these, the generation of dynamic content is one of the most critical, particularly for Web sites that generate significant dynamic content.

4. Control of service levels The control of service levels is a major concern for many deployed Web site applications. This is motivated by business reasons, including the need to maximize resource utilization and the need to keep clients motivated to access the site by delivering responses within reasonable time. Challenges stem from the characteristics of Web site workloads that typically exhibit high burstiness of arrivals and high variance of per-request resource usage. The problem of controlling the service levels in a Web site has been addressed in different formulations by a large body of research. The most frequently considered formulations include: - Single service - maximal performance. [8,22,29, 46,61,63] In the simplest formulation, a Web site provides a single type of service. The goal of service-level control is to maximize throughput and minimize response times across all of the requests received by the system. - Multiple services — differentiated performance. [2, 62] In a more general formulation, a Web site provides several types of services, each characterized by a relative importance coefficient. The goal of service-level control is to ensure that requests for more important services receive better service quality while less important services do not starve. - Multiple services - isolated performance. [1,5,6, 10,54] In an alternative formulation for Web sites providing multiple services, each service is associated with a minimum performance level, know as a Service Level Agreement (SLA), and defined by bounds on performance metrics like sustainable request rate, response time, and availability. The goal of service-level control is to ensure that each service achieves the SLA levels while the system resources are effectively exploited. Research has addressed the problem of service level control for Web sites based on host and cluster systems. For host-based sites, solutions are defined by methods for dispatching requests to the available execution

82

A. lyengar and D. Rosu /Architecting Web sites for high performance

threads [ 1,2] and for enforcing that these threads do not use more than service-specific shares of the CPU and disk resources [10]. For cluster-based sites, solutions are defined by methods for dispatching client requests to content delivery hosts and for performance monitoring. For both types of sites, request-dispatching solutions may include rules for discarding requests in order to prevent the system from reaching overload. In the remainder of this section, we discuss the solutions proposed for each of the three formulations of the servicelevel control problem, focusing on the solutions for cluster-based sites, the most relevant configuration for high performance Web sites. Single service - maximum performance In the most frequently addressed model of service level control, all requests reaching the Web site have equal importance. The goal is to maximize the site's throughput by ensuring that the load is well balanced across all of the site's processing nodes. A secondary goal is to minimize the average response time. Sample approaches include the increase of per-node hit ratios in the memory cache and reduction of wait times in node service queues. All solutions are based on the existence of a control component that is invoked for each new connection or request to decide to which node it should be dispatched or whether it should be dropped. The decision is based on per-node information, such as estimates of current CPU and disk loads, accessible content, and memory cache content. In addition, the decision might consider request characteristics, such as target content group, actual URL, and expected resource needs. When the identity of the requested content is a parameter of the dispatching decision, the controller includes a request analysis component. The feature of the control mechanism that has the strongest impact on performance and scalability is the placement of its modules for dispatching decision and request analysis. The proposed solutions include the following: (1) both dispatching decision and analysis are performed by the connection router, (2) dispatching decision is made by a specialized node, while analysis is done by processing nodes, and (3) both dispatching decision and analysis are performed by each of the processing nodes. Solutions based on connection router decisions and focusing only on throughput maximization have been the first to appear.1 In this group, the most common 1

Refer to [50] for a survey and taxonomy of router-based solutions.

request distribution policy is Weighted Round-Robin (WRR), in which the number of requests directed to a node is inversely proportional to its load [22]. Sample of per-node load estimators include the number of established connections, response times to probe requests sent by the controller, CPU utilization, and a combination of CPU and disk utilizations. Alternative policies include "Least Loaded First" and weighted "least connections" - with weights reflecting the relative capacities of the nodes [50]. While characterized by a relatively low overhead, decisions based only on load information cannot exploit content distribution across the disk stores and memory caches of the individual servers. To address this problem, content-based routing is used. A content-based router examines each request and dispatches it to a node that has access to the necessary content and is likely to deliver it with the least overhead. This method can also be used to route certain types of requests to designated servers. For example, a content-based router could send all dynamic requests to a particular set of servers. The drawback to content-based routing is that it introduces considerable overhead. Besides the actual overhead of reading and parsing the request, a connection management overhead is incurred. Namely, in order to examine the request content, the router must terminate the connection with the client, establish a connection to the server, and transfer messages between client and server. The transfer overhead is higher than for a regular (i.e., layer 4) router, which forwards messages between client and server without terminating the connection. In a straightforward implementation of content-based routing, the router is involved in the transfer of all of the data exchanged by client and server [61]. Better performance is achieved by using a TCP handoff protocol which transfers the client connection from the router to the server in a clienttransparent manner [22,46]. In this case, the router is only involved with the transfer of content sent by the client (mostly ACKs), which results in a much lower overhead than the transfer of all of the content sent by both server and client. For content-based dispatching, a straightforward method is to use a fixed partition of content to nodes. However, this may lead to low throughput performance because of the high propensity for load imbalances. A viable solution considered in [46,61] is to have the controller direct each request to a node that has recently serviced the same content if this is not heavily loaded relative to other nodes in the system. This

A. lyengar and D. Rosu /Architecting Web sites for high performance

leads to higher locality and hit rates in server caches. Load can be estimated by the number of active connections [46] or by the weighted sum of CPU and disk utilizations, with fixed weights, defined according to site characteristics [61]. The experimental evaluation presented in [46] demonstrates that a connection router using this locality-aware content-based redirection can achieve more then twice the performance of a router using WRR. However, recent studies provide experimental evidence that content-based routing implemented by the connection router does not scale well with cluster size and processing power and is limited by the capacity of the router. Addressing this drawback, [7] proposes a distributed architecture in which the request analysis is performed by the processing nodes and the dispatching decisions are made by a dedicated node. Given the relatively low overhead of decision making and load monitoring, a dispatcher node based on a 300 MHz PIII machine can sustain a peak of 50,000 connections/sec, which is about an order of magnitude larger than can be achieved with a content-based front-end router that performs both request analysis and decision. An alternative solution to the access router bottleneck is to have the request distribution decisions made by the processing nodes. In this paradigm, after analyzing a request, a node decides by itself whether to process it locally or to forward it to a peer [8,63]. Typically, the selected peer is the one that is the least loaded, according to the current node's knowledge of its peers' loads. For instance, the proposal in [8] uses periodic broadcast to disseminate load information such as CPU load or the number of locally opened and/or redirected connections. The solution proposed in [63] uses more complex load models and acquisition protocols targeted at maximizing the accuracy of the information used in the dispatching decision. Namely, a request is forwarded to the node with the least load provided this is significantly lower than that of the current node. Load is expressed by the stretch factor of the average response time, i.e., the expected increase in response time with respect to execution on an idle system. The load information may be requested at decision time, or it may be derived from possibly outdated information received previously through periodic multicast; the choice of method depends on the node's load index (i.e., weighted sum of CPU and disk utilization). This method is suitable for heterogeneous clusters, particularly for workloads with large and highly variable per-request overheads (about 10 sec average and one order of magni-

83

tude variance) and relatively low arrival rates (1-3 sec between requests), such as for a digital library. Both [8] and [63] present experimental results that demonstrate that inter-node request redirection not only solves the scalability problem but also enables a Web site application to effectively accommodate inappropriate load distributions that may result from DNS-based routing or from a high variance of per-request overheads. For instance, [8] presents experiments run on a 3-node server, with two thirds of requests going to one node and one sixth to each of the other two. The request forwarding mechanism is based on connection transfer [12]. The study shows that redirection based on load information (sampled at 1-sec intervals) results in significant improvements relative to the no-redirection approach. Mean response time was reduced by about 70%, variance was reduced by more than an order of magnitude, and throughput increased by about 25%. To conclude this section, we mention an important theoretical result. [29] demonstrates that the size-based request dispatching policy is optimal with respect to average response time. The size-based policy ensures that the requests directed to the same node have comparable sizes. The optimality result is due to the heavy-tailed distribution of Web content. [29] proposes a method for partitioning the content among the processing nodes and demonstrates by simulation that this scheme results in waiting times at least one order of magnitude lower than round-robin or least-load-first. Unfortunately, for most Web site applications, the applicability of this result is restricted by the difficulty of determining the (approximate) request sizes at the dispatcher. Multiple services - differentiated performance After extensively addressing the problem of differentiated Web services in the context of a single host, research has recently addressed this problem in a cluster context. The method proposed in [62] aims at ensuring that the various classes of service that a Web site provides receive prioritized access to resources in accordance with their relative importance to the system. More important services should be guaranteed better service quality, in particular, under heavy load situations. In addition, under light loads, requests for less important services should not starve. Service quality is quantified by the stretch factor defined earlier in this section. Using a simple queuing theory model, the authors derive basic relationships among per-class stretch factors, node assignments, resource utilization, and importance coefficients. These formulas are evaluated periodically

84

A. lyengar and D. Rosu /Architecting Web sites for high performance

using information on the current per-class arrival rates and CPU utilizations to determine the number of nodes to be allocated to a class and the associated request admission rates. This information is used by an access router to limit the number of requests serviced in each class and to appropriately distribute these requests among the nodes assigned to the class. The requestto-node mapping is based on a Least Loaded policy in which the relative CPU and I/O loads of the available nodes are considered. Multiple services - isolated performance Numerous solutions for Web site service-level control enable each service in the system to perform at the levels specified by the SLA, as predictable as if the service is running on an isolated system. Toward this end, the performance parameters specified in the SLA are translated into resource reservations (e.g., CPU, disk, and network) that are enforced through system-specific mechanisms. Typical solutions are based on two-layer resource infrastructures. One layer of the infrastructure performs system-level resource management, deciding how the system-level resource reservation of a service is split into per-node reservations. The other layer of the infrastructure performs node-level resource management, enforcing the per-node service reservations. For node-level resource managers, typical solutions are based on proportional-share resource allocation methods, like the SFQ CPU scheduler considered in [54] and the Lottery Scheduling-based "resource containers" [10] considered in [6]. Each service is assigned a CPU reservation as a percentage of the node's capacity. The resource manager schedules the thread processing in each service and appropriately charges the resource usage to the corresponding service reservations. Unallocated or unused resources on a node are fairly shared among the active services with reservations on that node. For the system-level resource manager, the goal is to ensure that each node has enough resources to process the assigned requests. The existence of this component is necessary when client requests directly hit the server nodes based on DNS routing or when the connection router policy is based on criteria other than system load, such as content locality [46]. An effective solution for the system-level component is to start with a default distribution of each service reservation to the available nodes and to periodically adjust the per-node service reservations to accommodate the actual load distribution. A per-node service reservation is increased when the service performs at

the maximum of its allocation on the node. However, adjustments do not change the system-level reservation of a service; the increase in reservation on one node is accompanied by equivalent decrease on other nodes [6, 54]. The adjustment decision is based on monitoring information regarding current per-node, per-service resource usage. In [6], the reservation is increased by a small amount proportional to the relative difference between the default per-node allocation and the actual usage up to a bound (e.g., min {5.500 • -D-u/D}, where D is the default allocation and equal on all the nodes, u is the current utilization, and 5 is a sample bound on the reallocated amount). The actual decision is made by solving an optimization problem that minimizes the distance between the solution and target reservation levels. The solution in [54], based on similar principles, explicitly accounts for the unallocated resources. The reservation increment is a percentage of the current usage. Also, this solution does not assume that the default per-node reservation is equal for all of the nodes in the system. Both studies provide evidence that in the presence of bursty arrivals, which are typical of many Web sites, a system with dynamically adjustable per-node service reservations can achieve better resource utilization and performance than a system with fixed per-node service reservations. In addition, these solutions enable high scalability. Experimental results in [54] demonstrate that the control infrastructure scales up to 50,000 (node, service)-pairs for decision periods of 15 sec. The proposal in [5] is similar but addresses a more complex, two-tier Web site architecture. The first tier includes the content delivery servers, which can be dynamically assigned to particular services depending on load variations. The second tier includes the content generation servers that have a static assignment to services. In addition, the site has network links shared by all the services. A connection router implements request throttling and load-based dispatching. The solution is based on an elaborate and flexible infrastructure for performance control and system management. A high-level resource manager analyzes the monitoring events. For each service, it adjusts the limits of request throttling and the allocation of first-tier servers. The limits of request throttling are determined by the load of the second-tier servers and by the utilization of network resources. To summarize, this section has presented the main system models considered in service-level control and reviewed relevant solutions in each category of models.

A. lyengar and D. Rosu / Architecting Web sites for high performance

85

Despite the variety of solutions to enforcing systemspecific performance goals, all of the related studies provide experimental evidence that dynamic resource allocation decisions significantly outperform the static solutions in the context of typical Web site workloads. Consequently, the run-time mechanism for servicelevel control is a critical component in a highly efficient, high performance Web site application.

5. Caching within the network Web site performance can be further improved by caching data within the network. Cached data are moved closer to clients, reducing the latency for obtaining the data. Network bandwidth is saved because cached data has to travel over fewer links. Since aggregate network consumption is reduced, this means that latency could also be reduced for uncached as well as cached data. Web site server CPU cycles are saved because fewer requests reach the servers at the Web site. A recent trend in architecting high-volume Web site applications is the outsourcing of static content distribution. The enabling mechanism is called a content distribution network (CDN) and represents an infrastructure of caches within the network that replicates content as needed to meet the demands of clients across a wide area (Fig. 3). CDNs are provided by companies such as Akamai and Digital Island and are maintained in sites belonging to commercial hosting services such as Exodus. Practically, the interaction between a Web site, its clients, and a CDN occurs as follows. A Web site pushes (or the CDN prefetches) some of its content into the CDN's caches. For instance, the content selected for replication in a CDN includes frequently requested image files. When a client is accessing the Web site for an HTML page, it is sent a customized version of the page in which some of the links to embedded objects point to object replicas in the CDN's caches rather than to the primary copies of these objects at the Web site. The replicas are selected such that the client can achieve the fastest response times. The decision is based on static information, such as geographic locations and network connectivity, and on dynamic information, such as network and cache loads. The cache infrastructure of CDNs is distributed across wide areas to enable the delivery of Web content from locations closer to clients. Therefore, CDNs contribute to reductions in client-observed response times

Fig. 3. Internet infrastructure with proxy caches and CDNs.

and in network traffic. Furthermore, CDNs benefit the service quality delivered to clients by dynamically adjusting content placement and per-site resource allocation in response to demand and network conditions. In addition, a CDN service releases internal Web site resources that can be used to provide better service for requests for non-cacheable content. Moreover, CDNs protect a Web site from unpredictable request bursts, obviating at reasonable cost the need for excess capacity at the Web site. This is because CDNs maintain the necessary network and computation resources that are used to service several Web sites. CDNs represent the Web site-biased alternative to the caching within the network, while Web proxy caches represent the client-biased alternative (Fig. 3). For Web proxy servers, which are typically deployed to service clients within administrative and regional domains, the performance benefits from caching result from the overlap in Web content accesses across the client population. Documents accessed by multiple clients can be cached at the proxy server, reducing the need to repeatedly fetch the documents from a remote server. The performance of two commercial CDNs (Akamai and Digital Island) are discussed in [38], and an overview of Web proxy caching is contained in [56]. The remote caching provided by CDNs has several advantages over Web proxy caching. First, content is prefetched into caches directly, whereas proxy caches are populated in response to client requests. Therefore, the Web site's overhead of offloading the content is incurred only once, and not repetitively, for all proxy caches that access the site. Second, a Web site has direct control of the content cached by CDNs. It can easily keep track of cached

86

A. lyengar and D. Rosu /Architecting Web sites for high performance

content and can use appropriately suited cache consistency protocols. As proxy caches are transparent to Web sites, they can only rely on object expiration times to prevent serving stale data. There is no standard protocol for a server to indicate to the proxy cache that an object has become obsolete. Third, a CDN provides a Web site with accurate statistics about the access patterns (e.g. number of hits) to the content serviced from its caches. In contrast, the Web site has no information about the content serviced by Web proxy caches. One technique for getting around this problem is for each cacheable page to include a small uncacheable image file. Since Web pages typically embed several image files, the relative Web site overhead for serving the uncacheable image is relatively low. It is then possible to determine the total requests for content from the site, including requests to data cached by proxy servers, by analyzing log requests for such uncacheable image files. One advantage that proxy caching has over CDN's is that any Web site can take advantage of proxy caching with no extra cost just by setting expiration times and HTTP headers appropriately. By contrast, people running a Web site have to pay money to a CDN service in order to use one. A possible concern for the Web site in its selection of a CDN service is the CDN's ability to cope with unexpected request rates. The quality that a CDN can provide under overwhelming requests bursts or changing network conditions depends on the amount and the distribution of its own resources. Recent research and standardization efforts have focused on enabling CDNs to intemperate towards improving their scalability, fault tolerance, and performance. The solution proposed in [13] is based on DNS-level routing among CDNs. The DNS router, more elaborate than in the case of a single site, maps client DNS server IP addresses to a geographical region and then returns the address of the CDN serving the region. The selection is based on CDN load. Upon receiving a client request, a CDN without a direct agreement with the target Web site acts as a regular Web proxy, retrieving content from the site and appropriately billing the site's CDN. Consistency of network caches A major problem with both CDN and proxy caches is maintaining cache consistency. The ideal is to achieve strong consistency. Strong consistency means that a cache consistency mechanism always returns the results of the latest write at the server. Due to network delays, it is generally not possible to achieve this literal notion

of strong consistency in Web caches. Therefore, a more relaxed notion of strong consistency is usually used with regard to the Web, e.g. a cache consistency method is strongly consistent if it never returns data that is outdated by more than t time units, where t is the delay for sending a message from a server to a cache [23]. With current standards, strong consistency can be achieved by polling. Namely, the cache polls the server on each client request to see if the cached data is current. This approach results in significant message traffic sent between caches and servers. It also increases response times at caches because a request cannot be satisfied from a cache until the cache has polled the server to ensure that it is sending updated content. Alternatively, strong consistency can be achieved by server-driven methods in which servers send update messages to caches when changes occur. This approach can minimize update traffic because update messages are only sent when data actually change. However, servers need to maintain state for cached objects indicating which caches store the objects. In addition, problems arise if a server cannot contact a cache due to network failure or the cache being down. One approach which can reduce the overhead for server-driven cache consistency is to use leases [14,28, 23,57-59]. In this approach, a cache obtains a lease from a server in order to cache an object. The lease duration is the length of time for which the server must inform the cache of updates to the object. After the lease expires, the server no longer needs to inform the cache of updates to the object. If the cache wants to continue to receive update messages, it must renew its lease on the object. Lease durations can be adjusted to balance server and network overheads. Shorter leases require less storage at the server but larger numbers of lease renewal messages. In the asymptotic cases, a lease length of zero degenerates into polling, while an infinite lease length degenerates into the conventional server-driven consistency approach. In the worst case when a server is unable to communicate with a cache, the lease length bounds the maximum amount by which a cached object may be obsolete. If the server and cache are always in communication, the cached object will never be obsolete (modulo communication delays). A variation on just using leases is to also use volume leases [58]. Volume leases are granted to a collection of objects known as a volume. In order to store an object, a cache must obtain both an object and volume lease for the object. Object leases are relatively long. Volume leases are relatively short. In the worst case

A. lyengar and D. Rosu / Architecting Web sites for high performance

when a server is unable to communicate with a cache, short volume leases bound the maximum amount by which any cached object in the volume is obsolete. The cost of maintaining the leases is low because volume leases amortize the cost of lease renewal over a large number of objects. Techniques for efficiently implementing volume leases for caching dynamic objects are presented in [57]. To summarize, this section has addressed the problem of reducing requests to Web sites by caching data within the network. We focused on CDN services, which have several advantages over traditional Web proxy caches, and techniques for maintaining cache consistency.

6. Conclusion There are a number of open research problems concerning the improvement of Web site performance. Additional research is needed in the area of caching dynamic Web data at remote points in the network. While several methods have been developed for maintaining consistency, more work needs to be done in demonstrating the feasibility of these methods for large numbers of Web sites and remote caches. One problem that hinders work in this area is the absence of a standard protocol agreed upon by multiple vendors for maintaining cache consistency. Another problem is the difficulty researchers have in obtaining realistic data from Web sites generating significant amounts of dynamic data. Another important area of research is related to the techniques that Web content designers can use to facilitate performance. For example, Section 3 discussed how Web pages can be constructed from fragments to improve performance and to allow portions of personalized Web pages to be cached by localizing personalized information to specific fragments. More research is needed in order to determine optimal strategies for breaking up Web pages into fragments, both for improving performance and for making it easier to design Web content. Several aspects of the service-level control problem need to be further explored. For instance, one research topic is related to the service quality model. Current solutions take conservative approaches in ensuring that SLAs are guaranteed in all circumstances. An important open question is whether combining minimal performance level guarantees with best effort reallocation of unused resources across services leads to better performance under highly bursty request arrivals.

87

Another topic of importance is the implication of cooperative CDNs on Web site performance. A Web site interacts with a CDN other than the one it has explicitly selected as if the CDN were a regular Web proxy cache. Therefore, the load observed by the server is larger when its CDN directs its requests to other CDNs. More elaborate methods for CDN cooperation are required in order to guarantee that Web sites observe the offloading negotiated with their selected CDN.

References [1] T. Abdelzaher, K.Shin and N. Bhatti, Performance Guarantees for Web Server End-Systems: A Control-Theoretical Approach, Accepted to IEEE Transactions on Parallel and Distributed Systems, 2001. [2] J. Almeida, M. Dadu, A. Manikutty and P. Cao, Providing Differentiated Levels of Service in Web Content Hosting Workshop on Internet Server Performance, 1998. [3] Apache Software Foundation, Apache http server project, http://www.apache.org/. [4] G. Apostolopoulos, V. Peris and D. Sana, Transport Layer Security: How much does it really cost? Proceedings of IEEE INFOCOM, 1999. [5] K. Appleby, S. Fakhouri, L. Fong, G. Goldszmidt, M. Kalantar, S. Krishnakumar, D.P. Pazel, J. Pershing and B. Rochwerger, Oceano - SLA Based Management of a Computing Utility, Proceedings of the IFIP/IEEE International Symposium on Integrated Network Management, 2001. [6] M. Aron, P. Druschel and W. Zwaenepoel, Cluster Reserves: A Mechanism for Resource management in Cluster-based Newtork Servers, Proceedings of ACM Sigmetrics, 2000. [7] M. Aron, D. Sanders, P. Druschel and W. Zwaenepoel, Scalable Content-aware Request Distribution in Cluster-based Network Servers, Proceedings of Annual Usenix Technical Conference, 2000. [8] L. Aversa and A. Bestavros, Load Balancing a Cluster of web Servers Using Distributed Packet Rewriting, Proceedings of International Performance, Computing, Communications Conference, 2000. [9] G. Banga and P. Druschel, Lazy Receiver Processing (LRP): A Network Subsystem Architecture for Server Systems, Proceedings of USENIX Symposium on Operating System Design and Implementation, 1996. [10] G. Banga, P. Druschel and J. Mogul, Resource containers: A new facility for resource management in server systems, Proceedings of USENIX Symposium on Operating Systems Design and Implementation, 1999. [11] G. Banga and J. Mogul, Scalable kernel performance for Internet servers under realistic loads, USENIX Symposium on Operating Systems Design and Implementation, 1998. [12] A. Bestavros, M. Crovella, Mark, J. Liu, Jun and D. Martin, Distributed Packet Rewriting and its Application to Scalable Server Architectures, Proceedings of the International Conference on Network Protocols, 1998. [13] A. Biliris, C. Cranor, F. Douglis, M. Rabinovich, S. Sibal, O. Spatscheck and W. Sturm, CDN Brokering, Proceedings of the International Workshop on Web Caching and Content Distribution, 2001.

A. lyengar and D. Rosu / Architecting Web sites for high performance [14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27] [28]

[29]

[30]

[31]

[32]

[33]

P. Cao and C. Liu, Maintaining Strong Cache Consistency in the World Wide Web, Proceedings of the 17th International Conference on Distributed Computing Systems, 1997. V. Cardellini, M. Colajanni and P. Yu, DNS dispatching algorithms with state estimators for scalable Web-server clusters, World Wide Web Journal 2(3) (1999). J. Challenger, P. Dantzig, A. lyengar A Scalable and Highly Available System for Serving Dynamic Data at Frequently Accessed Web Sites, Proceedings of SC98, 1998. J. Challenger, A, lyengar and P. Dantzig A Scalable System for Consistently Caching Dynamic Web Data, Proceedings of IEEE INFOCOM'99. J. Challenger, A. lyengar, K. Witting, C. Ferstat and P. Reed, A Publishing System for Efficiently Creating Dynamic web Content Proceedings of IEEE INFOCOM, 2000. M. Colajanni, P. Yu and D. Dias, Scheduling Algorithms for Distributed Web Servers, Proceedings of International Conference on Distributed Computing Systems, 1997. L. Degenaro, A. lyengar, I. Lipkind and I. Rouvellou, A Middleware System Which Intelligently Caches Query Results, Proceedings of Middleware, 2000. P. Deolasee, A. Katkar, A. Panchbudhe, K. Ramamritham and P. Shenoy, Adaptive Push-Pull of Dynamic Web Data: Better Resiliency, Scalability and Coherency, Proceedings of International World Wide Web Conference, 2001. D. Dias, W. Kish, R. Mukherjee and R. Tewari, A Scalable and Highly Available Web Server, Proceedings of IEEE Computer Conference, 1996. V. Duvvuri, P. Shenoy and R. Tewari, Adaptive Leases: A Strong Consistency Mechanism for the World Wide Web. Proceedings of INFOCOM, 2000. A. Feldmann, R. Caceres, F. Dough's, G. Glass and M. Rabinovich, Performance of Web Proxy Caching in Heterogeneous Bandwidth Environments, Proceedings of IEEE INFOCOM, 1999. D. Florescu, A. Levy, D. Suciu and K. Yagoud, Optimization of Run-time Management of Data Intensive Web Sites, Proceedings of VLDB Conference, 1999. A. Fox, S. Gribble, Y. Chawathe, E. Brewer and P. Gauthier, Cluster-Based Scalable Network Services, Proceedings of ACM Symposium on Operating Systems Principles, 1997. A. Freier, P. Karlton and P. Kocher, The SSL Protocol, http://home.netscape.com/ eng/ss13/ draft302.txt. C. Gray and D. Cheriton, Leases: An Efficient Fault-Tolerant Mechanism for Distributed File Cache Consistency, Proceedings of the 12th ACM Symposium on Operating Systems Principles, 1989. M. Harchol-Balter, M. Crovella and C. Murta, On Choosing a Task Assignment Policy for a Distributed Server System, Proceedings of Performance Tools, 1998. V. Holmedahl, B. Smith and T. Yang, Cooperative Caching of Dynamic Content on a Distributed Web Server, IEEE International Symposium on High-Performance Distributed Computing, 1998. G. Hunt, G. Goldszmidt, R. King and R. Mukherjee, Network Dispatcher: A Connection Router for Scalable Internet Services, Proceedings of the 7th International World Wide Web Conference, 1998. IBM Netfinity Web Server Accelerator V2.0, IBM Corporation, http://www.pc.ibm.com/ us/solutions/ netfinity/ server_accelerator.html. IBM Corporation, WebSphere Application Server, http:// www-4.ibm.com/ software/webservers/ appserv/ pr.version4. html.

[34]

[35]

[36] [37]

[38]

[39]

[40]

[41]

[42]

[43] [44]

[45] [46]

[47]

[48]

[49]

[50]

[51]

[52]

[53] [54]

A. lyengar, J. Challenger, D. Dias and P. Dantzig, HighPerformance Web Site Design Techniques. IEEE Internet Computing 4(2) (2000). A. lyengar and J. Challenger, Improving Web Server Performance by Caching Dynamic Data, Proceedings of USENIX Symposium on Internet Technologies and Systems, 1997. A. lyengar, E. MacNair and T. Nguyen, An Analysis of Web Server Performance, Proceedings of GLOBECOM. 1997. A. lyengar, M. Squillante and L. Zhang, Analysis and characterization of large-scale Web server access patterns and performance. World Wide Web 2(1,2) (1999). K. Johnson, J. Carr, M. Day and F. Kaashoek, The Measured Performance of Content Distribution Networks, Proceedings of the International Web Caching and Content Delivery Workshop. 2000. P. Joubert, R. King, R. Neves, M. Russinovich and J. Tracey, High-Performance Memory-Based Web Servers: Kernel and User-Space Performance, Proceedings of the USENIX Annual Technical Conference. 2001. B. Krishnamurthy, J. Mogul and D. Kristol, Key differences between HTTP/1.0 and HTTP/1.1, Proceedings of the 8th International World Wide Web Conference, 1999. E. Levy-Abegnoli, A. lyengar, J. Song and D. Dias, Design and Performance of a Web Server Accelerator, Proceedings of IEEE INFOCOM, 1999. Q. Luo. J. Naughton, R. Krishnamurthy, P. Cao and Y. Li, Active Query Caching for Database Web Servers, WebDB (Informal Proceedings), 2000. J. Mogul, The case for persistent-connection HTTP, Proceedings of SIGCOMM. 1995. I. Molnar, Answers from planet TUX: Ingo Molnar responds, Slashdot (http://slashdot.org/ articles/ 00/07/20/ 1440204.shtml). Open Market. FastCGI. http://www.fastcgi.com/. V. Pai. M. Aron, G. Banga, M. Svendsen. P. Druschel, W. Zwaenepoel and E. Nahum, Locality-Aware Request Distribution in Cluster-based Network Servers, Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems. 1998. V. Pai, P. Druschel and W. Zwaenepoel, lO-Lite: A Unified I/O Buffering and Caching System, Proceedings of the USENIX Symposium on Operating Systems Design and Implementation. 1999. K. Rajamani and A. Cox, A Simple and Effective Caching Scheme for Dynamic Content. Rice University CS Technical Report TR 00-371. 2000. P. Rodriguez, C. Spanner and E. Biersack, Web Caching Architectures: Hierarchical and Distributed Caching, Proceedings of the International Web Caching Workshop. 1999. T. Schroeder. S. Goddard and B. Ramamurthy, Scalable Web Server Clustering Technologies. IEEE Network (May/June 2000). A. Shaikh. R. Tewari and M. Agrawal, On the Effectiveness of DNS-based Server Selection, Proceedings of IEEE INFOCOM, 2001. J. Song, E. Levy, A. lyengar and D. Dias, Design Alternatives for Scalable Web Server Accelerators. Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2000. J. Song, A. lyengar, E. Levy and D. Dias, Architecture of a Web Server Accelerator, Computer Networks 38( 1), Jan. 2002. B. Urgaonkar and P. Shenoy, Share: Managing Resources in Shared Cluters. Technical Report TR01-08, Department of Computer Science. University of Massachusetts, 2001.

A. lyengar and D. Rosu / Architecting Web sites for high performance [55] A. van de Ven, kHTTPd Linux http accelerator, http://www. fenrus.demon.nl. [56] J. Wang, A Survey of Web Caching Schemes for the Internet, ACM Computer Communication Review 29(5) (1999). [57] J. Yin, L. Alvisi, M. Dahlin and A. lyengar, Engineering server-driven consistency for large scale dynamic web services, Proceedings of the 10th International World Wide Web Conference, 2001. [58] J. Yin, L. Alvisi, M. Dahlin and C. Lin, Volume Leases for Consistency in Large-Scale Systems, IEEE Transactions on Knowledge and Data Engineering 11(4) (1999). [59] H. Yu, L. Breslau and S. Shenker, A Scalable Web Cache Consistency Architecture, Proceedings of ACM SIGCOMM'99.

89

[60] Zeus Technology Limited, Zeus Web Server, http://www.zeus. co.uk. [61] X. Zhang, M. Barrientos, B. Chen and M. Seltzer, HACC: An Architecture for Cluster-Based Web Servers, Proceedings of the USENIX Windows NT Symposium, 1999. [62] H. Zhu, H. Tang and T. Yang, Demand-driven Service Differentiation in Cluster-based Network Services, Proceedings of IEEE INFOCOM, 2001. [63] H. Zhu, T. Yang, D. Watson, O. Ibarra and T. Smith, Adaptive Load Sharing for Clustered Digital Library Servers, International Journal on Digital Libraries 2(4) (2000). [64] H. Zhu and T. Yang, Class-based Cache Management for Dynamic Web Content, Proceedings of IEEE INFOCOM, 2001.

This page intentionally left blank

91

Scientific Programming 10 (2002) 91-100 1OS Press

Mobile objects in Java Luc Moreaua and Daniel Ribbensb a

Electronics and Computer Science, University of Southampton, SO 17 IBJ Southampton, UK Tel: + 44 23 80594487; Fax: + 44 23 80592865; E-mail: [email protected] b Service d'Informatique, University of Liege, 4000 Liege, Belgium Tel: +3243662640; Fax: +32 43662984; E-mail: [email protected]

Abstract: Mobile Objects in Java provides support for object mobility in Java. Similarly to the RMI technique, a notion of client-side stub, called startpoint, is used to communicate transparently with a server-side stub, called endpoint. Objects and associated endpoints are allowed to migrate. Our approach takes care of routing method calls using an algorithm that we studied in [22]. The purpose of this paper is to present and evaluate the implementation of this algorithm in Java. In particular, two different strategies for routing method invocations are investigated, namely call forwarding and referrals. The result of our experimentation shows that the latter can be more efficient by up to 19%.

1. Introduction Over the last few years, mobile agents have emerged as a powerful paradigm to structure complex distributed applications. Intuitively, a mobile agent can be seen as a running software that may decide to suspend its execution on a host and transfer its state to another host, where it can resume its activity. Cardelli [5] argues that mobile agents are the right abstraction to develop applications distributed across "network barriers", e.g. in the presence of firewalls or when connectivity is intermittent. In Telescript [17], software migration is presented as an alternative to communications over a wide-area network, in which clients move to servers to perform computations. Lange [16] sees mobile agents as an evolution of the client-server paradigm, and enumerates several reasons for using software mobility. It is a challenge to design and implement mobile agent applications because numerous problems such as security, resource discovery and communications, need to be addressed. Therefore, we introduce Mobile Objects in Java, a middleware that helps implement mobile agent systems by providing a concept of mobile object. Its specific contribution is a communication mechanism consisting of the invocation of methods on objects that may be mobile. Our motivation has been driven by developments in distributed computing over the last couple of decades. ISSN 1058-9244/02/$8.00 © 2002 - IOS Press. All rights reserved

Successive paradigms such as remote procedure calls (RPC) [3], method invocation in Network Objects [4] and remote method invocation (RMI) in Java [12], amongst others, abstract away from the reality of distribution. They successively provided programmers with new and more sophisticated abstractions. RPC provides homogeneity, by its marshalling and unmarshalling of data structures using data representation suitable for heterogeneous platforms. Network Objects offers memory uniformity because remote method invocations are syntactically identical to local ones and garbage collection takes care of local and distributed objects. Java RMI provides code propagation because the programmer no longer needs to replicate code to remote machines, but instead Java RMI is able to load code dynamically. The next logical step is to hide the location and movement of objects. A similar approach has also been adopted by the network community, which devised the next generation of the IP protocol (IPv6) with support for mobile addresses [14]. There exists an incremental approach to introduce mobility into an infrastructure that is unaware of mobility [7,16]. It consists of associating each mobile entity with a stationary home agent, which acts as an intermediary for all communications. While this approach preserves compatibility with an existing infrastructure, introducing an indirection to a home agent for every

92

L. Moreau and D. Ribbens /Mobile objects in Java

communication puts a burden on the infrastructure; this may hamper the scalability of the approach, in particular, in massively distributed systems, such as the amorphous computer [27] or the ubiquitous/pervasive computing environment [1]. Free from any compatibility constraint, we adopted an algorithm to route messages to mobile agents that does not require any static location: the theoretical definition of this algorithm based on forwarding pointers and the proof of its correctness have been investigated in a previous publication [22]. The purpose of this paper is to present Mobile Objects in Java, an implementation of the algorithm, which offers transparent method invocation and distributed garbage collection for mobile objects. By transparent, we mean that mobile and non-mobile objects present a same interface, which is independent of the object location and its migratory status. Distributed garbage collection ensures that an object, whether mobile or not, can be reclaimed once it is no longer referenced. While implementing our algorithm, it became clear that two strategies could be adopted, which we named call forwarding and referrals; we present these strategies and evaluate their performance through a benchmark. This paper is organised as follows. First, we provide more motivation for mobile agents by presenting two promising application domains in Section 2. Then, in Section 3, we summarise the algorithm we have investigated in [22]. In Section 4, we describe its implementation in Java, providing a transparent interface to mobile objects. We then discuss two different methods for routing method invocations, namely forwarding and referrals, in Section 5. In order to compare these techniques, we devise a synthetic benchmark, and analyse results in Section 6. Finally, we compare our approach with related work in Section 7.

2. Motivation In this Section, we provide further motivation for mobile agents. We describe two promising applications where mobile agents act semi-autonomously on behalf of users. The reasons for doing so, however, differ substantially in the two applications. Digital library Yan and Rana [30] present a high-level service for a digital library of radar images of the Earth. The library is composed of a set of confidential images and associated annotations with attached ownership. They

extend a Web-based client-server architecture with mobile agents that perform tasks on behalf of users and that are able to migrate to a predefined itinerary of hosts. After being dispatched, agents migrate securely, with data, code and state to an itinerary of servers that may have relevant data and services. Agents become independent of the user who created them: they can survive intermittent or unreliable network connections. Mobile agents are beneficial for several reasons. (i) They avoid the delivery of large volume of scientific data required for data mining of images; (ii) They help maintain confidentiality and ownership of data, by being run through security checks, ensuring that they have the rights to access the data; (iii) They are allowed specific queries on the library according to the "security level" they were granted.

Fig. 1. Architecture.

Mobile users The context of the Magnitude project [24] is the "ubiquitous computing environment" [27] where embedded devices and artifacts abound in buildings and homes, and have the ability to sense and interact with devices carried by people in their vicinity. Applications running on mobile devices interact with the infrastructure, and find and exploit services to fulfill the user's needs. However, communications between mobile devices and the infrastructure have some limitations, in the form of intermittent connectivity and low bandwidth. Furthermore, processing power and memory capacity of compact mobile devices remain relatively small. As a result, such an environment would prevent the large scale deployment of advanced services that are communication and computation intensive. We adopt mobile agents as proxies for mobiles users. As illustrated by Fig. 1, we utilise mobile agents, as

L. Moreau and D. Ribbens / Mobile objects in Java

semi-autonomous entities, which can migrate from mobile devices to infrastructure locations to take advantage of the resources their specific tasks require; mobile agents perform their tasks on the infrastructure, possibly involving further migration, and then return results back to mobile users. Summary Both scenarios use the idea of mobile agent, as a semi-autonomous proxy for a user. If granted the right to do so, mobile agents may migrate to new locations, where they can take advantage of local resources.

3. Message routing algorithm In this section, we summarise the message routing algorithm for mobile agents that we formalised in [22]. We consider a set of mobile objects and a set of sites (in our case JVMs) taking part into a computation. Each mobile object is associated with a timestamp, which is a counter incremented every time the object changes location. Each site keeps a record of the location where every mobile object known to the site is thought to be, and of the timestamp the object had at the time. Therefore, in a system composed of several sites, sites may have different information about a same mobile object (depending on how fast location information is propagated between sites). The algorithm proceeds as follows. When a mobile object decides to migrate from a site A to another site B, it informs A of its intention of migrating; a transportation service is used to transport the object to B. When the mobile object arrives at B, its safe arrival is acknowledged by informing its previous site A of its new location and of its new timestamp; site A can then update its local table with the mobile object's new position and timestamp. Mobile objects delegate to sites the task of sending messages to other objects. When a site receives a request for sending a message to a mobile object, it searches its table in order to find the object location. If the object is local, the message is passed onto the object. If the object is not local, but known to be at a remote location, the message is forwarded to the remote location. As migration is not atomic, a mobile object may have left a site, but the acknowledgement of its safe arrival may not have been received by the site yet. In such a case, the site temporarily has to enqueue messages

93

aimed at the object; as soon as the acknowledgement arrives, delayed messages may be forwarded. Timestamps are used to guarantee that sites always update their knowledge about mobile objects with more recent information than the one they currently have. If a site receives information with a timestamp that is smaller than the timestamp in its table, the received information is discarded. Such a timestamp mechanism is mandatory to avoid cyclic routing of messages [22]. In the algorithm described so far, a mobile object leaves a trail of forwarding pointers during its migration. In order to reduce the length of the chain of forwarding pointers, routing information and associated timestamp may be propagated by any site to any site; timestamps are again used to guarantee that the most recent information is stored in routing tables. In the rest of the paper, we discuss an implementation of this abstract algorithm. 4. Implementation in Java In Java RMI [12], an object whose methods can be invoked from another JVM is implemented by a remote object. Such a remote object is described by one or more remote interfaces. Remote method invocation is the action of invoking a method of a remote interface on a remote object. In practice, a stub acts as a client's local representative or proxy for a remote object. The stub of a remote object implements the same interface as the remote object: when a method is called on the stub, arguments are serialised and communicated to the remote object,1 where the method can be called; its result is transmitted back to the stub and becomes the method call result. A very desirable feature of this approach is that local and remote method invocation share an identical syntax. Now, we present an approach in which remote objects are allowed to be mobile, but clients still use the same stub-based method invocation mechanism, making them unaware of the location and movement of the mobile object. 4.1. Startpoints and endpoints Figure 2 displays the different entities of our implementation. The right-hand side of the picture repre1 Before Java 1.2, there was a notion of skeleton, which was a server-side representative of the object, responsible for deserialising the arguments.

94

L. Moreau and D. Ribbens /Mobile objects in Java Start point

JVM 2

Endpoint

stub

Remote Meth od Invocation

Unicast Remote Object

In least Mobile Object

Fig. 2. Startpoint and endpoint.

sents the "server-side" on JVM2, composed of a mobile object; the left-hand side is concerned with the "client-side" on JVM1. We adopt Nexus terminology [8], and we respectively name startpoint and endpoint the client-side and server-side representatives of a remote object. A mobile object is specified by an interface, which must also be implemented by its startpoints. Startpoints contain an RMI stub representing the current location of a mobile object, and permit direct communication with the endpoint; the endpoint passes messages to the mobile object. Additionally, the startpoint contains the mobile object's timestamp t. (The endpoint also has the same timestamp t.) Figure 3 displays the new configuration after the mobile object has migrated to JVM3. There exists a new endpoint acting as a server-side representative at the new location. Its timestamp is t + 1 following its increase after migration. The endpoint is referred to by a startpoint with timestamp t +1, which is sent to JVM2 as an acknowledgement to the safe arrival at JVM 3. This startpoint is used by the endpoint on JVM2 as a forwarding pointer to the new object location. When a method is activated on the startpoint on JVM1, the call is still transmitted to JVM2, where the endpoint is aware that the object has moved to JVM3 and uses the same mechanism to forward the call. As opposed to simple message passing, a remote method invocation is expected to return a result.2 In a first instance, our implementation is based on call forwarding and the result is propagated back along the chain to the initial startpoint where the method call was initiated. In such an algorithm, it is important to reduce any chain of forwarding pointers in order to reduce the cost 2

In the particular case of a procedure returning a type void, no result is returned, but the method invocation terminates after the call has been completed remotely; for the sake of presentation, we will no longer distinguish this case from normal return of values.

of method invocation, but also to make the system more resilient to failures. In Fig. 3. when JVM2 has to forward a call to JVM3, JVM2 knows that information on JVM1 is out of date. Therefore, when the result is returned to JVM1 , we can also return updated information about the mobile object location. To this end, we made the remote interface implemented by endpoints different from the interface implemented by mobile objects: we return not only the "usual result", but also the new object location. Returning updated location information at the same time as returning results may not propagate information soon enough, because processing on the server may be long. Therefore, independently, we might like to inform previous JVMs in the chain about the location of the mobile object. Since regular remote method invocation does not give any information about the method caller, we provide, as extra arguments, the stubs pointing to the JVMs involved in the chain. In summary, a startpoint implements the same interface as a mobile object. An endpoint has a derived interface passing extra routing information, both during the forward call and during the return of a result. The RMI-stub encapsulated in the startpoint implements the same interface as the endpoint. For the sake of illustration, let us consider the method talk specified in the interface Talker implemented by a mobile object. interface Talker { int talk(int v, String s); A startpoint associated with such a mobile object also implements the interface Talker. The endpoint of such a mobile object implements the .Endpoint -Talker I interface containing a method talk: interface _Endpoint_TalkerI { int Result talk (List from,

95

L. Moreau and D. Ribbens /Mobile objects in Java

Startpoint

JVM

Startpoint

2

JVMs

Endpoint

Endpoint

RMI stub

Rem Met hod Invoc ation

Unicast Remote Object

RMI stub

Unicast Method Invoc ation

Object

Unicast Mobile Object

Fig. 3. Mobile object migration.

int __v, String _s); The extra argument from is a list of RMI stubs to the JVMs that were involved in the passing of the current method call. The type _int -Result encapsulates an int as well as new routing information. We have implemented a stub compiler which takes care of generating such interfaces. It also creates the definitions of the Startpoint and endpoint classes. 4.2. Object migration We provide anew abstract class UnicastMobileOb j e c t , which encapsulates the behaviour common to all mobile objects. A mobile object must be defined as a subclass of UnicastMobileObject, from which two methods can be inherited: protected void tnigrate(String url, Serializable state) protected void install(Object state)

A mobile object can initiate its migration to another JVM, identified by a RMI-style URL, using the method migrate. The current object content will be serialized in conjunction with an extra argument. Upon an object's arrival, the method instal 1 is activated with the state argument passed to migrate. Both methods are defined as "protected" to guarantee that they are invoked only under the object's control. The Java object model does not make any guarantee regarding which thread executes remote methods. Therefore, for a single object, there may be several threads executing in parallel when a request for migrating is issued by one of them. As Java does not support thread migration, it is not possible to suspend the execution of all threads in order to resume them at the destination. Instead, we allow an object to migrate when there is only one thread executing a method of this object. It is therefore the programmer's responsibility to synchronize and terminate threads currently

executing in parallel, and, if necessary, to save their state in a serializable field of the object. Mobile Objects in Java also introduce the concept of "platform", a JVM that runs mobile objects securely. A platform is a RMI UnicastRemoteObject which advertises its presence by binding itself with a RMIstyle URL (specified at construction time) in a RMIregistry. This is such a URL which is expected as a first argument by migrate. Hooks are provided to perform security checks before executing objects in their sandbox [20]. 4.3. Startpoint deserialisation In our system, on a given platform, there is at most one instance of a Startpoint that refers to a given mobile object. In order to preserve this invariant, each platform maintains a table of all the startpoints it knows, which is updated when startpoints are deserialised. (We use the Java method doReadResolve [13] to override the object returned from the stream.) A desirable consequence of this implementation is that all objects using a specific Startpoint share the benefit of the most recent routing information for that startpoint. The table of startpoints is a hash table, using a unique name given to mobile objects as a hashing key. This table uses weak references [11] to guarantee that startpoints do not remain accessible longer than necessary. As a result, we ensure that mobile objects may be properly garbage collected. 4.4. Clearing routing information Routing information has to be cleared when it is no longer needed. Indeed, platforms run for a long period of time and host many visiting mobile objects, which leave forwarding pointers as they migrate to their next destination. We need to ensure that routing tables do not become filled with unnecessary routing information. We have observed [22] that the task of clearing routing tables is equivalent to the distributed termination

96

L. Moreau and D. Ribbens / Mobile objects in Java

problem [25]. A forwarding endpoint is allowed to be cleared if it can be proved that no other platform will ever forward method calls to it. This may be implemented using a distributed reference counting algorithm [23,25]. In particular, RMI provides a method Unreferenced for remote objects which is called when there is no remote reference to this object [12]. When this method is called on an endpoint, it may be unexported, and the reference to the next startpoint in the chain may be lost. Note that this mechanism can only work if tables of startpoints contain weak references to these. Otherwise, if startpoints remain live, the RMI-stubs they contain will also remain live, which will prevent the call of the Unreferenced method on the associated endpoints.

5. Forwarding vs referrals In our theoretical algorithm [22], messages are routed individually; a reply would be regarded as a separate message to be routed independently. The view that we have adopted for Mobile Objects in Java differs slightly because it is based on the remote method invocation paradigm: methods are invoked and are expected to produce a result. In the previous section, we showed that the result could be propagated backwards along the chain of forwarding pointers left by the mobile object. Long chains of remote method invocations offer too little resilience to failures of intermediary nodes. Instead of forwarding a method call, an endpoint could throw an exception indicating that the mobile object has migrated. The exception could contain the new startpoint pointing at the mobile object location. The approach consisting of throwing an exception containing a new startpoint, instead of forwarding a call, is similar to the referral mechanism [9] used in distributed search systems such as Whois++ [26] and LDAP [28]. It then puts the onus on the method invoker to re-try the invocation with the next location of the object; once the object has been reached, the result may then be returned to the caller directly. In our implementation, the startpoint is in charge of re-trying a method invocation until it becomes successful. Therefore, from the programmer's viewpoint, there is no syntactic difference between the two approaches. An option passed as argument to the stub compiler specifies whether code has to be generated for referrals or for call forwarding. In the rest of the paper, we compare the performance of the two approaches.

6. Benchmark The scientific programming community has a tradition of adopting benchmarks to evaluate the performance of computers; for instance, the Linpack Benchmark is a numerically intensive test used to measure floating point performance. Unfortunately, we lack benchmarks specifically suited to evaluate routing algorithms for mobile objects. This may be explained by the relative novelty of the concept of mobile object, and the inexistence of widely accepted applications for mobile agents. In a previous paper [23], we observed that there was no recognised benchmark for evaluating distributed garbage collectors; therefore, we designed some synthetic benchmarks for such a type of distributed algorithms. We propose to adopt a similar approach here. A synthetic benchmark is an abstraction of a real program, where routing of messages may have an impact on the performance of the computation. In our benchmark, we measure the cost of invoking a method on a mobile object that has changed location since the last time the method was invoked on it. In the context of the Magnitude architecture of Section 2, such a benchmark is reminiscent of the communications one may have with a mobile agent visiting several locations to perform a task. Figure 4 summarises the "Itinerary Benchmark". An Itinerary consists of N platforms P0. PN-I to be visited by a mobile object. A platform P, not part of the itinerary, is used to initiate invocations of a method m on the remote mobile object. Every method invocation takes as argument a list of J platform identifiers that the mobile object has successively to migrate to; an itinerary is completed when the mobile object returns to the first platform PO. As method m is invoked on the mobile object, it spawns a thread responsible for migrating the mobile object to J platforms, while method m terminates in parallel. On platform P, we measure the time taken to perform all method calls necessary to complete an itinerary. Figure 5 illustrates the execution of the Itinerary benchmark over 10 platforms (5 rather heavily loaded workstations/servers each running 2 JVMs), connected by a local area network. Each method call forced the object to migrate to one new location. We ran the same benchmark using both the call forwarding and the referrals techniques. We can see that in this specific instance, referrals are on average 9% faster than call forwarding, over 200 itineraries. We should observe the abnormal duration of the first itinerary in Fig. 5: in-

L. Moreau and D. Ribbens /Mobile objects in Java Set of platforms: Benchmark platform: Number of jumps: Number of itineraries:

97

P0, PI, . -., PN-I P, with P¹ Pi J C

Initial Configuration: • On P0, create a mobile object o that knows of all platforms P0, ..., PN-IOn Benchmark platform: • Repeat C times: - Create a partition { [ n 1 1 , . . . , n1J _,], [ n 2 1 , . . . , n 2 J ] , . . . , [nk 1 , • • •, nkr]} of integers in the range [I, N — 1] with N-l = J(k-l)+r,r<J; - For each subset [ni1 , . . . , ni.j]: * Invoke method m on object o with arguments [ n i 1 , . . . , n i j - ] ; - Invoke method m on object o with arguments [0]. Method m of object o: • When m is activated, with argument [ni1 , . . . , n i j ] : - In a separate thread, migrate object o successively to platforms P n i l , . . . , P n i j ', — Return from method m. Fig. 4. Itinerary Benchmark.

deed, it can be up to an order of magnitude slower than the others since it forces object byte-code to be loaded dynamically as the mobile object visits each platform for the first time. In Fig. 6, we summarise our results, which we discuss now. Several variants of the Itinerary benchmark were considered. (i) We always ran the Itinerary benchmark on 10 platforms. In one case, the platforms executed on 5 rather heavily loaded workstations/servers each running 2 JVMs) connected by a 100Mb local area network (Notation: LAN). In the other case, the platforms executed on 5 nodes of a cluster (Linux 2.2, 450 Mhz) with dedicated 100Mb network, with each node running 2 JVMs (Notation: Cluster). (ii) The partitioning of the platforms may be deterministic or non-determinisic. In the former case, the object systematically visits platforms in the same order (Notation: Sequential). In the latter case, the order of platforms is decided randomly for each itinerary (Notation: Random). (iii) We ran the Itinerary benchmark using both the call forwarding (Notation: CF) and the referrals techniques (Notation: Ref). (iv) When a mobile object migrates to successive locations, its new position can be acknowledged to all its previous locations (Notation: Eager Acknowledgement), or to its directly previous

location only (Notation: One Acknowledgement). In order to reduce some of the non-deterministic nature of the benchmark, we have introduced a delay between each method call to the mobile object, which gave time to the object to migrate to its location. Such a delay is not included in the results. In the first table of Fig. 6, eager acknowledgement of object migration resulted in methods calls to be forwarded at most once. This is confirmed by the average duration of a method call, which does not incur any significant variation as J, the number of migrations associated with a method call, increases. We also observe that there is no significant difference between sequential and random itineraries. Finally, the referrals technique appears to be marginally more efficient than call forwarding. In the second table of Fig. 6, acknowledgements of object position is back-propagated to the object's previous location only. Therefore, as we increase J, the number of platforms that the mobile object has to migrate to for each method call, we observe that method calls have to be forwarded further. Again, we do not observe any significant difference between sequential and random itineraries. However, the referrals technique becomes significantly more efficient than call forwarding: its efficiency is in the range [11%-19%] for a LAN, whereas its in the range [6%-l 1%] for a cluster. Instrumenting the Itinerary benchmark turned out to be more difficult than anticipated. Indeed, many ele-

98

L. Moreau and D. Ribhens /Mobile objects in Java

Fig. 5. An illustration of call forwarding vs referrals (LAN). Eager Acknowledgement (N = 10. C = 200) LAX .7

1 2 3 4

Random CF Ref 263 245 260 245 261 245 257 249

Sequential Method calls/Itinerary CF Ref % 10 266 243 97 260 244 77 5 4 262 244 77 3 ! 257 245 57

256

% 77 67 77 37

Cluster Random Sequential CF Ref % CF Ref 254 243 5% 255 244 258 246 57 259 246 257 246 47 257 251 257 256 77 260 255

256

261

7 57 57 27 27

27c

One. Acknowledgement (N = 10. C = 200) LAX J

1 2 3 4 5

Method calls/Itinerary 10 5 4 3 2

Sequential CF Ref % 299 323 343 329

264 278 289 287

117 147 167 137

Random CF Ref % Identical to Eager 299 263 12% 330 280 15% 348 293 16% 364 295 19%

Cluster ; Random Sequential % CF Ref % CF Ref Acknowledgement 7%; 275 258 67 278 258 295 271 8% 298 269 10% 312 279 117 312 280 10% 8% 315 288 97 312 288

Fig. 6. Average duration of a method call to a mobile object.

ments, not in our control, interact with our implementation. In particular, platform to platform communications were implemented with Java RMI, which uses Birrel's distributed garbage collector [4]. Such a distributed GC introduces synchronisations every time a stub is communicated by a remote method invocation; in particular, such synchronisations occur in the benchmark when an object migration is acknowledged, or when stubs are piggybacked. An alternative would be to use another algorithm [23] which does not introduce such synchronisations. Our rationale for comparing sequential and random itineraries was to test whether a cost was incurred because new connections needed to be opened. Java RMI hides the implementation details in a totally opaque manner, and we have no control over the management of these resources in our implementation.

Discussion Call forwarding requires two interventions of each intermediary platform for forwarding the call and the result, whereas referrals require only one such intervention. We believe that this element is the principal explanation for the superior performance of referrals in the presence of heavily loaded platforms (as in our LAN). We anticipate that such a configuration is similar to the environment in which mobile agents are likely to be deployed (cf. Section 2). At the beginning of our investigation, we debated whether referrals would be penalised by having to open new connections between the benchmark platform and itinerary platforms. In all likelihood, such connections had to be opened for distributed GC purpose in both variants of the algorithm, and therefore no significant change of performance could be attributed to this aspect. Tools to instrument resources used within the JVM would be extremely valuable in this context.

L. Moreau and D. Ribbens / Mobile objects in Java

7. Related work and conclusion We have presented Mobile Objects in Java a library able to route method invocations to mobile objects. We have discussed two ways of forwarding calls, namely call forwarding and referrals; the latter turned out to be more efficient in our benchmark. There is a third method where the caller explicitly passes a reference to itself, which is used by the callee to return the result. Such a method discussed in [8,21] allows the result to "short-cut" the chain of forwarding calls. A more extensive study is required to investigate the performance of these three methods (as well as the home agent approach) in various scenarios. Mobile Objects in Java is an integral part of a mobile agent system that we use to support mobile users in the Magnitude project [24]. From a software engineering viewpoint, such a library provides a separation of concern between higher-level interactions and message routing. We are adopting such a communication model in three different circumstances. (i) User-driven communication to their mobile agents; (ii) Return of results from a mobile agent to a mobile personal digital assistant; (iii) Communications between mobile agents. There are a number of other systems that support mobile computations, but they adopt a different philosophy. Emerald [15] supports migration of an object, including threads running in parallel. In Kali Scheme [6], continuations may be migrated between address spaces. None of them provides the transparent routing of messages, as described in this paper. Other approaches rely on a stationary entity to support communications between mobile objects, including Aglets [16], Nomadic Pict [29], April [18] and the InterAgent Communication Library [19]. Jumping Beans [2] is a commercial product offering support for mobile applications, but requires a server to be visited by mobile agents during each agent migration. Stationary and central locations put an extra burden on the infrastructure which we wanted to avoid in our implementation. Our investigation has highlighted a number of difficulties concerning the evaluation of algorithms for mobile agents. (i) In high-level implementations such as ours, in particular above Java RMI, the lack of tools to instrument low-level resources (connections, distributed garbage collection) makes it somewhat difficult to explain observed behaviours.

99

(ii) The absence of widely recognised benchmarks does not ease comparison with other authors. (iii) In mobile computing, social human behaviours dictate the patterns of physical mobility; these can be extensively used in simulations. Because we lack widely accepted applications of mobile agents, we also lack accepted models of their mobility. It is this specific problem that Huet [10] addresses by looking at a formal modelisation of routing algorithms as stochastic processes. In particular, he compares a centralised forwarder with distributed forwarding pointers. From the slides that were accessible to us, we were enable to establish the patterns of mobility he adopted, and whether call forwarding or referrals were considered. A challenging issue is to define simulations that are refined enough to take into account other activities such as distributed garbage collection, which itself also lacks recognised benchmarks. In the future, we wish to investigate strategies for propagating information about object's locations independently of remote method invocation. Such a study will have to consider new benchmarks, ideally derived from real applications, and should also include alternative routing algorithms. Furthermore, other requirements and their implications on performance need to be investigated, such as security and robustness of directory services.

Acknowledgements David De Roure pointed out an analogy between the directory service described in this paper and the notion of referrals used in distributed search systems and in "query routing". Thanks to Omer Rana for discussions on their agent-based Digital Library system, and to Danius Michaelides for his comments on the paper. This research is funded in part by EPSRC and QuinetiQ project "Magnitude" reference (GR/N35816).

References [1]

S, Adams and D. DeRoure, A Simulator for an Amorphous Computer, in: Proceedings of the 12th European Simulation Multiconference (ESM'98), Manchester, UK, June 1998. [2] Ad Astra, Jumping beans, Technical report, White Paper, 1999, http://www.JumpingBeans.com/. [3] A.D. Birrell and B. J. Nelson, Implementing Remote Procedure Calls, ACM Transactions on Computer Systems 2(1) (February 1984), 39-59.

L. Moreau and D. Ribbens / Mobile objects in Java [4] A. Birrell, G. Nelson, S. Owicki and E. Wobber, Network Objects, Technical Report 115, Digital Systems Research Center, February 1994. [5] L. Cardelli, Abstractions for Mobile Computation, in: Secure Internet Programming: Security Issues for Distributed and Mobile Objects, J. Vitek and C. Jensen, eds. Vol. 1603 of Lecture Notes in Computer Science, 1999. [6] H. Cejtin, S. Jagannathan and R. Kelsey, Higher-order distributed objects, ACM Transactions on Programming Languages and Systems 17(5) (September 1995), 704-739. [7] J. Dale and F.G. McCabe, Agent Management Support for Mobility, Fipa'98 draft specification, Fujitsu Laboratories of America, 1998. [8] I. Foster, C. Kesselman and S. Tuecke, The Nexus Approach to Integrating Multithreading and Communication, Journal of Parallel and Distributed Computing 37 (1996), 70-82. [9] N. Gibbins and W. Hall, Scalability issues for query routing service discovery, in: Proceedings of the Second Workshop on Infrastructure for Agents, MAS and Scalable MAS, May 2001. [10] F. Huet, Distribution and localisation, http://www.irit.fr/ ACTIVITES/PLASMA/PRO-Toulouse2001/LesPropositions /LesTransparents/pro-huet.pdf. [11] Java Reference Objects, http://java.sun.com/j2se/l.3/docs/ guide/refobs/. [12] Java Remote Method Invocation Specification, November 1996. [13] Java Object Serialization Specification, November 1998. [14] D.B. Johnson and C. Perkins, Mobility Support in IPv6, Internet draft, IETF Mobile IP Working Group, 1999. draft-ietfmobileip-ipv6-09.txt. [15] E. Jul, Migration of light-weight processes in Emerald, Operating Systems Technical Committee Newsletter 3(1) (1989), 25-30. [16] D.B. Lange and M. Ishima, Programming and Deploying Java Mobile Agents with Aglets, Addison-Wesley, 1998. [17] General Magic, Telescript Technology: Mobile Agents, 19%. [18] F.G. McCabe and K.L. Clark, APRIL - Agent Process Interaction Language, in: Proc. of ECAI'94 Workshop on Agent Theories. Architectures and Languages, Springer-Verlag, 1995.

[19] F.H. McCabe, InterAgent Communications Reference manual, Technical report, Fujitsu Laboratories of America, 1999. [20] G. McGraw and E.W. Felten, Securing Java, Wiley, 1999. [21] D. Michaelides, L. Moreau and D. DeRoure, A Uniform Approach to Programming the World Wide Web, Computer Systems Science and Engineering 14(2) (1999), 69-91. [22] L. Moreau, Distributed Directory Service and Message Router for Mobile Agents, Science of Computer Programming 39(23) (2001), 249-272. [23] L. Moreau, Tree Rerooting in Distributed Garbage Collection: Implementation and Performance Evaluation, Higher-Order and Symbolic Computation, To appear. [24] L. Moreau, D. De Rome, W. Hall and N. Jennings, MAGNITUDE: Mobile AGents Negotiating for ITinerant Users in the Distributed Enterprise, http://www.ecs.soton.ac.uk/lavm/ magnitude/, 2001. [25] G. Tel and F. Mattem, The Derivation of Distributed Termination Detection Algorithms from Garbage Collection Schemes, ACM Transactions on Programming Languages and Systems 15(1) (January 1993), 1-35. [26] C. Weider, J. Fullton and S. Spero, Architecture of Whois++ Index Service, Request for comments 1913, Internet Engineering Task Force, 19%. [27] M. Weiser, Some Computer Science Problems in Ubiquitous Computing, Communications of the ACM 36(7) (July 1993), 74-84. [28] M. Whalh, T. Howes and S. Kille, Light Weight Directory Access Protocol (v3). Request for comments 2251, Internet Engineering Task Force, 1997. [29] P. Wojciechowski and P. Sewell, Nomadic Pict: Language and Infrastructure Design for Mobile Agents, in: First International Symposium on Agent Systems and Applications/Third International Symposium on Mobile Agents (ASA/MA '99), October 1999. [30] Y. Yang, O.F. Rana, C. Georgousopoulos, D.W. Walker and R.D. Williams, Mobile agents and the sara digital library, in: Proceedings of IEEE Advances in Digital Libraries 2000, Washington, DC, May 2000, pp. 71-77.