COMMUNICATING PROCESS ARCHITECTURES 2005
Concurrent Systems Engineering Series Series Editors: M.R. Jane, J. Hulskamp, P.H. Welch, D. Stiles and T.L. Kunii
Volume 63 Previously published in this series: Volume 62, Communicating Process Architectures 2004 (WoTUG-27), I.R. East, J. Martin, P.H. Welch, D. Duce and M. Green Volume 61, Communicating Process Architectures 2003 (WoTUG-26), J.F. Broenink and G.H. Hilderink Volume 60, Communicating Process Architectures 2002 (WoTUG-25), J.S. Pascoe, P.H. Welch, R.J. Loader and V.S. Sunderam Volume 59, Communicating Process Architectures 2001 (WoTUG-24), A. Chalmers, M. Mirmehdi and H. Muller Volume 58, Communicating Process Architectures 2000 (WoTUG-23), P.H. Welch and A.W.P. Bakkers Volume 57, Architectures, Languages and Techniques for Concurrent Systems (WoTUG-22), B.M. Cook Volumes 54–56, Computational Intelligence for Modelling, Control & Automation, M. Mohammadian Volume 53, Advances in Computer and Information Sciences ’98, U. Güdükbay, T. Dayar, A. Gürsoy and E. Gelenbe Volume 52, Architectures, Languages and Patterns for Parallel and Distributed Applications (WoTUG-21), P.H. Welch and A.W.P. Bakkers Volume 51, The Network Designer’s Handbook, A.M. Jones, N.J. Davies, M.A. Firth and C.J. Wright Volume 50, Parallel Programming and JAVA (WoTUG-20), A. Bakkers Volume 49, Correct Models of Parallel Computing, S. Noguchi and M. Ota Volume 48, Abstract Machine Models for Parallel and Distributed Computing, M. Kara, J.R. Davy, D. Goodeve and J. Nash Volume 47, Parallel Processing Developments (WoTUG-19), B. O’Neill Volume 46, Transputer Applications and Systems ’95, B.M. Cook, M.R. Jane, P. Nixon and P.H. Welch Transputer and OCCAM Engineering Series Volume 45, Parallel Programming and Applications, P. Fritzson and L. Finmo Volume 44, Transputer and Occam Developments (WoTUG-18), P. Nixon Volume 43, Parallel Computing: Technology and Practice (PCAT-94), J.P. Gray and F. Naghdy Volume 42, Transputer Research and Applications 7 (NATUG-7), H. Arabnia Volume 41, Transputer Applications and Systems ’94, A. de Gloria, M.R. Jane and D. Marini Volume 40, Transputers ’94, M. Becker, L. Litzler and M. Tréhel ISSN 1383-7575
Communicating Process Architectures 2005 WoTUG-28
Edited by
Jan F. Broenink University of Twente, The Netherlands
Herman W. Roebbers Philips TASS, The Netherlands
Johan P.E. Sunter Philips Semiconductors, The Netherlands
Peter H. Welch University of Kent, United Kingdom
and
David C. Wood University of Kent, United Kingdom
Proceedings of the 28th WoTUG Technical Meeting, 18–21 September 2005, Technische Universiteit Eindhoven, The Netherlands
Amsterdam • Berlin • Oxford • Tokyo • Washington, DC
© 2005 The authors. All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, without prior written permission from the publisher. ISBN 1-58603-561-4 Library of Congress Control Number: 2005932067 Publisher IOS Press Nieuwe Hemweg 6B 1013 BG Amsterdam Netherlands fax: +31 20 687 0019 e-mail:
[email protected] Distributor in the UK and Ireland IOS Press/Lavis Marketing 73 Lime Walk Headington Oxford OX3 7AD England fax: +44 1865 750079
Distributor in the USA and Canada IOS Press, Inc. 4502 Rachael Manor Drive Fairfax, VA 22032 USA fax: +1 703 323 3668 e-mail:
[email protected]
LEGAL NOTICE The publisher is not responsible for the use which might be made of the following information. PRINTED IN THE NETHERLANDS
Communicating Process Architectures 2005 Jan Broenink, Herman Roebbers, Johan Sunter, Peter Welch, and David Wood (Eds.) IOS Press, 2005 © 2005 The authors. All rights reserved.
v
Preface We are at the start of a new CPA conference. Communicating Process Architectures 2005 marks the first time that this conference has been organized by an industrial company (Philips) in co-operation with a university (Technische Universiteit Eindhoven). We see that this also marks the growing awareness of the ideas characterized by ‘Communicating Processes Architecture’ and their growing adoption by industry beyond their traditional base in safety-critical systems and security. The complexity of modern computing systems has become so great that no one person – maybe not even a small team – can understand all aspects and all interactions. The only hope of making such systems work is to ensure that all components are correct by design and that the components can be combined to achieve scalability. A crucial property is that the cost of making a change to a system depends linearly on the size of that change – not on the size of the system being changed. Of course, this must be true whether that change is a matter of maintenance (e.g. to take advantage of upcoming multiprocessor hardware) or the addition of new functionality. One key is that system composition (and disassembly) introduces no surprises. A component must behave consistently, no matter the context in which it is used – which means that component interfaces must be explicit, published and free from hidden side-effect. Our view is that concurrency, underpinned by the formal process algebras of Hoare’s Communicating Sequential Processes and Milner’s π-Calculus, provides the strongest basis for the development of technology that can make this happen. Once again we offer strongly refereed high-quality papers covering many differing aspects: system design and implementation (for both hardware and software), tools (concurrent programming languages, libraries and run-time kernels), formal methods and applications. These papers are presented in a single stream so you won’t have to miss out on anything. As always we have plenty of space for informal contact and we don’t have to worry about the bar closing at half ten! We are pleased to have keynote speakers such as Ad Peeters of Handshake Solutions and Guy Broadfoot of Verum, proving that you can actually make profitable business using CSP as your guiding principle in the design of concurrent systems, be they hardware or software. The third keynote by IBM Chief Architect Peter Hofstee assures us that CSP was also used in the design of the communication system of the recent Cell processor, jointly developed by IBM, Sony and Toshiba. The fourth keynote talk is by Paul Stravers of Philips Semiconductors on the Wasabi multiprocessor architecture. We anticipate that you will have a very fruitful get-together and hope that it will provide you with as much inspiration and motivation as we have always experienced. We thank the authors for their submissions, the Programme Committee for their hard work in reviewing the papers and Harold Weffers and Maggy de Wert (of TUE) in making the arrangements for this meeting. Finally, we are especially grateful to Fred Barnes (of the University of Kent) for his essential technical expertise and time in the preparation of these proceedings. Herman Roebbers (Philips TASS) Peter Welch and David Wood (University of Kent) Johan Sunter (Philips Semiconductors) Jan Broenink (University of Twente)
vi
Programme Committee Prof. Peter Welch, University of Kent, UK (Chair) Dr. Alastair Allen, Aberdeen University, UK Prof. Hamid Arabnia, University of Georgia, USA Dr. Fred Barnes, University of Kent, UK Dr. Richard Beton, Roke Manor Research Ltd, UK Dr. John Bjorndalen, University of Tromso, Norway Dr. Marcel Boosten, Philips Medical Systems, The Netherlands Dr. Jan Broenink, University of Twente, The Netherlands Dr. Alan Chalmers, University of Bristol, UK Prof. Peter Clayton, Rhodes University, South Africa Dr. Barry Cook, 4Links Ltd., UK Ms. Ruth Ivimey-Cook, Stuga Ltd., UK Dr. Ian East, Oxford Brookes University, UK Dr. Mark Green, Oxford Brookes University, UK Mr. Marcel Groothuis, University of Twente, The Netherlands Dr. Michael Goldsmith, Formal Systems (Europe) Ltd., Oxford, UK Dr. Kees Goossens, Philips Research, The Netherlands Dr. Gerald Hilderink, Enschede, The Netherlands Mr. Christopher Jones, British Aerospace, UK Prof. Jon Kerridge, Napier University, UK Dr. Tom Lake, InterGlossa, UK Dr. Adrian Lawrence, Loughborough University, UK Dr. Roger Loader, Reading, UK Dr. Jeremy Martin, GSK Ltd., UK Dr. Stephen Maudsley, Bristol, UK Mr. Alistair McEwan, University of Surrey, UK Prof. Brian O'Neill, Nottingham Trent University, UK Prof. Chris Nevison, Colgate University, New York, USA Dr. Denis Nicole, University of Southampton, UK Prof. Patrick Nixon, University College Dublin, Ireland Dr. James Pascoe, Bristol, UK Dr. Jan Pedersen, University of Nevada, Las Vegas Dr. Roger Peel, University of Surrey, UK Ir. Herman Roebbers, Philips TASS, The Netherlands Prof. Nan Schaller, Rochester Institute of Technology, New York, USA Dr. Marc Smith, Colby College, Maine, USA Prof. Dyke Stiles, Utah State University, USA Dr. Johan Sunter, Philips Semiconductors, The Netherlands Mr. Oyvind Teig, Autronica Fire and Security, Norway Prof. Rod Tosten, Gettysburg University, USA Dr. Stephen Turner, Nanyang Technological University, Singapore Prof. Paul Tynman, Rochester Institute of Technology, New York, USA Dr. Brian Vinter, University of Southern Denmark, Denmark Prof. Alan Wagner, University of British Columbia, Canada
vii
Dr. Paul Walker, 4Links Ltd., UK Mr. David Wood, University of Kent, UK Prof. Jim Woodcock, University of York, UK Ir. Peter Visser, University of Twente, The Netherlands
This page intentionally left blank
ix
Contents Preface Herman Roebbers, Peter Welch, David Wood, Johan Sunter and Jan Broenink
v
Programme Committee
vi
Interfacing with Honeysuckle by Formal Contract Ian East
1
Groovy Parallel! A Return to the Spirit of occam? Jon Kerridge, Ken Barclay and John Savage
13
On Issues of Constructing an Exception Handling Mechanism for CSP-Based Process-Oriented Concurrent Software Dusko S. Jovanovic, Bojan E. Orlic and Jan F. Broenink
29
Automatic Handel-C Generation from MATLAB® and Simulink® for Motion Control with an FPGA Bart Rem, Ajeesh Gopalakrishnan, Tom J.H. Geelen and Herman Roebbers
43
JCSP-Poison: Safe Termination of CSP Process Networks Bernhard H.C. Sputh and Alastair R. Allen
71
jcsp.mobile: A Package Enabling Mobile Processes and Channels Kevin Chalmers and Jon Kerridge
109
CSP++: How Faithful to CSPm? W.B. Gardner
129
Fast Data Sharing within a Distributed, Multithreaded Control Framework for Robot Teams Albert Schoute, Remco Seesink, Werner Dierssen and Niek Kooij
147
Improving TCP/IP Multicasting with Message Segmentation Hans Henrik Happe and Brian Vinter
155
Lazy Cellular Automata with Communicating Processes Adam Sampson, Peter Welch and Fred Barnes
165
A Unifying Theory of True Concurrency Based on CSP and Lazy Observation Marc L. Smith
177
The Architecture of the Minimum intrusion Grid (MiG) Brian Vinter
189
Verification of JCSP Programs Vladimir Klebanov, Philipp Rümmer, Steffen Schlager and Peter H. Schmitt
203
x
Architecture Design Space Exploration for Streaming Applications through Timing Analysis Maarten H. Wiggers, Nikolay Kavaldjiev, Gerard J.M. Smit and Pierre G. Jansen
219
A Foreign-Function Interface Generator for occam-pi Damian J. Dimmich and Christian L. Jacobsen
235
Interfacing C and occam-pi Fred Barnes
249
Interactive Computing with the Minimum intrusion Grid (MiG) John Markus Bjørndalen, Otto J. Anshus and Brian Vinter
261
High Level Modeling of Channel-Based Asynchronous Circuits Using Verilog Arash Saifhashemi and Peter A. Beerel
275
Mobile Barriers for occam-pi: Semantics, Implementation and Application Peter Welch and Fred Barnes
289
Exception Handling Mechanism in Communicating Threads for Java Gerald H. Hilderink
317
R16: A New Transputer Design for FPGAs John Jakson
335
Towards Strong Mobility in the Shared Source CLI Johnston Stewart, Paddy Nixon, Tim Walsh and Ian Ferguson
363
gCSP occam Code Generation for RMoX Marcel A. Groothuis, Geert K. Liet and Jan F. Broenink
375
Assessing Application Performance in Degraded Network Environments: An FPGA-Based Approach Mihai Ivanovici, Razvan Beuran and Neil Davies
385
Communication and Synchronization in the Cell Processor (Invited Talk) H. Peter Hofstee
397
Homogeneous Multiprocessing for Consumer Electronics (Invited Talk) Paul Stravers
399
Handshake Technology: High Way to Low Power (Invited Talk) Ad Peeters
401
If Concurrency in Software Is So Simple, Why Is It So Hard? (Invited Talk) Guy Broadfoot
403
Author Index
405
Communicating Process Architectures 2005 Jan Broenink, Herman Roebbers, Johan Sunter, Peter Welch, and David Wood (Eds.) IOS Press, 2005 © 2005 The authors. All rights reserved.
1
Interfacing with Honeysuckle by Formal Contract Ian EAST Dept. for Computing, Oxford Brookes University, Oxford OX33 1HX, England
[email protected] Abstract. Honeysuckle [1] is a new programming language that allows systems to be constructed from processes which communicate under service (client-server or master-servant) protocol [2]. The model for abstraction includes a formal definition of both service and service-network (system or component) [3]. Any interface between two components thus forms a binding contract which will be statically verified by the compiler. An account is given of how such an interface is constructed and expressed in Honeysuckle, including how it may encapsulate state, and how access may be shared and distributed. Implementation is also briefly discussed. Keywords. Client-server protocol, compositionality, interfacing, component-based software development, deadlock-freedom, programming language.
Introduction The Honeysuckle project has two motivations. First, is the need for a method by which to design and construct reactive (event-driven) and concurrent systems free of pathological behaviour, such as deadlock. Second, is the desire to design a new programming language that builds on the success of occam [4] and profits from all that has been learned in two decades of its use [5]. occam already has one worthy successor in occam-π which extends the original language to support the development of distributed applications [6]. Both processes and channels thus become mobile. Honeysuckle is more conservative and allows only objects mobility. Emphasis has instead been placed on securing integrity within the embedded application domain. Multiple offspring are testimony to the innovative vigour of occam. Any successor must preserve its salient features. occam facilitates the natural expression of concurrency without semaphore or monitor. It possesses transparent, and mostly formal, semantics, based upon the theory of Communicating Sequential Processes (CSP) [7,8]. It is also compositional, in that it is rendered inherently free of side-effects by the strict separation of value and action (the changing of value). occam also had its weaknesses, that limited its commercial potential. It offered poor support for the expression of data structure and none for dynamic (abstract) data types. While processes afford encapsulation and allow effective system modularity, there is also no support for project (source code) modularity. One cannot collect related definitions in any kind of reusable package. Also, the ability only to copy a value, and not pass access to an object, to a parallel process caused inefficiency, and lay in contrast with the passing of parameters to a sequential procedure. Perhaps the most significant factor limiting the take-up of occam has been the additional threats to security against error that come with concurrency; most notably, deadlock. Jeremy Martin successfully brought together theoretical work on deadlock-avoidance using CSP with the effective design patterns for process-oriented systems introduced by Peter Welch et al.
2
I. East / Interfacing with Honeysuckle
[9,10,11,12]. The result was a set of formal design rules, each proven to guarantee deadlockfreedom within a CSP framework. By far the most widely applicable design rule relies on a formal service (client-server) protocol to define a model for system architecture. This idea originated with Per BrinchHansen [2] in the study of operating systems. Service architecture has a wide domain of application because it can abstract a large variety of systems, including any that can be expressed using channels, as employed by occam. However, architecture is limited to hierarchical structure because of a design rule that requires the absence of any directed circuit in service provision, in order to guarantee freedom from deadlock. A formal model for the abstraction of systems with service architecture has been previously given [3], based upon the rules employed by Martin. This separates the abstraction of service protocol and service network component, and shows how the definition of system and component can be unified (a point to be revisited in the next section). Furthermore, the model incorporates prioritisation, which not only offers support for reactive systems (that typically prioritise event response), but also liberates system architecture from the constraint of hierarchical (tree) structure. Finally, a further proof of the absence of deadlock was given, subject to a new design rule. Prioritised service architecture (PSA) presents the opportunity to build a wide range of reactive/concurrent systems, guaranteed free of deadlock. However, it is too much to expect any designer to take responsibility for the static verification of many formal design rules. Specialist skills would be required. Even then, mistakes would be made. In order to ease design and implementation, a new programming language is required. The compiler can then automate all verification. Honeysuckle seeks to combine the ambition for such a language with that for a successor to occam. It renders systems with PSA simple to derive and express, while retaining a formal guarantee of deadlock-freedom, without resort to any specialist skill or tool beyond the compiler. Its design is now complete and stable. A compiler is under construction and will be made available free of charge. This paper presents a detailed account of the programming of service protocol and the construction of an interface for system or component in Honeysuckle. In so doing it continues from the previous language overview [1]. We begin by considering the problem of modular software composition and the limitations of existing object- and process-oriented languages.
1. The Problem of Composition While occam is compositional in the construction of a monolithic program, it is not so with regard to system modularity. In order to recursively compose or decompose a system, we require: • some components that are indivisible • that compositions of components are themselves valid components • that behaviour of any component is manifest in its interface, without reference to any internal structure Components whose definition complies with all the above conditions may be termed compositional with regard to some operator or set of operators. As alluded to earlier, it has been shown how service network components (SNCs) may be defined in such a way as to satisfy the first two requirements when subject to parallel composition [3]. A corollary is that any system forms a valid component, since it is (by definition) a composition. Another corollary, vital to all forms of engineering, is that it is then possible to substitute any component with another, possessing the same interface, without affecting either
I. East / Interfacing with Honeysuckle
3
design or compliance with specification. Software engineering now aspires to this principle [13]. Clearly, listing a series of procedures, with given parameters, or a series of channels, with associated data types, does little to describe object or process. To substitute one process with another that simply sports the same channels would obviously be asking for trouble. A much richer language is called for, in which to describe an interface. One possibility is to resort to Floyd-Hoare logic [14,15,16] and impose formal pre- and post-conditions on each procedure (‘method’) or channel, and maintain invariants associated with each component (process or object class). However, this would require effectively the development of a language to suit each individual application and is somewhat cumbersome and expensive. It also requires special skill. Perhaps for that reason, such an explicitly formal approach has not found favour in much of industry. Furthermore, no other branch of engineering resorts to such powerful methods. Meyer introduced the expression design by contract [17], to which he devotes an entire chapter of his textbook on object-oriented programming [18]. This would seem to be just a particular usage of invariants and pre- and post-conditions, but it does render clear the principle that some protocol must precede composition and be verifiable. The difficulty that is peculiar to software, and that does not apply (often) to, say, mechanical engineering, is, of course, that a component is likely to be capable of complex behaviour, responding in a unique and perhaps extended manner to each possible input combination. Not many mechanical systems possess memory and the ability to change their response in perhaps a highly non-linear fashion. However, many electronic systems do possess significantly complex behaviour, yet have interfaces specified without resort to full first-order predicate calculus. Electronic engineers expect to be able to substitute components according to somewhat more specific interface description. One possibility for software component interface description, that is common with hardware, is a formal communication protocol detailing the order in which messages are exchanged, together with their type and structure. In this way, a binding and meaningful contract is espoused. Verification can be performed via the execution of an appropriate “statemachine” (finite-state automaton (FSA)). Marcel Boosten proposed just such a mechanism to resolve problems encountered upon integration under component-based software development [19]. These included race conditions, re-entrant call-backs, and inconsistency between component states. He interposed an object between components that would simulate an appropriate FSA. Communication protocol can provide an interface that is both verifiable and sufficiently rich to at least reduce the amount of logic necessary for an adequate definition, if not eliminate it altogether. In Honeysuckle, an interface comprises a list of ports, each of which corresponds to one end (client or provider) of a service and forms an attribute of the component. Each service defines a communication protocol that is translated by the compiler into an appropriate FSA. Conformance to that protocol is statically verifiable by the compiler. Static verification is to be preferred wherever possible for the obvious reason that errors can be safely corrected. Dynamic verification can be compared to checking your boat after setting out to sea. Should you discover a hole, there is little you can then do but sink. Discovering an error in software that is deployed and running rarely leaves an opportunity for effective counter-measures, still less rectification. Furthermore, dynamic verification imposes a performance overhead that may well prove significant, especially for low-latency reactive applications. It is thus claimed here that (prioritised) service architecture is an ideal candidate for secure component-based software development (CBSD).
I. East / Interfacing with Honeysuckle
4
Honeysuckle also provides balanced abstraction between object and process. Both static and dynamic object composition may be transparently expressed, without recourse to any explicit reference (pointer). Distributed applications are supported with objects mobile between processes. Together, object and service abstraction affords a rich language in which to express the interface between processes composed in either sequence or parallel. 2. Parallel Composition and Interfacing in Honeysuckle 2.1. Composition and Definition Honeysuckle interposes “clear blue water” between system and project modularity. Each definition of process, object, and service, is termed an item. Items may be gathered into a collection. Items and collections serve the needs of separated development and reuse. Processes and objects are the components from which systems are composed, and together serve the needs of system abstraction, design, and maintenance. Every object is owned by a single process, though ownership may be transferred between processes at run-time. Here, we are concerned only with the programming of processes and their service interface. A program consists of one or more item definitions, including at least one of a process. For example: definition of process greet imports service console from Environment process greet : { interface client of console defines String value greeting : "Hello world!\n" send greeting to console }
This defines a unique process greet that has a single port consuming a service named console as interface. The console service is assumed provided by the system environment, which is effectively another process composed in parallel (which must include “provider of console” within its interface description). Figure 1 shows how both project and system modularity may be visualized or drawn.
p.greet
s.console
greet
console
Figure 1. Visualizing both project and system modularity.
The left-hand drawing shows the item defining process greet importing the definition of service console. On the right, the process is shown running as a client of that service. Braces (curly brackets) denote the boundary of block scope, not sequential construction, as in C or Java. They may be omitted where no context is given, and thus no indication of scope required.
I. East / Interfacing with Honeysuckle
5
A process may be defined inline or offline in Honeysuckle with identical semantics. When defined inline, any further (offline) definitions must be imported above the description of the parent process. ... { interface client of console defines String greeting : "Hello world!\n" send greeting to console } ...
An inline definition is achieved simultaneously with command issue (greet!). A process thus defined can still be named, facilitating recursion. For example, a procedure to create a new document in, say, a word processor might include the means by which a user can create a further document: ... process new_document : { ... context ... ... ... new_document } ...
2.2. Simple Services If all the console service does is eat strings it is sent, it could be very simply defined: definition of service console imports object class String from StandardTypes service console : receive String
This is the sort of thing a channel can do — simply define the type of value that can be transmitted. Any such simple protocol can be achieved using a single service primitive. This is termed a simple service. Note that it is expressed from the provider perspective. The client must send a string. One further definition is imported, of a string data type from a standard library — part of the program environment. It was not necessary for the definition of process greet to directly import that of String. Definitions in Honeysuckle are transparent. Since that of greet can see that of console, it can also see that of String. For this reason, no standard data type need be imported to an application program. If more than one instance of a console service is required then one must define a class of service, perhaps called Console: definition of service class Console ...
It is often very useful to communicate a “null datum” — a signal:
6
I. East / Interfacing with Honeysuckle definition of service class Sentinel service class Sentinel : send signal
This example makes an important point. A service definition says nothing about when the signal is sent. That will depend on that of the process that provides it. Any service simply acts as a template governing the communication undertaken between two (or more) processes. Signal protocol illustrates a second point, also of some importance. The rules governing the behaviour of every service network component (SNC) [3] do not require any service to necessarily become available immediately. This allows signal protocol to be used to synchronize two processes, where either may arrive first. 2.3. Service Construction and Context Service protocol can provide a much richer interface, and thus tighter component specification, by constraining the order in which communications occur. Perhaps the simplest example is of handshaking, where a response is always made to any request: definition of service class Console imports object class String from Standard_Types service class Console : sequence receive String send String
Any process implementing a compound service, like the above, is more tightly constrained than with a simple service. A rather more sophisticated console might be subject to a small command set and would behave accordingly: service class Console : { defines Byte write : #01 Byte read : #02 names Byte command sequence receive command if command write acquire String read sequence receive Cardinal transfer String ...
Now something strange has happened. A service has acquired state. While strange it may seem, there is no cause for alarm. Naming within a service is ignored within any process that implements it (either as client or provider). It simply allows identification between references within a service definition, and so allows a decision to be taken according the intended object or value. This leaves control over all naming with the definition of process context.
I. East / Interfacing with Honeysuckle
7
One peculiarity to watch out for is illustrated by the following: service class Business : { ... sequence acquire Order send Invoice if acquire Payment transfer Item otherwise skip }
It might at first appear that payment will never be required and that service will always terminate after the dispatch of (a copy of) the invoice. Such is not the case. The above definition allows either payment to be acquired, then an item transferred, or no further transaction between client and provider. It simply endorses either as legitimate. Perhaps the business makes use of a timer service and decides according to elapsed time whether to accept or refuse payment if/when offered. Although it makes sense, any such protocol is not legitimate because it does not conform to the formal conditions defining service protocol [3]. The sequence in which communications take place must be agreed between client and provider. Agreement can be made as late as desired but it must be made. Here, at the point of selection (if) there is no agreement. Selection and repetition must be undertaken according to mutually recorded values, which is why a service may require state. A compound service may also be constructed via repetition. It might seem unnecessary, given that a service protocol is inherently repeatable anyway, but account must be taken of other associated structure. For example, the following might be a useful protocol for copying each week between two diaries: service diary : { ... sequence repeat for each WeekDay send day send week }
It also serves as a nice illustration of the Honeysuckle use of an enumeration as both data type and range. 2.4. Implementation and Verification Any service could be implemented in occam, using at most two channels — one in each direction of data flow. Like a channel, a service is implemented using rendezvous. Because, within a service, communications are undertaken strictly in sequence, only a single rendezvous is required. As with occam, the rendezvous must be initially empty and then occupied by the first party to become ready, which must render apparent the location of, or for, any message and then wait. Each service can be verified via a finite-state automaton (FSA) augmented with a loop iteration counter. At process start, each service begins in an initial state and moves to its
I. East / Interfacing with Honeysuckle
8
successor every time a communication is encountered matching that expected. Upon process termination, each automaton must be in a final “accepting” state. A single state marks any repetition underway. Transition from that state awaits completion of the required number of iterations, which may depend upon a previous communication (within the same service). Selection is marked by multiple transitions leaving the state adopted on seeing the preceding communication. A separate state-chain follows each option. Static verification can be complete except for repetition terminated according to state incorporated within the service. The compiler must take account of this and generate an appropriate warning. Partial verification is still possible at compile-time, though the final iteration count must be checked at run-time.
3. Shared and Distributed Services 3.1. Sharing By definition, a service represents a contract between two parties only. However, the question of which two can be resolved dynamically. In the use of occam, it became apparent that a significant number of applications required the same superstructure, to allow services to be shared in this way. occam 3 [20] sought to address both the need to establish a protocol governing more than one communication at a time and the need for shared access. Remote call channels effected a remote procedure call (RPC), and thus afforded a protocol specifying a list of parameters received by a subroutine, followed by a result returned. Once defined, RPCs could be shared in a simple and transparent manner. occam 3 also added shared groups of simple channels via yet another mechanism, somewhat less simple and transparent. The RPC is less flexible than service protocol, which allows specifying communications in either direction in any order. Furthermore, multiple services may be interleaved; multiple calls to a remote procedure cannot, any more than they can to a local one. Lastly, the RPC is added to the existing channel abstraction of communication, complicating the model significantly. In Honeysuckle, services are all that is needed to abstract communication, all the way from the simplest to the most complex protocol. Honeysuckle allows services to be shared by multiple clients at the point of declaration. No service need be explicitly designed for sharing or defined as shared. { ... network shared console parallel { interface provider of console ... } ... console clients }
Any client of a shared service will be delayed while another is served. Multiple clients form an implicit queue.
I. East / Interfacing with Honeysuckle
9
3.2. Synchronized Sharing Experience with occam and the success of bulk-synchronous parallel processing strongly suggest the need for barrier synchronisation. Honeysuckle obliges with the notion of synchronized sharing, where every client must consume the service before any can reinitiate consumption, and the cycle begin again. ... network synchronized shared console ...
Like the sharing in occam 3, synchronized sharing in Honeysuckle is superstructure. It could be implemented directly via the use of an additional co-ordinating process but is believed useful and intuitive enough to warrant its own syntax. The degree of system abstraction possible is thus raised. 3.3. Distribution Sharing provides a many-to-one configuration between clients and a single provider. It is also possible, in Honeysuckle, to describe both one-to-many and many-to-many configurations. A service is said to be distributed when it is provided by more than one process. ... network distributed validation ...
Note that the service thus described may remain unique and should be defined accordingly. Definition of an entire class of service is not required. (By now, the convention may be apparent whereby a lower-case initial indicates uniqueness and an upper-case one a class, with regard to any item — object, process, or service.) The utility of this is to simplify the design of many systems and reduce the code required for their implementation. Again, the degree of system abstraction possible is raised. A many-to-many configuration may be expressed by combining two qualifiers: ... network distributed shared validation ...
When distributed, a shared service cannot be synchronized. This would make no sense, as providers possess no intrinsic way of knowing when a cycle of service, around all clients, is complete. 3.4. Design and Implementation Neither sharing nor distribution influence the abstract interface of a component. Consideration is only necessary when combining components. For example, the designer may choose to replicate a number of components, each of which provides service A and declare provision distributed between them. Similarly, they may choose a component providing service B and declare provision shared between a number of clients. A shared service requires little more in implementation than an unshared one. Two rendezvous (locations) are required. One is used to synchronize access to the service and the other each communication within it. Any client finding the provider both free and ready (both rendezvous occupied) may simply proceed and complete the initial communication. After this, it must clear both rendezvous. It may subsequently ignore the service rendezvous until
10
I. East / Interfacing with Honeysuckle
completion. Any other client arriving while service is in progress will find the provider unready (service rendezvous empty). It then joins a queue, at the head of which is the service rendezvous. The maximum length of the queue is just the total number of clients, defined at compile-time. Synchronized sharing requires a secondary queue from which elements are prevented from joining the primary one until a cycle is complete. A shared distributed service requires multiple primary queues. The physical interface that implements sharing and shared distribution is thus a small process, encapsulating one or more queues. 4. Conclusion Honeysuckle affords powerful and fully component-wise compositional system design and programming, yet with a simple and intuitive model for abstraction. It inherits and continues the simplicity of occam but has added the ability to express the component (or system) interface in much greater detail, so that integration and substitution should be more easily achieved. Support is also included for distributed and bulk-synchronous application design, with mobile objects and synchronized sharing of services. Service (client-server) architecture is proving extremely popular in the design of distributed applications but is currently lacking an established formal basis, simple consistent model for abstraction, and programming language. Honeysuckle and PSA would seem timely and well-placed. Though no formal semantics for prioritisation yet appears to have gained both stability and wide acceptance, this looks set to change [21]. A complete programming language manual is in preparation, as is a working compiler. These will be completed and published as soon as possible. Acknowledgements The author is grateful for enlightening conversation with Peter Welch, Jeremy Martin, Sharon Curtis, and David Lightfoot. He is particularly grateful to Jeremy Martin, whose earlier work formed the foundation for the Honeysuckle project. That, in turn, was strongly reliant on deadlock analysis by, and the failure-divergence-refinement (FDR) model of, Bill Roscoe, Steve Brookes, and Tony Hoare. References [1] Ian R. East. The Honeysuckle programming language: An overview. IEE Software, 150(2):95–107, 2003. [2] Per Brinch Hansen. Operating System Principles. Automatic Computation. Prentice Hall, 1973. [3] Ian R. East. Prioritised service architecture. In I. R. East and J. M. R. Martin et al., editors, Communicating Process Architectures 2004, Series in Concurrent Systems Engineering, pages 55–69. IOS Press, 2004. [4] Inmos. occam 2 Reference Manual. Series in Computer Science. Prentice Hall International, 1988. [5] Ian R. East. Towards a successor to occam. In A. Chalmers, M. Mirmehdi, and H. Muller, editors, Proceedings of Communicating Process Architecture 2001, pages 231–241, University of Bristol, UK, 2001. IOS Press. [6] Fred R. M. Barnes and Peter H. Welch. Communicating mobile processes. In I. R. East and J. M. R. Martin et al., editors, Communicating Process Architectures 2004, pages 201–218. IOS Press, 2004. [7] C. A. R. Hoare. Communicating Sequential Processes. Series in Computer Science. Prentice Hall International, 1985. [8] A. W. Roscoe. The Theory and Practice of Concurrency. Series in Computer Science. Prentice-Hall, 1998. [9] Peter H. Welch. Emulating digital logic using transputer networks. In Parallel Architectures and Languages – Europe, volume 258 of LNCS, pages 357–373. Springer Verlag, 1987.
I. East / Interfacing with Honeysuckle
11
[10] Peter H. Welch, G. Justo, and Colin Willcock. High-level paradigms for deadlock-free high performance systems. In R. Grebe et al., editor, Transputer Applications and Systems ’93, pages 981–1004. IOS Press, 1993. [11] Jeremy M. R. Martin. The Design and Construction of Deadlock-Free Concurrent Systems. PhD thesis, University of Buckingham, Hunter Street, Buckingham, MK18 1EG, UK, 1996. [12] Jeremy M. R. Martin and Peter H. Welch. A design strategy for deadlock-free concurrent systems. Transputer Communications, 3(3):1–18, 1997. [13] Clemens Szyperski. Component Software: Beyond Object-Oriented Programming. Component Software Series. Addison-Wesley, second edition, 2002. [14] R. W. Floyd. Assigning meanings to programs. In American Mathematical Society Symp. in Applied Mathematics, volume 19, pages 19–31, 1967. [15] C. A. R. Hoare. An axiomatic basis for computer programming. Communications of the ACM, 12(10):576–580, 1969. [16] C. A. R. Hoare. Proof of correctness of data representations. Acta Informatica, 1:271–281, 1972. [17] Bertrand Meyer. Design by contract. Technical Report TR-EI-12/CO, ISE Inc., 270, Storke Road, Suite 7, Santa Barbara, CA 93117 USA, 1987. [18] Bertrand Meyer. Object-Oriented Software Construction. Prentice Hall, second edition, 1997. [19] Marcel Boosten. Formal contracts: Enabling component composition. In J. F. Broenink and G. H. Hilderink, editors, Proceedings of Communicating Process Architecture 2003, pages 185–197, University of Twente, Netherlands, 2003. IOS Press. [20] Geoff Barrett. occam 3 Reference Manual. Inmos Ltd., 1992. [21] Adrian E. Lawrence. Triples. In I. R. East and J. M. R. Martin et al., editors, Proceedings of Communicating Process Architectures 2004, Series in Concurrent Systems Engineering, pages 157–184. IOS Press, 2004.
This page intentionally left blank
Communicating Process Architectures 2005 Jan Broenink, Herman Roebbers, Johan Sunter, Peter Welch, and David Wood (Eds.) IOS Press, 2005 © 2005 The authors. All rights reserved.
13
Groovy Parallel! A Return to the Spirit of occam? Jon KERRIDGE, Ken BARCLAY and John SAVAGE The School of Computing, Napier University, Edinburgh EH10 5DT {j.kerridge, k.barclay, j.savage} @ napier.ac.uk Abstract. For some years there has been much activity in developing CSP-like extensions to a number of common programming languages. In particular, a number of groups have looked at extensions to Java. Recent developments in the Java platform have resulted in groups proposing more expressive problem solving environments. Groovy is one of these developments. Four constructs are proposed that support the writing of parallel systems using the JCSP package. The use of these constructs is then demonstrated in a number of examples, both concurrent and parallel. A mechanism for writing XML descriptions of concurrent systems is described and it is shown how this is integrated into the Groovy environment. Finally conclusions are drawn relating to the use of the constructs, particularly in a teaching and learning environment. Keywords. Groovy, JCSP, Parallel and Concurrent Systems, Teaching and Learning
Introduction The occam programming language [1] provided a concise, simple and elegant means of describing computing systems comprising multiple processes running on one or more processors. Its theoretical foundations lay in the Communicating Sequential Process algebra of Hoare [2]. A practical realization of occam was the Inmos Transputer. With the demise of that technology the utility of occam as a generally available language was lost. The Communicating Process Architecture community kept the underlying principles of occam alive by a number of developments such as Welch’s JCSP package [3] and Hilderink’s CTJ[4]. Both these developments captured the concept of CSP in a Java environment. The former is supported by an extensive package that also permits the creation of systems that operate over a TCP/IP network. The problem with the Java environment is that it requires a great deal of support code to create what is, in essence, a simple idea. Groovy [5] is a new scripting language being developed for the Java platform. Groovy is compatible with Java at the bytecode level. This means that Groovy is Java. It has a Java friendly syntax that makes the Java APIs easier to use. As a scripting language it offers an ideal way in which to glue components. Groovy provides native syntactic support for many constructs such as lists, maps and regular expressions. It provides for dynamic typing which can immediately reduce the code bulk. The Groovy framework removes the heavy lifting otherwise found in Java. Thus the goal of the activity reported in this paper was to create a number of simple constructs that permitted the construction of parallel systems more easily without the need for the somewhat heavyweight requirements imposed by Java. This was seen as particularly important when the concepts are being taught. By reducing the amount that has to be written, students may be able to grasp more easily the underlying principles.
J. Kerridge et al. / Groovy Parallel
14
1. The Spirit of Groovy In August 2003 the Groovy project was initiated at codehaus [5], an open-source project repository focussed on practical Java applications. The main architects of the language are two consultants, James Strachan and Bob McWhirter. In its short life Groovy has stimulated a great deal of interest in the Java community. So much so that it is likely to be accepted as a standard language for the Java platform. Groovy is a scripting language based on several languages including Java, Ruby, Python and Smalltalk. Although the Java programming language is a very good systems programming language, it is rather verbose and clumsy when used for systems integration. However, Groovy with a friendly Java-based syntax makes it much easier to use the Java Application Programming Interface. It is ideal for the rapid development of small to medium sized applications. Groovy offers native syntax support for various abstractions. These and other language features make Groovy a viable alternative to Java. For example, the Java programmer wishing to construct a list of bank accounts would first have to create an object of the class ArrayList, then send it repeated add messages to populate it with Account objects. In Groovy, it is much easier: accounts = [ new Account(number : 123, balance : 1200), new Account(number : 456, balance : 400)]
Here, the subscript brackets [ and ] denote a Groovy List. Observe also the construction of the Account objects. This is an example of a named property map. Each property of the Account object is named along with its initial value. Maps (dictionaries) are also directly supported in Groovy. A Map is a collection of key/value pairs. A Map is presented as a comma-separated list of key : value pairs as in: divisors = [4 : [2], 6 : [2, 3], 12 : [2, 3, 4, 6]]
This Map is keyed by an integer and the value is a List of integers that are divisors of the key. Closures, in Groovy, are a powerful way of representing blocks of executable code. Since closures are objects they can be passed around as, for example, method parameters. Because closures are code blocks they can also be executed when required. Like methods, closures can be defined in terms of one or more parameters. One of the most common uses for closures is to process a collection. We can iterate across the elements of a collection and apply the closure to them. A simple parameterized closure is: greeting = { name -> println "Hello ${name}" }
The code block identified by greeting can be executed with the call message as in: greeting.call ("Jon") // explicit call greeting ("Ken") // implicit call
Several List and Map methods accept closures as an actual parameter. This combination of closures and collections provides Groovy with some very neat solutions to common problems. The each method, for example, can be used to iterate across the elements of a collection and apply the closure, as in: [1, 2, 3, 4].each { element -> print "${element}; " }
J. Kerridge et al. / Groovy Parallel
15
will print 1; 2; 3; 4; ["Ken" : 21, "John" : 22, "Jon" : 25].each { entry -> if(entry.value > 21) print "entry.key, " }
will print John, Jon,
2. The Groovy Parallel Constructs Groovy constructs are required that follow explicit requirements of CSP-based systems. These are direct support for parallel, alternative and the construction of guards reflecting that Groovy is a list-based environment whereas JCSP is an array-based system [5]. 2.1 The PAR Construct The PAR construct is simply an extension of the existing JCSP Parallel class that accepts a list of processes. The class comprises a constructor that takes a list of processes (processList) and casts them as an array of CSProcess as required by JCSP. class PAR extends Parallel { PAR(processList){ super( processList.toArray(new CSProcess[0]) ) } }
2.2 The ALT construct The ALT construct extends the existing JCSP Alternative class with a list of guards. The class comprises a constructor that takes a list of guards (guardList) and casts them as an array of Guard as required by the JCSP. The main advantage of this constructor in use is that the channels that form the guards of the ALT are passed to a process as a list of channel inputs and thus it is not necessary to create the Guard structure in the process definition. The list of guards can also include CSTimer and Skip. class ALT extends Alternative { ALT (guardList) { super( guardList.toArray(new Guard[0]) ) } }
2.3 The CHANNEL_INPUT_LIST Construct The CHANNEL_INPUT_LIST is used to create a list of channel input ends from an array of channels. This list can then be passed as a guardList to an ALT. This construct only needs to be used for channel arrays used between processes on a single processor. Channels that connect processes running on different processes (NetChannels) can be passed as a list without the need for this construct. class CHANNEL_INPUT_LIST extends ArrayList{ CHANNEL_INPUT_LIST(array) { super( Arrays.asList(Channel.getInputArray(array)) ) } }
J. Kerridge et al. / Groovy Parallel
16
2.4 The CHANNEL_OUTPUT_LIST Construct The CHANNEL_OUTPUT_LIST is used to construct a list of channel output ends form an array of such channels and provides the converse capability to a CHANNEL_INPUT_LIST. It should be noted that all the channel output ends have to be accessed by the same process. class CHANNEL_OUTPUT_LIST extends ArrayList{ CHANNEL_OUTPUT_LIST(array) { super( Arrays.asList(Channel.getOutputArray(array)) ) } }
3. Using the Constructs In this section we demonstrate the use of these constructs, first in a typical student learning example based upon the use of a number of sender processes having their outputs multiplexed into a single reading process. The second example is a little more complex and shows a system that runs over a network of workstations and provides the basic control for a tournament in which a number of players of different capabilities play the same game (draughts) against each other and this is then used in an evolutionary system to develop a better draughts player. 3.1 A Multiplexing System 3.1.1 The Send Process The specification of the class SendProcess is brief and contains only the information required. This aids teaching and learning and also understanding the purpose of the process. The properties of the class are defined as cout and id (lines 2 and 3) without any type information. The property cout will be passed the channel used to output data from this process and id is an identifier for this process. The method run is then defined. 01 02 03 04 05 06 07 08 09 10 11
class SendProcess implements CSProcess { cout // the channel used to output the data stream id // the identifier of this process void run() { i = 0 1.upto(10) { // loop 10 times i = i + 1 cout.write(i + id) // write the value of id + i to cout } } }
There is no necessity for a constructor for the class or the setter and getter methods as these are all created automatically by the Groovy system. The run method simply loops 10 times outputting the value of id to which has been added the loop index variable i (lines 4 to 8). Thus the explanation of its operation simply focuses on the communication aspects of the process. 3.1.2 The Read Process The ReadProcess is similarly brief and in this version extracts the SendProcess identification (s) and value (v) from the value that is sent to the ReadProcess. It should also be noted that types might be explicitly defined, as in the case of s (line 18), in order to
J. Kerridge et al. / Groovy Parallel
achieve the desired effect. thousands. 12 13 14 15 16 17 18 19 20 21 22
17
It is assumed that identification values are expressed in
class ReadProcess implements CSProcess { cin // the input channel void run() { while (true) { d = cin.read() v = d % 1000 int s = d / 1000 println "Read: ${v} from sender ${s}" } } }
// // // //
read from cin v the value read from sender s print v and s
3.1.3 The Plex Process The Plex process is a classic example of a multiplex process that alternates over its input channels (cin) and then reads a selected input, which is immediately written to the output channel (cout) (line 31). The input channels are passed as a list to the process and these are then passed to the ALT construct (line 27) to create the JCSP Alternative. 23 24 25 26 27 28 29 30 31 32 33 34
class Plex implements CSProcess { cin // channel input list cout // output channel onto which inputs are multiplexed void run () { alt = new ALT(cin) running = true while (running) { index = alt.select () cout.write (cin[index].read()) } } }
3.1.4 Running the System on a Single Processor Figure 1, shows a system comprising any number of SendProcesses together with a Plex and a ReadProcess.
SendProcess a SendProcess
Plex
b
ReadProcess
SendProcess
Figure 1. The Multiplex Process Structure
In a single processor invocation, five channels a, connect the SendProcesses to the Plex process and are declared using the normal call to the Channel class of JCSP (line 35). Similarly, the channel b, connects the Plex process to the ReadProcess (line 36). A CHANNEL_INPUT_LIST construct is used to create the list of channel inputs that will be passed to the Plex process and which will be ALTed over (line 37).
J. Kerridge et al. / Groovy Parallel
18
The Groovy map abstraction is used (line 38) to create idMap that relates the instance number of the SendProcess to the value that will be passed as its id property. A list (sendList) of SendProcesses is then created (lines 39-41) using the collect method on a list. The list comprises five instances of the SendProcess with the cout and id properties set to the values indicated, using a closure applied to each member of the set [0,1,2,3,4]. A processList is then created (lines 42-45) comprising the sendList plus instances of the Plex and ReadProcess that have their properties initialized as indicated. The flatten() method has to be applied because sendList is already a List that has to be removed for the PAR constructor to work. Finally a PAR construct is created (line 46) and run. In section 4 a formulation that removes the need for flatten() is presented. 35 36 37 38 39 40 41 42 43 44 45 46
a = Channel.createOne2One (5) b = Channel.createOne2One () channelList = new CHANNEL_INPUT_LIST (a) idMap = [0: 1000, 1: 2000, 2:3000, 3:4000, 4:5000] sendList = [0,1,2,3,4].collect {i->return new SendProcess ( cout:a[i].out(), id:idMap[i]) } processList = [ sendList, new Plex (cin : channelList, cout : b.out()), new ReadProcess (cin : b.in() ) ].flatten() new PAR (processList).run()
3.1.5 Running the System in Parallel on a Network To run the same system shown in Figure 1, on a network, with each process being run on a separate processor, a Main program for each process is required. 3.1.5.1 SendMain SendMain is passed the numeric identifier (sendId) for this process (line 47) as the zero’th command line argument. A network node is then created (line 48) and connected to a default CNSServer process running on the network. From the sendId, a string is created that is the name of the channel that this SendProcess will output its data on and a One2Net channel is accordingly created (line 51). A list containing just one process is created (line 52) that is the invocation of the SendProcess with its properties initialized and this is passed to a PAR constructor to be run (line 53). 47 48 49 50 51 52 53
sendId = Integer.parseInt( args[0] ) Node.getInstance().init(new TCPIPNodeFactory ()) int sendInstance = sendId / 1000 channelInstance = sendInstance - 1 outChan = CNS.createOne2Net ( "A" + channelInstance) pList = [ new SendProcess ( id : sendId, cout : outChan ) ] new PAR(pList).run()
3.1.5.2 PlexMain PlexMain is passed the number of SendProcesses as a command line argument (line 54), as there will be this number of input channels to the Plex process. These input channels are created as a list of Net2One channels (lines 57-59) having the same names as were created for each of the SendProcesses. As this is already a list there is no need to obtain the input ends of the channels, as this is implicit in the creation of Net2One channels. The Plex outChan is created as a One2Net channel with the name B (line 60) and the Plex process is then run in a similar manner as each of the SendProcesses (lines 61, 62).
J. Kerridge et al. / Groovy Parallel
54 55 56 57 58 59 60 61 62
19
inputs = Integer.parseInt( args[0] ) Node.getInstance().init(new TCPIPNodeFactory ()) inChans = [] // an empty list of net channels for (i in 0 ... inputs ) { inChans << CNS.createNet2One ( "A" + i ) // append the channels } outChan = CNS.createOne2Net ( "B" ) pList = [ new Plex ( cin : inChans, cout : outChan ) ] new PAR (pList).run()
3.1.5.3 ReadMain ReadMain requires no command line arguments. It simply creates a network node (line 63), followed by a Net2One channel with the same name as was created for PlexMain’s output channel (line 64) and the ReadProcess is then invoked in the usual manner. 63 64 65 66
Node.getInstance().init(new TCPIPNodeFactory ()) inChan = CNS.createNet2One ( "B" ) pList = [ new ReadProcess ( cin : inChan ) ] new PAR (pList).run()
3.1.6 Summary In the single processor case, each process is interleaved on a single processor. In the multiprocessor case each process is run on a separate processor and it is assumed that CNSServer [6] is executing somewhere on the network. 3.2 A Tournament Manager The Tournament System, see Figure 2, is organized as a set of Board processes that each run a game in the tournament on a different processor. The Board processes receive information about the game they are to play from an Organiser process. The results from the Board processes are returned via a ResultMux process running on the same processor as the Organiser process. In order that the system operates in a Client-Server[6] mode each Board process is considered to be a client process and the combination of the Organiser and ResultMux processes is considered to be the Server.
Board
Tournament M2O
R
Organiser
ResultMux
W
O2M
Board
Figure 2 The Tournament System
The system requires that data be communicated as a set of GameData and ResultData objects. The system, as defined, cannot be executed on a single processor system as due account of the copying of network communicated objects, which have to implement
J. Kerridge et al. / Groovy Parallel
20
Serializable, is taken in the design. More importantly, the use of an internal channel between two processes has to be considered and a reply channel is utilized to overcome the fact that an object reference is passed between the ResultMux and Organiser processes.
3.2.1 The Data Objects Two data objects are used within the system, GameData holds information concerning the player identities and the playing weights associated with each player. A state (line 72) property is used to indicate whether the object holds playing data or is being used to indicate the end of the Tournament. 67 68 69 70 71 72 73
class GameData implements Serializable { p1 // id of player 1 p2 // id of player 2 w1 // list of weights for player 1 w2 // list of weights for player 2 state // string containing data or end }
The ResultData object is used to communicate results from the Board processes back to the Organiser process. The use of each property of the object is identified in the corresponding comments. The board on which the game is played is required (line 79) so the Organiser process can send another game to the Board process immediately. The state property (line 80) is used to indicate one of three states, namely; the board has been initialized waiting for a game, the object contains the results of a game and the tournament is finishing. 74 75 76 77 78 79 80 81
class ResultData p1 // p2 // result1V2 // result2V1 // board // state // }
implements Serializable { player 1 identifier player 2 identifier result of game for p1 V p2 result of game for p2 V p1 board used String containing init or result or end
3.2.2 The Board Process The Board process is a client process and has been constructed so that an output to the Organiser in the form of a result.write() (lines 96, 103, 119) communication is always followed immediately by a work.read() (line 98). The initialization code with its output is immediately followed, in the main loop, by the required input operation. The main loop comprises two sections of an if-statement, which finish with either the outputting of a result or a termination message. The latter does not need to receive an input from the Organiser process because the Board process will itself have been terminated. In the normal case, the outputting of a result at the end of the loop is immediately followed by an input at the start of the loop. These lines (96, 98, 103, 119) have been highlighted in the code listing. A consequence of using this design approach is that only one ResultData and one GameData object is required thereby minimizing the use of the very expensive new operator. The most interesting aspect of the code is that the access to the properties of the data classes is simply made using the dot notation. This results from Groovy automatically generating the setters, getters and class constructors required. This has the immediate benefit of making the code more accessible so that key points such as the structure of client and server processes is more obvious.
J. Kerridge et al. / Groovy Parallel
21
82 class Board implements CSProcess { 83 84 bId // the id for this Board process 85 result // One2One channel connecting the Board to the ResultMux 86 work // One2One channel used to send work to this Board 87 88 void run() { 89 println "Board ${bId} has started" 90 tim = new CSTimer() // used to simulate game time 91 gameData = new GameData() // the weights and player ids 92 resultData = new ResultData() // the result of this game 93 resultData.state = "init" 94 resultData.board = bId 95 running = true 96 result.write(resultData) // send init to Organiser 97 while (running) { 98 gameData = work.read() // always follows a result.write 99 if ( gameData.state == "end" ) { // end of processing 100 println "Board ${bId} has terminated" 101 running = false 102 resultData.state = "end" 103 result.write(resultData) // send termination to ResultMux 104 } 105 else { 106 // run the game twice with P1 v P2 and then P2 v P1 107 // simulated by a timeout 108 tim.after ( tim.read() + 100 + gameData.p2 ) 109 println "Board ${bId} playing games for 110 ${gameData.p1} and ${gameData.p2}" 111 outcome1V2 = bId // return the bId of the board playing game 112 outcome2V1 = -bId // instead of the actual outcomes 113 resultData.state = "result" 114 resultData.p1 = gameData.p1 115 resultData.p2 = gameData.p2 116 resultData.board = bId 117 resultData.result1V2 = outcome1V2 118 resultData.result2V1 = outcome2V1 119 result.write(resultData) // send result to ResultMux 120 } 121 } } }
3.2.3 The ResultMux Process This process forms part of the tournament system and is used to multiplex results from the Board processes to the Organiser. The ResultMux process runs on the same processor as the Organiser and thus access to any data objects by both processes have to be carefully managed. If this is not done then there is a chance that one process may overwrite data that has already been communicated to the other process because only an object reference is passed during such communications. In this case, the resultData object is read into in the ResultMux process and manipulated within Organiser. Yet again the desire is to reduce the number of new operations that are undertaken. new is both expensive and also leads to the repeated invocation of the Java garbage collector. In the version presented here only one instance of a ResultData object is created outside the main loop of the process. In addition, no new operation exists within the loop (lines 129-144). The only other problem to be overcome is that of terminating the ResultMux process. One of the properties (boards) of the process is the number of parallel Board processes invoked by the system. When a Board process receives a GameData object that has its state set to “end” it communicates this to the ResultMux process as well. Once the ResultMux process has received the required number of such messages it can then terminate itself (lines 137-140).
22
J. Kerridge et al. / Groovy Parallel
The other aspect of note is that the property resultsIn is a list of network channels and that these can be used as a parameter to the ALT construct without any modification because ALT (line 132) is expecting a list of input channel ends, which is precisely the type of a Net2One channel, see 3.2.6. Any ResultData that is read in on the resultsIn channels is then immediately written to the resultOut channel (line 143). The use of the reply property will be explained in the next section. 122 class ResultMux implements CSProcess { 123 boards // number of boards; used for process termination 124 resultOut // output channel from Mux to Organiser 125 reply // channel indicating result processed by Organiser 126 resultsIn // list of result channels from each of the boards 127 128 void run () { 129 resultData = new ResultData() // holds data from boards 130 endCount = 0 131 println "ResultMux has started" 132 alt = new ALT (resultsIn) 133 running = true 134 while (running) { 135 index = alt.select() 136 resultData = resultsIn[index].read() 137 if ( resultData.state == "end" ) { 138 endCount = endCount + 1 139 if ( endCount == boards ) { 140 running = false 141 } 142 } else { 143 resultOut.write(resultData) 144 b = reply.read() 145 } 146 } } }
3.2.4 The Organiser Process This is the most complex process but it breaks down into a number of distinct sections that facilitate its explanation. Yet again the use of the new operation has been limited to those structures that are required and none are contained within the main loop of the process. The outcomes structure is a list of lists that will contain the result of each game. The access mechanism is similar to that of array access but Groovy permits other styles of access that are more list oriented. Initially, each element of the structure is set to a sentinel value of 100 (lines 159-166). The result of each pair of games, pi plays pj and pj plays pi for all i <>j, is recorded in the outcomes structure such that pi v pj is stored in the upper triangle of outcomes and pj v pi in the lower part. Games such as draughts and chess have different outcomes for the same players depending upon which is white or black and hence is the starting player. The main loop has been organized so that the Organiser receives a result from the ResultMux. Saving the game’s results in the outcomes structure and then sending another game to the now idle Board process achieves this (lines 171-178). However, before another game is sent to the Board process a reply (line 178) is sent to the ResultMux process to indicate the ResultData has been processed. The resultData object is passed as a value from the ResultMux to the Organiser, which is an object reference. JCSP requires that once a process has written an object it should not then access that object until it is safe to do so. Thus once the outcomes structure has been updated the object is not required and hence the reply can be sent to the ResultMux process immediately. This happens on two occasions, first when the resultData contains the state “init” (line 180) and more commonly when a result is returned and the state is “result” (line 178).
J. Kerridge et al. / Groovy Parallel
147 class Organiser implements CSProcess { 148 boards // the number of boards that are being used in parallel 149 players // number of players 150 work // channels on which work is sent to boards 151 result // channel on which results received from ResultMux 152 reply // reply to resultMux from Organiser 153 154 void run () { 155 resultData = new ResultData() // create the data structures 156 gameData = new GameData() 157 println "Organiser has started" 158 // set up the outcomes 159 outcomes = [ ] 160 for ( r in 0 ..< players ) { // cycle through the rows 161 row = [ ] // 0 ..< n gives 0 to n - 1 162 for ( c in 0 ..< players ) { // cycle through the columns 163 row << 100 // 100 acts as sentinel 164 } 165 outcomes << row 166 } 167 // the main loop 168 for ( r in 0 ..< players) { 169 c = r + 1 170 for ( c in 0 ..< players) { 171 resultData = result.read() // an object reference not a copy 172 b = resultData.board 173 if ( resultData.state == "result" ) { 174 p1 = resultData.p1 175 p2 = resultData.p2 176 outcomes [ p1 ] [ p2 ] = resultData.result1V2 177 outcomes [ p2 ] [ p1 ] = resultData.result2V1 178 reply.write(true) // outcomes processed 179 } else { 180 reply.write(true) // init received 181 } 182 // send the game [r,c] to Board process b 183 gameData.p1 = r 184 gameData.p2 = c 185 gameData.state = "data" 186 // set w1 to the weights for p1 187 // set w2 to the weights for p2 188 work[b].write(gameData) 189 } 190 } 191 // now terminate the Board processes 192 println "Organiser: Started termination process" 193 gameData.state = "end" 194 for ( i in 0 ... boards) { 195 resultData = result.read() 196 bd = resultData.board 197 p1 = resultData.p1 198 p2 = resultData.p2 199 outcomes [ p1 ] [ p2 ] = resultData.result1V2 200 outcomes [ p2 ] [ p1 ] = resultData.result2V1 201 reply.write(true) 202 work[bd].write(gameData) 203 } 204 println"Organiser: Outcomes are:" 205 for ( r in 0 ... players ) { 206 for ( c in 0 ... players ) { 207 print "[${r},${c}]:${outcomes[r][c]}; " 208 } 209 println " " 210 } 211 println"Organiser: Tournament has finished" 212 } 213 }
23
24
J. Kerridge et al. / Groovy Parallel
Initially, the loop will receive as many “init” messages as there are Board processes. Thus once all the games have been sent to the Board processes, each of the Board processes will still be processing a game. Hence, another loop has to be used to input the last game result from each of these processes (lines 194-203). In this case the gameData that is output contains the state “end” and this will cause the Board process that receives it to terminate but not before it has also sent the message on to the ResultMux process. Finally, the outcomes can be printed (lines 204-211) or in the real tournament system evaluated to determine the best players so that they can be mutated in an evolutionary development scheme. 3.2.5 Invoking a Board Process Each Board process has to be invoked on its own processor. The network channels are created using CNS static methods (lines 216, 217). It is vital that the channel names used in one process invocation are the same as the corresponding channel in another processor. 214 215 216 217 218 219 220
Node.getInstance().init(new TCPIPNodeFactory ()); boardId = Integer.parseInt(args[0]) //the number of this Board w = CNS.createNet2One("W" + boardId) // the Net2One work channel r = CNS.createOne2Net("R" + boardId) // the One2Net result channel println " Board ${boardId} has created its Net channels " pList = [ new Board ( bId:boardId , result:r , work:w ) ] new PAR (pList).run()
3.2.6 Invoking the Tournament This code is similar expect that list of network channels are created by appending channels of the correct type to list structures (lines 224-230). Two internal channels between ResultMux and Organiser are created, M2O and O2M (lines 231, 232) and these are used to implement the resultOut and reply connections respectively between these processes. An advantage of the Groovy approach to constructors is that the constructor identifies each property by name, rather than the order of arguments to a constructor call specifying the order of the properties. It also increases the readability of the resulting code. 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238
Node.getInstance().init(new TCPIPNodeFactory ()); nPlayers = Integer.parseInt(args[0]) // the number of players nBoards = Integer.parseInt(args[1]) // the number of boards w = [] // the list of One2Net work channels r = [] // the list of Net2One result channels for ( i in 0 ..< nBoards) { i = i+1 w << CNS.createOne2Net("W" + i) r << CNS.createNet2One("R" + i) } M2O = Channel.createOne2One() O2M = Channel.createOne2One() pList = [ new Organiser ( boards:nBoards , players:nPlayers , work:w , result: M2O.in(), reply: O2M.out() ), new ResultMux ( boards:nBoards , resultOut:M2O.out(), resultsIn:r, reply: O2M.in() ) ] new PAR ( pList) .run()
J. Kerridge et al. / Groovy Parallel
25
4. The XML Specification of Systems Groovy includes tree-based builders that can be sub-classed to produce a variety of treestructured object representations. These specialized builders can then be used to represent, for example, XML markup or GUI user interfaces. Whichever kind of builder object is used, the Groovy markup syntax is always the same. This gives Groovy native syntactic support for such constructs. The following lines, 239 to 248, demonstrate how we might generate some XML [7] to represent a book with its author, title, etc. The non-existent method call Author("Ken Barclay") delivers the
Ken Barclay element, while the method call ISBN(number : "1234567890") produces the empty XML element
. 239 240 241 242 243 244 245 246 247 248
// Create a builder mB = new MarkupBuilder() // Compose the builder bk = mB.Book() { Author("Ken Barclay") Title("Groovy") Publisher("Elsevier") ISBN(number : "1234567890")
// // // // // //
Ken Barclay <Title>Groovy Elsevier
It is also important to recognize that since all this is native Groovy syntax being used to represent any arbitrarily nested markup, then we can also mix in any other Groovy constructs such as variables, control flow such as looping and branching, or true method calls. In keeping with the spirit of Groovy, manipulating XML structures is made particularly easy. Associated with XML structures is the need to navigate through the content and extract various items. Having, say, parsed a data file of XML then traversing its structures is directly supported in Groovy with XPath-like [7] expressions. For example, a data file comprising a set of Book elements might be structured as: 249
250 … 251 … 252 … 253 … 254
If the variable doc represents the root for this XML document, then the navigation expression doc.Book[0].Title[0] obtains the first Title for the first Book. Equally, doc.Book delivers a List that represents all the Book elements in the Library. With a suitable iterator we immediately have the code to print the title of every book in the library: 255 256 257 258 259 260
parser = new XmlParser() doc = parser.parse("library.xml") doc.Book.each { bk -> println "${bk.Title[0].text()}" }
The ease with which Groovy can manipulate XML structures encourages us the consider representing JCSP networks as XML markup. Groovy can then manipulate that information, configure the processes and channels, and then execute the model. For
26
J. Kerridge et al. / Groovy Parallel
example, we might arrive at the following markup (lines 261-274) for the classical producer–consumer system built from the SendProcess and the ReadProcess described in 3.1.1 and 3.1.2. The libraries to be imported are specified on lines 262 and 263. 261
262 263 264 265 <processlist> 266 <process class="SendProcess"> 267 <arg name="cout" value="chan.out()"/> 268 <arg name="id" value="1000"/> 269 270 <process class="ReadProcess"> 271 <arg name="cin" value="chan.in()"/> 272 273 274
To ensure the consistency of the information contained in these network configurations we could define an XML schema [7] for this purpose. A richer schema defines how nested structures could be described. From the preceding example we also permit a recursive definition whereby a simple <process> may itself be another <processlist>. Hence we can define the XML for the plexing system described in 3.1.4 by the following. 275
276 277 278 279 280 281 <processlist> 282 <processlist> 283 <process class="SendProcess"> 284 <arg name="cout" value="a[0].out()"/> 285 <arg name="id" value="1000"/> 286 287 <process class="SendProcess"> 288 <arg name="cout" value="a[1].out()"/> 289 <arg name="id" value="2000"/> 290 291 <process class="SendProcess"> 292 <arg name="cout" value="a[2].out()"/> 293 <arg name="id" value="3000"/> 294 295 <process class="SendProcess"> 296 <arg name="cout" value="a[3].out()"/> 297 <arg name="id" value="4000"/> 298 299 <process class="SendProcess"> 300 <arg name="cout" value="a[4].out()"/> 301 <arg name="id" value="5000"/> 302 303 304 <process class="Plex"> 305 <arg name="cout" value="b.out()"/> 306 <arg name="cin" value="channelList"/> 307 308 <process class="ReadProcess"> 309 <arg name="cin" value="b.in()"/> 310 311 312 313
J. Kerridge et al. / Groovy Parallel
27
By inspection we can see that the XML presented in lines 275 to 312 capture the Groovy specification of the system given in lines 35 to 46. The main difference is that the list of SendProcesses generated in lines 39 to 41 has been explicitly defined as a sequence of SendProcess definitions. A Groovy program can parse this XML and the system will then be invoked automatically on a single processor. The automatically generated output from the above XML script is shown in lines 314 to 330. As can be seen it generates two PAR constructs nested one in the other. The internal one contains the list of SendProcesses that are included within the one running the Plex and ReadProcess processes. Lines 314 and 315 show the jar files that have to be imported. The Groovy Parallel constructs described in section 2 have been placed in a jar file, emphasizing that Groovy is just Java. 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330
import com.quickstone.jcsp.lang.* import uk.ac.napier.groovy.parallel.* a = Channel.createOne2One(5) b = Channel.createOne2One() channelList = new CHANNEL_INPUT_LIST(a) new PAR([ new PAR([ new SendProcess(cout : a[0].out(), id : 1000), new SendProcess(cout : a[1].out(), id : 2000), new SendProcess(cout : a[2].out(), id : 3000), new SendProcess(cout : a[3].out(), id : 4000), new SendProcess(cout : a[4].out(), id : 5000) ]), new Plex(cout : b.out(), cin : channelList), new ReadProcess(cin : b.in()) ]) .run()
5. Conclusions and Future Work The paper has shown that it is possible to create problem solutions in a clear and accessible manner such that the essence of the CSP-style primitives and operations is more easily understood. A special lecture was given to a set of students who were being taught Groovy as an optional module in their second year. This lecture covered the concepts of CSP and their implementation in Groovy. There was consensus that the approach had worked and that students were able to assimilate the ideas. This does however need to be tested further in a more formal setting. Currently, Groovy uses dynamic binding and it can be argued that this is not appropriate for a proper software engineering language. It would only need for this checking to be done at compile time, say by a switch, and we could more robustly design, implement and test systems. Work is being undertaken to develop a diagramming tool that outputs the XML required by the system builder. This would mean that the whole system could be seamlessly incorporated into existing design and development tools such as ROME [8]. This could be extended to develop techniques for distributing a parallel system over a network of workstations or a Beowulf cluster. Further consideration could also be given to the XML specifications. An XML vocabulary might be developed that is richer than that presented. Such a vocabulary might provide a compact way to express for example, the channels used as inputs to processes where they become the Guards of an ALT construct. Can we answer the question posed by the title of this paper in the affirmative? We suggest that sufficient evidence has been presented and that this provides a real way forward for promoting the design of systems involving concurrent and parallel components.
28
J. Kerridge et al. / Groovy Parallel
Acknowledgements A colleague, Ken Chisholm, provided the requirement for the draughts tournament. The helpful comments of the referees were gratefully accepted. References [1] Inmos Ltd, occam2 Programming Reference Manual, Prentice-Hall, 1988. [2] C.A.R. Hoare, Communicating Sequential Processes. New Jersey: Prentice-Hall, 1985; available electronically from http://www.usingcsp.com/cspbook.pdf. [3] P.H. Welch, Process Oriented Design for Java – Concurrency for All, http://www.cs.kent.ac.uk/projects/ofa/jcsp/jcsp.ppt, web site accessed 4/5/2005. [4] G. Hilderink, A. Bakkers and J. Broenink, A Distributed Real-Time java System Based on CSP, The Third IEEE International Symposium on Object-Oriented Real-Time Distributed Computing, ISORC 2000, Newport Beach, California, pp.400-407, March 15-17, 2000. [5] Groovy Developer’s Web Site, accessed 4/5/2005, groovy.codehaus.org. [6] Quickstone Ltd, web site accessed 4/5/2005, www.quickstone.com. [7] http://www.w3.org/TR/REC-xml/; http://www.w3.org/TR/xpath. [8] K. Barclay and J. Savage, Object Oriented Design with UML and Java, Elsevier 2004; supporting tool available from http://www.dcs.napier.ac.uk/~kab/jeRome/jeRome.html.
Communicating Process Architectures 2005 Jan Broenink, Herman Roebbers, Johan Sunter, Peter Welch, and David Wood (Eds.) IOS Press, 2005 © 2005 The authors. All rights reserved.
29
On Issues of Constructing an Exception Handling Mechanism for CSP-Based Process-Oriented Concurrent Software† Dusko S. JOVANOVIC, Bojan E. ORLIC, Jan F. BROENINK Twente Embedded Systems Initiative, Drebbel Institute for Mechatronics and Control Engineering, Faculty of EE-Math-CS, University of Twente, P.O.Box 217, 7500 AE, Enschede, the Netherlands
[email protected] Abstract. This paper discusses issues, possibilities and existing approaches for fitting an exception handling mechanism (EHM) in CSP-based process-oriented software architectures. After giving a survey on properties desired for a concurrent EHM, specific problems and a few principal ideas for including exception handling facilities in CSP-designs are discussed. As one of the CSP-based frameworks for concurrent software, we extend CT (Communicating Threads) library with the exception handling facilities. The extensions result in two different EHM models whose compliance with the most important demands of concurrent EHMs (handling simultaneous exceptions, the mechanism formalization and efficient implementation) are observed.
Introduction Under process-oriented architectures in principle we assume that a program’s algorithms are confined within processes that exchange data via channels. When based on CSP [1], channels (communication relationships) are synchronous, following the rendezvous principle; executional compositions among processes are ruled by the CSP constructs, possibly represented as compositional relationships [2]. Today’s successors of the programming language occam, which was first to implement this programming model, are occam-like libraries for Java, C and C++ (the most known are the University of Twente variants CTJ [3], CTC and CTC++ [2, 4] and the University of Kent variants JCSP [5], CCSP [6] and C++CSP [7]). “Twente” variants are together referred to as CT (Communicating Threads), and for this paper all experiments are worked out within that framework. The general Twente CSP-based framework for concurrent embedded control software is referred to as CSP/CT, which implies the use of those concepts of CSP that are implemented in the CT and accompanying tools [8] in order to provide this particular process-oriented software environment. Recent work [9] is concerned with dependability aspects of the CSP/CT, which revives interest in fault tolerance mechanisms for CSP/CT, and among them the exception handling mechanism (EHM). Exception handling is considered “as the most powerful software faulttolerance mechanism” [10]. An exception is an indication that something out of the ordinary has occurred which must be brought to the attention of the program which raised it [11]. Practical results during the research history of thirty years ([12]) appeared as †
This research is supported by PROGRESS, the embedded system research program of the Dutch organization for Scientific Research, the Dutch Ministry of Economic Affairs and the Technology Foundation STW.
NWO,
30
D.S. Jovanovic et al. / Exception Handling Mechanisms for Concurrent Software
sophisticated EHMs in modern mainstream languages used for programming missioncritical systems, like C++, Java and Ada. This paper considers the exception handling concept on a methodological level of designing concurrent, CSP/CT process-oriented software. An EHM allows system designers to distribute dedicated corrective or alternative code components at places within software composition that maximize effectiveness of error recovery. Principles of EHM are based on provision of separate code segments or components to which the execution flow is transferred upon an error occurrence in the ordinary execution. Code segments or components that attempt error recovery (exception handling) are called exception handlers. The main virtue of this way of handling errors in software execution is a clear separation between normal (ordinary) program flow and parts of software dedicated to correcting errors. Because of alterations of a program’s execution flow due to exceptional operations, EHMs additionally complicate understanding of concurrent software. In [13] issues of exception handling in sequential systems are contrasted with those in concurrent systems, especially the problems of concurrently raised exceptions resolution and simultaneous error recovery. Despite favourable properties in structuring error handling and the fact that EHM is the only structured fault tolerance concept directly supported at the level of languages, it is not so readily used in mission- or life-critical systems. Lack of tractable methods for testing or, even more desired, formal verification of programs with exception handling is to be blamed for hesitant use of this powerful concept. As clearly stated in [14], “since exceptions are expected to occur rarely, the exception handling code of a system is in general the least documented, tested, and understood part. Most of the design faults existing in a system seem to be located in the code that handles exceptional situations.” 1 Properties of Exceptions and Exception Handling Mechanisms (EHMs) 1.1 EHM Requirements 1.1.1 General EHM Properties The following list combines some general properties for evaluating quality and completeness of an Exception Handling Mechanism (EHM) [13, 15, 16]. It should: 1. be simple to understand and use. 2. provide a clear separation of the ordinary program code flow and the code intended for handling possible exceptions. 3. prevent an incomplete operation from continuing. 4. allow exceptions to contain all information about error occurrence that may be useful for a proper handling, i.e. recovery action. 5. allow overhead in execution of exception handling code only in the presence of an exception – exception handling burdens on the error-free execution flow should be neglectable. 6. allow a uniform treatment of exceptions raised both by the environment and by the program. 7. be flexible to allow adding, changing and refining exceptions. 8. impose declaring exceptions that a component may raise. 9. allow nesting exception handling facilities.
D.S. Jovanovic et al. / Exception Handling Mechanisms for Concurrent Software
31
1.1.2 Properties of a Concurrent EHM The main difficulty of extending well-understood sequential EHMs for use in concurrent systems is the effect that occurrence of an exception in one of the collaborating processes certainly has consequence to the other (parallel composed) processes. For instance, exceptional interruption in one process before a rendezvous communication certainly causes blocking of the other party in the communication, causing a deadlock-like situation [17]. It is likely that an exceptional occurrence detected in one process is of concern of the other processes. In large parallel systems it may easily happen that independent exceptions occur simultaneously: more than one exception had been raised before the first one has been handled. The EHM, actually exception handlers, should detect these so-called concurrent exception occurrences [13]. Also the same error may affect different processes during different scenarios, so causing different but related exceptions. Such concurrent (and possibly related) exceptions need to be treated in a holistic way. In these situations handling exceptions one-by-one may be wrong – therefore in [13] the notion of exception hierarchy has been introduced. The term “exception hierarchy” should be distinguished from the hierarchy of exception handlers (which determines exception propagation, as addressed in the remainder). Neither has it anything to do with a possible inheritance hierarchy of exception types. The concept of exception hierarchy helps reasoning and acting in the case of multiple simultaneously occurring exceptions: “if several exceptions are concurrently raised, the exception used to activate the fault tolerance measures is the exception that is the root of the smallest subtree containing all of the exceptions” [13]. For coping with the mentioned problems, a concurrent EHM should make sure that: 10. upon an exception occurrence in a process communicating in a parallel execution with other processes, all processes dependent on that process should get informed that the exception has occurred. 11. all participating processes simultaneously enter recovery activities specific for the exception occurred. 12. in case of concurrent exception occurrences in different parallel composed processes, a handler is chosen that treats the compound exceptional situation rather than isolated exceptions. 1.1.3 Formal Verifiability and Real-time Requirements In order to use any variant of the EHM models proposed in section 3, for high integrity real-time systems (and to benefit from the CSP foundation for such one mechanism), the proposal should allow that: 13. the mechanism is formally described and verified. The system as a whole including both normal and exception handling operating modes should be liable to formal checking analysis. 14. the temporal behaviour of the EHM implementation is as much as possible predictable/controlled. In real-time systems, execution time of the EHM part of an application should be taken into account when calculating temporal properties of execution scenarios. 1.2 Sources of Exceptions in CSP-based Architectures Within the CSP/CT architecture, exceptional events may be expected to occur in the following different contexts:
32
D.S. Jovanovic et al. / Exception Handling Mechanisms for Concurrent Software
1. run-time environment: a. Run-time libraries and OS – illegal memory address, memory allocation problems, division by zero, overflow, etc… b. CT library components can raise exceptions (e.g. network device drivers or remote link drivers on expired timeout; array index outside the range, dereferencing a null pointer). 2. invalid(ated) channels (i.e. broken communication link, malfunctioning device or “poisoned” channels). 3. consistency checks inserted at certain places in a program can fail (e.g. a variable can go outside a permitted range). 4. exceptions induced by exceptions raised in some of the processes important to the execution of the process. 1.3 Mechanism of Exception Propagation After being thrown, an exception propagates to the place it can be eventually caught (and handled). A crucial mechanism of an exception handling facility is its propagation mechanism, which determines how to find a proper exception handler for the type of exception that has been thrown. Exception propagation always follows a hierarchical path, and in languages different choices are made [15, 16, 18]: dynamically along the function call chain or object creation chain or statically along the lexical hierarchy [19]. The exception propagation mechanism is crucial in understanding the execution flow in presence of exceptions and its complexity directly influences acceptance of the concept in practice. 1.4 Termination and Resumption EHM Models Occurrence of an exception causes interruption of the ordinary program flow and transfer of control to an exception handler. The state of the exceptionally interrupted processes is also a concern. Depending on the flow of execution between the ordinary and exceptional operation of software (in presence of an exception), the so-called handling models [15] can be predominantly divided in two groups: termination and resumption EHM models. In the termination model, further execution of an “exception-guarded” process, function or code block interrupted by an exceptional occurrence is aborted and never resumed. Instead, it is the responsibility of the exception handler to bring the system in such a state that it can continue providing the originally specified (or gracefully degraded) service. If the exception handler is not capable of providing such a service, it will throw the exception further. Therefore, adopting the termination model has intrinsically an unwelcome feature: the functionality of the interrupted process after the exceptional occurrence (termination) point has to be repeated in the handler. It may easily happen that the entire job before the exception occurrence has to be repeated. Therefore, the idea of allowing (also) the resumption mechanism within an EHM does not lose any of its attractions. In the resumption model, an exception handler will also be executed following the exception occurrence; however, the context of the exceptionally interrupted process will be preserved and after the exception is handled (i.e. the handler terminated), the process will continue its execution at the same point where it was interrupted. Both exception handling models gained initially equal attention, but practice made the termination model prevail for sequential EHMs, as much simpler to implement. It is adopted in all mainstream languages, as C++, Java and Ada.
D.S. Jovanovic et al. / Exception Handling Mechanisms for Concurrent Software
33
2 Exception Handling Facilities in CSP-based Architectures The EHM models discussed in the next section are to address the concurrency-specific issues and therefore aimed to be used at the level of processes in a process-oriented concurrent environment. They should be implementable in any language suitable for implementing the CSP principles themselves. It is another wish that the mechanism does not restrict use of sequential exception handling facilities (if any) present in a chosen implementation language. If a process encapsulates a complex algorithm that is originally developed with use of some native exception handling facilities, there should be no need to modify the original code. As long as the use of a native EHM is confined to internal use within a process, it does not clash with the EHM on the process-level. Practically, this means that internally used exceptions must all be handled within the process. However, as the last resort, a component should submit all unhandled exceptions to the process-level EHM complying with the processlevel exception handling mechanism. The principal difficulty with concerting error recovery in concurrent systems is posed by the fact that an exception occurrence in one process is an asynchronous event with respect to other processes. In a system designed as a parallel composition of many processes, proper handling of an exception occurrence that takes place in one of the participating processes might require that other dependent processes are interrupted as well. Propagation of unhandled exceptions is performed according to the hierarchical structure of exception handlers. In occam and the CSP/CT framework, the system is structured as a tree-like hierarchy made of constructs as branches and custom user processes containing only channel communications and pure computation blocks as leaves. A natural choice is to reuse an existing hierarchical construct/process structure and to use processes and constructs as basic exception handling units. This choice can be implemented in few ways: x every process/construct can be associated with an exception handler, x extended, exception-aware versions of processes/constructs can be used instead of ordinary processes and constructs, x a particular exception handling construct may be introduced. Regardless any particular implementation, upon an unsuccessful exception handing at the process level, the exception will be thrown further to the scope of a construct. Due to implementation issues, the termination model is preferred at the leaf-process level in an application. The termination model applied at the construct level would mean that prior to the execution of a construct-level exception handler all the subprocesses of the construct would have to terminate. This can happen in several ways: one can choose to wait till all subprocesses terminate (regularly or exceptionally) or force aborting further execution of all subprocesses. In real-time systems where timely reaction to unexpected events is very important the latter may be an appropriate choice. Abandoning the termination model (at the construct level) and implementing the resumption model is a better option when an exception does not influence some subprocesses at all or influences them in a way that can be handled without aborting the subprocesses. Using the resumption model at the construct level would not imply that a whole construct has to be aborted in order to handle the exception that propagated to the construct level. 2.1 Asynchronous Transfer of Control (ATC) One way to implement the termination model is by an internal mechanism related to the constructs that can force the execution environment to abort all subprocesses and release all the resources they might be holding. This approach resembles Ada’s ATC – Asynchronous
34
D.S. Jovanovic et al. / Exception Handling Mechanisms for Concurrent Software
Transfer of Control or asynchronous notification in Real-time Java. However forcing exceptional termination of all communicating, parallel composed processes poses a higher risk of corrupting process states by an asynchronous abortion (therefore in the Ada Ravenscar Profile [20] for high-integrity systems, the ATC is disabled). It is important to state that such a mechanism should be made in a way that all aborted subprocesses are given chance to finish in a proper state. This can be done by executing the associated exception handlers for each subprocess. 2.2 Channel Poisoning The other, more graceful, termination model is channel poisoning; sending a poison (or reset) along channels in a CSP network is proposed in [21] as a mechanism for terminating (or resetting) an occam network of processes. Processes that receive the poison spread it further via all the channels they are connected to. Eventually all processes interconnected via channels will receive the poison token and terminate. The method can be used for implementing the termination model of constructs. In the CSP/CT framework this approach is slightly modified as proposed in [2]: instead of passing the poison via the channels, the idea is to poison (invalidate) the channels. Furthermore, in [9], it is proposed that any attempt to access a poisoned channel by invoking its read/write operations will result in throwing an exceptions in the context of the invoking process. Consequently the exception handler associated with the process can handle the situation and/or poison other channels. 3 Architectures of EHM Models Having in mind all the challenges for constructing a usable EHM for concurrent software, the CSP architecture can be viewed as one offering an interesting environment for doing that. In this part a few concepts are discussed with one eye on all the listed requirements, among which a special concern is given to: handling of simultaneous exceptions, the mechanism formalization and (timely) efficient implementation. 3.1 Formal Backgrounds of EHM The first CSP construction that captures the behaviour when a process (Q) takes over after another process (P) signals a failure is conceived by Hoare already in 1973 [22] as P otherwise Q.
Association of a process and its handler can be modelled as in Figure 1.
Figure 1. Exception relation between a process P and its exception handling process Q
In the graphical notation as implemented in the graphical gCSP tool [8], the exception handling process Q (exception handling processes are represented as ellipses) is associated with the exception-guarded process P (ordinary processes are rectangles) by a compositional exception relationship [2], following actually Hoare’s “otherwise” principle. On the similar grounds, there have been several attempts to use CSP to formalize exception handling [2, 17, 23, 24]. However, all these attempts have been limited to formalizing the basic flow of activity upon exceptional termination of one process for benefit of another
D.S. Jovanovic et al. / Exception Handling Mechanisms for Concurrent Software
35
(thus without building a comprehensive mechanism that fulfils aforesaid requirements for a concurrent EHM). Also, they did not work out an implementation in a practical programming language (with exception of [2]). Common for all is that both ordinary operation and exceptional operation are encapsulated in processes. The compositionality of the design is preserved by combining these processes by a construct. Hoare eventually also catered for the basic termination principle with his interrupt operator (') in [25], in a follow-up work [26] annotated with an exception event i ('i). Despite its name, semantics of the ' operator is much closer to the termination model of exception handling than to what is today usually referred to as “interrupt handling”, since it implies termination of the left hand-side operand by an unconditional preemption by the right hand-side one. A true “interrupt” operator would be useful for modelling the resumption model of exception handling (as it actually was in the original proposal of the interrupt operator in [23]). In [25], yet another operator (alternation, depicted as 9) may be used to describe resuming a process execution after execution of another process (however, this operator is not supported by the FDR model checker, while 'is). In a recent work [27] a CSP-based algebra (with another variant of the Hoare’s exception-interrupt constructs) is developed for long transactions threatened by exceptional events. The handling of interrupts (exceptions) relies on the assumption that compensation for a wrongly taken act is always possible. This assumption is too strong in the context of controlling mechanical systems (with ever present real-time demands). Moreover, the concept focuses on undoing wrong steps and not directly on fault tolerance. Termination semantics is captured, besides )&Hoare’s ', also by (virtually the same) except proposed in [23] and exception operator ' that appears in [2]. Whichever version is used for modelling the exceptional termination of a process P that gets preempted by the handler Q, it can be represented by a compositional hierarchy (in Figure 2) that corresponds to Figure 1 as:
Figure 2. Compositional hierarchy of an exception construction
By “compositional hierarchy” we assume the way occam networks are built of processes and constructs (which are also processes). We find the tree structure excellently capturing this kind of executional compositions [8]. )&
3.2 Exception Construct ' )&
In the semantics of the ' operator [2], the composition in Figure 2 is interpreted as following: upon an exception occurrence in process P, an exception is thrown and P terminates; the exception is being caught by the exception construct (ExC1) and forwarded to Q which begins its execution (handling the exception). A concept of using a construct for modelling exception handling has a favourable consequence to the mechanism of propagating (unhandled) exceptions: in a CSP network with exception constructs, from the moment an exception is created and thrown by a process, it propagates upwards along the compositional hierarchy until a proper handler is found. Therefore, the propagation mechanism is clear and simple, since it follows the compositional structure of the CSP/CT concurrent design. Instead of the process P in Figure 2, there may be a construct with multiple processes. If the construct is an Alternative or Sequential one, the situation is the same as with a single
36
D.S. Jovanovic et al. / Exception Handling Mechanisms for Concurrent Software
process: upon exceptional termination of one of the alternatively or sequentially composed processes, the exception is caught by the exception construct and handled by the process Q. However, in the case of the Parallel construct, there is a possibility that more than one process ends up in an exceptional situation (and therefore terminates by throwing different exceptions). Consider situation in Figure 3.
Figure 3. Parallel construct under exception construct
Handler Q handles exceptions that may arise during execution of the parallel composition of the processes P1, P2 and the exception construct ExC1 (actually, the exceptions thrown by P3 and not handled by Q3). Here, it is a question at which moment exceptions from P1 should be handled (provided that the exception occurrence happens before P2 finishes)? Moreover, what if P2 exceptionally terminates as well? )& In the current implementation of ' [2, 28], the exceptions occurred in parallel composed processes are handled when the Parallel construct is terminated (i.e. when all parallel composed processes are terminated, successfully or exceptionally); for catching and handling all possible exceptions occurred in a parallel composition, a concept of exception set (collection of exceptions) is introduced. After termination of Par1, handler Q gets an exception set object with all exceptions thrown by the child processes (P1, P2 and ExC1 – all possibly unhandled processes in Q3 are rethrown). The concept of exception set has another useful role. From its contents a handler can reconstruct the exception hierarchy in case of simultaneous (concurrent) exceptions. 3.2.1 Channel Poisoning and the Exception Construct Sending a poison along channels as proposed in [21] is a mechanism for terminating the network or subnetwork. In the discussed EHM model proposal the poisoning mechanism assumes that channels can be turned into a poisoned state in which they respond on attempts of writing or reading by throwing back exceptions. In this way two problems are solved. The first problem is blocking of a rendezvous partner when the other one has exceptionally terminated. Consider the following situation (Figure 4, Figure 5):
Figure 4. Rendezvous (potential blocking)
Figure 5. Hierarchical representation of Figure 4
D.S. Jovanovic et al. / Exception Handling Mechanisms for Concurrent Software
37
Processes P1 and P2 are both “exception-guarded” by exception constructs ExC1 and ExC2 (i.e. by handlers Q1 and Q2 respectively), which are then parallel composed. Processes P1 and P2 communicate over channel c. Should it happen than one of the processes exceptionally terminates (before the rendezvous point), the other process stays blocked on channel c. For that reason, handlers Q1 and Q2 should in principle turn the channel into the poisoned state, so that the other party terminates with the same exception which caused the first process to terminate. To recall, this exception is thrown on an attempt of reading or writing. Moreover, an already poisoned channel on further poisoning attempts (which are function calls) returns the poisoning exception, for a reason that will be explained soon. If however the other rendezvous partner is already blocked on the channel, it should be released at the act of poisoning (and then end up with the exception). For this scheme to work, it is clear that all communicating parallel composed processes should be “exception-guarded”, i.e. sheltered behind exception constructs. In that case, an elegant possibility for concerted simultaneous exception handling comes automatically. On an exceptional occurrence in one of the communicating processes, provided all are accompanied with handlers that poison all channels connected to “their” processes, the information of the exception spreads within the parallel composition. In case of simultaneous exceptions, the spread of different exceptions progresses from different places (processes) under a parallel construct. In that case, it will inevitably happen that a handler will try to poison a channel that is already poisoned (with another exception). When channels respond to the attempt of poisoning by returning the exception that poisoned them initially, the handlers get information on occurrence of simultaneous exceptions. The handler of the parallel construct will ultimately be able to reconstruct the complete exception hierarchy. However, this mechanism suffers from two major problems. The first one is possible (unbounded) delay from occurrence of a (first) exception and handling it at the level of the parallel construct. Remember that all parallel composed processes must terminate before the handler of the parallel construct gets chance to analyse and handle the exception (set). Some processes may spend a lot of time before coming to the rendezvous on a (poisoned) channel and consequently be terminated! A possibility of Asynchronous Transfer of Control is already commented as unwelcome in the high-integrity systems. An additional penalty is that for the mechanism of rethrowing exceptions from poisoned channels to work, it is necessary to clone exceptions (so that every handler can consider the total exceptional situation) or at least have a rigorous administration of the (pointers to) occurred exceptions. The other problem inherent to the mechanism of channel poisoning is that a poison spread is naturally bounded by the interconnection network of channels and not by the boundaries of constructs. Namely, some channels may run to processes that belong to other constructs; ultimately this may lead to termination of the whole application, which contradicts to the idea of exception handling as the most powerful fault tolerance mechanism. In [21] a possibility of inserting special processes on boundaries of the subnetworks compelled to poisoning is proposed, but that means introducing completely non-functional components into the system. The other option is a model-based (tool-based) control of the poison spreading. 3.3 Interrupt Operator 'i, Environmental Exception Manager and Exception Channels In the channel poisoning concept, propagation of an exception event was based on existing communication channels. More apt to formal modelling would be a termination model based on a concept that considers exceptions as explicit events communicated among
38
D.S. Jovanovic et al. / Exception Handling Mechanisms for Concurrent Software
exception handlers via explicit exception channels. This change in paradigm makes formal modelling and checking more straightforward. Let us consider a Parallel construct Par containing 3 subprocesses: P1, P2 and P3 (see Figure 6). Process-level exception handlers associated with these processes are Q1, Q2 and Q3 respectively. In the scope of the exception construct ExC the exception handler associated with the construct Par is process Q.
Figure 6. Design rule for fault-tolerant parallel composition with environmental care
Using the interrupt operator this could be written in CSP as (in is an explicit exception event): Par = (P1 'i1 Q1) || (P2 'i2 Q2) || (P3 'i3 Q3).
In turn, the relation between processes Par and Q is modelled in the same way: Par 'i Q = ((P1 'i1 Q1) || (P2 'i2 Q2) || (P3 'i3 Q3)) 'i Q,
where i is the Par-level exception event. If an exception occurs during execution of a process P1, the process will be aborted and the associated exception handler Q1 will be invoked. This can actually be seen as an implicit occurrence of the exception event i1. If the exception cannot be handled by Q1, this exception should be communicated to the higher EHM level. Since such higher level EHM facilities are represented by some process from the environment playing a role of a higherlevel exception handler, this can be implemented as communication via channels. One can imagine that, following premature termination of a process, a higher EHM component can throw exceptions in the contexts of the other affected processes. In sense of CSP, this is equal to interrupting those processes, by inducing an event i2 that will cause, say process P2 to be aborted and wake up its exception handler (Q2). In this way, graceful termination (giving a chance for a process state clean-up) can be modelled by the CSP-standard interrupt operator 'i. Thus, from the interrupt operator 'i point of view aborting a process (P2) is nothing more then communicating the exception event (i2) to the exception handling process (Q2). And indeed the termination mechanism can be really implemented in this way. Special exception channels can be dedicated to this purpose. The communication via exception channel is actually an encapsulating mechanism used to throw an exception in the context of affected processes (P2 and P3), forcing them to abort further execution and forcing the execution of associated exception handlers (Q2 and/or Q3) instead. Although their implementation is more complicated, from synchronization point of view those channels are real rendezvous channels. This is the case because processes Q1, Q2 and Q3 are during ordinary operation mode always ready to accept events i1, i2 and i3 produced by the environment.
D.S. Jovanovic et al. / Exception Handling Mechanisms for Concurrent Software
39
Writing to an exception channel would pass data about the cause of the exception to the process-level exception handler. In addition, the process must be unblocked if it is waiting on a channel or semaphore. Afterwards, when a scheduler grants CPU time to that process, instead of a regular context switch to the stack of the process, it would switch to the stack unwrapped to a proper point for the execution of the exception handler. When all process-level handlers Q1, Q2 and Q3 terminate, the construct (Par) will terminate unsuccessfully by throwing an exception to its parent exception construct (ExC). As a consequence, the exception handler Q will be executed. But who will produce events i1, i2 and i3? The exception handler Q cannot do that because it can be executed only after the construct and all of its subprocesses have already terminated. It is possible to imagine an additional environment process (let us name it environmental exception manager - EEM) that does that. This process would have to run in parallel with the guarded construct or the whole application. Furthermore, because the exception handling response time is important, this newly introduced process should have a higher priority then the top application construct. For the running example, PriPar (EEM, Par 'i Q).
Every process is by default equipped with an exception handler which, if not redefined by the user, only throws all exceptions further to the environmental exception manager. While for the previous concept it was not necessary that all processes have associated handlers in order to let their exception be handled at the construct level, in this proposal it is the rule all processes must have attached handlers (as in Figure 6). One side-effect of this decision is that it becomes possible to define both a process and its exception handler as two functions of one object. Normally, in the occam-like libraries, processes are implemented as objects, but this was just a design choice since from the CSP point of view there is no obstacle in realizing a process as merely a function. While a process and its exception handler were defined in separate objects, the process had to pack all the data needed for exception handling into an exception object in order to pass it to its exception handler. Having them inside the same object is however more convenient in light of realtime systems. Besides reducing the memory usage, the dynamic memory allocation can be avoided since an exception handler can directly inspect data members defining the state of the process. The concept of the exception manager opens yet another possibility: thanks to the careful management of the exceptions events, the resumption feature becomes viable. The manager can have encoded application-specific exception handling rules. These rules may not necessarily terminate all subprocesses. The termination and the resumption model can be combined in one application. 3.3.1 Treating Complex Exceptional Situations In order to make an appropriate handling decision about occurrence of simultaneous exceptions in multiple processes (sometimes caused by the same physical fault), it is often necessary to check the state of certain resources internal to concurrently executing constructs. Obviously, handling such complex exception events requires some kind of exception hierarchy checks and application specific rules that are encoded for all possible combinations. If the number of those rules and combinations is very large, which is the case in complex systems, the environmental exception manager can be implemented as a complex process containing several environmental exception handlers covering different functional views of the system or different classes of exceptional scenarios. It is also possible to create one environmental exception handler for every construct in the system.
40
D.S. Jovanovic et al. / Exception Handling Mechanisms for Concurrent Software
All environmental exception handlers are processes. As such they can communicate with each other and use the knowledge obtained in such a way to decide whether they should safely terminate certain subprocesses by producing exception events. Each (Parallel) construct can have an associated higher level environmental exception handler. One environmental exception handler can handle one or more constructs. If the termination model is used, then this exception handler will throw an exception to every subprocess of the construct in order to terminate them in a safe way. The termination model should be a default behaviour. But if one needs the resumption model, a behaviour of such a higher-level environmental exception handler can be adapted and a set of application specific rules can be coded in order to specify how to deal with exceptions present in that case. In this way, by interpreting all exceptions as events, the whole application, including exceptional situations, can be formalized and formally checked via CSP model. The time from raising an exception inside a process-level exception handler until reaction by the environmental exception manager is bounded and there is no need to terminate processes not influenced by exceptions. Even if the resumption is not used, exception handling takes place immediately without a time delay and the termination can be much more efficiently accomplished. Another benefit is that simultaneous and related exceptions that originate in different constructs can be treated properly. Collecting exceptions in the exception sets is not used in this concept, since the environmental exception manager maps a subtree of the exception hierarchy to an appropriate exception event used for activating the 'i-mechanism at the construct level. The problem of unwanted uncontrolled exception propagation beyond construct boundaries does not exist at all. 4 Conclusions Exception handling facilities are a crucial tool for increasing dependability level of the software in an elegant manner. However, “exception handling and the provision of error recovery are extremely difficult in concurrent and distributed systems” [18]. Nevertheless, we find process-oriented architectures based on CSP model very promising for facilitating a concurrent exception handling mechanism. In subsections 3.2 and 3.3 two concurrent EHMs based on constructs (the exception )& construct ' from [2] and 'i from [25] and [26]) are proposed. The first one has been successfully implemented and demonstrated in robotic applications [28]; the other is currently under development. While the first one facilitates the termination model only, the second proposal holds promise of providing the resumption possibility as well. Especially attractive is the potential to have programs with exception handling facilities formally verified. Both exception operator and interrupt operator are useful starting points for designing a formally checkable EHM. For both proposals in this paper formal description and verification have to be yet exercised. Both EHMs are based on introduction of constructs that fit them nicely in the CSP hierarchical compositions, which simplifies reasoning and offers excellent possibilities for graphical specification in the graphical language and tool [8]. Extending the tool support towards automatic code generation and consistency administration is active research [9].
D.S. Jovanovic et al. / Exception Handling Mechanisms for Concurrent Software
41
References [1] [2] [3] [4]
[5]
[6]
[7]
[8]
[9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27]
[28]
C. A. R. Hoare, Communicating Sequential Processes, Communications of the ACM, vol. 21, pp. 666677, 1978. G. H. Hilderink, Managing Complexity of Control Software through Concurrency, PhD thesis, University of Twente, Netherlands, 2005, ISBN: 90-365-2204-8. G. H. Hilderink, JavaPP project at UT: http://www.ce.utwente.nl/JavaPP http://www.ce.utwente.nl/JavaPP, 2002. B. Orlic and J. F. Broenink, Real-time and fault tolerance in distributed control software, in Communicating Process Architectures 2003, J. F. Broenink and G. H. Hilderink, Eds. Enschede, Netherlands: IOS Press, 2003, pp. 235-250, ISBN: 1 58603 381 6. P. H. Welch, Process Oriented Design for Java: Concurrency for All, in ICCS 2002, volume 2330 of Lecture Notes in Computer Science. Amsterdam: Springer-Verlag, 2002, pp. 687-687, ISBN: 3-54043593-X. J. Moores, CCSP – A portable CSP-based run-time system supporting C and occam, in Architectures, Languages and Techniques – WoTUG-22, B. M. Cook, Ed. Keele, UK: IOS Press, 1999, pp. 147-168, ISBN: 90 5199 480 X. N. C. C. Brown and P. H. Welch, An Introduction to the Kent C++CSP Library, in Communicating Process Architectures 2003, J. F. Broenink and G. H. Hilderink, Eds. Enschede, Netherlands: IOS Press, 2003, pp. 139-156, ISBN: 1 58603 381 6. D. S. Jovanovic, B. Orlic, G. K. Liet, and J. F. Broenink, gCSP: A Graphical Tool for Designing CSP systems, in Communicating Process Architectures 2004, I. East, J. Martin, P. H. Welch, D. Duce, and M. Green, Eds. Oxford, UK: IOS press, 2004, pp. 233-251, ISBN: 1586034588. D. S. Jovanovic, Designing dependable process-oriented software, a CSP approach, PhD thesis, PhD thesis, University of Twente, NL, to appear in 2005. A. Romanovsky, Ed. Looking ahead in atomic actions with exception handling. T. Anderson and P. A. Lee, Fault Tolerance, Principles and Practice. Englewood Cliffs, NJ: PrenticeHall International, 1981. J. B. Goodenough, Exception Handling: Issues and A Proposed Notation, Comm. ACM, vol. 18, pp. 683-696, 1975. R. H. Campbell and B. Randell, Error Recovery in Asynchronous Systems, IEEE Trans. Software Eng., vol. 12, pp. 811-826, 1986. F. Cristian, Exception Handling and Tolerance of Software Faults, in Software Fault Tolerance, vol. 3, Trends in Software, M. R. Lyu, Ed., 1 ed. Chichester: John Wiley & Sons Ltd., 1995, pp. 81-107. P. A. Buhr and W. Y. R. Mok, Advanced Exception Handling Mechanisms, IEEE trans. Software Eng., vol. 26, pp. 820-836, 2000. A. Burns and A. Wellings, Real-Time Systems and Programming Languages, 3rd ed: Pearson Education, 2001. J.-P. Banatre and V. Issarny, Exception handling in communicating sequential processes: design, verification and implementation, INRIA, France, Rennes RR-1710, June 1992. J. Xu, A. Romanovsky, and B. Randell, Concurrent exception handling and resolution in distributed object systems, IEEE Trans. on Parallel and Distributed Systems, vol. 11, pp. 1019-1032, 2000. J. L. Knudsen, Exception Handling – A Static Approach, Software – Practice and Experience, vol. 14, pp. 429-449, 1984. A. Burns, The Ravenscar Profile, ACM Ada Letters, vol. XIX, pp. 49-52, 1999. P. H. Welch, Graceful Termination – Graceful Resetting, in occam User Group X. Enschede, Netherlands: IOS Press, 1989, pp. 310-317, ISBN: 90 5199 007 3. C. A. R. Hoare, Parallel programming: an axiomatic approach, Stanford University, Stanford, CA, USA CS-TR-73-394, October 1973. T. I. Dix, Exceptions and Interrupts in CSP, Science of Computer Programming, vol. 3, pp. 189-204, 1983. P. Jalote and R. H. Campbell, Fault tolerance using Communicating Sequential Processes, in International Symposium on Fault-Tolerant Computing, FTCS-14., 1984, pp. 347 - 352. C. A. R. Hoare, Communicating Sequential Processes: Prentice Hall, 1985. A. W. Roscoe, The Theory and Practice of Concurrency: Prentice Hall, 1997. M. Butler, C. A. R. Hoare, and C. Ferreira, A Trace Semantics for Long-Running Transactions, in Communicating Sequential Processes – The First 25 Years, vol. 3525, LNCS, A. E. Abdallah, C. B. Jones, and J. W. Sanders, Eds. Berlin Heidelberg: Springer-Verlag, 2005, pp. 133-150. T. H. van Engelen, CTC++ enhancements towards fault tolerance and RTAI, Control Laboratory, University of Twente, Enschede, MSc thesis 022CE2004, 2004.
This page intentionally left blank
Communicating Process Architectures 2005 Jan Broenink, Herman Roebbers, Johan Sunter, Peter Welch, and David Wood (Eds.) IOS Press, 2005 © 2005 The authors. All rights reserved.
43
Automatic Handel-C Generation from MATLAB® and Simulink® for Motion Control with an FPGA Bart REM a 1, Ajeesh GOPALAKRISHNAN a, Tom J.H. GEELEN a and Herman ROEBBERS b 2 a Eindhoven University of Technology (TU/e), Eindhoven, The Netherlands b Philips TASS, Eindhoven, The Netherlands Abstract. In this paper, we demonstrate a structured approach to proceed from development in a high level-modeling environment to testing on the real hardware. The concept is introduced by taking an example scenario that involves automatic code generation of Handel-C for FPGAs. The entire process is substantiated with a prototype that generates Handel-C code from MATLAB®/Simulink® for most common Simulink® blocks. Furthermore, we establish the potential of the notion by generating Handel-C for an FPGA, which controls the flow of paper through the scanning section of a printer/copier. Additionally, we present another method to generate Handel-C from a state-based specification. Finally, to verify and validate the behavior of the generated code, we execute several levels of simulation, including software-in-the-loop and hardware-in-the-loop simulations. Keywords. FPGA, embedded systems, motion control, Hardware Software CoDesign, MATLAB®, Hardware-in-the-Loop, Software-in-the-Loop, Co-Simulation, Electronic Design Automation (EDA), Automatic Document Feeder (ADF)
Introduction This paper is a result of the 2004 edition of the Workshop System and Software Engineering (WSSE04), organized by OOTI (Dutch: “Ontwerpers Opleiding Technische Informatica”) [1]. OOTI is the Software Technology training program at Eindhoven University of Technology. The starting point of the workshop was a project case formulated by OOTI together with Philips TASS B.V and Océ-Technologies B.V., which involved a typical “Proof of Concept by means of a Prototype”. In this case, customer was Philips TASS B.V. with Océ was one of the many stakeholders, who facilitated the project with an initial development test case. The core idea behind the project case was to demonstrate the feasibility of a design method where designs created in high-level modeling environments such as MATLAB® / Simulink® and Rational Rose environments can run directly on low-level targets, in our case a Field Programmable Gate Array (FPGA). The actual FPGA used was a Xilinx Virtex-II XC2V3000-4. 1
Software Technology program (OOTI), Department of Mathematics and Computer Science, Eindhoven University of Technology (TU/e), P.O. Box 513, 5600 MB, Eindhoven, The Netherlands. E-mail: {B.Rem, A.Gopalakrishnan, T.J.H.Geelen, ooti}@tue.nl 2 E-mail:
[email protected]
44
B. Rem et al. / Automatic Handel-C Generation from Simulink® for Motion Control with an FPGA
FPGAs are programmable hardware devices that can execute tasks that a dedicated piece of hardware or microcontroller would normally execute. One of the main motivations behind the choice of FPGA is its flexibility, resulting from its programmability. Another motivation is the possibility of large-scale parallelism, which results in high performance. High performance is important in motion control systems where real time tasks have to run with very short latencies. To exploit this parallelism, any language used to program the FPGA needs to have appropriate constructs for specifying parallelism. The language chosen in our case was Handel-C, which provided possibilities to execute multiple instructions in parallel (i.e. within the same clock cycle). Additionally, Handel-C has a short learning curve because of its similarity to the C language. Furthermore, the Electronic Design Automation tools like Celoxica’s DK3 [2] and the RC203 development board [3], were selected to speed up the development process. In our case, the chosen design and development strategy introduced several levels of simulation, including Software-in-the-loop (SIL) and Hardware-in-the-loop (HIL) simulations, which were used to verify the design and implementation. One additional requirement in the project was that the code for the FPGA needed to be automatically generated from models in MATLAB®/Simulink®. This was a challenge given the fact that the available automatic code generation tools could not be used, since they do not support the generation of Handel-C. To validate the design method, Océ provided a non-trivial test case, including the test hardware. The case involved a scanning section, namely the Automatic Document Feeder (ADF), which is essentially a part of a multifunctional printer/scanner combination from Océ. The main function of the ADF is to transport paper through the machine by reading out relevant sensors and activating appropriate motors and clutches. Currently, the flow of paper through the ADF is controlled by a microcontroller. Eventually, the idea is to have all tasks currently handled by the microcontroller to be handled by an FPGA. Roughly, these tasks can be split into supervisory control tasks and motor control tasks. Since both control tasks must run on the FPGA, the code for both these tasks should be automatically generated. Because the supervisory controller was specified and captured in the form of an execution sheet in Excel, an additional utility was needed that generated Handel-C code for the supervisory controller from its specification. To summarize, the three main goals of the workshop were: 1. To use an FPGA to control the flow of paper through a prototype of the ADF. 2. To follow a design method where models created in MATLAB®/Simulink® [4] can be automatically compiled to an FPGA image. 3. To generate Handel-C automatically for the motor controllers from MATLAB®/Simulink®, and for the supervisory controller from a state based specification in Excel. The outline of the article is as follows. The design method per se is introduced in Section 1. Section 2 describes the approach for automatically generating Handel-C from MATLAB®/Simulink®, resulting in a tool component called MDL2HC. A simple example demonstrates the key issues, design decisions, and the solutions. Section 3 explains the ADF test case from Océ. In this test case, the usage of MDL2HC is highlighted and another code generation tool component, called UtilX, is introduced. In addition, several levels of the design verification are defined, namely, software-in-the-loop and hardware-in-the-loop simulations. The details of the simulation based design method are presented next. Finally, we end with our conclusions and recommendations for future work.
B. Rem et al. / Automatic Handel-C Generation from Simulink® for Motion Control with an FPGA
45
Figure 1: Simulation steps in the design method
Figure 2: Starting point of simulation method
1. Approach to the Workshop: Simulation-Based Design In the chosen design method, which includes several simulation levels (Figure 1), each level eliminates uncertainties about the developed software. Several steps in this method are automated during the process, which conceptually could lead to executable models. Each level or step is detailed in the subsequent subsections. The steps are best described by means of the Océ ADF test case, which is introduced in the later part of the paper. Nevertheless, the framework is applicable to any prototype where motion control is needed. 1.1 MATLAB®/Simulink® Simulation: the Starting Point As a starting point, the prototype of the ADF and the controllers for the prototype are all simulated in MATLAB®/Simulink®. We obtained the models of the ADF, the motor controllers, and the supervisory controller from the Boderc project. This is a project at the Embedded Systems Institute, Eindhoven the Netherlands, which focuses on multidisciplinary modeling of distributed embedded real-time controllers of complex systems [5]. On the other hand, the supervisory controller is based on a specification of the state based execution sheet in Excel. Figure 2 visualizes the starting scenario. 1.2 Software in the Loop Co-Simulation: One Step Closer In the second step, as seen in Figure 3, the controllers are taken from MATLAB® and moved to Handel-C. Now, both controllers should be automatically generated from the environment in which they are specified. In case of the motor controllers the environment is MATLAB®, whereas in case of the supervisory controller the environment is an execution sheet in Excel. Consequently, we didn’t generate code for the supervisory controller from MATLAB®, but from its specification in Excel. This specification also represents our reference point for the description of correct behavior.
46
B. Rem et al. / Automatic Handel-C Generation from Simulink® for Motion Control with an FPGA
Figure 3: Software in the loop co-simulation Once both controllers are automatically generated (Sections 2 and 3 describe the process), they are co-simulated with the ADF model using the Celoxica co-simulation manager [6,7]. Naturally, the behavior of the system should not change and should remain in line with the expected behavior. The verification of the behavior is described in Section 3.3. 1.3 Hardware in the Loop: Almost There Until this step, all simulations were carried out in simulated time without an FPGA. When a hardware-in-the-loop (HIL) simulation is performed, it is still a simulation in the sense that the real prototype is not yet used. However, this simulation comes as close to the real prototype as possible because of the following: First of all, the simulation runs in real time. Second, during Hardware-in-the-Loop the real FPGA is used. The Celoxica toolset provides an out-of-the-box solution to create and upload the image to the FPGA. The crux of this step is to setup the HIL simulation in such a way that the interface between the FPGA and the ADF model is the same as the interface between the FPGA and the real prototype (the interface is detailed in Section 3.5). This means, for instance, that pin connections have to match and the correct voltages have to be supplied. Ultimately, the HIL setup would result in a plug-and-play solution, where in the last stage the FPGA can be connected to the real prototype. Besides the controllers, the ADF model has to be prepared to run in real time. For instance, the Simulink® solvers have to run in fixed-step instead of variable-step. The hardware target, on which the model is intended to run, is xPC Target. xPC Target is a PCbased box with (in our case) an Intel Pentium 4, which is running a real-time operating system from The MathWorks [8].
Figure 4: Hardware in the loop real-time simulation using the FPGA and xPC Target from The MathWorks
B. Rem et al. / Automatic Handel-C Generation from Simulink® for Motion Control with an FPGA
47
Figure 5: The final step: connection to the real prototype (no more models are used at this point) One extra feature needed is the possibility of logging dynamic behavior information. We implemented an UDP stack on the FPGA that sends data to the network. This stack was built on top of the Ethernet functionality provided by Celoxica's Platform Abstraction Library (PAL) [9]. Since the FPGA is fast enough, logging doesn’t cost much in terms of performance except for a little space on the FPGA. 1.4 Connection to Prototype: The Real Thing Once the Hardware in the Loop step is completed, the FPGA can be disconnected from the xPC Target and connected to the prototype (Figure 5). The same logging functions can be used as before. This ensures that the comparison between Hardware in the Loop and the real prototype is convenient. Now that the design method has been explained, the following section focuses on explaining how Handel-C code can be generated automatically from MATLAB® / Simulink®. 2. Automatic Generation of Handel-C Code from MATLAB® and Simulink® Models 2.1 Introduction As pointed out in Section 1, it is desirable to have a clear mapping from high-level models to their implementation. When this mapping is known, one can automate the step of translating the model into an implementation. This section describes how the automatic generation of Handel-C code from MATLAB®/Simulink® models was realized in the WSSE04. We will do this by taking an example model and iterating through all the steps of code generation. 2.2 Short Introduction to Handel-C Handel-C [10, 11], as introduced earlier, is a language for creating hardware configurations for FPGAs in a way that is similar to software development. The Handel-C language is based on ANSI-C, but with some elements removed (e.g. floating point arithmetic) and some FPGA specific elements added (e.g. a construct for parallelism). The data types in Handel-C are very flexible, given that you can define them with any number of bits. For example, you can declare a channel being an unsigned 2-bit number and a variable as being a signed 349-bit number. The language provides semaphores and synchronous communication through channels.
48
B. Rem et al. / Automatic Handel-C Generation from Simulink® for Motion Control with an FPGA
2.3 Approach to Code Generation For MATLAB®/Simulink®, a code generation package already exists: the Real-Time Workshop (RTW) [12, 13]. The part of the RTW that does the actual code generation is the Target Language Compiler (TLC), which comes with a large number of files that inform the TLC how to generate ANSI/ISO C code for all Simulink® blocks. In theory, it is possible to modify these files to generate Handel-C code instead. However, there are several drawbacks: x The TLC is large and complex, not in the last place because there are many blocks in Simulink®, each with numerous options. Getting acquainted with the TLC system already consumes some time, and making modifications to it consumes even more. x The supplied TLC files are focused on generating sequential code. As one of the most essential advantages of an FPGA over a general CPU is its true parallelism, the generated code needed to exploit this parallelism in a reasonable way. This means that not only the implementation of the individual blocks needs to be modified, but also the complete encapsulating framework. Since the project was bound by strict deadlines, the option of using the RTW with the TLC was rejected. Instead, the decision was taken to build a separate stand-alone application for doing the code generation. The mdl-files in which Simulink® models are saved are easy-toparse, structured plain-text files. Hence, they were chosen as input to the application. In a Simulink® model, all blocks are supposed to autonomously calculate their outputs based on their inputs, without interference from other calculations. This is perfectly suited for parallel execution on the FPGA. Consequently, the CSP (Communicating Sequential Processes) principle was used to implement the blocks. 2.4 Generator Design 2.4.1 Introduction The code-generation process, involving the code generator component called MDL2HC, consists of three stages, as shown by the numbered steps in Figure 6. The first stage involves parsing the model file (and other files it depends on) and storing it in a data structure in memory. As the types of 'wires' in the Simulink® model are not explicitly specified in the mdl-file, another stage is needed to derive these types. This type information will be added to the data structure in the second stage. The third and final stage is the actual code generation from data collected in the data structure. It was decided to treat all blocks as much the same as possible in the MDL2HC tool. The differences are introduced mainly in the Handel-C implementations for the blocks, which are placed together as macros in a separate file. The use of macros makes sure that no FPGA hardware is used for blocks that are not used and that every instance of a block will get it's own copy of the hardware so that all instances can run in parallel. 2.4.2 Example The Simulink® model shown in Figure 7 and Figure 8 will be used as an example to explain some of the details of MDL2HC. Figure 7 shows the top-level view and Figure 8 shows the content of the subsystem that is used.
B. Rem et al. / Automatic Handel-C Generation from Simulink® for Motion Control with an FPGA
Model (.mdl)
49
Constants (.m)
MDL2HC 1
2
Type Checking
3
Parsing
Model Representation
Block Library (.hch)
Handel-C Compiler
Code Generation Generated Code (.hcc)
Figure 6: Code generation process using the MDL2HC tool
Figure 7: Example Simulink® model, top view
Figure 8: Example Simulink® model, Subsystem content
2.4.3 The Parse Stage Since the variables from the MATLAB® workspace can be used as parameter values, optionally one or more m-files are read first. These files contain MATLAB® scripts. The MDL2HC program only understands a subset of the scripts: comments and assignments to symbols. Next, the mdl-file is parsed. First, the default parameters for blocks are read, then the blocks themselves are read, including their parameters, and finally the connections between blocks are read from the mdl-file. The parts of the file that MDL2HC doesn't use are only checked for correct syntax and thereafter ignored. Subsystem blocks in the file have a structure very similar to a complete mdl-file. Recursion is used for parsing the subsystems, which can be nested arbitrarily.
50
B. Rem et al. / Automatic Handel-C Generation from Simulink® for Motion Control with an FPGA
In the example, the subsystem is masked, and it has one parameter named 'factor'. This is used as the 'Gain' parameter of the gain block in the subsystem. The value written in the mask parameter dialog is 'p0', which in turn is defined in a .m file as being '3.5'. When the .m file is parsed, MDL2HC knows the mapping 'p0 ĺ 3.5'. While MDL2HC is parsing the subsystem, the mapping 'factor ĺ p0' is added to the list. Now when the parser sees the gain block with parameter value 'factor', it will try to apply as many mappings on it as possible, resulting in storing '3.5' for the value in the data structure. There is one more thing to parsing: references. Models can include subsystems that are defined in other mdl-files (a library). Simulink® already provides many of its blocks as part of a library. If such a reference is encountered in an mdl-file, MATLAB®'s search path is searched for the library file in question. Next, the library file is searched for the desired subsystem. 2.4.4 The Type-Check Stage After the parsing stage is complete, the data structure, which we have in memory, contains the following modeling elements: 1. Building blocks of the model 2. Parameters of the blocks 3. Connections between the blocks However, what we still need in order to generate Handel-C code, are the types of the connections between the blocks. These types are determined by the block that generates the signal for the wire in question. In Simulink®, a wire can be connected to at most one output port. The output type for each block is specified by just another parameter. However, this parameter can also take values like 'Inherit via back propagation' and 'Inherit via internal rule'. The first idea was to simply rely on the automatic width inference feature of the Celoxica Handel-C compiler. However, the compiler only determines the width of a channel, not whether it is signed or unsigned data. In addition, when using fixed-point types, additional information about the scaling is needed. This information is also needed for parameters. As a result, MDL2HC has to determine all types by itself. The types are determined in an iterative way at the level of ports (the connection points of a block). When there is a type conflict or when not all types could be determined, an error message will be returned that specifies the blocks with problems. 2.4.4.1 Type-Check Example A set of rules is used to find the types of a port. The simplest one is an explicitly specified output port type, which is just taken as it is. For example, the constant in the example model (see Figure 7) has as type 'sfix(16)', with a scaling of 2-8. This means that it is a signed fixed-point number. It is a 16-bit value that needs to be multiplied by 2-8 to obtain the value it represents, which means that 8 bits are used for the integer part and another 8 bits for the fractional part. The constant is connected to the Sum, which has the 'Require all inputs to have the same data type' option checked. From this information, first the Outport and then the Gain in the subsystem, which have their output type set to 'Inherit via backpropagation', can derive that their type must also be sfix(16) with scaling 2-8. They do this by looking at possible known types at the other end of the connection. So, the Outport could also try to look at the input type that the Scope block expects, but the Scope has no preference for a specific type.
B. Rem et al. / Automatic Handel-C Generation from Simulink® for Motion Control with an FPGA
51
A tricky one is the 'Inherit via internal rule' option that a few blocks have. This generally means the output will be of the smallest possible type that can still represent the result of the operation without losing information. The Sum block in the example is set to this option. Since it has to add to sfix(16) / 2-8 numbers, its output should be of type sfix(17) / 2-8. Here there is a subtle difference with Simulink®: as Simulink® is more oriented to general processor systems, it immediately jumps to a sfix(32) type, not knowing that 17-bit numbers are perfectly usable in our case. 2.4.5 The Code-Generation Stage 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52
#include "blocklib.hch" #define CLOCKRATE 20000 /*kHz*/
set clock = external with {rate = 20.0}; chan chan chan chan chan chan
unsigned 16 channel_2; signed 16 channel_3; signed 16 channel_3Fork[2]; signed 16 channel_4; signed 17 channel_5; unsigned 1 channel_6;
macro proc SubSystem_1(inputs, outputs, trigger_channel) { chan signed 16 channel_0; chan unsigned 16 channel_1; chan unsigned 1 trigger; chan unsigned 1 triggerFork[2];
par {
Fork(trigger, uvalue, triggerFork, 2); TriggerPort(trigger_channel, trigger, RISING); SubSystemInPort(*inputs[0], uvalue, channel_1, {&triggerFork[0]}, 1); Gain(channel_1, uvalue, UNSIGNED, 0, channel_0, SIGNED, 8, 896, SIGNED, 16, 8); SubSystemOutPort(channel_0, svalue, *outputs[0], {&triggerFork[1]}, 1);
} }
void main(void) { par { Fork(channel_3, svalue, channel_3Fork, 2); Inport(channel_2, UNSIGNED, 0,"InPort"); Constant(channel_4, SIGNED, 8, 538); DiscretePulseGenerator(channel_6, UNSIGNED, 0, 1, 2000, 1, 0, CLOCKRATE); Scope(channel_3Fork[1], svalue, SIGNED, 8); //Subsystem SubSystem_1 call SubSystem_1({&channel_2}, {&channel_3}, channel_6 /* trigger channel */ ); Sum({&channel_3Fork[0], &channel_4}, {SIGNED, SIGNED}, {8, 8}, channel_5, SIGNED, 8,2,{1,1}); Outport(channel_5, svalue, SIGNED, 8,"OutPort"); } }
Listing 1: Generated Handel-C code for the example
52
B. Rem et al. / Automatic Handel-C Generation from Simulink® for Motion Control with an FPGA
When the type-check stage is successfully completed, there is enough information in the internal data structure to generate the actual Handel-C code. The generated code is shown in Listing 1. The first line includes the library with the block implementations. The next lines (2-4) contain information about the clock speed of the FPGA. This is used in blocks dealing with time, such as the Pulse Generator. The speed to use (20 MHz in the example) was passed as an argument to MDL2HC. Lines 6-11 contain the declarations for the channels used in the top-level system. Next, all subsystems are listed (lines 13-31) and the main function (lines 33-52) concludes the code. This main function consists of just a par (parallel) block that calls the macros for all blocks in the system. The correct channels are passed as parameters to the macros. The subsystems are also macro definitions and are called like any other block. The subsystem definitions themselves start again with a list of channel declarations (lines 15-18) followed by a par block (lines 20-30) with all blocks it contains. By keeping track of the nesting level of subsystems, nicely indented code is generated. The generator also has a useful debugging feature. It can automatically generate code as if a Simulink® scope block is attached to every connection. The Handel-C implementation of the Scope block logs all data into a file during (co-) simulation and acts just as a sink when placed on an FPGA. An important problem the code generator has to solve is that a Handel-C channel can have at most one reader, while a wire in Simulink® can be connected to multiple blocks. To solve this problem Fork blocks are generated in the Handel-C code. These blocks read a value on their input and then write it to multiple outputs. For example, the Fork on line 37 of the generated code takes care of supplying the output of the subsystem (channel_3 in the code) to both the Scope and Sum blocks. 2.4.6 The Handel-C Block Library The library blocks are modeled as Communicating Sequential Processes. They first read all their inputs, then do their calculations, and finally write the result to the output channel. The only exception is the UnitDelay block. Since its output does not depend on the (current) input, it writes its output in parallel with reading its input. This is necessary to prevent deadlocks due to feedback loops. As the generated code shows (for example, on line 50 the ‘svalue, SIGNED, 8’ after ‘channel_5’), after most channel parameters are passed to blocks, additional parameters indicate the type of the channels. Only signed/unsigned information and fixedpoint scaling (number of bits after the binary point) is passed, as the total number of bits can already be retrieved using the built-in sizeof() macro. The library considers all numbers as fixed-point numbers. In this sense, integer numbers simply have no fractional part. One of the most important elements of the library is the type conversion macro. This complex macro expression can convert a given data-type to any other, which is done by padding, sign extending, or chopping off bits where necessary. The complexity of this macro has no effect on the FPGA performance: the compiler expands the macro until simple padding and chopping actions remain. Almost every block uses this macro for producing its output in the correct type. Some blocks, such as Sum, have a variable number of inputs. These inputs can also be of mixed types. Having just one generic macro that you can call with an arbitrary set of inputs is not completely trivial. After some experimenting, it turned out that it was possible to mix types when passing the channels as an array with pointers to the channels.
B. Rem et al. / Automatic Handel-C Generation from Simulink® for Motion Control with an FPGA
53
Another problem is to find the right type for the intermediate result, before it is converted to the required output type. For the Gain block, this is quite simple, as it multiplies just two numbers. For the Sum block, it is more complicated, as arbitrarily many inputs and several overflow conditions need to be taken into account. This was realized using a set of recursive macros that calculate the needed type. In implementing the subsystems, we had to make another choice. How are the blocks in the subsystems connected to the outer model? The most efficient way is to directly connect them using a channel. However, this channel should be global, which is not nice, and MDL2HC has to figure out how everything is connected (both connection from/to the subsystem and the connections from the ports inside the subsystem might fork). It was decided to just pass the channels in question as arguments to the subsystem macro. In the macro, these channels are passed to special Inport and Outport implementations. This choice is also convenient for solving another challenge: triggered and/or enabled subsystems. Triggered or enabled subsystems only perform calculations conditionally. Only when a signal changes appropriately (trigger) or is larger than zero (enable), the subsystem performs a calculation step. The enable and/or trigger signals are simply fed to EnablePort and TriggerPort blocks, which just produce a boolean output indicating whether a step may be done or not. These signals in turn are fed to all Inports and Outports in the subsystem. When a subsystem is not active, it will discard the input it reads and output the values from the previous step again. 2.5 Realized Functionality In the project, we had one month for the design and implementation of MDL2HC. Therefore, as a proof of concept, we implemented only a subset of the Simulink® blocks and parameter settings [14], which were enough for the Simulink® models we were working with. Vectors and bus objects were not implemented. Also the sample time, rounding, and saturation options that are present in some blocks, are not supported (rounding is always down and overflow behavior is always wrap-around). Table 1 shows a list of the implemented basic blocks. Besides these basic blocks, there are several additional blocks in various libraries that are built using basic blocks. These library blocks are also supported as long as they are built on blocks from the list above. It is also possible to directly use Handel-C implementations for these library blocks. This was done for the CompareToConstant block (which internally uses an S-Function). 2.6 Conclusion In less than a month, we've developed the MDL2HC tool, which can generate Handel-C code automatically from Simulink® models. Not all blocks are supported at the moment, but it is not difficult to add more blocks to MDL2HC. We successfully tested the MDL2HC application with a non-trivial real model, as described in the next section. 3. Case: Paper-Path of an Océ Automatic Document Feeder 3.1 Introduction to the Océ Case Now that we were introduced to the approaches for simulations and also for automatic Handel-C code generation from MATLAB®/Simulink® in the previous sections, we take a
54
B. Rem et al. / Automatic Handel-C Generation from Simulink® for Motion Control with an FPGA
look at the project case or a test platform, which we used to establish the pragmatism of these approaches. We introduce here a project case from Océ, that was used to demonstrate the feasibility. First, we introduce the test hardware used and also a high-level concept of the domain. Later, we try to promote the envisioned scenario, the problem, the approach, and the envisaged solution. Table 1. The Simulink® blocks supported by MDL2HC ®
Simulink Block
Comments
Constant DataTypeConversion DiscretePulseGenerator EnablePort Gain Ground Inport Logic Outport Reference RelationalOperator Saturate Scope Signum Step Subsystem Sum Switch Terminator ToFile ToWorkspace TriggerPort UnitDelay
Inherit from 'Constant Value' not supported as output type Only time-based (internal), and fixed amplitude of 1 States when enabling not supported, it must be held For the parameter scaling mode, only specified is supported Only usable in Handel-C simulation, logs received data into a file Only usable in Handel-C simulation, logs received data into a file Only usable in Handel-C simulation, logs received data into a file Function-call trigger type not supported -
3.1.1 Test Hardware The test hardware we used is a prototype of the Automatic Document Feeder (ADF). Figure 9 depicts the location of the ADF in one of Océ’s multifunctional printer/scanner combinations. The ADF of the Océ multifunctional represents a medium size embedded device, which forms ideal test hardware owing to the various real-time constraints to be met in the overall functioning of the device. The area in Figure 9 denoted by the larger oval represents the scanning section, which is responsible for scanning documents. The area marked by smaller oval is the ADF, which is responsible for transporting the originals. The motors are controlled by motor controllers using feed-forward and feedback control. These motor controllers let the motors follow a desired speed profile. Since the
B. Rem et al. / Automatic Handel-C Generation from Simulink® for Motion Control with an FPGA
55
interface between the motor controllers (on the FPGA) and the motors (on the ADF) is digital, the motor controllers must provide a Pulse Width Modulated (PWM) signal corresponding to the desired analog voltage. The motor controllers receive updates of the position from the motors.
ADF (Automatic Document Feeder)
Scanning Section
Figure 9: The location of the ADF depicted in one of Océ’s multifunctionals 3.1.2 Paper Scanning Process In order to the have the original document scanned and copied, the originals have to go through a paper scanning process, which essentially is a part of the document handling process. The document handling is a procedure for transporting and handling paper documents for processing and scanning. In this section, we take a look at a typical paper path in a paper scan process. The paper scan process is uniquely identified with the concept of a paper path. The paper path is the path followed by the paper during the course of the transportation in any scanning process. Figure 10 describes a typical paper path used in our case. In Figure 10, a typical simplex paper path is illustrated. In Figure (a), a single paper is separated from a stack of papers and the transport of the paper begins. The direction of motion of the paper is captured in Figure (b). In Figure (c), the paper travels half way through the loop and stops; then the direction is reversed. This reverse motion is stopped near the glass plate as shown in Figure (d). As illustrated in Figure (e), the paper is sucked up by the perforated suction belt, scanned and pushed out as shown in Figure (f). 3.1.3 System Overview Figure 11 captures the high-level functionality and modules needed to realize the FPGA based control of the ADF. At the lowest level, a model is needed for the paper path. This model is captured using MATLAB® as an S-Function [15]. Along this paper path are the sensors and clutches, which are also included in the model. Further, the ADF model contains a collection of motors, which realize the transport of the paper. These motors are modeled in Simulink®.
56
B. Rem et al. / Automatic Handel-C Generation from Simulink® for Motion Control with an FPGA
The supervisory controller, which also resides on the FPGA, determines the set points for the motors. These set points are communicated in the form of a desired speed and the maximum acceleration that may be used to reach that speed. Additionally, the supervisory controller is responsible for setting the clutches and getting the sensor status from the paper path. Typical Paper Path
Paper Original
Paper is placed on ADF
Typical Paper Path
Paper is pulled in
Perforated Suction Belt
Perforated Suction Belt
Glass Plate
Glass Plate
(a)
(b)
Typical Paper Path
Typical Paper Path
Paper is stopped and direction reversed
Paper is stopped
Perforated Suction Belt
Perforated Suction Belt
Glass Plate
Glass Plate
(c)
(d)
Typical Paper Path
Typical Paper Path
Paper is sucked up and scanned
Paper is Pushed out
Perforated Suction Belt
Perforated Suction Belt
Glass Plate
Glass Plate
(e)
(f)
Figure 10: Schematic cross section of the ADF showing how a piece of paper is transported through the ADF via a simplex paper path
position
Controllers Supervisory Controller
setpoints
ADF Model Motor Controllers
PWM
Motors
Paper Path
clutches sensors
To be implemented in Handel-C
To be modeled in MATLAB
Figure 11: Overall flow and identification of components for implementation and modeling
B. Rem et al. / Automatic Handel-C Generation from Simulink® for Motion Control with an FPGA
57
3.1.4 The FPGA Development Board The case is about using an FPGA to control the ADF. The chosen FPGA was a Xilinx Virtex-II FPGA (XC2V3000-4) and the development board used was the RC203 Development Board from Celoxica [3]. The FPGA sends control commands and signals to the ADF and receives feedback from the ADF, which is used to issue subsequent control.
Figure 12: The RC203 development board with the Xilinx Virtex-II XC2V3000-4 FPGA
Figure 12 shows the RC203 board used for development in our case. The circle in the center demarcates the Virtex-II FPGA that was used to control the Océ ADF. 3.2 Alternative Approach for Handel-C Generation – Supervisory Controller In this section, we present the global overview of the supervisory controller, followed by the top-level design and the Handel-C generation process. We explain the Handel-C generation process with a trivial example. 3.2.1 Global Concept Figure 13 illustrates the global concept and blow up of the various components within the code generator, which is named UtilX. The supervisory controller is specified in an execution sheet, which is a file with comma-separated values. The main challenge is to parse this execution sheet and create Handel-C code that is in one-to-one conformance with the logic captured in the execution sheet. The input to UtilX is the execution sheet in a Comma Separated Values (CSV) format. The Excelerator (Parser) component of UtilX parses the CSV file and converts it into an intermediate data structure (Supervisory Controller Data Structure (SCDS)). In particular, the CSV Parser component reads the CSV file, checks the validity of the keywords in the CSV File, and parses the file into a set of CSV constructs used for building the data structure. The SCDS Data Structure Builder component gets the Parsed CSV Constructs and builds up the SCDS.
58
B. Rem et al. / Automatic Handel-C Generation from Simulink® for Motion Control with an FPGA
The SCDS representation is visualized as a tree like structure with the SCDS Reference Object pointing to the root of this tree. This format-independent intermediate storage has it advantages. This helps in allowing the Handel-C generation mechanism to be pluggable in the sense that if new mechanism is used in the future it can use the same interface to get the logic from the execution sheet. Error!
Figure 13: Supervisory Controller code generator – Top-level design of UtilX The input to the Handel-C generator is an in-memory Supervisory Controller Data Structure (SCDS) Reference Object. The main components in the Handel-C generator are the SCDS Reader and the Code Generator. The SCDS Reader is used to traverse through the SCDS Reference Object and the result is a set of Data Structure constructs (states etc.) in a format usable for generation of Handel-C. The Code Generator uses the Data Structure Constructs and Handel-C Templates as Input. The Handel-C Templates are used to capture the static Handel-C Code Snippets in the form of templates. The Code Generator edits these Handel-C Templates and the process results in Handel-C Code. 3.2.2 Handel-C Generation Process In this section, we elaborate on the process of Handel-C code generation by zooming in on the Handel-C generator block (see Figure 14). We first introduce the concept, followed by a simple example. The Handel-C generator block uses a set of Handel-C templates to begin with. Within the Handel-C generator block is the Code Generator component that actually triggers the right actions based on the inputs from the SCDS Reader. The entire scenario works as follows. The Handel-C generator starts with the Handel-C template and uses the Reader component to read the template from an external file. Once the templates are read, it is stored as an internal memory representation. This memory representation assists in making changes and customizing the template to create the HandelC code based on our requirement. On a trigger from the Code Generators, the Editor component carries reads and writes back to the Internal Representation in a number of
B. Rem et al. / Automatic Handel-C Generation from Simulink® for Motion Control with an FPGA
59
stages, where at the end of the each stage the Internal Representation starts to take a step closer to the final code. After a number of read and write actions are preformed on the Internal Representation, the Writer Component dumps the Internal Representation to an External File which represents the finished Handel-C code. Figure 14 also shows a Logger component, which logs all actions carried out in the Handel-C generation process. Handel-C Templates
SCDS (Supervisory Controller Data Structure)
Reader
Editor SCDS Reader
Code Generator
Internal Representation
Writer
Handel-C
Logger
Log File
Figure 14: Handel-C generation process 3.2.3 Handel-C Generation Example 3.2.3.1 Sample State Chart The supervisory controller is modeled using state charts. A sample state chart is depicted in Figure 15 for the sake of code generation. We intend to neither take all the possible cases nor demonstrate the full domain specific features. We keep the example deviant from the domain of a real Supervisory Controller for the sake of comprehensibility. Figure 15 shows a state chart with two states, namely, State_0 and State_1. The transition from State_0 to State_1 only occurs if the condition is met, in this case the value of conditionVal must be TRUE. 3.2.3.2 Sample Templates As mentioned in Section 3.2.2, the entire code generation proceeds with the modification of appropriate templates. In our case the entire supervisory controller has to be represented as two Handel-C files, the supervisory controller source file called the “supercontroller.hcc” and supervisory controller header file called the “supercontroller.hch”. The template for supervisory controller source file is listed in Listing 2.
60
B. Rem et al. / Automatic Handel-C Generation from Simulink® for Motion Control with an FPGA
State_0
[conditionVal = TRUE]
State_1
Figure 15: Sample supervisory controller state chart 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
#include "supercontroller.hch" //***other Header Files***
//***Macros*** //***Variables*** //***Semaphores***
macro proc supercontroller(state, instanceID) { boolean doTransition;
switch(state) { //***states*** default: delay; }
}
Listing 2: Supervisory controller source file template The entire supervisory controller is represented in Handel-C as a macro with two parameters – the current State and the instance ID of the running Instance of the supervisory controller. The number of instances is set in the supervisory controller header file (see Listing 3). Switching of logic to be executed is based on the current state. In the default case, a single clock cycle delay is introduced. The semaphores are left out for this example case. The template for populating the states as indicated by “//***states***” in Listing 2, is listed in Listing 4. The “doTransition” variable gets its value from an external condition and based on the value of this “doTransition” variable, the decision of state switching is carried out. If no transition is intended, no state switching is carried out and the logic is to have a single clock cycle delay. The template of the End state or final state is listed in Listing 5. 1 2 3 4 5 6 7 8 9 10 11 12
#ifndef SUPERCONTROLLER_HCH #define SUPERCONTROLLER_HCH
#define instances <%instance_number%>
// number of SC instances
extern macro proc supercontroller(state,instanceID);
typedef enum { <%first_state%> = (unsigned <%state_size%>)0 <%next_state%> } States;
#endif // SUPERCONTROLLER_HCH
Listing 3: Supervisory controller header file template
B. Rem et al. / Automatic Handel-C Generation from Simulink® for Motion Control with an FPGA 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
61
case <%state_name%>: { <%additional_statements%> doTransition = (<%condition%>); if (doTransition == TRUE) { <%statements%> state=<%next_state%>; } else { delay; } } break;
Listing 4: Sample state template 1 2 3 4 5 6
case <%state_name%>: { <%statements%> state=<%next_state%>; } break;
Listing 5: Sample state (final state) template 3.2.3.3 Sample Generated Code Once we have the templates in place, it is a matter of the populating the templates with the right snippets based on the interpretation of the Code Generator (see Figure 13) logic in the Supervisory Controller Data Structures. The generated header file for the sample state chart from Figure 15 is in Listing 6. The generated source code for the same example can be seen in Listing 7. This is the final generated code obtained after populating the templates. The input states are correctly identified in lines 15 and 32. Transitions to subsequent states are populated in lines 24 and 34. Lines 19 and 20 handle the Transition condition. 1 2 3 4 5 6 7 8 9 10 11 12
#ifndef SUPERCONTROLLER_HCH #define SUPERCONTROLLER_HCH #define instances 1 // number of SC instances
extern macro proc supercontroller(state,instanceID); typedef enum { State_0 = (unsigned 7)0 , State_1 } States;
#endif // SUPERCONTROLLER_HCH}
Listing 6: Generated supervisory controller header file (“supercontroller.hch” )
1 2 3 4 5 6 7
#include "supercontroller.hch" //***other Header Files*** (NOT SHOWN FOR SAKE OF BREVITY)
//***Macros***(NOT SHOWN FOR SAKE OF BREVITY) //***Variables***(NOT SHOWN FOR SAKE OF BREVITY) //***Semaphores***(NOT SHOWN FOR SAKE OF BREVITY)
B. Rem et al. / Automatic Handel-C Generation from Simulink® for Motion Control with an FPGA
62 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
macro proc supercontroller(state, instanceID) { boolean doTransition; switch(state) { //***states*** case State_0: { //***<%additional_statements%>(NOT SHOWN FOR SAKE OF BREVITY)
doTransition = conditionVal; if (doTransition == TRUE) { //***<%statements%>%>(NOT SHOWN FOR SAKE OF BREVITY)
state= State_1;
} else { delay;
} } break; case State_1: { state= State_1; } break; }
}
Listing 7: Generated supervisory controller source file (“supercontroller.hcc” ) 3.3 Validation of Generated Controllers Following the simulation method (Figure 1) explained in Section 1, we check that the behavior of the controllers is the same in each step of the method. In other words, we check that the automatically generated controllers in step n behave the same as the ones in step n – 1. Consequently, we have to assume that the starting point, which is the MATLAB® model of the ADF, is correct. Via co-simulations and hardware-in-the-loop simulations, we show that the generated controllers behave the same as the controllers in MATLAB®; we do not show that the generated controllers behave according to the specifications in the execution sheet. To check that the supervisory controller behaves according to its specifications might be a challenge and could be considered for future work. For the supervisory controller an additional risk was identified: the given supervisory controller in MATLAB® may contain discrepancies with respect to the execution sheet, and since the idea was to generate the Handel-C supervisory controller from the execution sheet, these discrepancies may show up when the Handel-C supervisory controller is compared with the original MATLAB® simulation. 3.3.1 Validate MDL2HC with the Automatically Generated Motor Controller The automatically generated motor controllers created by MDL2HC can be validated by comparing the behavior of the motor controller in Handel-C with the motor controller in MATLAB® (see Figure 16). The behavior of both controllers was the same, proving that MDL2HC produces code that has the same result as the MATLAB® models.
B. Rem et al. / Automatic Handel-C Generation from Simulink® for Motion Control with an FPGA
63
Figure 16: The Handel-C motor controller generated by MDL2HC displays exactly the same behavior as the motor controller in MATLAB®. The motor controller controls a motor model.
Figure 17: Tracking behavior of motor controller [left: desired and actual position, right: error] on one motor in a hardware-in-the-loop simulation 3.3.2 Check Tracking Behaviour of Motor Controller Figure 17 shows how well a simulated motor follows a given set point when controlled by the FPGA, in a hardware-in-the-loop test. The left graph shows the actual and desired motor positions, which are very close to each other. The right graph shows the difference between these two. 3.3.3 Compare Behaviour of Motor & Supervisory Controller through All Simulation Steps Figure 18 displays all steps of the simulation method for the model that combines the supervisory controller and the motor controllers. A comparison between these graphs is not trivial. Some observations are noted next. Until 0.5 seconds all graphs display the same behavior. After 0.5 seconds, one observation is that the HIL simulation and the Real ADF
B. Rem et al. / Automatic Handel-C Generation from Simulink® for Motion Control with an FPGA
64
are delayed approximately 0.1 seconds compared to the pure MATLAB® simulation and Co-Simulation. No satisfying explanation for this delay exists yet, however the delay is related to the switching from simulated time to real time and how the supervisory controller handles timeouts. Another observation is that at 1.6 seconds the HIL simulation goes up while the Real ADF continues straight. This difference occurs because in the Real ADF the second motor controller, which should take the paper and transport it further, was not yet implemented. In the HIL Simulation the motor controller was implemented and the paper continues normally. Overall, more investigation is needed to truly validate the supervisory controller and make a comparison over all simulation steps. To help this investigation an easy means to log and analyze data from the real ADF is needed.
Figure 18: A comparison of the supervisory controller combined with one motor controller through all steps of the simulation method. 3.4 A Few Notes on Co-Simulation This section details a few of the challenges we had with the Co-Simulation. The CoSimulation was performed using Celoxica’s Co-Simulation Manager [6, 7]. The tool is an integration environment, which we used to connect DK with Simulink®. Setting up the CoSimulation is relatively simple but in the version we used, some things are non-trivial: x
x
x
x
To set up the Simulink® modules: define the ports for these modules in alphabetical order, otherwise the integration in Simulink® will not work properly (CoManSimulink S-function cannot be integrated in Simulink®). Not all values can be passed to Simulink®: everything worked well with signed 32-bit values as input to Simulink® ports. However, values smaller than 32 bits are sometimes passed incorrectly. For some frequencies (too high or too low frequencies) it looks like the CoSimulator fails to synchronize the Handel C module with the Simulink® one, so the values received/transmitted are unreliable. The simulation step size needs to be fixed.
B. Rem et al. / Automatic Handel-C Generation from Simulink® for Motion Control with an FPGA
65
3.5 Hardware in the Loop Description and Challenges During the HIL simulation, we experienced several challenges. These challenges and our approach to them are presented in this section. Figure 19 depicts, on a logical level, how the RC203 can be connected to the ADF or the xPC. Remember that the interface between the xPCÅÆRC203 and the ADFÅÆRC203 is the same, so that xPC and ADF are interchangeable. 3.5.1 Voltage Level Converter The RC203 can only handle 3.3V input and output signals. However, the ADF and the xPC work on 5V. To resolve this issue of difference between voltages, we created a voltage level converter. The Voltage converter is placed in between the ADF or xPC and the RC203. The Voltage converter “transforms” the 3.3V to 5V and vice versa. It has 18 times 3.3V input pins that are converted to 5V (from RC203 to ADF) and 10 times 5V input pins that are converted to 3.3V (from ADF to RC203).
Figure 19: Hardware in the Loop overview of component decomposition and interfaces 3.5.2 xPC The xPC is a PC-based box with an Intel Pentium 4, which runs a real-time operating system from The MathWorks [16]. It is used to emulate the ADF; i.e. it emulates the motors and the paper path logic of the ADF. The reason for having this xPC is that it is possible to test the controlling FPGA with the xPC simulation, instead of the real ADF. If the ADF is incorrectly controlled, this might result in physical damage to the ADF. With the xPC, there is no such risk. A drawback of the xPC is that, due to the PC-hardware architecture, the maximum sample time (resolution) is 20 kHz (0.00005 seconds). The 0.00005 value comes from the time the interrupt controller needs to get the source of the interrupt and report it to the CPU. The limits of xPC are experienced with the high frequency PWM signal discussed next.
B. Rem et al. / Automatic Handel-C Generation from Simulink® for Motion Control with an FPGA
66
3.5.3 PWM Signal As seen in Figure 19, the xPC receives a PWM signal from the RC203. This PWM signal tells the motor how fast it should run. Because the motors of the ADF are very sensitive, the resolution of the PWM signal has to be high enough (for example 250) to let the motor controller follow the set-points accurately. Also, the PWM frequency must be (close to) 20 kHz, because this frequency is almost not audible anymore and it is fast enough to smoothly control the motor. If we combine the required resolution and PWM frequency we see that we need to sample with a frequency of 20 kHz * 250 = 5 MHz in the xPC. Since the xPC has a maximum sample frequency of 20 kHz, it is clear that the xPC cannot process the PWM signal using simple (slow) I/O cards. Therefore, additional hardware is needed. The desired hardware needs to fulfill the following requirements: x x x
It must be able to handle a PWM frequency of 20 kHz . It must be supported (out-of-the-box) by MATLAB® / Simulink® and xPC. Ideally, it should be able to process the PWM signal on its own.
The PCI-cards of the PCI-CTRxx family of Measurements Computing meet all these criteria. Using two of its available timers, it can capture the PWM signal and retrieve the duty cycle. We did encounter a problem with passing a 0% duty cycle to the PC-CTR05 capture card. Therefore, we always pass at least 1% duty cycle, which is not enough to overcome the friction of the motors, but does solve the 0% duty cycle problem. 3.5.4 Quadrature Signals The quadrature signals are used to obtain the position of a motor. The quadrature signal is transmitted via two wires. Since the xPC does not have a real motor with encoders inside, it has to generate the same quadrature signals from software as the ones generated by the ADF. Each motor on the ADF has two sensors to indicate if a slit is detected or not. The sensor location is depicted in Figure 20. We had to perform two conversions: x x
Translate desired position to quadrature signals in the xPC in order to emulate the ADF’s quadrature sensors. Translate the quadrature signals back to position in the RC203. Both conversions are realized by using a state machine. Figure 21 shows the four possible states that the two sensors together can have.
3.5.5 Sensor Input / Actuator Output The sensors in the Paper Path model and ADF tell the controller where the paper is located in the copier. These sensors have a very low frequency compared to the Quadrature and PWM signals. Therefore, only their count is important, in order to find hardware that supports communication with all sensors and actuators on the xPC. Because we need at least 28 pins to control the ADF and we want the controller to be plug and play, we need to use a digital I/O card that can handle 28 lines. 3.5.6 Performance The FPGA should be able to operate at 30+ MHz. This is because all operations are executing in parallel. As mentioned earlier the xPC is limited to 20 kHz due to the interrupt controller implementation.
B. Rem et al. / Automatic Handel-C Generation from Simulink® for Motion Control with an FPGA
67
S e n s o rs
A
B
S lits
Figure 20: A quadrature signal is used determine the motor speeds. It is generated from two sensors that are placed just after each other. Essentially, these sensors count the edges of slits that have passed.
A
B
A not triggered B not triggered A triggered B not triggered A triggered B triggered A not triggered B triggered
Figure 21: The four possible states that a quadrature signal with two sensors can have.
4. Conclusions and Future work In this paper, we detailed the results from the WSSE 04, which involved automatic generation of Handel-C code for an FPGA, and using the FPGA to control the flow of paper through a scanning section of a printer. The generated code was tested using various levels of simulations and the effort also demonstrated how these simulation levels define a structured approach to proceed with the development from a high-level modeling environment down to the actual hardware. 4.1 Summary of the Results In the three months available for the workshop we realized several goals. First, we created two indigenous and unique solutions for automatically generating Handel-C for an FPGA. One solution generated Handel-C from MATLAB®/Simulink® that generated code for twenty-three (23) standard Simulink® blocks and provided the flexibility to extend this to other blocks as well. Another solution, which is specific for the project case we considered, generated Handel-C from a state-based specification in Excel. Both solutions were combined in a motion control case study provided by Océ.
68
B. Rem et al. / Automatic Handel-C Generation from Simulink® for Motion Control with an FPGA
Second, we realized the case study provided by Océ. We can control the flow of paper through a prototype – namely the Automatic Document Feeder – of the scanning section inside an Océ multifunctional. We replaced the existing microcontroller that handled all control related tasks (i.e. controlling motors, reading position sensors, and opening/closing clutches at the appropriate times) with an FPGA. The FPGA was programmed using Handel-C, which was generated automatically. Third, in order to validate our solutions and to increase our confidence in the generated controllers, we followed an incremental design and test method consisting of several levels of simulations. Using a Software-in-the-Loop simulation, we observed that the behavior of the Handel-C code generated for the motor controllers was exactly the same as the original motor controllers in MATLAB®/Simulink®. By virtue of a hardware-in-the-loop simulation, we verified our hardware interfaces. As a final result of the workshop we completed the system tests and managed to transport a ream of papers through the ADF from start to finish of the paper-path using our automatically generated code and the FPGA. Statistically speaking, we used 84% of the FPGA’s (Xilinx Virtex-II XC2V3000) resources, which also included the logging functionality. 4.2 Future work We think that as a part of the future work, more attention could be paid towards code optimization, since we note that the sizes of the generated FPGA designs are quite large. Our initial analysis indicates that in MDL2HC (Section 2), the large design size is mainly due to the extensive use of channels. We anticipate two independent optimizations possibilities. One is to use Handel-C signals instead of channels, but this introduces the challenge that everything has to be done within one clock cycle in order to work correctly. In addition, a new implementation for enabled and triggered subsystems would be needed in that case (since the current implementation relies on the blocking behavior of channels). The second suggestion is to take a closer look at the model. The model we used was not created with the intention to generate a hardware design out of it. Therefore, most data types were chosen to be 32-bit, while probably in a lot of cases fewer bits would suffice. Besides code optimization, more work is needed to verify that the supervisory controller combined with motor controllers behave as specified. A means is also needed to verify easily the real-time behavior of the controllers in the early stages of the design. Since the FPGA is extremely fast, we think that it would be beneficial to use the FPGA as a real-time target platform during the hardware-in-the-loop simulations. This would help to mitigate some speed limitations we experienced while capturing signals using the PC platform (xPC target) for hardware-in-the-loop. The code for the FPGA (used as a real-time platform) could also be automatically generated using the tools demonstrated in this paper. In any case, the unique cooperation between the parties involved is expected to continue for WSSE05. Acknowledgements This work would not have been possible without the unique and enthusiastic cooperation between OOTI, Philips TASS, Océ Technologies BV, Celoxica Limited, The MathWorks Benelux BV, the Boderc project at ESI and Océ; and of course all of the OOTI fellows of generation 2003. Also thanks to Johan Sunter from Philips Semiconductors who explained how he generated Handel-C and donated his program to serve as a starting point for this code generator.
B. Rem et al. / Automatic Handel-C Generation from Simulink® for Motion Control with an FPGA
69
References [1] Weffers H.T.G., Workshop System/Software Engineering, WSSE04, OOTI (Ontwerpers Opleiding Technische Informatica), Eindhoven University of Technology (TU/e), August 30, 2004. http://wwwooti.win.tue.nl/
[2] SB et al., DK Design Suite User Guide, DK3, Celoxica Limited, Document Number UM-2005-4.2. [3] Datasheet, RC203 Development & Evaluation Board Datasheet, Celoxica Limited, http://www.celoxica.com/techlib/files/CEL-W0504181A26-52.pdf
[4] User Guide, Using Simulink, Simulink 6, The MathWorks Inc. http://www.mathworks.com/access/helpdesk/help/pdf_doc/simulink/sl_using.pdf
[5] Hamberg, Roelof and Beenker, Frans, Summary Project Plan for the Boderc project on Multi-disciplinary system-controller design, Document Number 2002-10057, Version 02, Embedded Systems Institute http://www.esi.nl/site/projects/boderc/publications/summaryprojectplanv2.pdf
[6] SB et al., Co-simulation Manager User Manual, Platform Developer’s Kit, Celoxica Limited, Document Number 1. [7] SB et al., Co-simulation support for MATLAB, Platform Developer’s Kit, Celoxica Limited, Document Number UM-2006-1.1. [8] User Guide, XPC Target, Real Time Workshop, The MathWorks Inc., Version 2 http://www.mathworks.com/access/helpdesk/help/pdf_doc/xpc/xpc_target_ug.pdf
[9] SB et al., PAL User Guide and API Reference Manual, Platform Developer’s Kit, Celoxica Limited, Document Number 1. [10] SB et al., Handel-C Language Reference Manual, Celoxica Limited, Document Number RM-1003-4.2, 2004. http://www.celoxica.com/techlib/files/CEL-W0410251JJ4-60.pdf [11] Programming in Handel-C, Frequently Asked Questions, Celoxica Limited, http://www.celoxica.com/techlib/files/CEL-W0307171KW9-58.pdf
[12] User Guide, Real-time Workshop, Simulink 6, The MathWorks Inc. http://www.mathworks.com/access/helpdesk/help/pdf_doc/rtw/rtw_ug.pdf
[13] User Guide, Real-time Workshop Embedded Coder, Simulink 6, The MathWorks Inc. Version 4 http://www.mathworks.com/access/helpdesk/help/pdf_doc/ecoder/ecoder_ug.pdf
[14] User Guide, Simulink Reference, Simulink 6, The MathWorks Inc. http://www.mathworks.com/access/helpdesk/help/pdf_doc/simulink/slref.pdf
[15] User Guide, Writing S-Functions, Simulink 6, The MathWorks Inc. http://www.mathworks.com/access/helpdesk/help/pdf_doc/simulink/sfunctions.pdf
[16] User Guide, XPC TargetBox, Real Time Workshop, The MathWorks Inc., Version 1 http://www.mathworks.com/access/helpdesk/help/pdf_doc/xpc_targetbox/ xpc_targetbox_ug.pdf
This page intentionally left blank
Communicating Process Architectures 2005 Jan Broenink, Herman Roebbers, Johan Sunter, Peter Welch, and David Wood (Eds.) IOS Press, 2005 © 2005 The authors. All rights reserved.
71
JCSP-Poison: Safe Termination of CSP Process Networks Bernhard H.C. SPUTH and Alastair R. ALLEN Department of Engineering, University of Aberdeen, Aberdeen AB24 3UE, UK
[email protected] ,
[email protected] Abstract. This paper presents a novel technique for safe partial or complete process network termination. The idea is to have two types of termination messages / poison: LocalPoison and GlobalPoison. Injecting GlobalPoison into a process network results in a safe termination of the whole process network. In contrast, injected LocalPoison only terminates all processes until it is filtered out by Poison-Filtering Channels. This allows the creation of termination domains inside a process network. To make handling of a termination message easy, it is delivered as an exception and not as a normal message. The necessary Poisonable- and Poison-Filtering-Channels have been modelled in CSP and checked using FDR. A proof of concept implementation for Communicating Sequential Processes for Java (JCSP) has been developed and refined. Previously, JCSP offered no safe way to terminate the process network. When the user terminated the program, the Java Virtual Machine (JVM) simply stops all threads (processes), without giving the processes the chance to perform clean up operations. A similar technique is used to perform partial termination of process networks in JCSP, making it unsafe as well. The technique presented in this paper is not limited to JCSP, but can easily be ported to other CSP environments. Partial process network termination can be applied in the area of Software Defined Radio (SDR), because SDR systems need to be able to change their signal processing algorithms during runtime. Keywords. JCSP, SDR, Partial Process Network Termination, Poisoning, PoionableChannel, Poison-Filtering-Channel, Termination Domains
Introduction In CSP [1,2] applications consist of processes. A process is a sequence of instructions. In complex applications multiple processes execute concurrently. To avoid race conditions when accessing global resources, processes are not allowed to share global resources without synchronisation. Processes communicate with each other by using unidirectional channels. A communication over a channel only takes place when receiver and sender processes are cooperating. This rendezvous of processes is used for synchronisation in CSP. The combination of processes and channels forms a process network . Process networks can be visualised in the form of block diagrams. Processes are represented by blocks. Channels are represented by arrows, with the arrow head indicating the direction of communication of the channel. Figure 1 shows a simple process network, where the PRODUCER process sends messages, over a channel with name messenger, to the CONSUMER process. To terminate a process network all processes of it need to terminate. In the example given this means the both PRODUCER and CONSUMER have to terminate. In CSP a process only terminates once it has fulfilled its task. A problem occurs when a process does not know when it has fulfilled its task. This is, for instance, the case for processes that are part of a signal processing chain. A signal processing chain consists of three parts: data source, signal
72
B.H.C. Sputh and A.R. Allen / JCSP-Poison: Safe Termination of CSP Process Networks PRODUCER (Process)
messenger (Channel)
CONSUMER (Process)
Figure 1. Simple process network diagram
processing part and a data sink, see figure 2. In such a chain, only the data source is able to determine that all signal processing has been performed. And this only if the data processing is done off-line, meaning that the signal data was provided in the form of a file. If the signal processing is performed online not even the data source knows when it has reached its end of usefulness. This decision is made by the user, who then has to close the application to stop the signal processing. But even if the data source is able to determine that the task is fulfilled, it has no way of telling the other processes of the chain. This is caused by the way signal processing processes operate, it is a constant loop of: inputting a frame of signal data, processing the frame and then outputting it. In the case of no signal data arriving anymore, these processes will wait infinitely for new signal data and thus never terminate. The following text will discuss different methods to enable safe termination of process networks. Furthermore, in Software Defined Radios (SDR)[3,4], it is necessary to exchange software modules, without affecting their execution environment, thus partial process network termination is necessary and a technique for it will be developed and discussed as well in this paper. Due to previous work by us in the field of signal processing and CSP done using JCSP [5], the implementation of the techniques shown in this paper are based upon JCSP. Data Source
Signal Data Path
Signal Processing
Data Sink
Figure 2. Principle Signal Processing Chain
1. Terminating Networks in JCSP In this section the various available techniques of process network termination in JCSP [6] are introduced. The shortcomings of these techniques are discussed as well. 1.1. Process Network Termination in JCSP JCSP [6] is an environment allowing the development of Java applications following CSP principles. In JCSP, processes are represented by Java threads. The Java Virtual Machine (JVM) differentiates between two types of threads, user threads and daemon threads. The definition of a daemon thread, according to [7, Page 26], is: ”A daemon is simply a thread that has no other role in life than to serve others”, with others meaning other threads. By contrast, a user thread serves the user of the program. For the JVM, a program which consists only of daemon threads does not perform any purpose and is terminated. As this can lead to unwanted program termination, all threads are created as user threads by default and can be converted to daemons afterwards. JCSP creates all threads as daemons. Only the thread created by the JVM itself, when starting the program, is a user thread. Termination of a complete JCSP process network can therefore be done by terminating the thread created by the JVM. This type of process network termination is simple to use but in most cases utterly unsafe. This is due to the fact that individual processes cannot perform clean up operations prior to their termination.
B.H.C. Sputh and A.R. Allen / JCSP-Poison: Safe Termination of CSP Process Networks
73
1.1.1. Partial Process Network Termination in JCSP In JCSP it is possible to spin off process networks using the ProcessManager class. This class also provides the ability to stop the process network, that it spun off. According to the documentation this function should not be used. JCSP therefore, offers no special support for network termination. Instead the user must arrange for network termination himself. 1.1.2. Possible Solution One possible way to terminate a process network is by broadcasting the termination message to all processes, of a process network. Upon reception of this message, the processes terminate. This can be achieved in JCSP by sharing a message variable among the group of processes to be terminated. There are two types of processes: one broadcaster process and multiple receiver processes. Access to this variable is synchronised on a JCSP Barrier shared among all processes. The JCSP Barrier splits each processing cycle into two phases, to comply with the CREW (Concurrent-Read, Exclusive-Write) concept: • In the first phase all receiver processes check the shared message variable. Broadcaster process will not change the message variable during that time. • In the second phase the broadcaster may change the shared message variable. The receiver processes must not look at the shared message variable. Once the broadcaster decides to terminate the system, it changes the message variable and all processes, including the broadcaster, will see the termination message the next time they check the variable. This way all processes terminate in the same cycle. This approach is having a number of drawbacks: • Broadcaster and receiver processes need to synchronise twice per cycle on the JCSP Barrier. This is an computing intensive task. • It will only work for systems where all processes have the same cycle length. In signal processing this is not necessary the case. There a combiner process may require multiple signal data frames, from the process up the stream, to construct a single frame which is required by the later stages of processing. For the process upstream a single cycle is the creation of one frame, for the combiner process the cycle length is as well the creation of one frame, but it requires multiple frames from the upstream process to do so. If now the upstream process tries to synchronise with the JCSP Barrier it will have to wait for the combiner process as well to synchronise. But with the combiner process not having finished its cycle, it will not synchronise, instead it will wait for a frame to arrive from the process upstream. The result is a deadlock. Therefore, if in such a system the previous mentioned technique for broadcasting is used, the developer has to make sure that all processes have the same cycle length. In the example given, the upstream process would have to produce multiple output frames during one cycle. 1.2. Introduction to Graceful Termination Not being able to perform clean up operations may lead to data corruption, or unnecessary resource consumption. In [8] P.H. Welch discusses different ways to perform safe termination of process networks, and introduces a technique called graceful termination. In this technique a special message, the poison, is sent to one process of the network, this is the reason why this technique is also referred to as poisoning.
74
B.H.C. Sputh and A.R. Allen / JCSP-Poison: Safe Termination of CSP Process Networks
A process receiving poison: 1. Performs all necessary clean up operations; 2. Places while-no-poison-black-hole-messages processes, also known as black-hole processes at all its channel inputs. Such processes swallow any arriving messages, but terminate when poison is received. 3. Sends poison over all its channel outputs, to poison all processes it sends data to. 4. Does items 2. and 3. above in parallel and waits for all installed processes an the poison-sends to terminate. Then it terminates. While this technique looks good at first glance there are several problems when trying to use it in practice. Why this is the case will be discussed in the following. 1.3. JCSP and Graceful Termination The implementation of graceful termination as proposed in [8], relies on the ability of processes to differentiate between incoming normal messages and poison. But how to differentiate between a poisonous and a normal message? For the Object-channels available in JCSP, it can be easily done by defining a class representing poison, so the receiver of a message has only to perform a type checking operation. For integer-channels, this is not possible, as only integer values are passed. One possible solution is to send two integer values for a normal message. The first integer value indicating the poison or not (for instance use 0 for non poisoned and 1 for poisoned), the second the value to be transferred. In case of poison only the first integer is transmitted. This will work, but requires the double amount of bandwidth and increases the processor drain during normal operation. As this is undesirable, a different delivery mechanism for poison has to be found.
A1
B
A2
(a) Prior to termination
A1
BLACK HOLE
A2
BLACK HOLE
C
(b) After termination of process B
Figure 3. Complex process network before and after termination of process B
A terminated process does not any longer fetch messages from its channel-inputs. If now processes try to send messages over the associated channels, these processes will wait indefinitely for the receiving process to fetch the message: these processes deadlock. To avoid this situation, the terminating process needs to somehow service its channel inputs. This is done by connecting each channel input with a process that simply swallows any incoming message, these processes are called black hole processes. This effectively prevent the process network from deadlocking. If in the process network shown in Figure 3(a), process B terminates, it creates two Black Hole processes for its two channel inputs and sends poison to its channel output. Process C terminates, without creating a Black Hole process, due to it only being connected to one channel over which it received poison. As illustrated in Figure 3(b), the net result of the termination operation is that there are now two black hole processes, swallowing any messages arriving from A1 and A2. These processes will cease to exist once processes A1 and A2 each
75
B.H.C. Sputh and A.R. Allen / JCSP-Poison: Safe Termination of CSP Process Networks
decided to terminate and send poison down their channels. The result of applying the graceful termination technique to this network with process B determining when to terminate, is an incomplete process network termination. Lets investigate how a complete network termination can be achieved using graceful termination and the possibility of every process of the network being able to initiate a complete termination.
A1
A1
B
B
C
C
A2
A2
(a) Process network ready for complete termination initiated by B
(b) Process network modified to allow termination initiation by any process
Figure 4. Complete terminatable complex process network for different poison originators
With the poison represented as a message, every incoming message needs to be checked whether it is poison or not. This constant checking not only results in an overhead, but also makes the resulting code more complex, due to the additional code necessary for checking and handling. Also, messages can only travel in the direction of the channel. For processes which act as data source, this can result in extra channels, just to transport the termination message. This camouflages the data flow, and makes use of these processes more complex. To be able to terminate the complete process network of figure 3(a), assuming that process B injects the poison, it would be necessary to introduce two additional channels. The resulting process network is shown in figure 4(a). When the decision to inject poison can only be made by process C, then a total of three additional channels are required. The data flow of the process network becomes totally hidden if we have the goal that any process should be able to initiate a complete network termination, see figure 4(b). Of course the complexity of every process increases as it has to check all channel inputs for incoming messages. B1 A B2 Figure 5. One2Any-Channel with multiple receiver processes
In JCSP, channels are allowed with multiple channel-inputs and outputs, such as the One2Any-, Any2One- or Any2AnyChannels. These channel types are necessary to have a way to implement master-worker environments, where multiple identical workers operate upon requests issued by one master. Another area is that of environments where multiple processes want to submit similar requests to one process. This is, for instance, the case when a process is used to guard a resource from concurrent access. An example for a process network not terminating correctly using this approach, is given in Figure 5. To terminate the process network shown it is necessary to relay a termination message to all processes that are listening on this channel. But how does process A know how many processes are listening on the channel?
76
B.H.C. Sputh and A.R. Allen / JCSP-Poison: Safe Termination of CSP Process Networks
1.4. Summary of Problems of Graceful Termination and JCSP The previous sections detailed the problems poisoning poses in general and especially for use in the JCSP environment. In the following our JCSP-Poison approach will be introduced, which tackles these problems: • Every message received by a process has to be checked, as to whether it is poisonous. For channels that are heavily used, this may waste a lot of computing resources. • Termination of a process only having an outgoing channel, from outside this process, is not possible, unless a dedicated termination channel is provided. This is caused by delivering the poison as a message, which can only travel in one direction over the channel. • Partial process network termination is handled by the graceful resetting technique. But requires to adjust all processes connecting to the sub-network. • JCSP channels can have multiple writer and reader processes, but only one writer and one reader can exchange messages at any one time. To deliver poison in such a system, the sending process would need to know how many processes are connected to each channel over which it is sending poison over. For a process this is impossible to know, as the decision is made by the developer of the overlying process network. The following section will detail how these problems have been resolved for JCSP. Section 3 will cover the actual implementation and how to use it. A CSP model for the JCSP implementation will be developed in section 4. The paper closes with drawing conclusions and an outlook of further work in the area.
2. JCSP-Poison The first part of this introduction to JCSP-Poison covers improvements for complete process network termination, with the second part detailing the changes to support partial process network termination. JCSP-Poison tries to provide clean process network termination for JCSP using the graceful termination technique introduced in section 1.2. In order to make this technique feasible in JCSP, the previously mentioned problems need to be overcome. The core problem of graceful termination in conjunction with JCSP is that poison is defined as a message. A message can only be sent to one recipient: other processes listening on this channel will be unaware of the situation. The poison also only affects the processes that receive it but not the channels that carry it. In JCSP-Poison, poison is not a message but an object that is passed from processes to channels and vice versa. A special mechanism for propagation is used. The propagation of poison is done by injecting it into a channel instead of using the normal channel-output methods. A channel which got injected with poison is considered to be poisoned. The delivery of the poison to a process is done by throwing an exception, whenever a process tries to interact with a poisoned channel. This is a similarity to the poisoned channel of Gerald Hilderink, mentioned in [9], where an ArrrghException is thrown at anyone, tries to use a poisoned channel. What is unclear here is who is allowed to poison the channel, only the writer end or is it possible for the reader end to do this also. The channels of C++CSP [10] both channel ends are able receive and deliver poison. The direction independence of the injection method, makes it possible for the poison to travel in the reverse direction of the data flow of a channel, thus avoiding the need to add extra channels for this purpose. Once a channel gets poisoned it wakes up any processes currently waiting for a rendezvous and throws a poison exception at them.
B.H.C. Sputh and A.R. Allen / JCSP-Poison: Safe Termination of CSP Process Networks
77
This technique solves the following of the earlier mentioned problems: 1. No need to check every incoming message for poison, therefore clearer handling code and less computing resource usage; 2. The poison can propagate backwards over the channels, making correct termination of a process network easier to implement; 3. Due to the channel rejecting any interaction with it, no black hole processes are required to avoid deadlocks. This once again saves computing resources. 4. Poison spreads to all processes using a channel, even when channels with multiple inputs and outputs are used; 5. Handling of poison messages is enforced, due to the delivery as PoisonException and the Java compiler enforcing exception handling. This technique is optimised to terminate a complete process network. Its working is similar to a bulldozer destroying everything in its path: it terminates every process that crosses its path. In this state it is unsuitable for terminating only parts of a process network. To terminate sub-networks is a requirement for exchanging software modules in an SDR-Platform. An SDR-Platform is basically split into a signal processing part and a data acquisition part. The signal processing part is represented by exchangeable software modules. The data acquisition part runs constantly, in the case of it terminating it terminates the software module as well. 2.1. Poisoning of Sub-Networks To poison sub-networks it is necessary to limit the propagation of poison in the process network. To do so one could install channels that do not let poison pass, at the borders of the sub network. This approach has the flaw that it effectively prevents the complete termination of the process network. To avoid this, it is necessary for the channels at the borders to be able to differentiate between a complete and a sub-network termination message. This can be achieved by having multiple types of poison, in the case of JCSP-Poison two types are defined: • GlobalPoison: GlobalPoison is distributed throughout the complete Process Network, it is never filtered out. This type of poison is used to terminate the complete process network. • LocalPoison: LocalPoison is used to terminate sub-networks and is filtered out by the channels at the borders. Both types of poison are derived from a common base class. With these two types of poison comes the need to ensure that the type of poison does not change during propagation. Therefore, the PoisonException carries a reference to the original poison exception, which can be retrieved in the exception handler and injected into the other channel ends of the process. The two types of poison and the different channels provided are the main difference between JCSP-Poison and its predecessors. 2.2. Poisonable-Channel A Poisonable-channel can be in two states. The first one being normal operation, in this state the channel acts as a normal channel. However, once the channel is in the poisoned state, all requests to it, whether reading or writing to it, will result in the channel issuing an exception. To change from normal state into poisoned state, the channel offers the void injectPoison(Poison) method. The name injectPoison was chosen because the process that poisons the channel has the choice between two types of poison. Another reason was that it seemed a nice connection to the real world, where poison is injected into an organism.
78
B.H.C. Sputh and A.R. Allen / JCSP-Poison: Safe Termination of CSP Process Networks
2.3. Poison-Filtering-Channel A Poison-Filtering-Channel is a channel that acts as a normal channel, but depending on the type of poison received acts differently. When receiving a GlobalPoison the channel will act like a poisoned Poisonable-Channel, thus allowing for a complete process network termination. The situation is different if a LocalPoison is received. In this case, the poison will not be relayed at all but silently disposed of and the Poison-Filtering-Channel will not change in to the poisoned state. It is the responsibility of the developer not to operate on a channel end previously poisoned. Use of these channels is at the borders of sub-networks, to avoid LocalPoison leaving the sub-network. These channels have to be used with care, as they can easily result in systems that deadlock. This is, for instance, the case when two sub-networks are connected over a Poison-Filtering-Channel and one of these networks terminates using a LocalPoison. The other sub-network not knowing about this termination, might now try to communicate with the terminated network over the channel, but will wait for a reply indefinitely. Poison-Filtering-Channels should only be used in cases where the sub-network is meant to be exchanged. In this case they ensure that no message is lost during the change of the sub-networks. One feature concerning Poison-Filtering-Channels is when they have multiple channelinputs or outputs. As they effectively block any LocalPoison arriving and do not relay it at all, it is possible to terminate only one connected sub-network. This allows the number of worker process sub-networks in a master-worker system to be adjusted dynamically. 2.3.1. Poison-Injector-Channel A special case of Poison-Filtering-Channel is the so called Poison-Injector-Channel. This channel behaves normally like a normal One2One Poison-Filtering-Channel, i.e. no LocalPoison will affect it. But it is possible for a process to poison one channel end, with LocalPoison, using either the injectLocalPoisonIntoWriter() or injectLocalPoisonIntoReader(). This channel makes it possible to inject a LocalPoison into a sub-network to terminate it. At the same time a complete process network termination is not hindered, i.e. no special precautions have to be made. An implementation of this channel exists, but so far no CSP model has been created. 3. JCSP Implementation In the previous sections the ideas behind JCSP-Poison were discussed. The following section discusses the implementation of poisoning in JCSP-Poison. It starts with the definition of poison in JCSP, followed by a definition of the interface exposed by the poisonable channels. 3.1. Representation of Poison In JCSP-Poison, poison is represented by two classes: GlobalPoison and LocalPoison. Both classes are derived from the interface: Poison. This allows the use of functions which are indifferent to the type of poison they handle. The classes representing poison have no further functionality. The class diagram in figure 6 shows the relationship of Poison and its children. 3.2. Delivery of Poison to a Process In JCSP-Poison, poison is delivered to a process not as a message, but as an exception. The class PoisonException is used for this task. The exception has a member which can hold an instance of Poison. The method Poison getPoison() is used to retrieve the value of this member. This new property requires a redefinition of the JCSP channel ends, detailed in section 3.3.4.
B.H.C. Sputh and A.R. Allen / JCSP-Poison: Safe Termination of CSP Process Networks
79
«interface» Poison
GlobalPoison
LocalPoison
Figure 6. Class Diagram of the different types of Poison
public interface Poisonable { /** * Injects Poison into a Channel; * @param Poison Poison to be injected. */ public void injectPoison(Poison poison); }
Listing 1: Definition of the Poisonable interface
3.3. Poisonable-Channel Interface A JCSP-Poison channel provides three methods to the processes using it: 1. Poisoning the channel. 2. Sending a message. 3. Receiving a message. In the following these three methods will be detailed. 3.3.1. Poisoning a Channel For JCSP-Poison to work, it is important to be able to poison a channel. The question is how to do that. With poison not being a message in JCSP-Poison, and its need to travel backwards over a channel, it was not possible to use the normal Object read() and void write(Object) methods. For this reason a new method: void injectPoison(Poison), has been added to the JCSP channel interface. This method is available to both channel ends, accepting any type of poison. Once this method is called, and decided that the channel should be poisoned, it will wake up any process currently waiting for a channel transaction, and a poison exception will be thrown. The implementation of the void injectPoison(Poison) differs between the Poisonableand Poison-Filtering-Channels. Poison-Filtering-Channels only become poisoned by a GlobalPoison. This is the only difference between these two JCSP channel types. Poisonable Interface: As both channel ends allow for injecting poison into the channel an interface called Poisonable has been defined which contains the definition of the injectPoison(Poison poison) method. Its source is given in listing 1. 3.3.2. Sending a Message In JCSP sending a message is done using the void write(Object object) method. To be able to deliver the poison exception, this method had to be enabled to throw the PoisonException.
80
B.H.C. Sputh and A.R. Allen / JCSP-Poison: Safe Termination of CSP Process Networks
public interface PoisonableChannelOutput extends Poisonable { /** * This method sends a message over a channel. * @param object * Reference to the message to send. * @throws PoisonException * Is thrown when the channel has been poisoned */ public void write(Object object) throws PoisonException; }
Listing 2: The PoisonableChannelOutput interface
public interface PoisonableChannelInput extends Poisonable { /** * This method reads a message from the channel. * @return The message read from the channel * @throws PoisonException * Is thrown, when the channel has been poisoned. */ public Object read() throws PoisonException; }
Listing 3: The PoisonableChannelInput interface
3.3.3. Receiving a message To receive messages from a channel in JCSP the method Object read() is used. Like the write method previously this method had to be extended to be able to throw exceptions of type PoisonException. 3.3.4. Channel Ends A channel in CSP has two ends, a writer and a reader channel end. In the following paragraphs the definition of these channel ends in JCSP-Poison is detailed. Writer Channel End: The complete writer channel end for the Poisonable-Channel is given in listing 2. To allow the writer process to poison the channel, this interface is derived from the Poisonable interface of listing 1. Reader Channel End: The reader channel end of a JCSP-Poison channel is a combination of the Poisonable interface and the modified read method. Its definition is shown in listing 3. Relationship of Channels and Channel Ends: The class tree of figure 7 shows the relationship between the three interfaces defined previously. This concludes the definition of the interface exposed by the JCSP-Poison channels. A derivation of the poisonable channels from the standard JCSP channels removes the distinction between the normal channels and poisonable channel ends. It is therefore not advisable. It might seem to be a good idea from the perspective of code reuse. This is actually not the case, as all methods of the original JCSP channel implementation had to be modified, so no code could be reused by derivation. The JCSP channel implementation code was reused by means of copy and paste, so the system did not start from scratch. Only the implementation of the Poisonable-Channel and PoisonFiltering-Channel constructs were added to the system, leaving the rest of JCSP untouched.
B.H.C. Sputh and A.R. Allen / JCSP-Poison: Safe Termination of CSP Process Networks
81
«interface» Poisonable
«interface» PoisonableChannelOutput
«interface» PoisonableChannelInput
Guard
PoisonableAltingChannelInput
PoisonableOne2OneChannel
PoisonFilteringOne2OneChannel
Figure 7. Class diagram of the Poisonable- and the Poison-Filtering-Channel
3.3.5. Provided Channels JCSP-Poison provides currently three channel implementations: • The class PoisonableOne2OneChannel provides an implementation of the PoisonableChannel defined earlier. • The Poison-Filtering-Channel is provided by the class PoisonFilteringOne2OneChannel. • The Poison-Injector-Channel is provided by the class PoisonInjectorOne2OneChannel The class diagram of the two channel types is given in figure 7, these implementations have been modelled using CSPM by us. These models are not shown here, due to their high complexity. Both channel models have been tested for equivalence with the models shown in section 4, using FDR 2.80 [11]. There are also unchecked versions of the One2Any, Any2Oneand Any2AnyChannels available, which should work fine, since they are derived from the One2OneChannel versions. At least during the test phase we could not find any anomalies, but admit that this is no proof for correctness of implementation. 3.4. An Example for Handling of Poison To give a simple example of how JCSP-Poison is used in practice, one of the test processes is discussed. This process is not doing anything useful but shows how a process should react upon catching a PoisonException. The process of listing 4 has two PoisonableChannelOutputs: out1 and out2. While running, this process sends integers of increasing value over the two channels connected to it. If one of these channels gets poisoned, it reports this to the process, by throwing an exception of type PoisonException, once the process tries to access the channel. The process catches this exception and injects the received poison into both channels, using the injectPoison(Poison) method. The poison that has been transferred over the channel is retrieved from the exception
82
B.H.C. Sputh and A.R. Allen / JCSP-Poison: Safe Termination of CSP Process Networks
public class Producer implements CSProcess { PoisonableChannelOutput out1; PoisonableChannelOutput out2;
public Producer(PoisonableChannelOutput out1, PoisonableChannelOutput out2) { this.out1 = out1; this.out2 = out2; } public void run() { Integer inputInt = null; int i = 0; while (true) { try { i++; out1.write(new Integer(i)); out2.write(new Integer(i)); } catch (PoisonException e) { out1.injectPoison(e.getPoison()); out2.injectPoison(e.getPoison()); return; } } }
}
Listing 4: Example of a JCSP Process handling a PoisonException
using the method Poison getPoison(). It is important to mention that no process should create its own instance of a poison while handling a PoisonException. The example given shows that JCSP-Poison is easy to apply in JCSP based programs. Also altering of existing JCSP process networks should be easy without disrupting existing structures. 4. CSP Model for the Poisonable Channels In the following, the CSP model for the poisonable channels is developed. It models the functionality of the poisonable channels Java implementation. It is not a direct mapping of the Java code, but only behaves like its Java counterpart. The counter part is a direct mapping of the Java implementation of the PoisonableOne2OneChannel and PoisonFilteringOne2OneChannel classes to CSP. This has been developed, but is not shown in this paper. The reason for this is the high complexity of this model due to trying to be as close to Java as possible. The Java implementation model is equivalent to the model shown here. The section starts with defining the interface used to communicate with the channel. This is followed by the development of the CSP model of the Poisonable-Channel and the Poison-Filtering-Channel. We know that the model shown has a high complexity, it is caused by the fact that this model has been developed to match its java counter part. Plans to develop a simplified model exist, but have not yet been implemented. 4.1. Interface of the Poisonable Channels Channels in JCSP, due to their unidirectional nature, have a writer end and a reader end. This results in two different protocols, one for each end. As discussed earlier, the goal of
B.H.C. Sputh and A.R. Allen / JCSP-Poison: Safe Termination of CSP Process Networks
83
the poisonable channels is to allow poison to travel backwards over channels, to get an easy and therefore safe to use poison mechanism. Furthermore, the poison should not be encoded in standard messages, but should be reported by using the Java exception mechanism. The protocols used for these channels are orientated by the way Java method calls are modelled in [12]. The reason for this is that the model shown here is the counter part to the CSP version of the Java implementation for the poisonable channels. The model given in [12] has been extended by us to support exceptions and also to standardise the names of the channels and events used to perform function calls. It is not included in this paper, but whenever this paper drifts away from the original model a short explanation is given. The protocols shown here are meant for interaction with the model given for the poisonable channels. That is the reason why these processes terminate after having sent a message. Before defining the protocol for reader and writer in detail, the data types for the message and the poison have to be defined, this is done in equation 1. This equation defines three sets: Poison, InternalPoison and Data. The set Poison defines the types of poison used outside the channel and consists of LocalPoison and GlobalPoison. The channel itself can be in three possible states: not poisoned None, poisoned with LocalPoison and poisoned with GlobalPoison. These states are stored in a variable of type InternalPoison, which incorporates Poison and None. The Data set defines the possible messages that can be transferred over the channels. Poison = {LocalPoison, GlobalPoison} InternalPoison = Poison ∪ {None}
(1)
Data = {True, False} Due to the fact that the model is meant to provide a functional model of a Java implementation, it is necessary to include the notion of object and thread into the model. This is done by appending the object-id and thread-id to the channel and event names. The sets of equation 2 define the sets Objects for possible object-ids and Threads for the thread-ids. In the present model there is always only object of a channel being created. Therefore, Objects containing only one object-id is sufficient. If at a later time the number of possible objects should be increased, the set Objects has to be modified. To be able to check a channel it is necessary to have at least two processes accessing it, in JCSP processes are represented by threads, therefore at least two thread-ids have to be available. Objects = {0} Threads = {1, 2}
(2)
4.1.1. Modelling Method Calls and Exceptions In the model used in this paper, methods are called by an event methodname start.o.t, with o being the object-id and t the thread-id used. If the method has one parameter, this is transferred over a channel named methodname start.o.t: transmission of the parameter will then also start the method. Once the method has completed its task, it will acknowledge the successful transaction using an event of name methodname ack.o.t. If the method is provides a return value, this value will be transferred over a channel of the same name. Exception will be transferred over a channel named methodname ex.o.t. An exception can be thrown by a method during the whole time the process is waiting for the methodname ack.o.t event. To be able to cater for that, the reception of the exception message is modelled as an interruption.
84
B.H.C. Sputh and A.R. Allen / JCSP-Poison: Safe Termination of CSP Process Networks
4.1.2. Inject Poison Interface Both reader and writer sides can inject poison into a channel. Injecting poison is a time consuming operation, and no other operations should take place at the same time. To cater for this, it is modelled as a method call using these two events: • inject poison start.o.t.p – thread t : Threads wants to inject poison p : Poison into the channel with the object-id o : Objects; • inject poison ack.o.t – injecting poison into the channel with object-id o : Objects by thread t : Threads completes successfully;
The resulting alphabet αINJECT POISON(o : Objects, t : Threads) is given in equation 3. Processes can not know whether they will receive a LocalPoison or a GlobalPoison, therefore this alphabet includes both types of poison. αINJECT POISON(o : Objects, t : Threads) =
{inject poison start.o.t.p, inject poison ack.o.t | p ∈ Poison}
(3)
The snippet of equation 4, shows how to inject a poison p : Poison into a PoisonableChannel o : Objects by a thread t : Threads. Developers wanting to use this functionality can simply adjust it to their needs and insert in their own code. inject poison start.o.t!p → inject poison ack.o.t → . . .
(4)
4.1.3. Write Interface The write method takes the message m : Data to be transferred as parameter write start.o.t!m, and successful completion of the transaction is signalled by the event write ack.o.t. If the channel is poisoned, the poison will be transferred over the channel write ex.o.t. In equation 6 the CSP model of a write method call is given.
• write start.o.t.m – thread t : Threads wants to write a message m : Data on a channel o : Objects; • write ack.o.t – the write operation, by thread t : Threads, on the channel o : Objects, completed successfully; • write ex.o.t.p – the write operation of thread t : Threads, on channel o : Objects was interrupted by a Poison-Exception, delivering poison p : Poison;
In equation 5, the alphabet αWRITE(o : Objects, t : Threads) is given. It should be incorporated by processes that want to write to poisonable channels. αWRITE(o : Objects, t : Threads) = {write start.o.t.m, write ack.o.t, write ex.o.t.p | m ∈ Data, p ∈ Poison}
(5)
The code snippet of equation 6, can be used when writing to poisonable channels. It is important that the user of the poisonable channels offers the ability to receive exceptions delivered by the write ex.o.t channel. (write start.o.t!m → write ack.o.t → . . . )
(write ex.o.t?p : Poison → . . . )
(6)
4.1.4. Read Interface The protocol to be used on the reader side is given in equation 8. Similar to the writer protocol the reader protocol can be interrupted by an exception if the channel is poisonous.
B.H.C. Sputh and A.R. Allen / JCSP-Poison: Safe Termination of CSP Process Networks
85
• read start.o.t – thread t : Threads wants to read a message from a PoisonableChannel o : Objects; • read ack.o.t.m – thread t : Threads read operation, on the channel o : Objects terminates successfully, returning message m : Data; • read ex.o.t.p – the read operation of thread t : Threads, on channel o : Objects resulted in a Poison-Exception to be thrown with poison p : Poison;
The alphabet of the READ process, given in equation 5, contains all events that are possible in when reading from a Poisonable-Channel o : Objects with a thread t : Threads. αREAD(o : Objects, t : Threads) = {read start.o.t, read ack.o.t.m, read ex.o.t.p | m ∈ Data, p ∈ Poison}
(7)
Reading from a Poisonable-Channel requires similar precautions as writing to it. The code snippet given in equation 8, shows how to do it correctly. (read start.o.t → read ack.o.t!m : Data → . . . )
(8)
(read ex.o.t?p : Poison → . . . )
4.1.5. The Reader and Writer Channel Ends Traditionally the writer end of a channel is considered to be the left side, while the reader end is the right side. In our model a Poisonable-Channel offers its writer and reader ends to inject poison into the channel. This has to be reflected in the interface of the poisonable channels. The interface for the left side of a Poisonable-Channel is given in equation 9, while the interface of the right side is given in equation 10. αLEFT (o : Objects, t : Threads) =
(9)
αWRITE(o, t) ∪ αINJECT POISON(o, t) αRIGHT(o : Objects, t : Threads) =
(10)
αREAD(o, t) ∪ αINJECT POISON(o, t) 4.1.6. The Complete Poisonable Channels Interface
The complete interface, valid for all poisonable channels: αPC(o, t1 , t2 ), is obtained by combining the αLEFT(o, t) and αRIGHT(o, t) . The resulting interface αPC, is given in equation 11 and expanded in equation 12. Figure 8 gives a graphical representation of the interface poisonable channels represented by the process PC(...) expose. PC(o,t1,t2) write_start.o.t 1.d:Data
read_start.o.t 2
write_ack.o.t 1
read_ack.o.t 2.d:Data
write_ex.o.t 1.p:Poison
read_ex.o.t 2 .p:Poison LEFT(o,t 1)
RIGHT(o,t 2)
inject_poison_start.o.t 1.p:Poison
inject_poison_start.o.t 2.p:Poison
inject_poison_ack.o.t 1
inject_poison_ack.o.t 2
Figure 8. Interface for all Poisonable Channels
86
B.H.C. Sputh and A.R. Allen / JCSP-Poison: Safe Termination of CSP Process Networks
αPC(o : Objects, t1 : Threads, t2 : Threads) = αLEFT(o, t1 ) ∪ αRIGHT(o, t2 ) αPC(o, t1 , t2 ) = ⎫ ⎧ write start.o.t1 .d, write ack.o.t1 , inject poison ack.o.t1 ,⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ex.o.t .p, start.o.t .p, read poison inject 1 1 ⎪ ⎪ ⎬ ⎨ read start.o.t2 , read ack.o.t2 .d, inject poison ack.o.t2 , ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ inject poison start.o.t2 .p, write ex.o.t2 .p ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎭ ⎩ | d ∈ Data, p ∈ Poison
(11)
(12)
4.2. Test-Bench for Poisonable Channels In the following the desired behaviour of the poisonable channels is defined. This is necessary to test whether the model developed later on complies with our expectations. For this purpose the test-bench given in equations 13 till 19has been created. The test-bench covers the different scenarios the Poisonable-Channel and the Poison-Filtering-Channel should cater for. In section 4.5 the channel models are verified using this test-bench. 4.2.1. Sending a Message The Poisonable channels should operate just like normal channels. Equation 13 defines the process TB MESSAGE TRANSFER(...), which specifies a normal channel transaction using the interface specified above.
TB MESSAGE TRANSFER(o : Objects, t1 : Threads, t2 : Threads, m : Data) = TB WRITE MESSAGE(o, t1 , m)
|[ αTB WRITE MESSAGE(o, t1 , m) | αTB READ MESSAGE(o, t2 ) ]|
(13)
TB READ MESSAGE(o, t2 ) \ {transmit.o.d | d ∈ Data}
TB WRITE MESSAGE(o : Objects, t : Threads, m : Data) = write start.o.t.m → transmit.o!m → write ack.o.t → TB WRITE MESSAGE(o, t, m)
αTB WRITE MESSAGE(o : Objects, t : Threads, m : Data) =
{write start.o.t.m, write ack.o.t, transmit.o.d | d ∈ Data}
(14) The TB WRITE MESSAGE(...) process of equation 14 performs a successful write transaction on a channel, complying to the interface specification of the poisonable channels.
TB READ MESSAGE(o : Objects, t : Threads) =
read start.o.t → transmit.o?x → read ack.o.t.x → TB READ MESSAGE(o, t, m)
αTB READ MESSAGE(o : Objects, t : Threads, m : Data) =
{read start.o.t, read ack.o.t.d, transmit.o.d | d ∈ Data} (15) In equation 15 the successful reception of messages is modelled, according to the poisonable channels interface specification.
87
B.H.C. Sputh and A.R. Allen / JCSP-Poison: Safe Termination of CSP Process Networks
4.2.2. Sending a Message, Reader Injects Poison The process of equation 16 simulates the case, that the writer end of the channel tries to write a message, while reader end injects poison into the channel. In this case the writer should receive an exception.
TB READER INJECTS POISON(o : Objects, t1 : Threads, t2 : Threads, p : Poison) =
write start.o.t1 .True → inject poison start.o.t2 .p →
inject poison ack.o.t2 → write ex.o.t1 .p → STOP
(16) 4.2.3. Trying to Receive a Message, Writer Injects Poison In equation 17 the roles are reversed, now the reader end tries to receive a message, while the writer end injects poison into the channel. The reader should receive an exception. TB WRITER INJECTS POISON(o : Objects, t1 : Threads, t2 : Threads, p : Poison) = read start.o.t2 → inject poison start.o.t1 .p →
inject poison ack.o.t1 → read ex.o.t2 .p → STOP
(17) 4.2.4. Access to a Poisoned Channel Process TB POISONED CHANNEL(...), given in equation 18, first poisons the channel, and then lets both reader and writer ends access it. It is expected that both processes receive an exception.
TB POISONED CHANNEL(o : Objects, t1 : Threads, t2 : Threads, p : Poison) = inject poison start.o.t1 .p → inject poison ack.o.t1 →
read start.o.t2 → read ex.o.t2 .p →
(18)
write start.o.t1 .True → write ex.o.t1 .p → STOP
4.2.5. A LocalPoisoned Channel is Used TB LOCAL POISONED CHANNEL(...) of equation 19 demonstrates that a Poison-FilteringChannel really filters out a LocalPoison and operates normally after that. This is done by first injecting LocalPoison into the channel and then performing normal channel operations using the TB MESSAGE TRANSFER(...) process.
TB LOCAL POISONED CHANNEL(o : Objects, t1 : Threads, t2 : Threads, m : Data) =
inject poison start.o.t1 .LocalPoison → inject poison ack.o.t1 →
TB MESSAGE TRANSFER(o, t1 , t2 , m)
(19)
88
B.H.C. Sputh and A.R. Allen / JCSP-Poison: Safe Termination of CSP Process Networks
4.3. The Poisonable-Channel The Poisonable-Channel’s behaviour depends upon the state it is in. It can be in any of these three states: 1. Normal: In this state it operates as a normal CSP channel. This is the default state after creation of the channel. When injected with LocalPoison the channel will change into the LPoison state. Injecting the channel with GlobalPoison will change the state to GPoison. 2. LPoison: The channel changes into this state after it gets injected with LocalPoison in the Normal state. Every invocation of either read or write functions will be answered by an exception with LocalPoison. The channel stays in this state until it gets injected with GlobalPoison, then it changes into the GPoison state. 3. GPoison: In this state the channel will answer any read or write request by throwing an exception with GlobalPoison. There is no possible state change from this state. 4.3.1. The POISON(o : Objects) Process The POISON(o : Objects) process is responsible to store the state of the poisonable channels. With poison being injectable from both sides of a channel it is necessary to provide both sides with access to this process. The set PoisonAccess of equation 20, defines the possible accessors of the process. PoisonAccess = {left, right}
(20)
The state of the channel has to be consistent for both sides of the channel. This means that once one side of the channel receives a poison, both channel ends deliver it to their users. To keep the channel state consistent only one side of the channel is allowed to change the poison state of the channel at a time, following the CREW rules. The read access has therefore, be blocked during a change of the channel state. In the POISON(o : Objects) process this is achieved by going into a locked state, using the poison lock.o.a : PoisonAccess events, before altering the state of the channel. Once the state change is done the process is unlocked with the poison unlock.o.a : PoisonAccess events. Equations 21 - 25, give the definition of the POISON(o : Objects) process. Figure 9 illustrates the interface the POISON(o : Objects) process offers to its environment. poison_lock.o.left poison_unlock.o.left poison_get.o.left. p:InternalPoison poison_set.o.left. p:InternalPoison
poison_lock.o.right POISON (o:Objects)
poison_unlock.o.right poison_get.o.right. p:InternalPoison poison_set.o.right. p:InternalPoison
Figure 9. Interface of the POISON(o:Objects) process
POISON(o : Objects) =
poison lock.o.a → POISON LOCKED(o, a, POISON(o))
2
2
a:PoisonAccess
2
a:PoisonAccess
poison get.o.a!None → POISON(o)
(21)
B.H.C. Sputh and A.R. Allen / JCSP-Poison: Safe Termination of CSP Process Networks
LPOISON(o : Objects) =
poison lock.a → POISON LOCKED(a, LPOISON(o))
2
2
a:PoisonAccess
2
a:PoisonAccess
2
a:PoisonAccess
2
(22)
poison get.o.a!LocalPoison → LPOISON(o)
GPOISON(o : Objects) =
poison lock.o.a → POISON LOCKED(a, GPOISON(o))
2
89
(23)
a:PoisonAccess
poison get.o.a!GlobalPoison → GPOISON(o)
POISON LOCKED(o : Objects, a : PoisonAccess, callingProcess) =
poison set.a.None → poison unlock.o.a → callingProcess
2 poison set.o.a.LocalPoison → poison unlock.o.a → LPOISON(o)
2 poison set.o.a.GlobalPoison → poison unlock.o.a → GPOISON(o)
2 poison unlock.o.a → callingProcess
⎧ ⎫ poison get.o.a.p, poison set.o.a.p, ⎪ ⎪ ⎨ ⎬ αPOISON(o : Objects) = poison lock.o.a, poison unlock.o.a ⎪ ⎪ ⎩ ⎭ | a ∈ PoisonAccess, p ∈ InternalPoison
(24)
(25)
4.3.2. The POISON VALVE(...) Process
In the previous section the states of the poisonable channels and their handling were detailed. So far it is possible to correctly handle poison that is received while both channel ends are unused. But what if one end tries to send or receive a message? In this case the waiting channel end should be woken up and throw an exception with the received poison at its user. This requires that the channel end injected with poison, sends a wake up call to the waiting channel end. But this is only necessary when one channel end is currently waiting. Therefore it is necessary for the channel ends to be able to filter events when they are not appropriate. This task is done by the POISON VALVE(...) process, which is detailed in equations 27 till 30, a graphical representation is given in figure 10. To control the valve, a channel carrying messages of type ValveControl is used, the type is defined in equation 26. out.p:Poison in.p:Poison
POISON_VALVE (in, out, ctrl)
ctrl.vc:ValveControl
Figure 10. Interface of the POISON VALVE(...) process
ValveControl = {open, close}
(26)
POISON VALVE(in, out, ctrl) = POISON VALVE CLOSED(in, out, ctrl)
(27)
90
B.H.C. Sputh and A.R. Allen / JCSP-Poison: Safe Termination of CSP Process Networks
POISON VALVE OPEN(in, out, ctrl) =
ctrl.close → POISON VALVE CLOSED(in, out, ctrl) ⎞ ⎛ in?m → ⎜ ⎟
⎟ out!m → POISON VALVE OPEN(in, out, ctrl) 2⎜ ⎠ ⎝
2 ctrl.close → POISON VALVE CLOSED(in, out, ctrl)
2 ctrl.open → POISON VALVE OPEN(in, out, ctrl)
POISON VALVE CLOSED(in, out, ctrl) =
in?m → POISON VALVE CLOSED(in, out, ctrl)
2 ctrl.open → POISON VALVE OPEN(in, out, ctrl)
2 ctrl.close → POISON VALVE CLOSED(in, out, ctrl) in.p, out.p, ctrl.vc αPOISON VALVE(in, out, ctrl) = | p ∈ Poison, vp ∈ ValveControl
(28)
(29)
(30)
4.3.3. The PCM INJECT POISON(...) Process Both channel ends offer the environment to inject poison into the channel. As they behave similarly in this respect they both use the PCM INJECT POISON(...) process for this task. The prefix PCM indicates that this process is part of the Poisonable-Channel-Model. The key task of this process is to pass the received poison to the POISON(...) process and signal the other channel end about the arrival of new poison, by sending it over the channel given in parameter pc. The definition of the process is given in equations 31 and 32, while the interface is illustrated in figure 11. poison_lock.o.a inject_poison_start.o.t. p:Poison inject_poison_ack.o.t
PCM_INJECT_POISON (o:Objects, t:Threads, a:PoisonAccess, pc)
poison_unlock.o.a poison_set.o.a. p:Poison pc.p:Poison
Figure 11. Interface of the PCM INJECT POISON process
PCM INJECT POISON(o : Objects, t : Threads, a : PoisonAccess, pc) = inject poison start.o.t?p : Poison → poison lock.o.a →
poison set.o.a!p → poison unlock.o.a → pc!p →
(31)
inject poison ack.o.t → SKIP αPCM INJECT POISON(o : Objects, t : Threads, a : PoisonAccess, pc) = inject poison start.o.t.p, inject poison ack.o.t, poison lock.o.a,
poison unlock.o.a, poison set.o.a.p, pc.p | p ∈ Poison
(32)
B.H.C. Sputh and A.R. Allen / JCSP-Poison: Safe Termination of CSP Process Networks
91
4.3.4. The PCM TRANSMIT(...) Process To allow a channel transaction to take place, a transmitter and a receiver are necessary. The PCM TRANSMIT(...) process, defined in equations 33 and 34, complies to the write interface defined earlier. The complete interface of the process is shown in figure 12. Before the PCM TRANSMIT(...) process sends the message passed to it over the write start.o.t channel, it checks whether the channel is poisoned. If that is the case the poison is delivered to the caller over the write ex.o.t channel. If the reader channel end gets poisoned while the write operation on the channel transmit.o is still pending, the writer channel end should be alarmed. This is done by having an external choice between the writing on the transmit.o channel and receiving poison from the rp2.o channel, rp stands for right side poison. The rp2.o channel is the output of a POISON VALVE(...) process, discussed earlier. The status of the valve is controlled using the rp ctrl.o channel, which stands for right poison control. transmit.o.m :Data write_start.o.t.m :Data write_ack.o.t write_ex.o.t.p :Poison
rp_ctrl.o.cm:ValveControl PCM_TRANSMIT (o:Objects, t:Threads)
rp2.o.p:Poison poison_get.o.left.p :InternalPoison
Figure 12. Interface of the PCM TRANSMIT process
PCM TRANSMIT(o : Objects, t : Threads) = write start.o.t?m : Data → rp ctrl.open → poison get.left?p → ⎞ ⎛ if p = None then ⎟ ⎜ ⎜ transmit.o!m → rp ctrl.o.close → ⎟ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ write ack.o.t → SKIP ⎜ ⎟ ⎟ ⎜ ⎟ ⎜ rp2.o?p → rp ctrl.o.close → ⎟ ⎜ 2 ⎟ ⎜ ⎟ ⎜ write ex.o.t!p → SKIP ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ else ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ rp ctrl.o.close → ⎟ ⎜ ⎟ ⎜ write ex.o.t!p → ⎠ ⎝ SKIP
αPCM TRANSMIT(o : Objects, t : Threads) = ⎧ ⎫ ⎪ ⎪ ⎪ write start.o.t.m, write ack.o.t, write ex.o.t.p,⎪ ⎪ ⎪ ⎪ ⎪ ⎬ ⎨ rp ctrl.o.open, rp ctrl.o.close, rp2.o.p, ⎪ poison get.o.left.ip, transmit.o.m ⎪ ⎪ ⎪ ⎩ | m ∈ Data, ip ∈ InternalPoison, p ∈ Poison
⎪ ⎪ ⎪ ⎪ ⎭
(33)
(34)
92
B.H.C. Sputh and A.R. Allen / JCSP-Poison: Safe Termination of CSP Process Networks
4.3.5. Preventing Concurrent Writing and Poisoning The left channel side offers its user two possible operations, either transmission of a message or inject poison into the channel. Both operations can not happen concurrently. To represent this the process PCM LEFT CHOOSER(o, t) has been created which is an external choice between the PCM TRANSMIT(o, t) and the PCM INJECT POISON (...) processes, its definition is given in equations 35 and 36. A compositional block diagram of the resulting process is given in figure 13. PCM_LEFT_CHOOSER (o:Objects, t:Threads) transmit.o.m :Data write_start.o.t.m :Data write_ack.o.t write_ex.o.t.p :Poison
rp_ctrl.o.cm:ValveControl PCM_TRANSMIT (o, t)
rp2.o.p:Poison poison_get.o.left.p :InternalPoison poison_lock.o.a
inject_poison_start.o.t. p:Poison inject_poison_ack.o.t
PCM_INJECT_POISON (o, t, left, lp1.o)
poison_unlock.o.a poison_set.o.left. p :Poison lp1.o.p :Poison
Figure 13. The PCM LEFT CHOOSER processes composition
PCM LEFT CHOOSER(o, t) = (PCM TRANSMIT(o, t) 2 PCM INJECT POISON (o, t, left, lp1.o)) ;
(35)
PCM LEFT CHOOSER(o, t) αPCM LEFT CHOOSER(o : Objects, t : Threads) = αPCM TRANSMIT(o, t) ∪ αPCM INJECT POISON(o, t, left, lp1.o)
(36)
4.3.6. Writer Channel End To complete the writer channel end it is necessary to include the POISON VALVE(...) process required by the PCM TRANSMIT(...) process. The valve can not be brought in earlier, as the valve has to be constantly running, while there is an external choice whether the PCM TRANSMIT (...) or the PCM INJECT POISON(...) process is executed. The PCM LEFT (...) process, of equations 37 and 38, is the result of these considerations. The structure of the PCM LEFT(...) process is also shown in figure 14.
PCM LEFT(o : Objects, t : Threads) =
PCM LEFT CHOOSER(o, t)
|[ αPCM LEFT CHOOSER(o, t) | αPOISON VALVE(rp1.o, rp2.o, rp ctrl.o) ]|
(37)
POISON VALVE(rp1.o, rp2.o, rp ctrl.o) αPCM LEFT(o : Objects, t : Threads) = αPCM LEFT CHOOSER(o, t) ∪ αPOISON VALVE(rp1.o, rp2.o, rp ctrl.o) The writer channel end of the Poisonable-Channel is now completely defined.
(38)
B.H.C. Sputh and A.R. Allen / JCSP-Poison: Safe Termination of CSP Process Networks
93
PCM_LEFT(o:Objects, t:Threads) PCM_LEFT_CHOOSER (o:Objects, t:Threads) transmit.o.m :Data write_start.o.t.m :Data
rp_ctrl.o.cm:ValveControl
write_ack.o.t
PCM_TRANSMIT (o, t)
write_ex.o.t.p :Poison
rp2.o.p:Poison
POISON_VALVE (rp1.o, rp2.o, rp_ctrl.o)
rp1.o.p:Poison poison_get.o.left.p :InternalPoison poison_lock.o.left
inject_poison_start.o.t. p:Poison inject_poison_ack.o.t
poison_unlock.o.left
PCM_INJECT_POISON (o, t, left, lp1.o)
poison_set.o.left.p :Poison lp1.o.p:Poison
Figure 14. Interface of the PCM LEFT process
4.3.7. PCM RECEIVE(...) Process The PCM RECEIVE(...) process defined in equations 39 and 40, allows to read a message from the channel. Once started with the read start.o.t event it opens the valve enabling it to receive incoming poison. This is followed by checking whether the channel is already in a poisoned state. If that is the case, the valve will be closed again and the poison will be delivered to the calling process, using the read ex.o.t channel, before terminating. In case the channel is not poisonous, the process tries to read a message from the transmit.o channel. While waiting for the message to arrive it also waits for a potential poison to arrive over the lp2.o channel. Once the transmit.o channel delivered a message the valve is closed and the process delivers the received message using the read ack.o.t channel. The process then terminates. If instead poison arrives from the lp2.o channel, the process closes the valve and delivers the poison using the read ex.o.t channel. This is followed by terminating of the PCM RECEIVE(...) process. This concludes the description of the PCM RECEIVE(...) process. The interface of this process is shown in figure 15.
transmit.o.m :Data
read_start.o.t
lp_ctrl.o.cm:ValveControl lp2.o.p:Poison poison_get.o.right.p :InternalPoison
PCM_RECEIVE (o:Objects, t:Threads)
read_ack.o.t.m :Data read_ex.o.t.p :Poison
Figure 15. Interface of the PCM RECEIVE process
94
B.H.C. Sputh and A.R. Allen / JCSP-Poison: Safe Termination of CSP Process Networks
PCM RECEIVE(o : Objects, t : Threads) = read start.o.t → lp ctrl.o.open → poison get.o.right?p → ⎞ ⎛ if p = None then ⎟ ⎜ ⎜ transmit.o?m : Data → lp ctrl.o.close → ⎟ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ read ack.o.t!m → SKIP ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ lp2.o?p → lp ctrl.o.close → ⎟ ⎜ 2 ⎟ ⎜ ⎟ ⎜ read ex.o.t!p → SKIP ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ else ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ lp ctrl.o.close → ⎟ ⎜ ⎟ ⎜ read ex.o.t!p → ⎠ ⎝ SKIP
αPCM RECEIVE(o : Objects, t : Threads) = ⎧ ⎫ read start.o.t, read ack.o.t.m, ex.o.t.p,⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ lp ctrl.o.open, lp ctrl.o.close, lp2.o.p, ⎪ ⎬ ⎪ poison get.o.right.p, transmit.o.m ⎪ ⎪ ⎪ ⎩ | m ∈ Data, p ∈ InternalPoison
(39)
(40)
⎪ ⎪ ⎪ ⎪ ⎭
4.3.8. Preventing Concurrent Reading and Poisoning The PCM RIGHT CHOOSER(...) process is used to prevent the reading process simultaneously performing a channel transaction and injecting poison into the channel. The definition of the process is given in equations 41 and 42. A graphical representation is given in figure 16. PCM_RIGHT_CHOOSER (o:Objects, t:Threads) transmit.o.m :Data read_start.o.t lp_ctrl.o.cm:ValveControl lp2.o.p:Poison
PCM_RECEIVE (o, t)
read_ack.o.t.m :Data read_ex.o.t.p :Poison
poison_get.o.right.p :InternalPoison poison_lock.o.a poison_unlock.o.a poison_set.o.right. p :Poison
inject_poison_start.o.t. p:Poison PCM_INJECT_POISON (o, t, right, rp1.o)
inject_poison_ack.o.t
rp1.o. p:Poison
Figure 16. Interface of the PCM RIGHT CHOOSER process
PCM RIGHT CHOOSER(o, t) =
(PCM RECEIVE(o, t) 2 PCM INJECT POISON(o, t, right, rp1.o)) ;
PCM RIGHT CHOOSER(o, t)
(41)
B.H.C. Sputh and A.R. Allen / JCSP-Poison: Safe Termination of CSP Process Networks
αPCM RIGHT CHOOSER(o : Objects, t : Threads) = αPCM RECEIVE(o, t) ∪ αPCM INJECT POISON(o, t, right, rp1.o)
95
(42)
4.3.9. Reader Channel End
Similar to the writer channel end the reader channel end also needs to incorporate a POISON VALVE(...) process for the PCM RECEIVE(...) process, this is done in the PCM RIGHT(...) process, which is a combination of the PCM RIGHT CHOOSER(...) and POISON VALVE(...) processes. The PCM RIGHT(...) processes CSP model is given in equations 43 and 44. Figure 17 illustrates the interconnections of the combined processes. PCM_RIGHT (o:Objects, t:Threads) PCM_RIGHT_CHOOSER (o, t) transmit.o.m :Data
lp1.o.p:Poison poison_get.o.right.p :InternalPoison
read_start.o.t
POISON_VALVE (lp1.o, lp2.o, lp_ctrl.o)
lp_ctrl.o.cm:ValveControl lp2.o.p:Poison
PCM_RECEIVE (o, t)
read_ack.o.t.m :Data read_ex.o.t.p :Poison
poison_lock.o.right poison_unlock.o.right
inject_poison_start.o.t. p:Poison PCM_INJECT_POISON (o, t, right, rp1.o)
poison_set.o.right. p:Poison
inject_poison_ack.o.t
rp1.o.p:Poison
Figure 17. Interface of the PCM RIGHT process
PCM RIGHT(o : Objects, t : Threads) = PCM RIGHT CHOOSER(o, t)
|[ αPCM RIGHT CHOOSER(o, t) | αPOISON VALVE(lp1.o, lp2.o, lp ctrl.o) ]|
(43)
POISON VALVE(lp1.o, lp2.o, lp ctrl.o)
αPCM RIGHT(o : Objects, t : Threads) = αPCM RIGHT CHOOSER(o, t) ∪ αPOISON VALVE(lp1.o, lp2.o, lp ctrl.o)
(44)
4.3.10. The POISONABLE CHANNEL(...) Process The Poisonable-Channel is a combination of the writer channel end (left side) and the reader channel end (right side). The channel ends are represented by the PCM LEFT(...) and the PCM RIGHT(...) processes. To avoid exposing internal events, the alphabet of the POISONABLE CHANNEL(...) process is defined to be identical to the one defined for the interface PC defined in equation 11. The CSP model for the channel is given in equation 45, its alphabet in equation 46. The Poisonable-Channel structure is graphically represented in figure 18.
96
B.H.C. Sputh and A.R. Allen / JCSP-Poison: Safe Termination of CSP Process Networks
POISONABLE CHANNEL(o : Objects, t1 : Threads, t2 : Threads) = ⎞ ⎛ PCM LEFT(o, t1 ) ⎜ |[ αPCM LEFT(o, t ) | αPCM RIGHT(o, t ) ]| ⎟ 2 1 ⎠ ⎝
PCM RIGHT(o, t2 )
(45)
|[ αPCM LEFT(o, t1 ) ∪ αPCM RIGHT(o, t2 ) | αPOISON(o) ]| POISON(o) αPOISONABLE CHANNEL(o : Objects, t1 : Threads, t2 : Threads) = αPC(o, t1 , t2 )
(46)
4.4. Poison-Filtering-Channel The Poison-Filtering-Channel is used to segment a process network into two parts, which can then be poisoned independently, by using LocalPoison. This is for instance a desired behaviour when having a design with fixed and a reconfigurable part, as is the case in a SDR Platform. There the signal processing module needs to be exchanged during runtime, without affecting the remaining platform. But even in segmented designs, there should be the ability to be terminated completely without caring about the segments. This was the reason for introducing GlobalPoison, which should be distributed into every corner of a process network. 4.4.1. The PFCM INJECT POISON(...) Process
The previous paragraph gave an indirect description of the behaviour of the PoisonFiltering-Channel. In short it should filter out any LocalPoison it gets injected, but let pass GlobalPoison. The process now needs to differentiate between LocalPoison, which is ignored, and GlobalPoison which will poison the channel. This results in a change of the PCM INJECT POISON process into the PFCM INJECT POISON(...) process given in equations 47 and 48. In this case the prefix PFCM indicates that this process is part of the Poison-Filtering-Channel Model.
PFCM INJECT POISON(o : Objects, t : Threads, a : PoisonAccess, pc) =
inject poison start.o.t?p → ⎛ ⎞ if p = GlobalPoison then ⎜ ⎟ ⎜ poison lock.o.a → poison set.o.a!p → poison unlock.o.a →⎟ ⎜ ⎟ ⎜ ⎟ pc!p → inject poison ack.o.t → SKIP ⎜ ⎟ ⎜ ⎟ ⎜ else ⎟ ⎝ ⎠
(47)
inject poison ack.o.t → SKIP αPFCM INJECT POISON(o : Objects, t : Threads, a : PoisonAccess, pc) = inject poison start.o.t.p, inject poison ack.o.t, poison lock.o.a,
poison unlock.o.a, poison set.o.a.p, pc.p | p ∈ Poison
(48)
PCM_RIGHT(o, t2)
PCM_LEFT (o, t1)
PCM_RIGHT_CHOOSER (o, t2)
PCM_LEFT_CHOOSER (o, t1) transmit.o.m:Data
write_start.o.t 1.m:Data write_ack.o.t 1 write_ex.o.t 1.p:Poison
PCM_TRANSMIT (o, t 1)
POISON_VALVE (rp1.o, rp2.o, rp_ctrl.o)
rp1.o.p:Poison lp1.o.p:Poison
poison_get.o.left.p inject_poison_start.o.t 1.p:Poison inject_poison_ack.o.t 1
PCM_INJECT_POISON (o, t1 , left, lp1.o)
poison_set.o.left. p
PCM_RECEIVE (o, t2)
read_ack.o.t 2.m:Data read_ex.o.t 2.p:Poison
poison_get.o.right.p inject_poison_start.o.t 2.p:Poison
poison_unlock.o.right
poison_unlock.o.left poison_lock.o.left
read_start.o.t 2 POISON_VALVE (lp1.o, lp2.o, lp_ctrl.o)
POISON (o)
poison_lock.o.right poison_set.o.right. p
Figure 18.: Interface of the PCM CHANNEL process
PCM_INJECT_POISON (o, t2, right, rp1.o)
inject_poison_ack.o.t 2
B.H.C. Sputh and A.R. Allen / JCSP-Poison: Safe Termination of CSP Process Networks
PCM_CHANNEL (o:Objects, t 1:Threads, t 2:Threads)
97
98
B.H.C. Sputh and A.R. Allen / JCSP-Poison: Safe Termination of CSP Process Networks
4.4.2. Writer Channel End The writer channel end of the Poison-Filtering-Channel is represented by the process PFCM LEFT(...), equations 49 and 50. This process offers two types of operation, either transmitting a message to the other channel end or injecting poison into the system. PFCM LEFT(o : Objects, t : Threads) = PFCM LEFT CHOOSER(o, t)
|[ αPFCM LEFT CHOOSER(o, t) | αPOISON VALVE(rp1.o, rp2.o, rp ctrl.o) ]|
POISON VALVE(rp1.o, rp2.o, rp ctrl.o)
(49) αPFCM LEFT(o : Objects, t : Threads) = αPFCM LEFT CHOOSER(o, t) ∪ αPOISON VALVE(rp1.o, rp2.o, rp ctrl.o)
(50)
The definition of the PFCM LEFT CHOOSER(...) process is given in equations 51, 52.
PFCM LEFT CHOOSER(o : Objects, t : Threads) =
(PCM TRANSMIT(o, t) 2 PFCM INJECT POISON(o, t, left, lp1.o)) ;
PFCM LEFT CHOOSER(o, t)
αPFCM LEFT CHOOSER(o : Objects, t : Threads) =
αPCM TRANSMIT(o, t) ∪ αPFCM INJECT POISON(o, t, left, lp1.o)
(51)
(52)
4.4.3. Reader Channel End
The reader channel end offers a choice between reading from the channel and injecting poison into it. The reader side’s CSP model is given in equations 53, 54. PFCM RIGHT(o : Objects, t : Threads) =
PFCM RIGHT CHOOSER(o, t) |[ αPFCM RIGHT CHOOSER(o, t) | αPOISON VALVE(lp1.o, lp2.o, lp ctrl.o) ]|
POISON VALVE(lp1.o, lp2.o, lp ctrl.o)
(53) αPCM RIGHT(o : Objects, t : Threads) = αPFCM RIGHT CHOOSER(o, t) ∪ αPOISON VALVE(lp1.o, lp2.o, lp ctrl.o)
(54)
The PFCM RIGHT CHOOSER(...) processes definition is given in equations: 55, 56. PFCM RIGHT CHOOSER(o : Objects, t : Threads) =
(PCM RECEIVE(o, t) 2 PFCM INJECT POISON(o, t, right, rp1.o)) ;
(55)
PFCM RIGHT CHOOSER(o, t) αPCM RIGHT CHOOSER(o : Objects, t : Threads) =
αPCM RECEIVE(o, t) ∪ αPFCM INJECT POISON(o, t, right, rp1.o)
(56)
B.H.C. Sputh and A.R. Allen / JCSP-Poison: Safe Termination of CSP Process Networks
99
4.4.4. The POISON FILTERING CHANNEL(...) Process The Poison-Filtering-Channel is split into the writer channel end (left side) and the reader channel end (right side). The channel ends are represented by the PFCM LEFT(...) and the PFCM RIGHT(...) processes. The CSP model for the channel is given in equation 57, its alphabet in equation 58.
POISON FILTERING CHANNEL(o : Objects, t1 : Threads, t2 : Threads) = ⎞ ⎛ PFCM LEFT(o, t1 ) ⎜ |[ αPFCM LEFT(o, t ) | αPFCM RIGHT(o, t ) ]| ⎟ 2 1 ⎠ ⎝
PFCM RIGHT(o, t2 )
(57)
|[ αPFCM LEFT(o, t1 ) ∪ αPFCM RIGHT(o, t2 ) | αPOISON(o) ]| POISON(o) αPOISON FILTERING CHANNEL(o : Objects, t1 : Threads, t2 : Threads) =
αPC(o, t1 , t2 )
(58)
4.5. Applying the Test-Bench After developing the model for Poisonable-Channel and the Poison-Filtering-Channel it is necessary to see whether they comply with the test-bench defined in section 4.2. 4.5.1. Testing the Poisonable-Channel Equation 59 shows the desired outcome when checking the Poisonable-Channel against the test-bench. All checks except the one against the TB LOCAL POISONED CHANNEL(...) process should succeed. Figure 19 shows a screen-shot of FDR with the results of the checks. assert POISONABLE CHANNEL(0, 1, 2) T TB MESSAGE TRANSFER(0, 1, 2, TRUE)
assert POISONABLE CHANNEL(0, 1, 2) T TB READER INJECTS POISON(0, 1, 2, LocalPoison)
assert POISONABLE CHANNEL(0, 1, 2) T TB READER INJECTS POISON(0, 1, 2, GlobalPoison)
assert POISONABLE CHANNEL(0, 1, 2)
T TB WRITER INJECTS POISON(0, 1, 2, LocalPoison) assert POISONABLE CHANNEL(0, 1, 2)
T TB WRITER INJECTS POISON(0, 1, 2, GlobalPoison)
assert POISONABLE CHANNEL(0, 1, 2)
T TB POISONED CHANNEL(0, 1, 2, LocalPoison) assert POISONABLE CHANNEL(0, 1, 2)
T TB POISONED CHANNEL(0, 1, 2, GlobalPoison) assert POISONABLE CHANNEL(0, 1, 2)
T TB LOCAL POISONED CHANNEL(0, 1, 2)
(59)
100
B.H.C. Sputh and A.R. Allen / JCSP-Poison: Safe Termination of CSP Process Networks
Figure 19. FDR Screen Shot showing the results of the refinement operations
4.5.2. Testing the Poison-Filtering-Channel Testing the Poison-Filtering-Channel against the test-bench must result in the tests with LocalPoison to fail. This is caused by the fact that the Poison-Filtering-Channel is not transparent for LocalPoison. Which is of course its purpose. Instead the test against the TB LOCAL POISONED CHANNEL(...) must succeed, to show that the Poison-FilteringChannel is working properly. Checking was done using FDR and a screen-shot showing the results is given in figure 19.
assert POISON FILTERING CHANNEL(0, 1, 2)
T TB MESSAGE TRANSFER(0, 1, 2, TRUE) assert POISON FILTERING CHANNEL(0, 1, 2)
T TB READER INJECTS POISON(0, 1, 2, LocalPoison)
assert POISON FILTERING CHANNEL(0, 1, 2) T TB READER INJECTS POISON(0, 1, 2, GlobalPoison)
assert POISON FILTERING CHANNEL(0, 1, 2)
T TB WRITER INJECTS POISON(0, 1, 2, LocalPoison)
assert POISON FILTERING CHANNEL(0, 1, 2) T TB WRITER INJECTS POISON(0, 1, 2, GlobalPoison)
assert POISON FILTERING CHANNEL(0, 1, 2)
T TB POISONED CHANNEL(0, 1, 2, LocalPoison) assert POISON FILTERING CHANNEL(0, 1, 2)
T TB POISONED CHANNEL(0, 1, 2, GlobalPoison) assert POISON FILTERING CHANNEL(0, 1, 2)
T TB LOCAL POISONED CHANNEL(0, 1, 2)
(60)
B.H.C. Sputh and A.R. Allen / JCSP-Poison: Safe Termination of CSP Process Networks
101
t-domain1
TCP Connection
TCP Connection
Protocol Handler Server Back-end
TCP Connection
Protocol Handler
Protocol Handler
t-domain1 Server Back-end
TCP Connection
Protocol Handler t-domain2 Poison-Filtering-Channel
Figure 20. Process network of the Streamer Figure 21. Termination Domains created by a Poison-Filtering-Channel
5. Applications / Examples
The example shown in this section was the initial motivation to look into the area of terminating sub networks. The project we were working on was the development of a small server application to deliver signal data streams to multiple clients over the network. Similar to the SHOUTcast [13] or icecast [14] servers used for mp3 streaming. The functioning of this server was simple: a client should connect to a TCP socket provided by the service and then a small Protocol Handler process would be spun off, handling one client. The streaming of the data was handled by the Server Back − end process. The Protocol Handler was connected to the Server Back − End using an Any2OneChannel for data exchange. The resulting structure of the process network for two simultaneously connected clients is shown in figure 20. The spinning off of the Protocol Handler processes was no problem, due to the availability of the ProcessManager class in JCSP. When a client disconnected from the server, the Protocol Handler process for this client detected this, and informed the Server Back − end process about this, to remove this client from all streams. After this the Protocol Handler becomes obsolete and should terminate. Naturally the Server Back − end should stay online, to be able to serve current and future clients. The solution during that time was to use the deprecated void ProcessManager.stop() method to terminate the Protocol Handler process. This was done despite the JCSP documentation warning about possible deadlocks. It worked for the streamer project but it was clear that a clean and safe solution had to be found. The result of this effort is presented in this paper. So how to perform this termination in JCSP-Poison, without using the deprecated void ProcessManager.stop() method? It is actually quite simple. The Any2OneChannel used to connect the Protocol Handler and the Server Back − end has to be made a Poison-FilteringChannel. Upon detection of a client disconnect the Protocol Handler process has to start to poison its channel ends using LocalPoison. In case the server back-end is terminated by the user, it must start to inject poison using GlobalPoison, which will then also affect the protocol handler process networks. The resulting system is segmented in multiple termination domains, one for each protocol handler process network plus one for the server back-end, these of course only work if LocalPoison is injected. An additional termination domain exists when GlobalPoison is injected, then the complete network terminates. The different termination domains (t-domain) are illustrated in figure 21.
102
B.H.C. Sputh and A.R. Allen / JCSP-Poison: Safe Termination of CSP Process Networks
6. Conclusions This paper showed a channel model supporting an easy way to terminate process networks. It furthermore showed how to enhance poisoning to support partial process network termination. This technique has further applications in the area of partial process network reconfiguration, where one process network is replaced by another. This is for instance necessary to create a CSP based Software Defined Radio. The technique shown in this paper resolved some of the problems of the original graceful-termination technique proposed by P.H. Welch. Most important to mention is that now processes do not need to spin off black hole processes anymore to swallow any arriving messages. The possibility to poison data source processes, which only have outgoing channels, is a further advantage of the technique demonstrated here. Other applications of the technique shown here, include the KCSP [15] environment, which currently provides complete process network termination. Unfortunately, this environment does not support the generation of exceptions, so the developer has to check the return value of each channel operation. But in the kernel environment it is even more important to perform proper clean up operations, than it is in user mode applications.
7. Further Work Obviously, there are a lot of channel types not handled in this paper. To just name a few, there is still no model on how to deal with poison and the call channels provided by JCSP. Further investigations also include modelling the poisonable Any2One, One2Any and Any2Any channel types, to see whether they behave as expected. The model shown in this paper was used to verify the correctness of the JCSP-Poison implementation model (which was not shown). This results in the shown model to be more complex than necessary, as it had to behave externally like a Java class. The next iteration of the shown model should be completely independent from JCSP and therefore become less complex. This should be done before modelling the other channel types available in JCSP. In C++CSP Networked [16], channel ends can be restricted in their ability to poison a channel. Depending on the actual application of this feature, it is similar to use a PoisonFiltering-Channel and only use LocalPoison in the sub-network that should not affect the other networks. From a security standpoint, it is more powerful, as it allows to prevent a sub-network to poison its environment, by accidentally using a GlobalPoison instead of a LocalPoison. Therefore, the ability to prevent a channel end to inject poison into a channel should be added to the poisonable channels available in JCSP-Poison. Partial process network reconfiguration in JCSP is also an area to look into, some ideas have already been drafted, but a proof of concept implementation is still missing. A JCSP implementation of mobile processes [17], which are available for OCCAM already [18,19], will have to broadcast a message inside a sub-network. JCSP-Poison had similar problems when trying to poison a sub-network, so an extension of JCSP-Poison should be able to provide the desired sub-network broadcast functionality. Developing a concept to bring general message broadcasting, without using the barrier technique, to JCSP is a possible area to improveme JCSP-Poison.
Acknowledgements The authors would like to thank P.H. Welch for providing the JCSP source code. Also appreciation to the reviewers for their careful reading.
B.H.C. Sputh and A.R. Allen / JCSP-Poison: Safe Termination of CSP Process Networks
103
References [1] C.A.R. Hoare. Communicating sequential processes. Communications of the ACM, 21(8):666–677, August 1978. [2] C. A. R. Hoare. Communicating Sequential Processes. Prentice-Hall, 1985. [3] Joseph Mitola. The software radio architecture. IEEE Communications Magizine, (5):26–38, May 1995. [4] Joseph Mitola. Software radio architecture: A mathematical perspective. IEEE Journal on Selected Areas in Communications, 17(4):514–538, April 1999. [5] Oliver Faust, Bernhard Sputh, and David Endler. Chaining Communications Algorithms with CSP. In Ian R. East, Prof David Duce, Mark Green, Jeremy M. R. Martin, and Peter H. Welch, editors, Communicating Process Architectures 2004, pages 325–338, 2004. [6] P. H. Welch and P. D. Austin. Jcsp home page http://www.cs.kent.ac.uk/projects/ofa/jcsp/. [7] Cay S. Horstmann and Gary Cornell. Core Java 2: Volume II-Advanced Features, volume 2 of The Sun Microsystems Press Java Series. Sun Microsystems Press, A Prentice Hall Title, 901 San Antonio Road, Palo Alto, California, 94303-4900 U.S.A., 2002. ISBN: 0-13-092738-4. [8] Peter H. Welch. Graceful termination – graceful resetting. In Andr`e W. P. Bakkers, editor, OUG-10: Applying Transputer Based Parallel Machines, pages 310–317, 1989. [9] P.H. Welch. Concurrency, exceptions and poison. Mailing List, September 2001. URL: http://www.wotug.org/lists/occam/1076.shtml. [10] Neil C. Brown and Peter H. Welch. An Introduction to the Kent C++CSP Library. In Jan F. Broenink and Gerald H. Hilderink, editors, Communicating Process Architectures 2003, pages 139–156, 2003. [11] Formal Systems (Europe) Ltd. FDR2 User Manual, sixth edition, May 2003. URL: http://www.fsel.com/fdr2 manual.html. [12] Peter H. Welch and Jeremy M. R. Martin. Formal Analysis of Concurrent Java Systems. In Peter H. Welch and Andr`e W. P. Bakkers, editors, Communicating Process Architectures 2000, pages 275–301. [13] nullsoft AOLmusic. SHOUTcast homepage. Internet. URL: http://www.shoutcast.com. [14] Xiph.org Foundation. icecast homepage. Internet. URL: http://www.icecast.org. [15] Bernhard Sputh. K-CSP Component Based Development of Kernel Extensions. In Ian R. East, Prof David Duce, Mark Green, Jeremy M. R. Martin, and Peter H. Welch, editors, Communicating Process Architectures 2004, pages 311–324, 2004. [16] Neil C. Brown. C++CSP Networked. In Ian R. East, David Duce, Mark Green, Jeremy M. R. Martin, and Peter H. Welch, editors, Communicating Process Architectures 2004, pages 185–200, 2004. [17] Fred Barnes and Peter H. Welch. Communicating Mobile Processes. In Ian R. East, David Duce, Mark Green, Jeremy M. R. Martin, and Peter H. Welch, editors, Communicating Process Architectures 2004, pages 201–218, 2004. [18] Mario Schweigler, Frederick R. M. Barnes, and Peter H. Welch. Flexible, Transparent and Dynamic occam Networking With KRoC.net. In Jan F. Broenink and Gerald H. Hilderink, editors, Communicating Process Architectures 2003, pages 199–224, 2003. [19] Mario Schweigler. Adding Mobility to Networked Channel-Types. In Ian R. East, David Duce, Mark Green, Jeremy M. R. Martin, and Peter H. Welch, editors, Communicating Process Architectures 2004, pages 107–126, 2004.
Appendix. PoisonableOne2OneChannel Source Code Listing 5 shows the actual implementation of the PoisonableOne2OneChannel in JCSPPoison. The code is based upon the One2OneChannel provided by JCSP. The original listing is quite long, due to extensive commenting the code (originally this were seven pages). Some of the comments were removed where appropriate, but the comments concerned with the poisoning aspect of the system were left in. The code shown below is the basis for all poisonable channels. One of them, the Poison-Injector-Channel has to be able to differentiate between the two channel ends in terms of poison. This is the reason for the methods: isReaderPoisoned(), getReaderPoison(), isWriterPoisoned() and getWriterPoison(). In this channel no differentiation between reader and writer end poison is performed.
104
B.H.C. Sputh and A.R. Allen / JCSP-Poison: Safe Termination of CSP Process Networks
package jcsp.lang; public class PoisonableOne2OneChannel extends PoisonableAltingChannelInput implements PoisonableChannelOutput { /** * The monitor synchronising reader and writer on this channel */ protected Object rwMonitor = new Object(); /** * The (invisible-to-users) buffer used to * store the data for the channel. */ protected Object hold; /** * The synchronisation flag */ protected boolean empty = true; /** * This flag indicates that the last transfer went OK. The * purpose is to not throw a PoisonException to the writer * side when the last transfer went OK, but the reader side * injected poison before the writer side finished processing * of the last write transfer. */ protected boolean done = false; /** * The Alternative class that controls the selection */ protected Alternative alt; /** * This member holds the poison * that was injected into the channel. */ protected Poison poison = null; /** * This method is used whether the * reader side of the channel is poisoned. * * @return True if the reader side is poisoned, else false. */ protected boolean isReaderPoisoned() { return (null != poison); } /** * This method is used to retrieve the type of * poison that the reader side is poisoned with. * * @return The poison that should be reported * to the reader side. */ protected Poison getReaderPoison() { return poison; }
B.H.C. Sputh and A.R. Allen / JCSP-Poison: Safe Termination of CSP Process Networks
/** * This method is used whether the * writer side of the channel is poisoned. * * @return True if the writer side is poisoned, else false. */ protected boolean isWriterPoisoned() { return (null != poison); } /** * This method is used to retrieve the type of poison that * the writer side is poisoned with. * * @return The poison that should be reported * to the writer side. */ protected Poison getWriterPoison() { return poison; } /** * Reads an Object from the channel. * * @return the object read from the channel. * * @throws PoisonException in case the channel is poisoned. */ public Object read() throws PoisonException { synchronized (rwMonitor) { if (isReaderPoisoned()) { throw new PoisonException(getReaderPoison()); } if (empty) { empty = false; try { rwMonitor.wait(); } catch (InterruptedException e) { throw new ProcessInterruptedError( "*** Thrown from Poisonable One2OneChannel.read ()\n" + e.toString()); } } else { empty = true; rwMonitor.notify(); } if (isReaderPoisoned()) { throw new PoisonException(getReaderPoison()); } else { done = true; rwMonitor.notify(); return hold; } } } /** * Writes an
Object to the channel. * @param value the object to write to the channel. * @throws PoisonException in case the channel is poisoned. */
105
106
B.H.C. Sputh and A.R. Allen / JCSP-Poison: Safe Termination of CSP Process Networks public void write(Object value) throws PoisonException { synchronized (rwMonitor) { if (isWriterPoisoned()) { throw new PoisonException(getWriterPoison()); } hold = value; if (empty) { empty = false; if (alt != null) { alt.schedule(); } } else { empty = true; rwMonitor.notify(); } try { rwMonitor.wait(); } catch (InterruptedException e) { throw new ProcessInterruptedError( "*** Thrown from Poisnable One2OneChannel.write (Object)\n" + e.toString()); } if (true == done) { done = false; } else { if (isWriterPoisoned()) { hold = null; throw new PoisonException(getWriterPoison()); } else { done = true; } } } } /** * Turns on Alternative selection for the channel. * @param alt the Alternative class which will control the selection * @return true if the channel has data that can be read * or is poisoned, else false */ boolean enable(Alternative alt) { synchronized (rwMonitor) { // When the channel is poisoned then make // it seem as if a message has arrived. if (isReaderPoisoned()) { return true; } if (empty) { this.alt = alt; return false; } else { return true; } } } /** * Turns off Alternative selection for the channel. * @return true if the channel has data that can be read * or is poisoned, else false */
B.H.C. Sputh and A.R. Allen / JCSP-Poison: Safe Termination of CSP Process Networks boolean disable() { synchronized (rwMonitor) { alt = null; // Either data or Poison available for pickup. return (!empty || isReaderPoisoned()); } } /** * Returns whether there is data pending on this channel. * @return true if the channel has data that can be read * or is poisoned, else false */ public boolean pending() { synchronized (rwMonitor) { // Data or Poison waiting for pickup return (!empty || isReaderPoisoned()); } } /** * Injects poison into the channel. * @param poison the poison to inject into the channel. */ public void injectPoison(Poison poison) { synchronized (rwMonitor) { // did get poison passed to the function? if (null == poison) { return; } // channel currently not poisoned if (null == this.poison) { this.poison = poison; // wake up possible sleeper rwMonitor.notifyAll(); // If alternation is used at the // reader side, alarm the reader. if (null != alt) { alt.schedule(); } // done return; } // is this Channel only poisoned with // LocalPoison? if (LocalPoison.class.isInstance(this.poison)) { // Global Poison overwrites LocalPoison if (GlobalPoison.class.isInstance(poison)) { this.poison = poison; // wake up possible sleeper rwMonitor.notifyAll(); // If alternation is used at the // reader side, alarm the reader. if (null != alt) { alt.schedule(); } return; } } } } }
Listing 5: Implementation of the PoisonableOne2OneChannel
107
This page intentionally left blank
Communicating Process Architectures 2005 Jan Broenink, Herman Roebbers, Johan Sunter, Peter Welch, and David Wood (Eds.) IOS Press, 2005 © 2005 The authors. All rights reserved.
109
jcsp.mobile: A Package Enabling Mobile Processes and Channels Kevin CHALMERS and Jon KERRIDGE School of Computing, Napier University, Edinburgh EH10 5DT {k.chalmers, j.kerridge}@napier.ac.uk Abstract. The JCSPNet package from Quickstone provides the capability of transparently creating a network of processes that run across a TCP/IP network. The package also contains mechanisms for creating mobile processes and channels through the use of filters and the class Dynamic Class Loader, though their precise use is not well documented. The package jcsp.mobile rectifies this position and provides a set of classes and interfaces that facilitates the implementation of systems that use mobile processes and channels. In addition, the ability to migrate processes and channels from one processor to another is also implemented. The capability is demonstrated using a multi-user game running on an ad-hoc wireless network using a workstation and four PDAs. Keywords. JCSP, JCSPNET, Mobile Processes and Channels, Dynamic Class Loading, Migratable Processes and Channels, Location Aware Computing
Introduction The motivation for the work reported in this paper came from the desire to modify an existing multi-player maze game that ran on a network of PC workstations to a version that used PDAs for the player processes and a wireless enabled laptop for the display process. The initial reason for building the game was to demonstrate the use of JCSPNet [1] in a manner that would motivate students to better understand the concepts of concurrency, parallelism and the benefits of the JCSP approach. In the event, the project went far beyond this initial goal and resulted in the creation of a package that enables the design and implementation of systems that use mobile processes and channels. The JCSPNet package already contained the basic mechanisms for achieving this but was somewhat tortuous in the way that such processes were created and incorporated into a complete system. The jcsp.mobile package brings together a number of the existing JCSPNet features and capabilities into a framework that makes such systems easier to design, implement and use. In the next section we shall present the justification for the design of jcsp.mobile and describe the main components from which it is constructed. Section 2 will describe the relevant mobile parts of the maze game and it will also show how the concepts can be utilized in the implementation of a home consumer electronics remote control that requires just one controller for all such devices. In the next section we shall show how migratable processes and channels can be implemented using a simple producer-consumer based example. Finally, the package is evaluated, some conclusions are drawn and areas for future work identified.
110
K. Chalmers and J. Kerridge / jcsp.mobile
1. Overview of jcsp.mobile JCSPNet does include classes and interfaces associated with Dynamic Class Loading. However the API documentation provided does not give any indication of how these classes and interfaces should be used. Initial attempts to understand the problem involved sending objects that implemented the CSProcess interface across channels and though initially this gave the impression of working it soon became apparent that the problem was far more complicated than such a simple approach permits. It is in fact necessary to understand how Java loads classes and any other classes contained within a class that is executed on a processor and furthermore to understand the problems that can occur if the designer does not fully appreciate the intricacies of the mechanism. Given that the ability to transmit Applets around the web was one of the design principles of Java it is somewhat surprising that this capability is so opaque. The jcsp.mobile package creates a new concept called a Process Channel which hides both the class DynamicClassLoader and the class Filter [1] that has to be added to a channel so that it can communicate class objects. The various types of Process Channel are defined as interfaces within Mobile and instances of these Process Channels are created in a manner analogous to ordinary channel creation in a Channel Name Server (CNS) [1]. An abstract class MobileProcess is provided which the user simply extends to create their own mobile process. A mobile device has to run a process called MobileProcessClient that initializes itself with a call to Mobile.init() as a node in a mobile network, which then gives it access to all the facilities of the jcsp.mobile package. The initialization mechanism mimics that used to initialize a node of an ordinary CNS style network. Mobile processes within CSP are first described within the occam language by Barnes as a proposal [2], after developing some of the necessary functionality required for the language. This proposal, although more applicable for a different platform, can be tailored for use within JCSP, particularly the concept of one process (a server) transferring another process (the mobile process) to another process (the client). Within the Java language, the functionality that required development for the occam language is not necessary, because Java uses references to objects (if an object is said to equal another, only one such object exists) as opposed to occam’s use of copy semantics (when an object is said to equal another, two unique objects are in existence) [3] as well as the Java language’s interpreted nature 1.1 Dynamic Class Loading Due to the interpreted nature of the Java language, it is possible to load new classes into the runtime environment as required by the system. Basically, this means that it is possible to send a class description to an executing process (e.g. read from a file, sent over a network) and for that process to create new instances of the class as required. How this works is not clearly described, although there are some attempts [4] that claim a clear definition , mentioning the already poor documentation issues. However, these do not accurately give a description either. Figure 1 illustrates how class loading operates within Java. In this system, three class loaders are defined, the System Class Loader (started when the JRE starts) and two developed class loaders, Class Loader 1 and Class Loader 2. These both have the System Class Loader as a parent. These developed class loaders have information relating to loading specific classes, and if they cannot load a class, they call the System Class Loader to load it for them. The System Class Loader has no knowledge of its child class loaders.
K. Chalmers and J. Kerridge / jcsp.mobile
111
Looking at each of the class loaders in turn, the System Class Loader knows about Class A and Class B and can only use these classes within its context. For example, if Class A were to try and create an instance of Class C or Class D, an exception would be raised relating to the class definition not being found. System Class Loader
Class Loader 1
Class B
Class Loader 2
Class C
Class A
Class B
Class D
Figure 1. Class Loading in Java
For Class Loader 1, as well as the classes available from the System Class Loader, and Class C, a further version of Class B is available. This can lead to versioning problems between classes, i.e. Class C may expect the system defined version of Class B yet receive the one defined by Class Loader 1. This may also occur with Class A as well. Class A will use the System Class Loader to load itself and any classes within itself, and the problem of a Class Loader 1 version of Class B being passed into Class A and causing the versioning problem can arise. Class Loader 2 only knows how to define Class D and the system classes, and therefore any attempt to create an instance of Class C will fail with the class not defined exception. If Class D should need instances of Class A or Class B, it would call upon the System Class Loader to load them. Classes are generally defined using an array of bytes equivalent to the data normally stored within the class file. These bytes can obviously be sent across a communications medium such as the channels being used within JCSP, and initially this was the route taken when examining sending class data with an object. It is hinted at, however, that it is possible to set channels to send this data automatically [5], although how is not well documented. This led to an investigation as to how to achieve class loading over channels. The first point of investigation was the documentation available for the package. The two overview documents provide little in depth discussion of how to use anything but the most basic features, and suggests the API specification as a point of reference for more advanced features. The problem here is that API specifications provide descriptions of all the classes available within the package, and have one starting access point available. It is a matter of guess work where to start the search from here. Examination of the factory methods associated with channel creation and the specialization required to create these class-loading channels was fruitless. In the end a text search for “class loading” in the documentation located a starting point. This, however, did not clear up how to implement class loading easily, and still expected some further knowledge before it could be used. What the documentation describes is the creation of service - an instance of DynamicClassLoader - that needs to be
K. Chalmers and J. Kerridge / jcsp.mobile
112
installed onto a Node using its Service Manager. This simply runs services available on the Node and is of no concern here. When the class loading service is started, it creates two new processes a JFTP Server (there is no mention what JFTP stands for) and a Class Manager. Exactly how these operate is not described. It was deduced that when an object is sent across the channel, the Class Manager attempts to load the class data from its available class definitions. If it should fail it will request the class definition from the sending process’s JFTP, and when received, uses this to load the class data for the object it has just received. Although the creation and running of the DynamicClassLoader service may appear to solve the problem, the little documentation available does mention attaching filters to the channel ends to allow the sending and receiving of class data down the declared channels. This led an investigation of exactly how filtered channels operate, and fortunately the documentation for this feature is more concise. JCSP allows the creation of a FilteredChannel object from other channel ends (FilteredChannelInput and FilteredChannelOutput respectively) and then the addition of filters to these FilteredChannel objects as required. This channel type has the same functionality externally as the channel used to create it. Internally, when the relevant operation (read or write) is called, it uses the filters added to it to modify the message in some way. What happens with the Dynamic Class Loading filter is that the location of the sending Node’s JFTP is attached. When the receiving Node reads the object from the channel, the Class Manager is passed this location. Then, if class definitions are required for the object received, the Class Manager can request it from the sending location’s JFTP. This leads to the breakdown of Dynamic Class Loading given in Figure 2.
SEND
FILTER
FILTER
RECEIVE
CLASS MANAGER
JFTP
Figure 2. Dynamic Class Loading Using Filters
This model demonstrates how the Filter adds the JFTP location to the object being sent by the Send process, and then sending it down the channel. At the receiving end, the Filter takes the message, extracts the JFTP location and sends it on to the Class Manager. The original object is passed to the Receive process, which can then use the object, even if it had no prior knowledge as to what the object was. Although this may seem quite a trivial concept when described here (start a service, add filter to the channel then read from channel as normal), the lack of good documentation created a situation where much searching and experimentation was involved. This could have been avoided. However, the simplicity of the approach does mean it can be started and used easily, with little modification to an existing system. One of the primary reasons for this complexity is to deal with the case where a process uses other classes within its definition and it would not be immediately obvious that these other classes would be required. The mechanism implemented ensures that such additional class definitions are obtained automatically. The result of this experimentation is a package, jcsp.mobile, the components of which are now described.
K. Chalmers and J. Kerridge / jcsp.mobile
113
1.2 Mobile
The class Mobile captures the basic functionality of the mobile capability. It defines a 1 DynamicClassLoader {2} and it is presumed that only one such instance per node of Mobile is initialized, in a similar manner to the creation of CNS in JCSPNet. A ServiceManager {8, 9} is created into which theClassLoader is installed {10}, named and then started in the method init {5-17}. The NodeKey for the init method is generated by a call to the TCPIPNodeFactory in a similar manner to that used by CNS {8}. 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 1
public class Mobile { protected static DynamicClassLoader theClassLoader; private static boolean isInitialised = false; public static void init(NodeKey thisNode) { if (!isInitialised) { theClassLoader = new DynamicClassLoader(); ServiceManager theServiceManager = Node.getInstance(). getServiceManager(thisNode); theServiceManager.installService (theClassLoader, "Class Loader"); theServiceManager.startService("Class Loader"); isInitialised = true; } else System.out.println("Mobile already initialised"); } public static ProcessChannelOutput createOne2Net ( NetChannelLocation channelLocation) { return ProcessChannelEndFactory. createOne2Net(channelLocation); } public static ProcessChannelOutput createOne2Net (String channelName) { return ProcessChannelEndFactory.createOne2Net(channelName); } public static ProcessChannelOutput createOne2Net (NetChannelOutput netOut) { return ProcessChannelEndFactory.createOne2Net(out); } ... plus other shared output and input variants public static AltingProcessChannelInput createNet2One() { return ProcessChannelEndFactory.createNet2One(); } public static AltingProcessChannelInput createNet2One (String channelName) { return ProcessChannelEndFactory.createNet2One(channelName); } public static AltingProcessChannelInput createNet2One (NetAltingChannelInput netIn) { return ProcessChannelEndFactory.createNet2One(in); } }
The notation {n} in the text refers to a line number in a program listing
K. Chalmers and J. Kerridge / jcsp.mobile
114
The static methods of Mobile can then be called in manner analogous to CNS to create various types of Process Channels. Each of these static calls results in the calling of a similar method in a ProcessChannelEndFactory. Some of these static methods are shown {19-49}. Those omitted simply create the shared versions of the channels. Three versions of the channel creation method are provided, the first takes a NetChannelLocation parameter, used in the creation of anonymous channels or when a channel end is communicated over a channel {22}. The second takes a String parameter that is used in the same manner as creating a named network channel using the CNS {28}. The third creates a Process Channel from an existing Net Channel {33}, which is necessary to allow simple creation of Migratable Channels with class loading capabilities (see section 3). 1.3 Process Channels The Process Channels are simply an extension of the existing NetChannel interfaces provided within JCSPNet. In order that we can use ProcessChannelInputs in an alternative, an abstract class AltingProcessChannelInput is created {58-69} that provides the required functionality. 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69
public interface ProcessChannelInput extends NetChannelInput { } public interface ProcessChannelOutput extends NetChannelOutput { } public abstract class AltingProcessChannelInput extends NetAltingChannelInput implements ProcessChannelInput { public AltingProcessChannelInput() { } public AltingProcessChannelInput(AltingChannelInput alt) { super(alt); } }
The ProcessChannelEndFactory provides factory methods for creating instances of the Process Channels using the same method names as those used in the CNS. These are the methods called from the singleton class Mobile described previously. Variants are provided that manipulate a NetChannelLocation {72}, or will invoke a call to the corresponding CNS method using a character string identifier for the channel being created {76}, or create a Process Channel from an existing Net Channel {79}. 70 71 72 73 74 75 76 77 78 79 80 81
class ProcessChannelEndFactory { protected static One2Process createOne2Net( NetChannelLocation channelLocation){ return new One2Process(channelLocation); } protected static One2Process createOne2Net(String channelName){ return new One2Process(channelName); } protected static One2Process createOne2Net (NetChannelOutput netOut){ return new One2Process(netOut); }
K. Chalmers and J. Kerridge / jcsp.mobile 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96
115
protected static Process2One createNet2One(){ NetAltingChannelInput netIn = NetChannelEnd.createNet2One(); return new Process2One(netIn); } protected static Process2One createNet2One(String channelName){ NetAltingChannelInput netIn = CNS.createNet2One(channelName); return new Process2One(netIn); } protected static Process2One createNet2One (NetAltingChannelInput netIn){ return new Process2One(netIn); } ... plus other variants }
The class Process2One references the DynamicClassLoader (theClassLoader) created in the singleton class Mobile. From theClassLoader it obtains the read filter {105}, which it attaches to the process channel {106}. 97 class Process2One extends AltingProcessChannelInput { 98 private FilteredAltingChannelInput filterIn; 99 private NetAltingChannelInput netIn; 100 private static DynamicClassLoader theClassLoader = 101 Mobile.theClassLoader; 102 protected Process2One(NetAltingChannelInput netIn) { 103 super(netIn); 104 this.netIn = netIn; 105 filterIn = FilteredChannelEnd.createFiltered(netIn); 106 filterIn.addReadFilter(theClassLoader.getChannelRxFilter()); 107 } 108 ... plus some other methods not required for the discussion 109 }
The class One2Process undertakes the complimentary operation to that just described except that it adds a write filter to an output channel. An output channel can be created either using a NetChannelLocation {114}, a string reference to a Net Channel {119} or by using an existing Net Channel {124} as shown below. 110 class One2Process implements ProcessChannelOutput { 111 private NetChannelOutput netOut; 112 private FilteredChannelOutput filterOut; 113 private DynamicClassLoader theClassLoader = Mobile.theClassLoader; 114 protected One2Process(NetChannelLocation channelLocation){ 115 netOut = NetChannelEnd.createOne2Net(channelLocation); 116 filterOut = FilteredChannelEnd.createFiltered(netOut); 117 filterOut.addWriteFilter(theClassLoader.getChannelTxFilter()); 118 } 119 protected One2Process(String channelName){ 120 netOut = CNS.createOne2Net(channelName); 121 filterOut = FilteredChannelEnd.createFiltered(netOut); 122 filterOut.addWriteFilter(theClassLoader.getChannelTxFilter()); 123 } 124 protected One2Process(NetChannelOutput netOut){ 125 this.netOut = netOut; 126 filterOut = FilteredChannelEnd.createFiltered(netOut); 127 filterOut.addWriteFilter(theClassLoader.getChannelTxFilter()); 128 } 129 ... plus some other methods not required for the discussion 130 }
116
K. Chalmers and J. Kerridge / jcsp.mobile
1.4 MobileProcess MobileProcess is an abstract class {131-156} that is extended by a concrete implementation of a class that is to be made mobile. The class has a number of methods that permit the connection of arrays of input and output channels that the mobile process can use. There are a number of such methods that permit a variety of operation on the channels, enabling the attachment and detachment of channels dynamically. 131 public abstract class MobileProcess implements 132 CSProcess, Serializable { 133 protected ChannelInput[] in = null; 134 protected ChannelOutput[] out = null; 135 136 public final void init(ChannelInput[] in, ChannelOutput[] out) { 137 this.in = in; 138 this.out = out; 139 } 140 public final void remove() { 141 this.in = null; 142 this.out = null; 143 } 144 public final void attachInputChannels(ChannelInput[] in) { 145 this.in = in; 146 } 147 public final void attachOutputChannels(ChannelOutput[] out) { 148 this.out = out; 149 } 150 public final void detachInputChannels() { 151 in = null; 152 } 153 public final void detachOutputChannels() { 154 out = null; 155 } 156 }
1.5 Mobile Process Server MobileProcessServer is the process used by an application’s server to send mobile processes to the client processors. The application’s server will in fact be the parallel composition of the MobileProcessServer and a sender process. The sender process creates the Mobile Processes and when required outputs these to the MobileProcessServer, which then sends them over the network to a client process. The MobileProcessServer has no knowledge of the process it is sending. MobileProcessServer inputs the mobile process from sender on the internal channel processIn {180}. The mobile process is output on the ProcessChannelOutput processOut {181}, which will have all the required filters and access to theClassLoader automatically provided. The NetChannelInput processRequest {160} is used by the MobileProcessServer to receive requests from client processes for a mobile process. The MobileProcessServer is implemented as a singleton with an empty constructor. The init method {169-172} is passed the name of the network channel, serviceName, upon which it will receive requests for client processes and uses this name to create the only named network channel in the system. The init method is also passed the internal channel by which it is connected to the sender process. In due course, the process’s run method is invoked by the application’s server (see 2.1.1 for an example description).
K. Chalmers and J. Kerridge / jcsp.mobile
117
The run method reads the location of a network input channel {177} that will read the mobile process from the server using the previously created processRequest net channel. The MobileProcessServer then creates {179} an anonymous network output channel using a call to the Mobile class, which connects the MobileProcessServer to the MobileProcessClient that requested the mobile process. The server then reads {180} the MobileProcess from the sender using the internal channel, processIn, and then writes theProcess to the network channel connected to the client processor {181}. 157 public class MobileProcessServer implements CSProcess { 158 private ChannelInput processIn; 159 private ProcessChannelOutput processOut; 160 private NetChannelInput processRequest; 161 private static MobileProcessServer theServer = 162 new MobileProcessServer(); 163 private MobileProcessServer() 164 { 165 } 166 public static MobileProcessServer getServer() { 167 return theServer; 168 } 169 public void init(String serviceName, ChannelInput processIn) { 170 processRequest = CNS.createNet2One(serviceName); 171 this.processIn = processIn; 172 } 173 ... some other access processes 174 175 public void run() { 176 while (true) { 177 NetChannelLocation clientLocation = 178 (NetChannelLocation)processRequest.read(); 179 processOut = Mobile.createOne2Net(clientLocation); 180 MobileProcess theProcess = (MobileProcess)processIn.read(); 181 processOut.write(theProcess); 182 } 183 } 184 }
1.6 Mobile Process Client The MobileProcessClient is the only process that has to run on a client processor and is typically invoked by a call to main(). An extension does allow the process to be invoked as part of a parallel compilation if the client side needs to receive more than one process or is itself part of another network. The process invokes itself and then enters the main() method. The user {195} is asked for IP address of the CNS applicable to the network and the name {196} of the network channel by which the client process will communicate with the server process. This must be exactly the same character string as used by the server process in its init() method. These values could be passed as arguments to the main() method if the code were modified appropriately. The client process initializes itself as part of a mobile system {200} and connects itself to the same network by means of identifying the IP address of the CNS Server. The client then uses CNS.resolve {202} to connect itself to the CNS. This is used so that many clients can use the same network channel in sequence. Once the network channel has been resolved the connection to the server can be created as a net output channel {203}. The client then creates a net input channel (processReceive) by means of a call {205} to the static method Mobile.createOne2Net. The location of this channel is then communicated to the server by writing its channel location to the server {206} using the only named
118
K. Chalmers and J. Kerridge / jcsp.mobile
network channel that was previously resolved. The mobile process is then read {207} from the anonymous network channel processReceive. The received mobile process is then executed by means of constructing a ProcessManager and thereby invoking the mobile process run() method {208}. 185 public class MobileProcessClient implements CSProcess { 186 private static SharedChannelInput serviceNameInput; 187 private static ProcessChannelInput processReceive; 188 private static SharedChannelOutput processOut; 189 private static MobileProcessClient theClient = 190 new MobileProcessClient(); 191 private MobileProcessClient() 192 { 193 } 194 public static void main(String[] args) { 195 String CNS_IP = Ask.string("Enter IP address of CNS: "); 196 String processService = 197 Ask.string("Enter name of process service: "); 198 try 199 { 200 Mobile.init(Node.getInstance().init( 201 new TCPIPNodeFactory(CNS_IP))); 202 NetChannelLocation serverLoc = CNS.resolve(processService); 203 NetChannelOutput toServer = 204 NetChannelEnd.createOne2Net(serverLoc); 205 processReceive = Mobile.createNet2One(); 206 toServer.write(processReceive.getChannelLocation()); 207 MobileProcess theProcess =(MobileProcess)processReceive.read(); 208 new ProcessManager(theProcess).run(); 209 } 210 catch (NodeInitFailedException e) 211 { 212 System.out.println("Failed to connect to CNS"); 213 System.exit(-1); 214 } 215 } 216 ... plus some other access methods and a run() 217 }
2. Using jcsp.mobile 2.1 Maze Game The maze game is itself quite simple and can be played by up to four players. Each player receives a copy of the same random maze. The goal of the game is to move the player’s token from one corner of the maze to the opposite corner. The effects of each player’s moves are shown on a central display. The players can only see their own token on their display. Because the maze is generated randomly it is possible that a player could be completely blocked in by walls. Thus each player is given the ability to clear a limited number of walls in order to make progress to their goal. As players move they can erect walls in their wake to block their opponents. The user interface provides buttons for moving in all four directions. Additionally, there are two further buttons that permit the clearing of a wall or the erection of a new blocking wall. As the players achieve their goal they are informed of their finishing position. The sequence of interactions between the processes is as follows. The display process acts as the server. It creates the random maze and then sends this maze to each of the
K. Chalmers and J. Kerridge / jcsp.mobile
119
players indicating their starting position and hence implicitly their goal. Thereafter, the player process sends the moves the player has requested to the display process. The display process obtains a move from each player at every step regardless of whether the player has made a move or not. The display process then updates its maze accordingly and then communicates the new state of the maze to each player process. If the display process detects that a player has reached their goal an appropriate indication is sent to that player only. This set of channels is captured in Figure 3. In constructing the system as a set of mobile processes, the input ends of channels are created first, so that the input end location can be sent to processes that output on a network channel. Hence the Display process will create all the input ends of the fromPlayer channels, which will be communicated as part of the mobile Player process to each of the players. The Player process will then create the input ends for the init and toPlayer channels, which it will send back to the Display process using the fromPlayer channel. All these channels are created anonymously, which removes the complexity of creating a universal naming convention for mobile systems. Player init fromPlayer toPlayer
Display
Player
Figure 3. Network Channel Structure for the Maze Game
2.1.1 MazeServer The MazeServer process is the main entry point on the server side of the system. It is used to implement the process Display in Figure 3. The user is asked for the CNS Server’s IP address {226} and also the name of the service required {227}, which must be the same as that specified in the MobileProcessClient processes that request and then run each of the Player processes shown in Figure 3. Additionally, parameters that specify the number of players and the size of the maze are requested {228-230}. The server process then initializes itself as Mobile connected to the identified CNS Server {231}. The channel (toServer) that connects the server to the sender is then declared {233}. The sender process is then declared {234} as an instance of MazeServer and its constructor {247-253} is passed the output end of the toServer channel and the parameters of the game. The server process is then declared {236} as an instance of MobileProcessServer and initialized {237} with the input end of the toServer channel and the name of the service that is provided. Recall that this serviceName is that of the only network channel that is created using a character string to identify the channel and provides the initial connection between any client and the server. A Parallel is then constructed {238} comprising theServer and sender and their run methods invoked. The run method of the sender process is then defined as part of the MazeServer class {255}. An array of ProcessInputChannels is declared that implements the fromPlayer channels {256}. An array of MobileMazePlayers (see the next section) is then declared {258}. For each of the players, a MobileMazePlayer is defined with the required parameters, the identifier of the player and the channel location of the fromPlayer net input
120
K. Chalmers and J. Kerridge / jcsp.mobile
channel {259-263}. As requests are received from the client processes the next mobile player process is sent to the client process. Thus a client does not know which player process they will receive. Once all the player processes have been sent to each of the client processes an instance of the MazeDisplayMain process is declared {265} and executed by passing it as a parameter to the ProcessManager constructor {267}. We shall not describe the detail of the MazeDisplayMain process but its first stage is to read in the net channel locations of the input ends of the init and toPlayer channels. 218 public class MazeServer implements CSProcess { 219 private int noPlayers; 220 private ChannelOutput out; 221 private int rows, columns; 222 public static void main(String[] args) 223 { 224 try 225 { 226 String CNS_IP = Ask.string("Enter IP of CNS: "); 227 String serviceName = Ask.string("Enter unique service name: "); 228 int noPlayers = Ask.Int("Enter number of players: ", 1, 4); 229 int rows = Ask.Int("Rows: ", 5, 20); 230 int columns = Ask.Int("Columns: ", 6, 20); 231 Mobile.init(Node.getInstance().init( 232 new TCPIPNodeFactory(CNS_IP))); 233 One2AnyChannel toServer = Channel.createOne2Any(); 234 MazeServer sender = new 235 MazeServer(toServer.out(), noPlayers, rows, columns); 236 MobileProcessServer theServer =MobileProcessServer.getServer(); 237 theServer.init(serviceName, toServer.in()); 238 new Parallel(new CSProcess[] {theServer, sender}).run(); 239 } 240 catch (NodeInitFailedException e) 241 { 242 System.out.println("Error connecting to CNS"); 243 System.exit(-1); 244 } 245 } 246 247 private MazeServer(ChannelOutput out, int noPlayers, 248 int rows, int columns) { 249 this.out = out; 250 this.noPlayers = noPlayers; 251 this.rows = rows; 252 this.columns = columns; 253 } 254 255 public void run() { 256 ProcessChannelInput[] fromPlayers = new 257 ProcessChannelInput[noPlayers]; 258 MobileMazePlayer[] players = new MobileMazePlayer[noPlayers]; 259 for (int i = 0; i < noPlayers; i++) { 260 fromPlayers[i] = Mobile.createNet2One(); 261 players[i] = new MobileMazePlayer(i, 262 fromPlayers[i].getChannelLocation()); 263 out.write(players[i]); 264 } 265 MazeDisplayMain theMaze = new MazeDisplayMain(noPlayers, 266 fromPlayers, rows, columns); 267 new ProcessManager(theMaze).run(); 268 } 269 }
K. Chalmers and J. Kerridge / jcsp.mobile
121
2.1.2 MobileMazePlayer The MobileMazePlayer class extends MobileProcess and contains a constructor {273} that is invoked by the MobileProcessServer when the instances of the mobile process are created. The MobileProcessClient that inputs this process invokes the run method {278287}. The run method declares the init and toPlayer channels {279, 282} using a call to Mobile.createNet2One. The location of the fromPlayer channel is passed as parameter of the constructor {280}, which is used to create a Mobile One2Net channel. The mobile player process then writes the locations of the init and toPlayer channels to the Display process as described above using the fromPlayer channel {283,284}. An instance of a PlayerFrame process is then run {285}, which has not been modified in any way from the version that ran in a non-mobile manner. 270 public class MobileMazePlayer extends MobileProcess { 271 private int playerId; 272 private NetChannelLocation fromPlayerLocation; 273 public MobileMazePlayer(int playerNumber, 274 NetChannelLocation fromPlayerLocation) { 275 this.playerId = playerNumber; 276 this.fromPlayerLocation = fromPlayerLocation; 277 } 278 public void run() { 279 NetChannelInput init = Mobile.createNet2One(); 280 NetChannelOutput fromPlayer = 281 Mobile.createOne2Net(fromPlayerLocation); 282 NetChannelInput toPlayer = Mobile.createNet2One(); 283 fromPlayer.write(init.getChannelLocation()); 284 fromPlayer.write(toPlayer.getChannelLocation()); 285 new PlayerFrame(playerId, 100, init, fromPlayer, toPlayer).run(); 286 } 287 }
2.2 Home Systems Remote Control Within a current home, with many different electrical consumer products (TV, DVD Player, Stereo), there may be many different remote controls in use. The user may find it hard to remember exactly which control operates which product, they may lose a control or they may have too many controls in general. Using the mobile process technology, it is possible for an electrical product to contain its own remote control internally, and send it to a universal, touch screen device when requested. In other words, there is one control for all devices. This would effectively remove the requirement of product manufacturers developing remote controls for each device they create in a physical form, only a software developed system is required. This could be implemented using Bluetooth TCP/IP, or JCSP could be expanded to use serial connections and thereby allow the use of an infra red port. Looking at digital television technology set top boxes currently download new system software when sent via the communication medium. Applying the concept of internally stored remote control systems means that they can also update their remote controls. This allows systems to be updated in ways not normally possible and coupling this with the ease of buying a new remote control, when one is lost, demonstrates the possibilities of these systems. To demonstrate, a simple remote control device system was developed. Three devices that can be controlled are created, and are virtually demonstrated as three different windows on a PC. The remote control for each device will be transferred, as a Mobile Process, to a
K. Chalmers and J. Kerridge / jcsp.mobile
122
PDA when requested. To simplify the concept, the PDA has an initial interface consisting of three buttons named TV, DVD and STEREO. When one of these buttons is pressed, the relevant control is received from the relevant device server. Figure 4 illustrates the basic process network design and shows how it relates back to the pattern presented for mobile processes.
MOBILE PROCESS CLIENT
2: Request
4: TV Remote Process
MOBILE PROCESS SERVER
3: TV Remote Process
TV SENDER
5: TV Remote Process
1: “TV”
REMOTE CONTROL RECEIVER
6: Communication
TV SCREEN
Figure 4. Remote Control System Process Network
3. Migrating Processes and Channels The first step here is to modify the MobileProcess so it reflects more closely the model defined by Barnes (Barnes, 2001) by adding a channel that can be used to signal the process to stop. Originally this was called a kill channel, but reset is probably a more easily interpreted name, as the process is not necessarily killed. This channel can be used to send a signal to stop the process, or receive a channel location so that the process can be redirected to a new Node. 3.1 Migratable Channels A Migratable Channel is one that can move from one Node to another and continue communication with its opposite channel end almost seamlessly. At a local level, this type of operation is easy with Java as a reference to a Channel End can be quite easily passed from process to process. Doing this at a distributed level is a lot more difficult, as it requires the opposite channel end to be redirected to the new location. A Networked Channel operates using client-server style architecture, with the input end acting as a server, and the output as a client. Because of this it is possible to have multiple output ends (clients) connected to a single input end (server), and all the output end needs to know is the location of its relative input end. If this is viewed in the context of a Migratable Channel, then as long as the migrating output channel retains the location of the input channel it is using, it can be quite easily sent across a network. Input channels are a bit more difficult. If it is moved to another location then each of its relative output ends must be made aware of this. Due to the input channel having no knowledge of its output channels, the obvious solution of sending the new location to these channels is not possible. Quickstone have developed a solution to this problem however. What appears to happen is that when a Migratable Input Channel is moved, it leaves a small process listening on its original location with a temporary channel name it declares with the CNS. When the input channel arrives at its new location, it declares this channel with the CNS. When an output channel then tries to send to the old location, it is told the name of the channel to resolve with the CNS, which it then uses to reconnect to the channel. Figure 5 illustrates this concept.
K. Chalmers and J. Kerridge / jcsp.mobile
temp
Input Channel End
123
Output Channel End
temp
Input Channel End New Location
temp CNS
Input Channel End New Location
New Channel
Output Channel End
Figure 5. Migrating Channels
This method works quite well, as long as both Channel Ends are declared as Migratable. This is so that the output end knows in some cases it may receive a name to connect to a new channel location. In both instances (input and output), when the channel is about to move, a prepareToMove() method is called on the channel so that any necessary actions can be performed on the channel as required before movement. This idea can be taken further into the concept of a Migratable process also. 3.2 Migratable Processes If the Mobile Process class is extended to create a Migratable Process class, it can be seen that implementing a prepareToMove() on the process is also a good idea. This will allow any necessary actions to be performed on the process to prepare it to be moved (a prime example of this is the removal of non-serializable objects as these cannot be sent across networked channels). This method is declared abstract as each specific process will require different actions to prepare it for migration. It is possible to test a basic Migratable Process, and the simplest to use is a ProducerConsumer system. Due to the simplicity of the system itself, it is not necessary to use the prepareToMove() function as no real preparation is necessary to move the process. It is also easier to have the basic processes wrapped inside a parent Migratable Process which will start them when it arrives at its relevant client location. This parent class will then read from its reset channel and when it receives a channel location will stop the Producer or Consumer, create a channel from the channel location received and write itself down the created channel. 3.3 A Producer-Consumer Example When this version of the Consumer is created, it is passed a MigratableChannelInput end, which can be passed in when the server process creates it. This is similar to the idea of passing in a channel location but a channel does not need to be created by the process as it is already present. Due to the nature of the Migratable Channel this is perfectly viable code.
124
K. Chalmers and J. Kerridge / jcsp.mobile
When the process is run, it initially declares the reset channel with the CNS {294}, which allows other process to find the reset channel. A new consumer process is then created {296}, and started within its own ProcessManager {297}. A ProcessManager allows a single process to be started and for the main process to continue if the start() method is called instead of run() {298}. Next a channel location is read from the declared reset channel {299}. When a location is received, the consumer is stopped and the MigratableChannelInput is informed that it is about to move {300-301}. A ProcessChannelOutput is then created {302} and the MigratableConsumer is sent down it {303}. When another client receives the process, this run method is invoked again with the Producer having no knowledge that the process has relocated, and hence the Consumer is truly mobile. The Consumer’s input channel can also be made a Process Channel {295}, allowing unknown objects to be sent from the Producer to the Consumer. 288 public class MigratableConsumer extends MigratableProcess { 289 MigratableChannelInput in; 290 public MigratableConsumer(MigratableChannelInput in) { 291 this.in = in; 292 } 293 public void run(){ 294 reset = Mobile.createNet2One(“consumer.reset”); 295 ProcessChannelInput processIn = Mobile.createNet2One(in); 296 Consumer consumer = new Consumer(in); 297 ProcessManager pm = new ProcessManager(consumer); 298 pm.start(); 299 NetChannelLocation toLocation = (NetChannelLocation)reset.read(); 300 pm.stop(); 301 in.prepareToMove(); 302 ProcessChannelOutput toNext = Mobile.createOne2Net(toLocation); 303 toNext.write(this); 304 }
4. Evaluation 4.1 Embedded Systems Some of the main issues relating to embedded systems are memory constraints and the ability to update the systems easily. In general, an embedded system would require all the necessary class files and resource files before it could be started, and this can be a strain on storage requirements, especially if some of the class files are rarely used. If the framework for mobile processes is used, as well as a system that removes unnecessary class definitions from the Class Manager, the embedded system only requires the basic classes used here, and this will possibly lead to a smaller memory footprint. Of course, if time constraints are placed on the embedded system then this method is inefficient. The other problem associated with embedded systems - that of updating multiple systems easily - can be implemented quite easily with this framework, and this can even be done without stopping the systems. By using the reset channel, it is possible to make the system update itself as necessary by forcing them to access the server process responsible for the clients. 4.2 Distributed Systems and Mobile Agents The concept of a mobile agent was first discussed by White as a successor to Remote Procedure Calling [6]. White points out that the main limitation of RPC for distributed
K. Chalmers and J. Kerridge / jcsp.mobile
125
systems is that the client requires some knowledge as to the service provided by the server, and describes the concept of mobile agents to overcome this limitation by allowing a system to inform a client type application exactly what kind of procedures it provides. Looking at the framework developed here for mobile processes, it is apparent that a system that exhibits this kind of functionality has been created. The client side process does not need any prior knowledge as to what functionality it will have. In effect it is told exactly what procedures it can use when it requests a service. Another concept of mobile agents is the ability to send a process from one system to another so it can execute some functionality and then return. Again, it is quite apparent that the mobile process framework also provides this functionality quite concisely. It is quite possible to send processes between systems easily, without either system knowing exactly what the process needs to do. Of course, there are security constraints that need to be considered to stop malicious processes being sent into a system, but the basic principle has been developed. 4.3 Location Aware Computing Currently, location awareness within systems relates mainly to the availability of different information being available to the user depending on their current location, coupled with the ability for users to know each other’s location. Basically the system adapts to offer different information and services relating to where the user happens to be. An example of this type of environment can be seen in a museum, where currently exhibit information is generally contained on a display card next to the exhibit in question. This may require the onlooker to move away from the exhibit to the card to read information pertaining to the exhibit, and then back. If a location aware system was implemented in this situation, these cards could be replaced by information displayed on a PDA relevant to the exhibit the user is near. Furthermore, it would be possible for the system to provide links to other relevant information about the exhibit within the PDA, or externally on a server, with these links possibly being locations of other exhibits, audio, video or textual information. This kind of system is being placed as the next generation of application for mobile devices, but the limitation of providing information relevant to the location is apparent, and is really only an extension on current browser technology. What the mobile process technology has shown is that it is possible to change the system itself, to a Location Adaptive System or Location Context System, where the system itself adapts to the location of the user. 5. Conclusion and Future Work The first point to consider when evaluating JCSP is how well it reflects the underlying principles of Communicating Sequential Processes, and although no thorough investigation into this has taken place here, it can be deemed by some of the resources used that JCSP does provide the mechanisms expected [7,8]. Using JCSP is relatively easy, and can be related back to the basic premise of individual processes sharing resources by the use of communication channels. JCSP, or more importantly Java, does present some issues relating to the implementation of a CSP system at a local level, due to the reference semantics used. This means even though an object is passed from one process to another over a channel the sending process can still change the shared object [3]. A more accurate evaluation can be made of the networked aspects made available by JCSP Network Edition, especially in reference to the robustness of the Channel Name
126
K. Chalmers and J. Kerridge / jcsp.mobile
Server and underlying architecture used to implement the networked aspect. During this work, the CNS has been tested on a number of levels. Firstly, the ease of creating networked channels given only the IP address of the CNS and a channel name should be mentioned. This feature creates an almost invisible method of networked channel creation, with locations of individual processes not being required, just the location of the CNS they are using. This functionality can be attributed to the implementation of the Broker design pattern [9]. The second feature of the Network Edition of JCSP is anonymous channels. This functionality is one of the main reasons that the concept of mobile processes within JCSP is possible. Using anonymous channels, by passing channel locations within a process, another layer of transparency is possible, mainly as the CNS does not have to be used to forward messages. This is where the CNS advances beyond the original Broker pattern used. In theory it is quite possible to have a system that does not use a CNS after an initial process transaction as it is never used again. This concept is also the reason why the Dynamic Class Loading service can be made available, as a process can be given the location to retrieve class data from, again transparently. The whole transparency and ease of use of these features truly demonstrates how well the networked features have been developed. The final feature of consideration is how the channels have been developed. There were two choices on how to develop a networked channel, having the underlying server socket as an input channel end receiving messages, or have the output channel acting as the server socket. Given that servers generally send information to clients it would seem most likely that the output channel would act as a server socket. However, had this type of architecture been implemented, it would not be possible to send channel locations between distributed processes as the initially declared output channel would only be able to send locations, not receive them. The designers of JCSP Network Edition took this into consideration however, and developed the input channels as server sockets, enabling the development of mobile processes as a whole. Also due to the good development was the ease with which JCSP could be transferred onto a mobile platform, although this was mainly because of the underlying Java Runtime Environment used. A viable version of J2ME Connected Device Configuration Personal Profile, which acts as Java Standard Edition 1.3.1, was made available by IBM[10]. It was fairly easy to transfer a basic JCSP system onto a PDA device and it is even possible to use a PDA device as the CNS. Currently, we are developing a system for context and location aware computing that will allow a user to roam around an environment such that their PDA will download location specific clients to interact with a server depending upon the particular wireless access point they are using to access the network. Acknowledgements Kevin Chalmers acknowledges the support of the Student Award Agency Scotland who contributed to his tuition fees for the undergraduate degree for which the work reported in this paper formed part of his final year project. We are grateful for the helpful comments of the referees.
K. Chalmers and J. Kerridge / jcsp.mobile
127
References [1] Quickstone Ltd, web site accessed 4/5/2005, http://www.quickstone.com/. [2] F.R.M. Barnes and P.H. Welch. Prioritised Dynamic Communicating and Mobile Processes. IEEE Proceedings – Software, 150(2), 121-136. April 2003. [3] F.R.M. Barnes and P.H. Welch. Mobile Data, Dynamic Allocation and Zero Aliasing: an occam Experiment. In A. Chalmers, M. Mirmehdi, and H. Muller (Eds.), Communicating Process Architectures 2001. IOS Press, 2001. [4] Z. Qian, A. Goldberg, and A. Coglio. A Formal Specification of Java Class Loading. ACM SIGPLAN Notices, 35(10), 325-336. October 2000. [5] P.H. Welch. CSP Networking for Java (JCSP.Net). [Electronic Version]. Retrieved January 14, 2005 from http://www.cs.kent.ac.uk/projects/ofa/jcsp/jcsp.ppt. 2002. [6] J. White, (1996). Mobile Agents White Paper. [Electronic Version]. General Magic. Retrieved March 22, 2005 from http://citeseer.ist.psu.edu/white96mobile.html. 1996. [7] P.H. Welch (1998). Java Threads in the Light of occam/CSP. In P.H. Welch and A.W.P. Bakkers (Eds.), Architectures, Languages and Patterns for Parallel and Distributed Applications, WoTUG-21, pp. 259-284. IOS Press, 1998. [8] P.H. Welch and J.M.R. Martin, (2000). Formal Analysis of Concurrent Java Systems. In P.H. Welch and A.W.P. Bakkers, (Eds.), Communicating Process Architectures 2000 (pp 275-301). IOS Press. 2000. [9] F. Buschmann, R. Meunier, H. Rohnert, P. Sommerlad, and M. Stal (1996). A System of Patterns: Pattern-Orientated Software Architecture. Chichester: Wiley. 1996. [10] IBM J2ME, web site accessed 4/5/2005. http://www-106.ibm.com/developerworks/wireless/ library/wi-j2me/.
This page intentionally left blank
Communicating Process Architectures 2005 Jan Broenink, Herman Roebbers, Johan Sunter, Peter Welch, and David Wood (Eds.) IOS Press, 2005 © 2005 The authors. All rights reserved.
129
CSP++: How Faithful to CSPm? W. B. GARDNER 1 Dept. of Computing & Information Science, University of Guelph, Canada
Abstract. CSP++ is a tool that makes specifications written in CSPm executable and extensible. It is the basis for a technique called selective formalism, which allows part of a system to be designed in verifiable CSPm statements, automatically translated into C++, and linked with functions coded in C++. This paper describes in detail the subset of CSPm that can be accurately translated by CSP++, and how the CSP semantics are achieved by the runtime framework. It also explains restrictions that apply to coding in CSPm for software synthesis, and the rationale for those restrictions. Keywords. CSPm, C++, selective formalism
Introduction CSP++ is a pair of tools—a translator and an object-oriented application framework (OOAF)—that together make CSP specifications both executable and extensible. The initial development goes back to 1999 [1], and was based on a local dialect of CSP called csp12 [2] that was supported by an in-house verification tool of fairly limited capabilities. In order to bring CSP++ to a wider community, and build a straight-line design flow for software synthesis starting from robust commercial verification tools, CSP++ has been recently redeveloped so as to accept input in CSPm, the same machine-readable dialect used by FDR2 and ProBE from Formal Systems (Europe) Ltd. [3]. There is more to CSP++ besides translation and execution of CSPm specifications. It also includes a strategy for practicing formal methods in software engineering, dubbed selective formalism [4]. This strategy provides a logical way to combine formal specifications written in CSPm with source code written in the popular programming language C++, without ruining verified properties. It is an unabashed attempt to breech the resistance of software developers to adopting pure formal methods, by offering a sort of pragmatic compromise. The purpose of this paper is to expose the specific design choices that went into the selection and implementation of the CSPm subset. It begins with an overview of CSP++, first introducing the approach of selective formalism, and then going through the steps of the design flow carried out by the automated tools. The heart of the paper helps unfold in four sections the answer to the title’s question, How faithful is CSP++ to CSPm? If one wants to synthesize software from a CSPm specification, what can one expect, and what is one limited from doing? Do the execution semantics match the traces of the specification, and what happens when non-formal C++ code is linked in? There is a clear rationale for what has and has not been implemented. Convergence with CSPm is described in terms of the operators and constructs that are synthesizeable (as of version 4.1). Divergence from CSPm is also discussed in detail. Note 1 Assistant Professor, Modeling & Design Automation Group, Dept. of Computing & Information Science, University of Guelph, ON, Canada, N1G 2W1. E-mail:
[email protected]. CSP++ is available for download from the author’s website: http://www.cis.uoguelph.ca/~wgardner, Research link.
130
W.B. Gardner / CSP++: How Faithful to CSPm?
that this order of presentation mixes together constructs, philosophy, implementation, and limitations. The main part of the paper ends with a list of platforms that CSP++ is known to run on, and several case studies from which performance measurements have been gleaned. The paper then concludes with a brief review of related work—other CSP frameworks and translators—and plans for future work. Beyond the obvious issue of what features to add to CSP++, we also muse on what it would take to popularize the practice of selective formalism based on CSP. 1. Overview of CSP++ The overview below assumes that the reader is familiar with Communicating Sequential Processes [5] and needs no special justification for choosing CSP as a modeling tool for concurrent systems. The main contribution of CSP++ is software synthesis based on CSP. The first subsection sets the context for this, by explaining what is meant by “selective formalism” and how it applies to CSP++. The next subsection goes through the steps of the CSP++ design flow. Finally, the synthesis tools—the translator and the OOAF—are described chiefly from a user’s standpoint. Their internal operation is not described in detail, in order to avoid duplicating documentation available from other sources [6][7]. 1.1 Selective Formalism The notion of selective formalism is based on the oft-observed fact that resistance to the adoption of formal methods in the software industry runs high. This state of affairs is not without rational basis. For example, three practical drawbacks of formal methods are: 1. A company will not likely have on hand many designers who can utilize a formal notation, and will not be eager to retrain its programmers who are already skilled in conventional programming languages. 2. Even if a specification is produced in a formal notation and subjected to verification, the notation will have to be translated, presumably by hand, into a programming language suitable for implementing on the target platform. Aside from the timeconsuming and error-prone nature of this manual step, it may not be clear that properties verified in the formal specification could be retained in translated form. 3. Formal notations are more convenient for expressing abstractions above the level of a detailed implementation. Therefore, specifications written in a formal notation will likely need to be supplemented by conventional program code in any case. It must be acknowledged that the first point above does not match the profile of companies whose clients have forced them to take a “high road” vis-à-vis formal methods, e.g. for the sake of highly safety-critical products. Such companies have their own solutions, and selective formalism could even represent a backward step for them. The essential compromise of selective formalism is that many benefits of using a formal notation can be obtained without committing to it for building an entire system. It is proposed that the control backbone of a system be specified using a formalism that is wellsuited to expressing interprocess synchronization and communication, i.e. CSP, and that this specification be automatically translated to a conventional language, C++. Provision is made for supplementing the translated formal backbone with additional C++ code in a way that does not invalidate the specification’s formal properties. The “selective” aspect refers to the designer’s decision to describe more or less of the system in CSP, according to the system’s characteristics.
W.B. Gardner / CSP++: How Faithful to CSPm?
131
This approach goes far to overcoming the three drawbacks: 1. The company will need only a small number of trained “CSP gurus” who can write CSP and run the verification tools. Much of the coding, integration, and testing can be carried out by programmers skilled in C++. 2. Automatic translation is used to render the verified CSP control backbone into compilable C++. The semantics of the resulting executable code match the CSP specification. Thus, the specification is not destined to become an “orphan” in the development process; it can be modified and retranslated on demand. 3. Programmers need not attempt awkwardly to express every algorithm or calculation in CSP, but can use C++ where formal properties are not an issue. Interfacing the system with its actual environment via I/O can be carried out conveniently in C++. Selective formalism is therefore based on software synthesis, particularly on the ability to make CSP specifications both executable and extensible. The steps of the design flow based on the automated tools are described in the next subsection. These steps are depicted in Figure 1.
cspt Translator
CSP Specs
Verification Tools
CSP++ Control Layer
User-coded C++ Functions
User-coded Functions Utilities RTOS
Target System Figure 1. CSP++ Design Flow
1.2 Design Flow A designer will start by creating a specification in CSPm, and using the tools from Formal Systems—checker, ProBE, and FDR2—to simulate it and verify its properties. Experience shows that CSP tends to feature in four roles in such a specification, constituting four complementary models: 1. Functional Model: These statements capture the desired system behavior in terms of CSP processes engaging in named events. 2. Environment Model: These statements simulate the behavior of entities in the system's target environment, in terms of processes engaging in events. The functional model can be simulated by synchronizing it with the environment model. 3. Constraint Model: Other processes may optionally be added alongside the functional model to limit or constrain the event sequences that can occur. A
W.B. Gardner / CSP++: How Faithful to CSPm?
132
constraint model is used to focus on critical event sequences in the functional model that must—or must not—occur in order for the system to be “safe.” If verification shows that the constraint is violated, the functional model must be improved. 4. Implementation Model: Since the functional model will likely be fairly high-level, it will normally need to be refined to an implementation model, still in CSP, but with more detailed processes and events added. Verification will confirm whether the implementation is a legitimate refinement of the original functional model. After verification is satisfactory, the CSPm specification can be sent to the synthesis tools (described next), and the resulting C++ source code compiled and linked. This program can be run with tracing enabled for simulation purposes, in which case it will print out a trace of every event executed, identifying the process in control at that moment. (For synchronizing events, only the process that arrived last at the rendezvous will be identified.) In order to complete the implementation, the designer returns to the CSPm specification and removes (or comments out) the environment model, since the idea is for the translated CSPm to interact with the system’s real environment. At this point, named CSPm channels that were previously synchronized with the environment model are now free to be linked with C++ user-coded functions (UCF). These functions can perform system calls, carry out I/O, and utilize third-party packages such as a database management system, under control of the translated CSPm backbone. For debugging purposes, the translated C++ can be run with a conventional debugger (e.g. gdb). Since the original CSPm is inserted as comments in the translated source file, interleaved with the resulting C++ translation, it is easy to relate the two and, in effect, set breakpoints in the CSPm and inspect local variables. Execution can also be conveniently stepped out of the CSPm into the user-coded functions. 1.3 Synthesis Tools Since the semantical “distance” between CSP and executable machine code is large, an intermediate code translation target was created in the form of an OOAF. The framework, called CSP++, is architected in terms of C++ classes that mirror the objects in the world of CSP—chiefly processes and channels—and supply their proper semantics. The job of the translator is to convert a CSPm specification into a particular customization of the framework, which, when compiled and run, emulates the original specification. The translator and framework form a tool chain (Fig. 2) and are described in the following subsections. framework headers CSPm spec
cspt Translator
C++
C++ Compiler
framework library object files
Linker
executable
Figure 2. CSP++ Tool Chain
1.3.1 cspt Translator The translator, called cspt, originally supported the local dialect csp12. For version 4.0, it has been refitted with a front end that accepts a carefully chosen subset of CSPm. The particulars of this subset are discussed in detail in section 2. The aim is that any text file conforming to the subset which is syntactically acceptable to checker, ProBE, and FDR2,
W.B. Gardner / CSP++: How Faithful to CSPm?
133
will be translated accurately into C++ source code which, when compiled with the CSP++ class library headers and linked with the CSP++ object library, will execute with the same semantics as simulated by ProBE. Looking at the output source code, the user will observe that each CSPm process definition has been translated into one or more C++ functions, and that the function bodies contain instantiations and method invocations of CSP++ classes. Some CPP preprocessor macros are used for the translator’s convenience, and the readability of the source code is quite high. The process called SYS is taken as the starting point for execution, and a main function is generated to process command line options (e.g. enable tracing) and then launch SYS. Execution stops and the main function returns in any of these circumstances: 1. SYS terminates by executing SKIP. 2. Any process executes STOP. 3. All processes are waiting for a synchronization event and the command line option of idle checking was enabled. In cases 2 and 3, a dump is printed showing the status of all active processes, identifying which events they are waiting on for synchronization. 1.3.2 Execution Framework CSPm processes are mapped into threads. The current version of CSP++ is based on GNU Pth [8], a portable package for nonpreemptible threads. The translator is smart enough to avoid consuming resources with gratuitous thread creation: In the two common cases of a process turning into another process (e.g. P = a -> Q) and tail recursion (P = b -> P), the current thread simply carries on, in the one case changing its identity to Q, and in the recursive case by looping back to P. This is called “chaining.” Compositional cases spawn new threads as required. For example, P = A||B would spawn threads for A and B. The sequence Q = R;S would spawn a thread for R, wait for it to finish, and then chain to S without spawning. Now suppose R were written inline, as say, Q = e->SKIP;S. In this case no thread would be spawned; e would just be executed by Q’s thread. Complex expressions incorporating composition are handled by extracting unnamed subprocesses. In the example, Z = (P||Q);R, (P||Q) would be extracted by the translator as a subprocess. Z would spawn it to perform the parallel composition and wait for it to finish, after which Z would chain to R. Changing the underlying thread model is not difficult, and has been done several times already. The base class for CSPm process objects is called task, and all the thread-aware code is localized in its methods for easy portability. In order to fully emulate the dynamics of a CSPm specification, the runtime system maintains a branching environment stack (i.e. tree structure). Whenever the CSPm elements of synchronization sets, renaming, and hiding are encountered, corresponding environment objects are pushed onto the current process’s branch of the stack. All CSPm events are interpreted in light of their process’s current environment context, which necessitates a good deal of stack searching. User-coded functions are integrated as follows: When an event is to be executed, the framework will check whether a user-coded function was supplied at link time. If so, the UCF will be called, and if channel I/O is involved, data will be transferred to/from the UCF. If no UCF is linked to the event/channel name, the event can be used for synchronization with another CSPm process as usual.
W.B. Gardner / CSP++: How Faithful to CSPm?
134
The most challenging feature of CSPm to implement is multiparty synchronization in the presence of external choice. This is handled by trying each alternative in turn until one succeeds, or if none succeeds, then suspending the thread on a condition variable. The last party to arrive at a synchronization is called the “active” party. It is responsible for canceling the other choice alternatives (if applicable), transferring any channel data (if applicable), and waking up all remaining “passive” parties. For simulation purposes, any events that are not synchronized in the specification get some default treatment at run time: plain events and channel output are printed, and integer input is obtained for channel input. This means that, for example, if P = ch!10, the framework will output 10. But if another process is put in parallel with P, say, Q = ch?x, then nothing will be printed because the event will be absorbed internally. In addition, as mentioned above, the framework can have trace printing enabled. In that case, each successful synchronization and channel data transfer will be logged on the cerr (stderr) stream, and will reflect any renaming and/or hiding that is in effect. 2. Convergence with CSPm Appendix A of the FDR2 User’s Manual [9] is taken as the “bible” for CSPm syntax. The same presentation is also available from Appendix B of [9]. The basic principles behind decisions concerning which features of CSPm to support in CSP++ for translation can be stated as follows: x
x
x
We want to implement for synthesis a rich, useful subset of CSPm with as few restrictions as possible. Anything one writes in that subset, and verifies, should be synthesizeable without modification and hand-tinkering, since those activities can be fertile sources of bugs. The above principle implies that we don’t offer “extensions,” since those would not be verifiable. Extensions for synthesis’ sake that could be camouflaged from FDR2, say as comments, might be entertained in the future. We assume that users have access to the Formal Systems tools, so there are some things, such as channel statements, that cspt does not validate. If one bypasses at least running checker before translating, unnecessary problems may be created.
The idea of a “synthesizeable subset” is also found in hardware synthesis. For example, VHDL was originally conceived as a specification language, and then became adapted for simulation. In recent years, CAD vendors have created synthesis products that generate digital circuits from structural or behavioural descriptions input in VHDL. There is no attempt to synthesize each and every VHDL construct, since the language was never created with that intention. Therefore, the vendors define their own synthesizeable subsets of VHDL. Similarly, CSP++ supports a subset of CSPm for software synthesis. Descriptions of supported constructs are divided below into four areas: events, processes, operators, and other language constructs. 2.1 Supported Events In CSPm, the events collected into trace sequences are compound symbols made up of components separated by dots. The leftmost symbol is a channel name, and the components to its right (if any) are considered the channel’s subscripts and/or data. In CSP++, we dub an event having no data—i.e. a bare occurrence of a channel name—as an atomic event.
W.B. Gardner / CSP++: How Faithful to CSPm?
135
However, an atomic event may have subscripts. The distinction between subscripts and data in CSPm is blurry; we attempt to clarify it in CSP++ usage (see section 3.3 for full discussion). The designer’s intent in using subscripts is likely to define a group of channels or events that have the same base name. CSP++ supports alphanumeric channel names that are accepted by the C++ compiler as valid variable names. Subscripts and data may comprise from 1 to n dotted components, where n is currently set at 10. The contents of subscripts and data components are determined by the datatypes supported by the translator. Currently, CSP++ supports only integer data. 2.2 Supported Processes In CSPm, a powerful feature is the ability to write parameterized process definitions, including multiple definitions of the same-named process. CSP++ supports such overloaded definitions with 0 to n parameters, where n is currently set at 10. There are two restrictions regarding overloaded process definitions in CSP++: x x
All definitions must have the same number of parameters. To work as expected, the most general definition should be coded last.
The first restriction means that the set of definitions P(1), P(2), and P(n) would be valid in the same specification, but P, P(i), and P(1,n) would not. The second restriction means that coding P(n) before P(1) and P(2) would result in the P(n) definition always being invoked, even by explicit statements such as a -> P(1), which would be contrary to the designer’s intent. The cspt translator tells when a process invocation can be resolved at translation time, and when binding must be deferred to run time. In the latter case, a parameter table is generated for any sets of process definitions that require runtime binding. Process definitions can be recursive, with tail recursion being handled very efficiently. Even infinite tail recursion results in no stack growth. In terms of special “built-in” process names, SKIP and STOP are supported. STOP aborts execution with a process status dump. 2.3 Supported Operators CSP++ supports these operators: x x x x
Prefix: event -> proc Conditional: if expr then proc1 else proc2; where expr is a relational expression Event renaming: proc[[oldname <- newname]] Event hiding: proc\{name}
CSPm’s relational operators (==, !=, <, >, <=, >=) and arithmetic operators (+, -, *, /, %) are recognized. Renaming and hiding can be inserted anywhere, using parentheses to designate their scope of application. All styles of composition are supported, including parallel ([| |]), interleaving (|||), and sequential (;). The one flavour of parallel syntax supported at present is interface parallel, where the set of synchronizing events is explicitly listed. Within that set, only bare channel names are permitted. The implication is that any event starting with a listed channel name will be a synchronizing event. The production syntax {| names |} is handled properly. Linked parallel and alphabetized parallel composition are not supported.
W.B. Gardner / CSP++: How Faithful to CSPm?
136
External choice ([]) is supported, but not internal (nondeterministic) choice (|~|). An important restriction is that the first event of alternative processes must be explicitly exposed using prefix notation. For example, let: P = a -> A Q = b -> B
Suppose my intention is to choose between P and Q. Simply writing (P [] Q) is not allowed. Instead, I must write (a -> A [] b -> B), thereby exposing the initial events of each alternative. This is to make it easy for the translator to identify the events that the choice depends on. In fact, it is equivalent to writing (a -> A | b -> B), which is valid CSP but not part of the CSPm dialect. Multiple alternatives can be written as (a -> A [] b -> B [] c -> C), and so on. 2.4 Other Constructs cspt recognizes both single-line (--) and block ({- … -}) style comments. All declarative statements are ignored: nametype, datatype, subtype, and channel. Presently, these are treated as equivalent to single-line comments, therefore declarations stretching over multiple lines will regrettably result in syntax errors. At the current time, cspt does not need to interpret these declarations, but instead infers channel names from operations. Furthermore, all data is assumed to be of integer type. Assert statements (used by FDR2) are also ignored. In summary, the restrictions detailed above do yield a valid subset of CSPm that can be input to checker, ProBE, and FDR2 without complaint from those tools. 3. Divergence from CSPm In this section, the features of CSPm that are not fully supported by CSP++ are detailed. They are broken into subsections of unimplemented operators, process parameters, and channel I/O. 3.1 Unimplemented Operators Some valid CSPm operators not supported, due either to the translator’s not handling the syntax, or to the framework’s lack of a mechanism to implement the semantics. These are listed in four separate categories to help illuminate their current status and future prospects. The categories are arranged in order of increasing reluctance to tackle them. 3.1.1 Category: Planned for Later Since data in CSP++ is handled via OO classes and polymorphism, adding support for additional datatypes into the runtime framework is not difficult. Expanded support will be targeted as the need is demonstrated by future case studies. Candidates from CSPm include sets, sequences, and simple enumerated datatypes. Character strings might be introduced as sequences of integer values. The Boolean guard (&) will be added; it is similar to the if … then construct already supported.
W.B. Gardner / CSP++: How Faithful to CSPm?
137
Implementation of the interrupt (/\) operator is planned. It would be very useful, but the framework currently does not contain a mechanism to support it. For example, P/\Q would put P into a mode whereby prior to executing each event, it would check whether the first event of Q has occurred, and if so, terminate itself (as if P executed SKIP). Regarding UCFs linked from P’s events, it will have to be decided whether a blocked UCF should be interruptible, perhaps with some optional cleanup feature. 3.1.2 Category: Low Benefit Cost Ratio These include constructs that would admittedly be desirable to support, but whose benefits do not presently appear to justify the effort entailed. There are satisfactory workarounds for these cases. Other flavours of parallel composition, linked and alphabetized, could be added, but interface parallel is already satisfactory. Similarly, the lack of replicated operators can be worked around by writing out all the cases. P [] Q is problematic to translate in the general case. If P and Q are defined so that their initial events are stated, well and good. But if not, locating the initial events requires considerable manipulation so as to rearrange the process definitions into head normal form [10]. That technique has not yet been pursued in the translator. To some extent, this is a result of the decision to make the C++ output of the translator closely correspond with the CSPm source input. 3.1.3 Category: Questionable in Synthesis Context Nondeterminism, including internal choice (|~|) and “untimed” timeout ([>), falls into this category. While nondeterminism can be useful in specifications, it is difficult to think of a clearly appropriate treatment when synthesizing source code. Some constructs that are not inherently nondeterministic can become such in practice. For example, external choice, where the alternative events are the same, becomes nondeterministic: e->P [] e->Q. cspt does not detect such cases, and would handle this example by trying event e twice. If event e succeeds, P will be chosen. If the process has to wait on event e, then when e eventually occurs, P will still be chosen. 3.2 Process Parameters For now, only integer values are allowed for process parameters. As datatypes are expanded, process parameters will accept non-numeric data. CSPm allows channel names as parameters, and this may also be implemented in CSP++. 3.3 Channel I/O If any area of CSPm could be described as a quagmire for software synthesis, this is it. The problem of channel I/O, i.e. transferring data from one process to another, is that from the trace semantics viewpoint of CSP, there is honestly no such thing as “I/O,” and ProBE and FDR2 reflect this well. To be specific, if a trace is observed to contain the event foo.1.2.3, there are many ways it could have got there: x x x
One process executed foo.1.2.3 Two processes synchronized on foo.1.2.3 One process output foo!1.2.3, and another input foo?x, or foo?1.2.y, or even
x
Two processes synchronized on foo.1.2.3, and a third input foo?x
foo.1!2?z
W.B. Gardner / CSP++: How Faithful to CSPm?
138
Many other combinations are possible, including what could be called “mixed mode” transfers where operators ostensibly calling for output (!) appear alongside input (?) operators in the same event expression. Furthermore, in interpreting a compound (dotted) event, one cannot say by inspection whether some or all components are intended to function as 1- or n-dimensional subscripts of the channel name, or whether some or all components are to be considered as data values. It is not difficult to write obscure-looking specifications using these capabilities. This free-for-all should be contrasted with the original straightforward meaning of “channel” in CSP: A channel was intended to be a primitive structural component in the design of a system, dedicated to one-way, unbuffered, point-to-point data transfer between a particular pair of processes. This kind of definition is extremely easy for system designers to understand and utilize, therefore, it is attractive to implement for the purpose of software synthesis. The key problem is that channel I/O is, in effect, a metaconcept layered on top of pure event synchronization, and when one looks solely at traces, I/O is found to have dissolved and disappeared. Since ProBE and FDR2 are engaged in state exploration, and since states are represented by traces, it is natural that those tools focus on events, and thus treat I/O in a highly generalized fashion that can barely be recognized as such by programmers. The result is that in ProBE and FDR2, “I/O” operations are treated as pattern matching on events, where “output” (!) asserts components that must match, and “input” (?) designates wildcards that always match, provided any accompanying input datatype constraints are satisfied. After a match has been identified among multiple processes, the full compound event goes into the trace, and any wildcarded components (variables) are bound to copies of the corresponding event components. From the synthesis standpoint, it was judged that implementing ProBE/FDR2 style pattern matching for events would burden the runtime mechanism with high overhead. Furthermore, it was doubted that such generality was needed or even desirable in practical systems. Instead, CSP++ for the most part reverts to the original meaning of channel I/O, which is a valid subset of CSPm in any case. The following restrictions have been adopted: x x x x x
x
cspt distinguishes between “atomic” events meant only for synchronization, and “channel” events meant for either input or output. The general form of an atomic event is: chan[.s]*, where s is a numeric subscript and []* represents zero or more instances. An output event is: chan[.s]*!d[.d]*, where s is as above, and d’s are data values—numeric expressions or bound variables. An input event is: chan[.s]*?v[.v]*, where s is as above, and v’s are unbound variables. An output event can transfer multiple data components into a single variable and vice versa. In this skeletal example (which does not work exactly as written), (cc!1.2.3 || (cc?x -> dd!x) || dd?a.b.c), x would receive 1.2.3, and then a, b, and c would receive 1, 2, and 3, respectively. For synchronization and communication purposes, the channel name and all subscripts must match. The synchronization set for interface parallel composition should contain either the bare (unsubscripted) name of an atomic event {foo}, or else the channel name within the closure set (production) notation {|chan|}, which will cover all variants of subscripts and data values.
W.B. Gardner / CSP++: How Faithful to CSPm?
139
Thus it will be seen that subscripts, if any, must appear before an I/O operator, and that only a single operator, and therefore transfer direction, is allowed. The number of subscripts that appear with a given atomic or channel name must be consistent, or a translation error will result. These restrictions impose considerable clarity on the usage of channels in a specification. While it may be advisable to use a given channel only for unidirectional communication between a particular pair of processes, the translator does not enforce this. Indeed, broadcast I/O is easy to arrange by means of one outputting process and multiple inputting processes. However, multiple outputters of the same event are not allowed and will result in a runtime error. 4. Extension of CSPm via User-coded Functions The ability to link CSPm events with UCFs is an essential ingredient of selective formalism. The basic idea is easy to explain.: When CSP statements are used to model the behaviour of a system, the executions of named events in CSP are intended for two purposes: (1) to synchronize and communicate with other CSP processes; and (2) to mirror what the system does in reality. We could say that purpose (1) is for internal use within the specification, but purpose (2) is for external use. Thus, in the classic vending machine example, a coin.25 event corresponds to the customer inserting a quarter, a choc event to pressing the chocolate candy button, and so on. The concept of user-coded functions is essentially to provide some C++ code to bridge the gap between the named CSP events and, in this case, the electronic switch inputs. Just as two purposes for using events were identified in the previous paragraph, CSP++ makes the restriction that events can be used either for internal synchronization and communication, or for linking to UCFs. Actually, the step in the design flow where the environment model is removed frees up events that were synchronizing with the simulated environment to be used externally with the real environment. To put it another way, removing the environment model converts the events that were synchronizing it with the implementation model from purpose (1) events into purpose (2) events that are now candidates for linking with UCFs. At first glance, this restriction may seem purely arbitrary. This question will be revisited below, along with other issues raised by UCFs, after first looking more closely at what UCFs can be used for. 4.1 Nature of UCFs From the beginning of CSP++ development, it was intended that UCFs be put to practical use in two primary roles, I/O and computation. The first role extends CSPm by providing an interface to external hardware and software. The second role is an escape hatch from CSPm – which was never intended to be a full-featured programming language – allowing programmers to switch into C++ for tasks that would be too awkward to express in CSPm, or too inefficient for execution in translated form. Under the first role, three flavours of UCFs can be recognized, according to the three types of events that invoke them. This is how their UCFs are invoked by CSP++: 1. Atomic event: call UCF, which returns when its processing is “done” 2. Channel input: call UCF, which returns when input has been obtained; input data is bound to channel’s variables 3. Channel output: call UCF with output values as arguments; UCF returns when output has been accomplished
140
W.B. Gardner / CSP++: How Faithful to CSPm?
Case 2 of channel input may involve blocking the process (thread) that is executing the event, but other processes will continue to execute. Timeouts and interrupts are not currently implemented in CSP++, but when they are, this raises the issue of applying them to blocked UCFs. In case studies to date, this first role has worked well, but plans for UCFs in the second role proved to be too simplistic. The basic problem is illustrated by the following example: Suppose my e-commerce system needs to calculate the sales tax for a purchase based on the price of the goods and the country they will be shipped too. This calculation would be nicely implemented by looking up the tax rate in a table and doing a multiplication. To represent the lookup table in CSPm would be annoying, and there are no safety or deadlock properties at stake, so this should be a perfect opportunity to drop out of CSPm into a C++ UCF. But how do we write the UCF-linked events in CSPm? The two tools at our disposal are atomic events and channel I/O. The way to make channel I/O work is by visualizing a black-box “ComputeSalesTax” process that has an input channel (for the price and country code) and an output channel (for the tax). Then we might code the following to link to the two UCFs: MARKUP (price, destination) = putprice!price.destination -> gettax?tax -> ...
The problem here is that the mythical ComputeSalesTax process has to keep track of internal state between the calls to the two UCFs linked to putprice and gettax. In the current version of CSP++, this is left for the programmer to accomplish by means of static storage shared by the two UCFs. This is not very satisfactory, since in the general case the UCFs could be invoked at any time from multiple processes. Probably what is needed is a secure mechanism for the framework to furnish storage to such UCFs on a per-process basis, perhaps by extending the member data of the object that represents the process executing the event. The above illustrates the case where the UCFs are successively invoked from the same process (i.e. the ends of the channels to and from the “black box” reside in the same CSP process). There is another case, though. Suppose we wish to use UCFs to implement a queue data structure. Then the ends of the enqueue and dequeue channels will very likely be in different CSP processes. What we’re proposing here is to replace an entire CSPm process with C++ code. This makes sense under two conditions: 1) the replaced process doesn’t need its own thread of control; and 2) it was earlier represented as a CSPm process that was subjected to verification, and we are convinced that the C++ replacement is equivalent. It may be worth building up a library of tested UCFs, for example, of data structures, that are known to be equivalent to given CSPm processes. 4.2 Issues Raised by UCFs This subsection is organized as a series of four questions and answers. 1. How can we be sure that UCFs are not breaking the formalism, or giving us a mere veneer of verification? Since UCFs are replacing abstract named CSPm events that have no intrinsic meaning, it does not really matter what UCFs do, with one exception: They must not go “behind the back,” so to speak, of the CSPm control backbone by engaging in interprocess synchronization or communication. As long as that principle is not violated, any formal properties verified on the CSPm specification should still apply to the synthesized system.
W.B. Gardner / CSP++: How Faithful to CSPm?
141
2. For input-linked UCFs, which party is responsible for validating input, the C++ or the CSPm? Validation can be done at either level. As an example, suppose we code the following specification: datatype Num = {1,2,3,4} channel button : Num GETINP = button?x:Num -> PROCESS(x)
When running ProBE or FDR2, if the environment of GETINP were to offer to engage in button!5, no synchronization would take place. But the cspt translator ignores channel declarations and datatypes, so if a UCF were linked to button?x, could it return 5 in x? It could, but it should not. To obey the spirit of CSP, the UCF should validate its input to ensure that it falls in the legal range and is not returned to the control backbone. Alternatively, validation code can be written at the CSPm level, and UCF-linked events can be used to reflect error conditions to the environment. 3. Events linked to UCFs currently cannot participate directly in choice. Why is that? Can this restriction be overcome? The reason for this restriction is that choice is implemented by “trying” an event (i.e. offering to engage in it), and if it succeeds (meaning the offer is accepted), the successor process is executed. If it does not succeed, each alternative is tried in turn. If none are found to succeed, the process is blocked with all alternatives remaining on offer until one is accepted. This kind of try-and-back-out protocol is difficult to coordinate with UCFs, since their current calling sequence is designed to be exercised on a one-shot basis. A more complex calling sequence, which allows direct participation in choice, may be provided in a future version of CSP++. For example, this would be compatible with the programming of polled input. 4. Events linked to UCFs cannot also be used internally for interprocess synchronization. Can this restriction be overcome? It is likely that the main circumstance where this need would arise is when a constraint model is involved. Removing the environment model would normally take away the internal use, but if a constraint model is present, the event may still be needed to synchronize with those processes as well as to communicate with the environment. If we allowed UCF-linked events to also synchronize with other CSPm processes, what would be the implications? To answer this, we must start by identifying the precise time when a UCF involved in synchronization should be called. The only sensible plan is to call the UCF after the (two or more) parties arrive at the rendezvous, and of course it must be called exactly once, in order to properly reflect CSP trace semantics. Now let’s look at the possible participating events and decide what useful interpretations could be played out: x
x
Atomic events: After recognizing that it is the last party to arrive at the rendezvous, the active party would call the UCF, and then complete synchronization processing (including waking up the other parties). All parties are doing input (?): This is the broadcast case, from the outside environment to multiple internal processes. The active party would call the UCF and transfer the returned input to all parties, and then complete synchronization processing.
W.B. Gardner / CSP++: How Faithful to CSPm?
142
x x
Multiple parties are doing output (!): This is not allowed in CSP++ (see section 3.3 above). One party is doing output, other parties are doing input: This is also a broadcast case. The active party (who, as the last one to arrive at the rendezvous, knows the output values) would call the UCF to perform the output externally, and then transfer the output to the inputting parties prior to completing synchronization processing.
The above analysis shows that lifting the restriction could be worthwhile. But the programmer would need to understand clearly, on a case by case basis, exactly what a linked UCF was expected to do. 5. Tested Platforms and Performance By now, CSP++ has been ported to and tested on several different Unix variants, several case studies have been created, and some performance measurements have been taken. These three topics are presented below. 5.1 Platforms Since CSP++ is currently based on GNU Pth threads, in principle it should be able to run on any platform that Pth supports. So far it has been confirmed to work on Solaris 9 (i86), Redhat Linux 9, Fedora Core 3, and Gentoo Linux, coupled with Pth-2 and the gcc-3 C++ compiler. It is available from the author’s website in a zip archive including: x x
cspt compiler (binary executable) CSP++ framework (C++ header files and object library for classes)
5.2 Case Studies Three case studies have been created. Each one features an initial design made in StateCharts and the derived CSPm statements. In fairness, these are still at the level of “toy” systems, chiefly for proof-of-concept purposes. They demonstrate CSP++ translating and executing the full range of CSPm operators, and the integration of user-coded functions. The references papers all have samples of CSPm and translated C++ code. x
x
x
DSS, Disk Server Subsystem—The implementation model includes a disk scheduler and request buffer, with simulated disk driver and simulated clients [7][1][4]. It was originally coded using csp12, but has been recoded in CSPm. ATM, Automated Teller Machine—The CSPm includes some verification assertions, and the user-coded functions communicate with a MySQL database [11][12]. POS, Point-of-Sale Cash Register—This system (in progress) is based on porting CSP++ to uClinux for the Xilinx MicroBlaze embedded processor core implemented on a Virtex-II FPGA [13].
The CSPm and C++ source code for DSS and ATM are available for downloading from the author’s website.
W.B. Gardner / CSP++: How Faithful to CSPm?
143
5.3 Performance The DSS case study has been useful for performance metrics, being easy to exercise in a loop (e.g. 20,000 simulated disk requests). In order to make a comparison with a similarpurpose commercial synthesis tool, the DSS system, going back to its StateCharts model, was input to Rational Rose RealTime (RRT, now called Rational Technical Developer). RRT accepts StateCharts as part of a UML model, and generates C++ source code that compiles and links with its own message-driven runtime framework. The comparison is not very ideal, since the operating systems differed (Linux vs. Windows 2000) and also the compilers (g++ vs. Microsoft Visual C++), but tests were performed on the same hardware platforms. The timings (in seconds) are shown in Table 1.
Tool CSP++ 2.1 RRT CSP++ 4.0
Table 1. Timing for 20,000 Repetitions of DSS Run Time Operating System, Threads Compiler, Optimization 1.60 s Redhat Linux 9, gcc 2.96 –O2 LinuxThreads 1.47 s Windows XP MS VC++ 6.0 27.03 s Redhat Linux 9, Pth gcc 3.2.2 –O2
In measurements with an earlier version 2.1 of CSP++ based on LinuxThreads, the CSP++ implementation of DSS was comparable to the RRT implementation in run time. After porting to Pth, performance deteriorated alarmingly; the cause is under investigation. If Pth is the culprit, another portable thread package will be sought. 6. Related Work One category of related work is based not on coding in CSP directly, but on providing a library of classes or functions for conventional programming languages that obey CSP’s semantics. Rather than promoting direct verification of specifications, this is more an attempt to give software practitioners reliable, well-understood components to build with. Examples of libraries inspired by CSP communication semantics include, for Java, CTJ (formerly called CJT) [14], JCSP [15], and JACK [16]; for C, CCSP [17] and libcsp [18]; and for C++, C++CSP [19] and CTC++ [20]. JCSP and CCSP are a related tool family, as are CJT and CTC++. Another category features a “straight line” route to verification, like CSP++’s approach, starting with CSP that can be directly verified, and carrying out automatic translation to an executable program. An older tool called CCSP [21] translated a small subset of CSP to C. Recently, the emergence of first-category libraries has facilitated this strategy, and there is now direct translation of CSPm into Java (based on CTJ and JCSP) and C (based on the newer CCSP) [22]. 7. Future Work A good deal of future work has already been implied above in the listing of “divergences.” Another potentially fruitful area is performance optimization. Currently, the runtime framework always carries out full environment searching for every event. This allows for dynamic process creation, recursion, and application of renaming and hiding. However, this capability represents overkill for many applications, since CSPm is often used to initially construct a static process structure which is subsequently maintained throughout execution. In that typical system architecture, the translator would be capable of identifying and
144
W.B. Gardner / CSP++: How Faithful to CSPm?
binding synchronizing events to one another at translation time, rather than letting the framework search for them over and over again. This would result in significant savings at run time. CSP++ has always been aimed at embedded systems, but application to real-time systems will require introducing some notion of time. CSP++ is based on the original CSP notation, which does not explicitly model time. While it is already possible to synthesize specifications based on “tock” timing [9], the constant synchronizations on a periodic tock event throughout the specification would be grossly inefficient. Instead, it is probably preferable to implement operators from Timed CSP [23]. However, this raises the question of verification, since the Formal Systems tools do not recognize those operators. Adding timed operators to CSP++ would likely suit it for building “soft” real-time systems, but it will probably not be possible to offer the latency guarantees required for “hard” real-time applications. Further on the theme of targeting embedded systems, porting is underway of CSP++ to an SoPD (system on programmable device) platform [13]. If Pth proves too difficult to port to this platform, there is the option of porting the framework’s thread model to a suitable RTOS. This can be accomplished by changing only the task class. Finally, some work has been reported in synthesizing hardware circuits from CSP via Handel-C, an algorithmic hardware description language that has CSP-like constructs [24]. We would like to partition a CSPm specification into software- and hardware-based processes, and synthesize the channel communication between them. This falls under the heading of hardware/software codesign [25]. The aim is to make CSP++ useful for building embedded systems with both hardware and software components, and for SoC (system on chip). 8. Conclusion To return to the question posed by the title, how faithful is CSP++ to CSPm? The short answer is, faithful enough to be useful. The longer answer is, it doesn’t do everything CSPm does, but results suggest that the subset it does do replicates the semantics of CSP. Admittedly, this has not been formally proven. The development of CSP++ has shown that selective formalism based on software synthesis can be a viable software development technique. Furthermore, the recent commercialization of some CSP-based toolkits indicates that some in industry are seeing practical value to CSP-based approaches. But how can more acceptance of such approaches be achieved? The rest of this conclusion speculates on this topic. First, we could point out that even without carrying out verification, which admittedly takes training to do well, the CSP++ approach is attractive on its own right. Here are several reasons: 1. Some verification is “automatic” anyway, particularly checking for deadlocks, so if one uses CSPm and FDR2, that will come as a beneficial side effect. 2. Software synthesis is a productivity tool and a way of maturing the software engineering process by putting more emphasis on the specification as the primary design artifact. 3. CSP is a natural, disciplined way to organize the design of concurrent systems, and should make them more reliable, even without verification. 4. CSP is not one of the more obscure formal notations, therefore portions of CSPm specifications can be shown to clients as a way of getting to the bottom of what they really mean by prose requirements.
W.B. Gardner / CSP++: How Faithful to CSPm?
145
5. StateCharts are also a nice way to design systems and are useful to show people, and it is easy to convert StateCharts to CSPm for the purpose of software synthesis via CSP++. Undoubtedly, using CSP with verification is much better than without. While the paradigm of selective formalism means that a company would not have to train every software developer in CSP, some CSP gurus would necessarily be required. What human organizational elements are needed to facilitate this? First of all, it’s easy to speculate that sending people for one or two complete university courses in formal methods and CSP is not going to have wide appeal to many managers. Therefore, we would like to find effective ways to bring a typical college-trained programmer up to a level of competency in CSP sufficient to understand, write, and verify CSPm specifications. For this purpose, it is unnecessary to understand deeply the theory of CSP or be able to do proofs. One does have to learn the operators, see and write samples of code according to the “four roles of CSPm” (section 1.2), plug them into ProBE and play with them. In terms of CSP++ specific training, they must learn how to use the synthesis tools and how to link in user-coded functions that obey the restrictions. The concepts behind formal verification are more abstract, but minimal competency using FDR2 is also important. This includes learning how to make simplifications for the sake of verification. Even if the subset of gurus who handle the verification is small, the under-guru level of CSPm practitioners should at least understand what formal verification is about. From the standpoint of training a cadre of CSPm practitioners, we feel that existing literature on CSP is largely missing a “cookbook” aspect comparable to the popular “Gang of Four” design patterns book [26]. The purpose of that book was to enlighten programmers who already knew the basics of object-oriented programming that “To accomplish common task X, with which you’re likely familiar, you code up your classes thusly.” This kind of cookbook approach spares programmers from “reinventing the wheel,” and, more important, enlightens them on different useful models of “wheels” they would not have imagined for themselves. Can a similar kind of “CSP design pattern cookbook” be provided for would-be CSPm programmers? This would be a great help in popularizing CSP-based techniques, such as CSP++. Acknowledgments This research was supported by NSERC (Natural Science and Engineering Research Council) of Canada. References [1]
[2] [3] [4]
[5]
W.B. Gardner, and Micaela Serra. CSP++: A Framework for Executable Specifications, chapter 9. In Fayad, M., Schmidt, D., and Johnson, R., editors. Implementing Application Frameworks: ObjectOriented Frameworks at Work. John Wiley & Sons. 1999. Mantis H.M. Cheng. Communicating Sequential Processes: a Synopsis. Dept. of Computer Science, Univ. of Victoria, Canada, April 1994. FDR2 web site, Formal Systems (Europe) Limited. http://www.fsel.com [as of 5/16/05]. W.B. Gardner. Bridging CSP and C++ with Selective Formalism and Executable Specifications, In First ACM & IEEE International Conference on Formal Methods and Models for Co-design (MEMOCODE '03), Mont St-Michel, France, June 2003, pp. 237-245. C.A.R. Hoare. Communicating Sequential Processes. Prentice Hall. 1985.
146 [6]
[7]
[8] [9] [10] [11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21] [22]
[23] [24] [25] [26]
W.B. Gardner / CSP++: How Faithful to CSPm? W.B. Gardner. Converging CSP Specifications and C++ Programming via Selective Formalism, ACM Transactions on Embedded Computing Systems (TECS), Vol. 4, No. 2, May 2005, pp. 1-29. Special Issue on Models & Methodologies for Co-Design of Embedded Systems. W.B. Gardner. CSP++: An Object-Oriented Application Framework for Software Synthesis from CSP Specifications. Ph. D. dissertation, Dept. of Computer Science, Univ. of Victoria, Canada. 2000. http://www.cis.uoguelph.ca/~wgardner/, Research link. GNU Pth – The GNU Portable Threads. http://www.gnu.org/software/pth/. Failures-Divergence Refinement: FDR2 User Manual, May 2, 2003, Formal Systems (Europe) Ltd. A.W. Roscoe. The Theory and Practice of Concurrency. Prentice Hall, 1998. S. Doxsee, and W.B. Gardner, Synthesis of C++ Software from Verifiable CSPm Specifications, to appear in: 12th Annual IEEE International Conference and Workshop on the Engineering of Computer Based Systems (ECBS 2005), Greenbelt, MD, Apr. 4-5, pp. 193-201. S. Doxsee, and W.B. Gardner, Synthesis of C++ Software for Automated Teller from CSPm Specifications, 20th Annual ACM Symposium on Applied Computing (SAC ‘05), Track: Software Engineering: Applications, Practices, and Tools, poster paper, Santa Fe, NM, Mar. 2005, pp.1565-1566. J. Carter, M. Xu, and W.B. Gardner, Rapid Prototyping of Embedded Software Using Selective Formalism, to appear in: 16th IEEE International Workshop on Rapid System Prototyping (RSP 2005), Montréal, June 8-10, pp. 99-104. G. Hilderink, J. Broenink, W. Vervoort, and A. Bakkers, Communicating Java Threads, Proc. of the 20th World occam and Transputer User Group Technical Meeting, Enschede, The Netherlands, 1997, pp. 48–76. P.H. Welch, and J.M.R. Martin, A CSP Model for Java Multithreading, International Symposium on Software Engineering for Parallel and Distributed Systems (PDSE 2000), Limerick, Ireland, 2000, pp. 114-122. L. Freitas, A. Cavalcanti, and A. Sampaio, JACK: A Framework for Process Algebra Implementation in Java, Proceedings of XVI Simpósio Brasileiro de Engenharia de Software, Sociedade Brasileira de Computacao, Oct. 2002. J. Moores, CCSP—A Portable CSP-based Run-time System Supporting C and occam, in B.M. Cook, editor, Architectures, Languages and Techniques for Concurrent Systems, vol. 57 of Concurrent Systems Engineering series, WoTUG, IOS Press, Amsterdam, the Netherlands, April 1999, pp. 147168. R.D. Beton, libcsp—A Building mechanism for CSP Communication and Synchronisation in Multithreaded C Programs, in P.H. Welch and A.W.P. Bakkers, eds., Communicating Process Architectures 2000, vol. 58 of Concurrent Systems Engineering series, IOS Press, Amsterdam, The Netherlands. N.C.C. Brown, and P.H. Welch, An Introduction to the Kent C++CSP Library, in J.F. Broenink and G.H. Hilderink, eds., Communicating Process Architectures 2003, vol. 61 of Concurrent Systems Engineering Series, IOS Press, Amsterdam, The Netherlands, September 2003, pp. 139-156. J.F. Broenink, D. Jovanovic and G.H. Hilderink, Controlling a Mechatronic Setup Using Real-time Linux and CTC++, S. Stramigioli (Ed.), Proc. Mechatronics 2002, Enschede, The Netherlands, pp. 1323-1331. B. Arrowsmith, and B. McMillin, How to Program in CCSP, Technical Report CSC 94-20, Department of Computer Science, University of Missouri-Rolla, August 1994. V. Raju, L. Rong, and G.S. Stiles, Automatic Conversion of CSP to CTJ, JCSP, and CCSP, Communicating Process Architectures 2003, vol. 61 of Concurrent Systems Engineering Series, IOS Press, 2003. Steve Schneider, Concurrent and Real Time Systems: The CSP Approach, John Wiley & Sons, Inc., New York, NY, 2000. Jonathan D. Phillips, and G.S. Stiles, An Automatic Translation of CSP to Handel-C, Communicating Process Architectures 2004, vol. 62 of Concurrent Systems Engineering Series,IOS Press, pp. 19-37. Frank Vahid and Tony Givargis. Embedded System Design: A Unified Hardware/Software Introduction, John Wiley & Sons, 2002. Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley, 1995.
Communicating Process Architectures 2005 Jan Broenink, Herman Roebbers, Johan Sunter, Peter Welch, and David Wood (Eds.) IOS Press, 2005 © 2005 The authors. All rights reserved.
147
Fast Data Sharing within a Distributed, Multithreaded Control Framework for Robot Teams Albert SCHOUTE a 1, Remco SEESINK b, Werner DIERSSEN c and Niek KOOIJ d a University of Twente, b Atos Origin NL, c Triopsys NL, d Qmagic NL
Abstract. In this paper a data sharing framework for multi-threaded, distributed control programs is described that is realized in C++ by means of only a few, powerful classes and templates. Fast data exchange of entire data structures is supported using sockets as communication medium. Access methods are provided that preserve data consistency and synchronize the data exchange. The framework has been successfully used to build a distributed robot soccer control system running on as many computers as needed. Keywords. robot soccer, control software, distributed design, data sharing, multithreading, sockets
Introduction This paper describes the control software framework of the robot soccer team Mobile Intelligence Twenty (MI20). Many different types of robot soccer competitions are organized by international associations [1], [2] with varying game and hardware rules. Our team competes in the FIRA MiroSot league, in which small-sized, wheeled robots are controlled based on localization by a central camera system. The application is representative for control systems that heavily rely on globally shared sensor information. In contrast to the centralized way of robot localization the team control system is designed in a distributed way, where separate single- or multi-threaded programs control distinct parts of the system. The big advantage of this design is that we can run our system on as many computers as we think is necessary. So if some tasks are very computationally demanding, for instance robots tracking, path planning or playing strategy, we can run the programs on separate computers. Distributed software design has many more advantages, but also one big disadvantage: it complicates data sharing. Because many threads have to share common data, they will communicate quite intensively. Therefore we need to find a very fast way of exchanging data. We chose to use sockets as a communication medium, because they can provide fast communication. The second important issue in our design is that we exchange entire data structures. Because the layout of the data structure is known on both sides of the communication channel, we can address members of the structured data without using functions, which provides good speed performance. Functionality is added to automatically maintain data consistency between application threads that access the data structures and communication threads that exchange the data. The application programmer can use safe 1 Corresponding Author: University of Twente, Dept. of Computer Science, Postbox 217, 7500 AE Enschede The Netherlands, Email:
[email protected]
148
A. Schoute et al. / Fast Data Sharing in a Control Framework for Robot Teams
access methods without having to bother about thread interference. This way we have achieved a fast and reliable system that we can expand or change, without the need of redesigning the system. 1. Application 1.1 The Robot Soccer Game Environment
Figure 1. MiroSot League competition set-up
In the MiroSot league the robot size is limited to cubes with maximum measure 7.5 cm. Competition categories differ with respect to robot team sizes (3, 5, 7 or 11 players) and the matching dimensions of the playground. Our robots have an onboard DSP-processor that takes care of wheel velocity control and wireless communication. A digital camera above the playground captures images that are processed by the team computer(s) that steer(s) the team of robots. Robots are recognized by means of colour patches on their top surface. The game is played with an orange golf ball. Wheel velocity set-points are sent to the robots by a radio link, each team using a different frequency.
Figure 2. Impression of the real game situation
1.2 Requirements The design of the data sharing framework has been influenced to a large extend by the requirements of the robot soccer application. Let us consider the main aspects. First of all, in a game situation it is important to continue under all circumstances even if certain robots are not functioning properly, if disturbing events happen on the play ground or processes deteriorate. Only a human arbiter may interrupt - according to the
A. Schoute et al. / Fast Data Sharing in a Control Framework for Robot Teams
149
playing rules - the fully computer-controlled game. A distributed, concurrent design with independently operating components will contribute to the robustness of the system. Furthermore, the application must be highly reactive and requires fast responsiveness to the actual situation. Image data need to be processed at the camera frame rate (30 frames per second). Due to many circumstantial influences, for example lighting conditions, data may be unreliable and must be filtered. State information should reflect the real-time situation as close as possible. The rate at which robots receive control data depends on the team size and typically lies in between 10 to 20 set-points per second. For the application it is important that the most recent sensor data and updated state information is made available throughout the system as fast as possible. State information in this kind of application has a permanent structure and is maintained in globally known data types. Sharing of state information in a flexible way implies that arbitrary many concurrently running threads can access common data structures asynchronously. If the system is distributed over multiple programs, possibly running on different computers, we still want to be able to share common data structures. The data content has to be proliferated to “mirror” the same data at different places. Of course updating and exchanging shared data must be organized in a sensible way. So, the application programmer is responsible of defining data structures as being common and establishing communication processes that create a “refreshment chain” by which updates are proliferated. We require that the content of shared data structures is “near-time equivalent”, which means that reading threads obtain a recently written data content. A reading process may also require getting the next refreshed data instance. 1.3 Solution Approach The framework presented provides the tools to manage the shared data access and proliferation in an easy, efficient and safe way. Several practical implementation decisions are made to make the data sharing as fast as possible in the context of C++ based programming and Linux based multithreading. The main approach could be stated as a combination of: 1. a shared memory access model within a single multithreaded program 2. a socket communication model to exchange common data structures (in binary form) between program variables of compatible type Ease of programming is reached by making the access to shared data structures transparent to the fact whether a common data structure is accessed by threads within only one or within multiple programs. In the latter case the same data structure is defined in each program and the content is mirrored in some sense. But the access method of remotely operating threads remains the same. We do not intend to introduce a new concurrency concept. Nor do we claim that our implementation presents a unique, novel solution. Similar distributed data sharing facilities can be provided by using other programming concepts and tools. In this respect, a comparison with other approaches has to be made yet. Our purpose is to offer a fast and practical solution in the given object-oriented context for distributed software development while preserving efficient ways of data sharing. In object-oriented programming environments like Java and .Net so-called “object streams” are supported. Complete objects are “serialized” into a string representation to be transported over a network. The associated overhead, however, does not comply with the needs of real-time control applications.
150
A. Schoute et al. / Fast Data Sharing in a Control Framework for Robot Teams
2. Framework Components The distributed data sharing framework is realized in C++ by means of only a few, powerful classes and templates: x x x x x
x
a super-class Cthread that enables threads to start, stop, pause and resume a class Cmutex to exclusively lock data and wait on or signal data presence or renewal a template class Csafe usable for any type of shared, structured variable to enforce safe access a super-class Csocket to instantiate threads that operate on sockets template classes Ccommunication_sender and Ccommunication_receiver to instantiate communication threads that send or receive the content of a “safe variable” over a socket a super-class Cexception to keep error management simple while acting appropriately on different sorts of exceptions
Thread instances of Cthread are actually based on Posix compliant threads, known as pthreads [3]. Linux supports multithreading by running pthreads as kernel processes[4]. The pthread-package supplies synchronization functions for exclusive access to class objects according to the monitor concept [5]. The power of the framework doesn’t result from each of the classes alone. It results from their combined use by fully exploiting all the nice features of C++ like function inheritance, type-independent template-programming and function overloading. For example the template declaration Csafe, being a derived class of Cmutex, creates exclusive access to arbitrarily typed variables. Basically it adds a “value” of any type to an instance of class Cmutex. This “mutex” instance functions as exclusion and synchronization monitor for the added “value”. The template declaration of Csafe is contained in nothing more than a two-page header file. This provides the basic locking mechanism to preserve data consistency of shared variables accessed by multiple threads. Moreover, the wait and signal functions of class Cmutex (again based on the pthread-package) automatically take care of condition synchronization between asynchronously reading and writing threads. In the Cmutex class a single private condition variable is defined on which threads will block when calling the wait function. The solution resembles object synchronization as made implicitly available in Java [6]. By defining any variable as “Csafe” and obeying the usage protocol as shown in the next section, the programmer can rely on the mechanism to guarantee safe and synchronized access. The safe access mechanism is applied to the framework itself to extend its power even more. Thread instances of class Cthread are represented by underlying pthreads that can be created, paused or stopped. Their state can be dynamically changed by other threads and hence the state variable is implemented as a Csafe object. Only if a thread of class Cthread is executing its “run-function” the underlying pthread is needed and actually present as a Linux process. The introduction of a function run() as “actual body” of a thread is borrowed from Java. 3. Framework Usage When reading or writing a Csafe-variable X exclusive access needs to be established by explicitly calling locking and unlocking functions as follows:
A. Schoute et al. / Fast Data Sharing in a Control Framework for Robot Teams
151
X.lock(); /* now X.value can be read or written safely */ X.unlock();
It has been considered to perform locking implicitly and hide it from the programmer. However, this is rather a burden than an advantage if accesses are more complex. A mixture of both explicit and implicit locking would be even more confusing. So explicit locking is required as being the most transparent, flexible and efficient solution, although it is not enforced automatically. If it is important to keep the locking period short the programmer can make a local copy. In the context of a dynamic application like robot soccer fast asynchronous updating of state information is an important issue. The synchronization properties inherited from the Cmutex class make the signaling of and waiting on data renewal very straightforward. The program sequences in Table 1 show how a reading thread waits for renewed data to become available and a writing thread signals the renewal of it. Table 1. Reader / Writer Synchronisation
Reader
Writer
X.lock(); X.wait(); /* reading of X.value */ X.unlock();
X.lock(); /* writing of X.value */ X.signal(); X.unlock();
On this schema many variations are possible. If multiple threads possibly wait on reading the same variable X the writer should issue X.broadcast() instead of X.signal(). In the former case any waiting thread is signaled, in the latter case only a single one is signaled. In fact, the most robust way of programming is to use always X.broadcast(). A thread may also read or write a new value only if the variable X is not locked by using the function X.try_lock() instead of X.lock(). This could be desirable in order to avoid locking delays when data has to be captured and distributed in real-time. Figure 3 reflects the case where a camera thread distributes images to multiple “subscriber threads” by writing a new image to each of their “safe” image variables. By using try_lock the variable is refreshed only if the reading thread is not yet busy with processing an earlier image. subscriber image variable grab frame
image buffer
subscriber image variable subscriber image variable
camera thread distributes new images by “try_lock & signal”
subscriber threads get recent images by “lock & wait”
Figure 3. Camera images are copied to multiple Csafe variables as an example of safe data distribution
The simple data exchange concept provided by Csafe variables has been extended to a distributed environment by means of the communication classes Ccommunication_sender and Ccommunication_receiver of the framework. As these classes are derived from the classes Csocket and Cthread respectively, instances of the communication classes become sender and receiver threads capable of communicating through sockets. If a thread has
152
A. Schoute et al. / Fast Data Sharing in a Control Framework for Robot Teams
modified a Csafe variable in a program on one computer, it has only to signal this variable to activate an associated chain of sender and receiver threads to transport the modified content to another computer. Finally, the receiver thread will update a similar variable in a program that runs on the other computer. Any thread waiting on this variable is notified. Dedicated sender and receiver threads have to be defined to couple a pair of distributed Csafe variables. An example related to the robot soccer application is given in the next section. Note that distributed Csafe variables are automatically updated by chains of sender and receiver threads. Updating on demand would avoid unnecessary traffic, but induces extra delay time. Due to the general nature of sockets, the framework allows for interoperability between Linux, Solaris or Windows. There is however a prerequisite to be made with respect to compatibility of the compilers used. Apart from byte-order conversion (big/littleendian) that is automatically detected and corrected, the variables must be mapped on memory identically on all machines. 4. MI20 Software Architecture The framework facilities have been used extensively in the MI20 control software. Due to the distributed design there is no essential difference in controlling a single robot team or controlling both teams of a robot game. In the latter case the global vision system tracking the robots consists of a single program for image processing and two separate programs for the state estimation as viewed by each team. The image-processing program contains multiple threads interpreting the images: for each team a vision thread together with threads that display images on the user interface. A camera thread distributes images to all of these image processing threads in a way as described in the previous section. Also the “soccer playing intelligence” of the system is distributed over multiple agent threads. Each team consists of player agents, one for each robot, and a single coach agent. When controlling two teams the system has the multi-agent architecture as shown in Figure 4. Each of the robots is steered by its player agent. This agent actually sends control commands to a thread that drives the radio-frequency link. camera
world state estimator
coach agent
Team B
image processing
Team A
player agents RF
robots
world state estimator coach agent
player agents RF
robots
Figure 4. Controlling a complete robot match with two teams using a single camera system
Let us take the player agent as an example to see how data is exchanged in the system. State information is maintained in several globally known data structures like “world data”,
A. Schoute et al. / Fast Data Sharing in a Control Framework for Robot Teams
153
“player data”, “coach data”, “wheel data”, etc.. In the main program of the player agent, Csafe-variables are defined for all of the data structures needed – for example: Csafe
world_data; Csafe player_data[PLAYERS];
The player agent will typically read the world data produced by the state estimator and write player data and wheel data. The interconnection structure for a player agent is established by defining its communication servers. For example, to receive world data and send player data to the coach agent: Ccommunication_receiver_thread Iworld(&world_data, P2W_PORT[robot_id]); Ccommunication_sender_thread Iplay(&player_data, P2C_PORT[robot_id]);
Then the communication threads only have to be started by calling Iworld.start(); Iplay.start(); etc. Thereafter the distributed data exchange will proceed automatically through the locking and synchronization protocol as described in the previous section. The distributed approach forces the separate control parts to communicate through well-defined interfaces. This has the additional advantage of modular design making independent development and testing easier. For example, the coach and playing agents can be tested by using a simulator without changing any of the interfaces. The simulator used even runs on a Windows machine, whereas all the MI20 control software runs on Linux. 5. Implementation Features 5.1 Coupled Exclusion In certain cases it is desirable to access multiple Csafe-variables within a single exclusion regime. For example to read the speed values of both robot wheels consistently. This has been made possible by the option to supply a common Cmutex variable as argument of the constructor function of the Cmutex class. Without this argument Csafe-variables use their private mutex, with this argument given an indirect link is made to the Cmutex-variable supplied. 5.2 Pausing and Resuming Threads For efficiency reasons only thread instances that are actually running have underlying pthreads in operation. Non-running thread instances only exist as class instance, but do not consume further system resources. The idea is that threads are started or resumed through the user interface only when necessary and paused or stopped when not needed anymore. By this way for instance, the actual number of running player threads can be configured dynamically to match the real world. A drawback is the requirement that threads have to poll regularly their status to see if they should pause or stop. 5.3 Automatic Connection Recovery Socket connections may become broken for several reasons. Any sender thread will try to re-establish the connection. It makes use of type-specific exception classes derived from the Cexception superclass to catch different exception causes and to take appropriate action.
154
A. Schoute et al. / Fast Data Sharing in a Control Framework for Robot Teams
6. Conclusion In this paper we focused on the additional software “infrastructure” that supports the distributed design of the robot soccer system MI20. The MI20 system consists of three major parts that have been designed by master thesis students, e.g. the global vision system [7], the intelligent decision engine [8] and the motion planning subsystem [9]. These parts could not have been developed and glued together so easily without the distributed data sharing framework. This framework has been designed and implemented at the beginning of the project to serve as common starting environment. It has been extended gradually during the subsequent integration stages. The source code of a simple application example that uses the framework is online available at the author’s home page [10]. The main objective of the framework was to make distributed system composition easy without suffering from the overhead, which has been realized successfully. The result proofs that in a dedicated application like robot soccer both distributed processing and fast and easy data sharing can go together. Fast data communication is reached by the exchange of complete, commonly known data structures using sockets. Easy data access is the result of full exploitation of today’s software facilities as offered by the C++ (template) class concept, multithreading packages and socket communication. The flexibility of the distributed control framework has resulted in many blessings not planned in advance. As mentioned, the system was easily expanded with a duplicate playing team (and duplicate user interface), allowing us to control a complete robot soccer match. Whereas the system was initially set up to play with teams of 5 robots, the system is equally capable of handling larger teams, which made it possible to participate in the “large” Mirosot league with 7 against 7 robots. References [1] [2] [3] [4] [5] [6] [7]
RoboCup: www.robocup.org Federation of International Robosoccer Association: www.fira.net D.R. Butenhof, Programming with POSIX threads, Addison-Wesley, 1997. M. Beck et al., Linux Kernel Internals, 2nd Ed., Addison-Wesley, 1997. A. Silberschatz et al., Applied Operating System Concepts, John Wiley & Sons, 2000. S. Oaks and H. Wong, Java Threads, 2nd Edition, O’Reilly & Associates, 1999 N.S. Kooij, The development of a vision system for robotic soccer, Masters Thesis, University of Twente, 2003. [8] R.A. Seesink, Artificial Intelligence in multi-agent robot soccer domain, Masters Thesis, University of Twente. 2003 [9] W.D.J. Dierssen, Motion planning in a robot soccer system, Masters Thesis, University of Twente. 2003 [10] wwwhome.cs.utwente.nl/~schoute/ES_files/fc_esi_frame.tar
Communicating Process Architectures 2005 Jan Broenink, Herman Roebbers, Johan Sunter, Peter Welch, and David Wood (Eds.) IOS Press, 2005 © 2005 The authors. All rights reserved.
155
Improving TCP/IP Multicasting with Message Segmentation Hans Henrik HAPPE and Brian VINTER Dept. of Mathematics & Computer Science, University of Southern Denmark, DK-5230 Odense M, Denmark. {hhh , vinter} @imada.sdu.dk Abstract. Multicasting is a very important operation in high performance parallel applications. Making this operation efficient in supercomputers has been a topic of great concern. Much effort has gone into designing special interconnects to support the operation. Today’s huge deployment of NoWs (Network of Workstations) has created a high demand for efficient software-based multicast solutions. These systems are often based on low-cost Ethernet interconnects without direct support for group communication. Basically TCP/IP is the only widely supported method of fast reliable communication, though it is possible to improve Ethernet performance at many levels – i.e., by-passing the operating system or using physical broadcasting. Low-level improvements are not likely to be accepted in production environments, which leaves TCP/IP as the best overall choice for group communication. In this paper we describe a TCP/IP based multicasting algorithm that uses message segmentation in order to lower the propagation delay. Experiments have shown that TCP is very inefficient when a node has many active connections. With this in mind we have designed the algorithm so that it has a worst-case propagation path length of O(log n) with a minimum of connections per node. We compare our algorithm with the binomial tree algorithm often used in TCP/IP MPI implementations. Keywords. Multicasting, NoW, HPC
1. Introduction Message multicasting is a highly investigated topic[1], partly because of the importance of multicasting performance in parallel programming[2]. In [3] it is found that 9.7% of all calls to the MPI[4] layer in the NAS[5] benchmark-suite are broadcast operations. In fact the only operations that are more frequent are the point-to-point send and receive operations. Most work into multicast algorithms are very analytical and consider theoretical performance using quite simplified hardware models. Previous work has shown that at the system level, the optimal topology for broadcast algorithms are quite different from the theoretical findings[6]. In this work we move the findings from research in wormhole routed interconnects[7] into the software stack. In section 2 we describe the existing algorithms for multicasting in computational clusters. In section 2.1 we introduce the concept of segmenting messages in the software stack and in section 3 we show how segmentation may be used for multicasting. In this section we also introduce a multicasting tree that only requires each process to send the same amount of data as the size of the multicasted message, at the most. In section 4 the model is implemented and tested in two Ethernet based clusters.
H.H. Happe and B. Vinter / Improving TCP/IP Multicasting with Message Segmentation
156
0
1
0
2 1
d+b
d+2b
1
2d+2b
2
2
2d+3b
2d+3b a)
d+b 1
3
2d+4b
2d+2b
d+2b 2
2
2d+3b
d+3b
2d+3b b)
Figure 1. Multicast trees. Node expressions is the time at which the node receives the message and edge numbers are the step in which communication takes place. a) Binary multicast tree. b) Binomial multicast tree.
2. Point-to-Point Multicasting Multicasting a message by means of point-to-point communication must be done in accordance to some algorithm. Basically this algorithm must select which destination processes should aid in spreading the message by forwarding to other destinations. The optimal message path, when multicasting using point-to-point communication, will yield a tree where the root is the source process. This is easily realized from the fact that all other processes only need to receive the message once. In [8] an optimal multicast solution in the logP model[9] has been given. The tree structure in this solution depends very much on the parameters of the model. In real systems these parameters can be very dynamic and therefore a more practical approach is used to define a suitable tree structure. In the following we will give a simple analysis of some of the classical tree structures often used in real systems. This analysis will be based on at simple network model where the time t to send a message from one process to another is defined as: t=d+b
(1)
d = delay message size b= . max bandwidth
(2)
(3)
d is the one-byte latency and b is the time imposed by the maximum point-to-point bandwidth. The letter m will be used to denote the time for the whole multicast and n is the number of processes involved (source and destinations). A simple way to implement multicasting is to have the source process send the message directly to all destination processes. This will give a multicast time of m = (n − 1)b + d which scales very poorly for large messages. For small messages this seams to be a good solution, but the source process will be very busy sending which means it cannot attend to other matters (i.e. calculation). The next logical step is to make a higher tree with a constant fanout f as shown in Figure 1.a. Here m ≤ h(d + f b), where h = logf n is the height of the tree. Because the height of the tree is a logarithmic function, this tree will give O(log n) time complexity. Given d, b and n it is possible to find the optimal fanout. An advantage of this multicast tree is that the work of sending is shared among many processes (those with children in the tree). A binomial multicast tree is an optimization of a binary multicast tree. When a process in a binary multicast tree is done sending to its children it considers the multicast finished, while the children are still sending. Instead it could continue sending to other processes, which is
H.H. Happe and B. Vinter / Improving TCP/IP Multicasting with Message Segmentation
157
0,1,2 0
1
2
0,1,2 1 0,1,2
a)
0
2
2
1 0
b)
Figure 2. Segmentation multicast graphs. Edge numbers are the segments transfered along the path. a) Sequential tree (segmented seqtree). b) Basic idea of the optimal algorithm.
the idea behind the binomial tree (Figure 1.b). The structure of this tree ensures that every process that has received the message, participates in sending to those processes that have not received the message. As illustrated in the figure this gives better results than a plain binary tree because more processes work in each step. Trees with a constant fanout f > 2 could give better results for small messages, because the height of a binomial tree is h = log2 n. The uneven work that each process has to do is a disadvantage of binomial multicast trees. 2.1. Message Segmentation A major problem with point-to-point multicasting is that the maximum multicast bandwidth cannot be more than half the maximum point-to-point bandwidth, when there is two or more destinations. Either the root sends to both destinations or one destination forwards to the next. This is only true when all of a message is received at a destination before it is forwarded. Message segmentation deals with this problem by splitting messages into smaller segments. Now forwarding can begin when the first segment has been received and together with the right multicast algorithm it is possible to multicast with a bandwidth that exceeds half the maximum point-to-point bandwidth. The segment size s dictates the delay of relaying a message through a process. Theoretically s should be as small as possible in order to minimize this delay, but the actual splitting and forwarding imposes an overhead. These trade-offs has to be considered when choosing a proper segment size. When using message segmentation the classical multicast trees are far from optimal. The problem is that some processes send the message more than once. This sets an upper bound max bandwidth/fmax on the multicast bandwidth, where fmax is the maximum fanout in the multicast tree. In order to achieve multicast bandwidths near the maximum point-to-point bandwidth, the multicast algorithm has to ensure that the message size is the upper bound on how much data each process must transmit. A sequential tree (a fanout of one) is the obvious choice for a multicast structure that utilizes segmentation (Figure 2.a). This structure ensures that all processes at most receive and send messages once. Theoretically the multicast time of this structure would be m = (n − 1)(d + bs ) + b, where bs = s/max bandwidth is the time imposed by the maximum point-to-point bandwidth when transmitting a segment. This gives a time complexity of O(n) which might be negligible for large messages (b will dominate), but for small messages the propagation delay (n − 1)(d + bs ) will dominate m. In the following this algorithm will be called “segmented seqtree”. In [8] an optimal solution for multicasting a number of segments in the logP model is presented. The basic idea of this solution is that the source process scatters the segments to the destination processes in a round-robin fashion. When a destination process receives a segment
158
H.H. Happe and B. Vinter / Improving TCP/IP Multicasting with Message Segmentation
4, ,2,
...
0
seg
me
nts
ts 0
n me
1,3
,5,
seg
6
1
2
4
3
5
...
6
5
3
4
2
1
Figure 3. Segmented bintree algorithm. Process 0 is the source process.
it becomes the root of the segment and sends the segment to all other destination processes, by means of an optimal single segment multicast tree (Figure 2.b). A final multicast structure for each segment will depend on the logP parameters and the number of segments. Also, there can be special cases where those processes that are done can aid in finishing the multicast (end-game). 3. Segmented Multicasting in TCP/IP Networks All the information needed in order to construct an optimal point-to-point multicast is hard to obtain at runtime. We have therefore tried a more general algorithm inspired by the optimal solution described in [8]. Figure 2.b illustrates how the algorithm works. The source process spreads out segments evenly to the destination processes. Each destination process sends the received segments to all other destination processes. The message size is an upper bound on how much data each process must send in this algorithm and each segment is only forwarded by one process. We have evaluated the algorithm in an Ethernet based cluster using TCP/IP. Results showed that the segmented seqtree algorithm performed much better for large messages. Without further investigation we believe that TCP/IP does not handle multiple connections well. With this algorithm all processes have multiple connections that constantly are switching between activity and inactivity. This puts much pressure on the TCP subsystem in the form of buffer management which again increase memory activity (i.e. cache misses). It might also be TCP congestion control that cannot adapt fast enough. This is a subject of further research. TCP connections will always use extra memory, so limiting the number of connections is preferable. The segmented seqtree algorithm has this feature but the path from the source to the last destination is O(n) long. This will result in poor performance as the message size decreases. We have devised an algorithm, we call “segmented bintree”, that falls in between the characteristics of two above. This algorithm uses a minimum of connections given these constraints: 1. The path length must be O(log n). 2. The message size is an upper bound on how much data a process has to forward. Figure 3 illustrates how it works. Two binary trees containing all destination processes are build with the constraint that the total sum of children for each process is at most two. When multicasting the source process sends every other segment to the root of each of these
H.H. Happe and B. Vinter / Improving TCP/IP Multicasting with Message Segmentation
159
two trees. The segments are forwarded all the way down to the leaves with the result that all processes receives the message. The first constraint obviously hold with this algorithm. Because of the manor in which the two trees are created, the second constraint also holds. Each process will at most have four active connections which is a minimum given the constraints. In the general case, the first constraint dictates that the source process or one of its descendants must forward to at least two processes. We will call this process x. x also has to have a descendant y that has to forward to at least two processes. The second constraint dictates that x cannot forward the whole message to y, hence y has to receive the remaining part from some other process z. z will not always be a child of y, because some z will have two or more children of its own. Therefore, there will exist a process y that has at least four connections given the constraints. 4. Experiments The binomial, segmented seqtree and segmented bintree algorithms have been implemented using TCP sockets for communication. The algorithms have then been compared in two different clusters. 4.1. Clusters The clusters used have the following configurations: • Gigabit: 64 nodes, Intel Pentium 4 3.2 GHz CPUs, switched gigabit Ethernet network. Network adapter is connected to the CPU via the 32bit PCI bus, Linux 2.6. • CSA Gigabit: 13 nodes, Intel Pentium 4 2.6 GHz CPUs, switched gigabit Ethernet network. Network adapter is connected to the CPU via Intel’s Communications Streaming Architecture (CSA), Linux 2.6. Note that the plain 64-node gigabit cluster is a production cluster which we had very limited access to (shared with other by means of a job queue). Therefore it was not possible to investigate irregularities in the following results. Also, the 32bit PCI bus connection to the network adapter makes full-duplex gigabit communication impossible. The interconnect consisted of a set of 24-port switches with independent full-duplex 10 gigabit connections between them. We did not try to compensate for this heterogeneity when laying out the multicast trees, but all tests were run on same set of nodes. This issue should only affect the segmented algorithms, because these utilize parallel communication extensively. The fact that the results do not show any evidence of this heterogeneity, suggests that the 32bit PCI issue insured that the inter-switch links were not saturated. The small CSA gigabit cluster has been included to test the algorithms in a gigabit cluster where close to maximum full-duplex gigabit bandwidth is possible. 4.2. Results Both the segmented algorithms have been run with a segment size of 8KB throughout the tests. In general this size has proved to give good results, though it might be a subject of further study. Time measurements have been carried out by letting all destination nodes notify the source node when the full message had been received. We have not compensated for this extra notification delay in the following results. In all the results the multicast bandwidth is the bandwidth of the actual multicast and not the accumulated bandwidth of all communication lines.
H.H. Happe and B. Vinter / Improving TCP/IP Multicasting with Message Segmentation
160
Gigabit cluster, message size 1B, segment size 8KB
18000 segmented bintree segmented seqtree binomial
16000 14000
time - usec
12000 10000 8000 6000 4000 2000 0 2
4
16
8
32
64
nodes
Figure 4. Multicast time/number of nodes. Latency in the gigabit cluster for one byte messages.
Overall the segmented bintree algorithm performs just as well or better than the binomial, for 32KB or larger messages. The segmented seqtree algorithm needs even larger message sizes before it generally outperforms the binomial, which was expected. Figure 4 shows the segmented seqtree algorithm’s problem with small messages. The binomial algorithm performs slightly better than the segmented bintree. This was expected because the segmented bintree algorithm has a slightly longer path to the last destinations, due to the final communication between the subtrees. With 256KB messages in Figure 5 the segmented bintree algorithm generally outperforms the others. With 64 nodes it is 196% faster than the binomial algorithm. Also, when comparing the two and four node runs, we start to see that the nodes cannot handle full-duplex gigabit communication. In Figure 6 with 8MB messages the segmented bintree algorithm scales very well. The bandwidth decrease is very small as the number nodes increases, while additional nodes has a much greater impact on the binomial algorithm. With 64 nodes the segmented bintree algorithm is 320% faster than the binomial. The performance of the segmented seqtree algorithm should be close to that of the segmented bintree with 8MB messages, but this is not the case. It must be an issue with the specific cluster because we do not see the same result in other smaller clusters (Figure 8). Figure 7 shows the results for different message sizes with 64 nodes. The segmented bintree algorithm follows the binomial up to a message size of 32KB. As the message size increases beyond 32KB we see the effect of segmentation which makes it possible to increase its multicast bandwidth all the way up to the maximum, given the 32bit PCI issue (see the result for four nodes in Figure 6). The importance of using nodes capable of full-duplex gigabit communication becomes very clear when looking at the results from the CSA gigabit cluster (Figure 8). Here the multicasting bandwidth reaches 82.6MB/s which is 74.4% of the maximum point-to-point TCP bandwidth, which has been measured to 111MB/s.
161
H.H. Happe and B. Vinter / Improving TCP/IP Multicasting with Message Segmentation Gigabit cluster, message size 256KB, segment size 8KB
70 segmented bintree segmented seqtree binomial
60
bandwidth - MB/sec
50
40
30
20
10
0 2
4
32
16
8
64
nodes
Figure 5. Multicast bandwidth/number of nodes. Multicast bandwidth in the gigabit cluster with 256KB messages.
Gigabit cluster, message size 8MB, segment size 8KB 90
segmented bintree segmented seqtree binomial
80
bandwidth - MB/sec
70
60
50
40
30
20
10
2
4
8
16
32
64
nodes
Figure 6. Multicast bandwidth/number of nodes. Multicast bandwidth in the gigabit cluster with 8MB messages.
162
H.H. Happe and B. Vinter / Improving TCP/IP Multicasting with Message Segmentation Gigabit cluster, 64 nodes, segment size 8KB
45 segmented bintree segmented seqtree binomial
40
bandwidth - MB/sec
35 30 25 20 15 10 5
16384
8192
4096
2048
1024
512
256
128
64
32
16
8
4
2
1
0
message size - KB
Figure 7. Multicast bandwidth/message size. Multicast bandwidth in the gigabit cluster with 64 nodes.
5. Conclusion The goal of this work was to improve software-based point-to-point multicasting, by means of message segmentation. Tests has shown that minimizing the number of active connections reduces TCP/IP’s communication overhead considerably. With this in mind, we have devised an algorithm that theoretically has an O(log n) time complexity, while only using four or less connections per process. This algorithm utilizes message segmentation in order to achieve multicasting bandwidths, close to the maximum point-to-point bandwidth. The algorithm can do this because no process sends more data than the size of the multicasted message. This also distributes the work evenly among the involved processes. We have compared the algorithm with a more obvious segmentation algorithm (sequential tree) and the widely used binomial tree algorithm. The results have shown that our algorithm generally outperforms the binomial algorithm with 32KB or larger messages and in some test it were up to 320% faster. For messages smaller than 32KB the binomial algorithm wins with a small margin. Using another algorithm in this case could easily solve this problem.
References [1] Taxonomy and Survey. Total order broadcast and multicast algorithms. [2] Andrew S. Tanenbaum, M. Frans Kaashoek, and Henri E. Bal. Parallel programming using shared objects and broadcasting. IEEE Computer, 25(8):10–19, 1992. [3] Ted Tabe and Quentin F. Stout. The use of the MPI communication library in the NAS parallel benchmarks. Technical Report CSE-TR-386-99, 17, 1999. [4] Message Passing Interface Forum. MPI: A message-passing interface standard. Technical Report UT-CS-94-230, 1994.
H.H. Happe and B. Vinter / Improving TCP/IP Multicasting with Message Segmentation
163
CSA Gigabit cluster, 13 nodes, segment size 8KB
90 segmented bintree segmented seqtree binomial
80 70
bandwidth - MB/s
60 50 40 30 20 10
16384
8192
4096
2048
1024
512
256
128
64
32
16
8
0
message size - KB
Figure 8. Multicast bandwidth/message. Multicast bandwidth in the CSA gigabit cluster with 13 nodes. [5] D. H. Bailey, L. Dagum, E. Barszcz, and H. D. Simon. Nas parallel benchmark results. In Supercomputing ’92: Proceedings of the 1992 ACM/IEEE conference on Supercomputing, pages 386–393. IEEE Computer Society Press, 1992. [6] John Markus Bjørndalen, Otto J. Anshus, Tore Aarsen, and Brian Vinter. Configurable Collective Communication in LAM-MPI. In James Pascoe, Roger Loader, and Vaidy Sunderam, editors, Communicating Process Architectures 2002, pages 123–134, 2002. [7] Lionel M. Ni and Philip K. McKinley. A survey of wormhole routing techniques in direct networks. Computer, 26(2):62–76, 1993. [8] Richard M. Karp, Abhijit Sahay, Eunice E. Santos, and Klaus E. Schauser. Optimal broadcast and summation in the logP model. In ACM Symposium on Parallel Algorithms and Architectures, pages 142–153, 1993. [9] David E. Culler, Richard M. Karp, David A. Patterson, Abhijit Sahay, Klaus E. Schauser, Eunice Santos, Ramesh Subramonian, and Thorsten von Eicken. LogP: Towards a realistic model of parallel computation. In Principles Practice of Parallel Programming, pages 1–12, 1993.
This page intentionally left blank
Communicating Process Architectures 2005 Jan Broenink, Herman Roebbers, Johan Sunter, Peter Welch, and David Wood (Eds.) IOS Press, 2005 © 2005 The authors. All rights reserved.
165
Lazy Cellular Automata with Communicating Processes Adam SAMPSON, Peter WELCH and Fred BARNES Computing Laboratory, University of Kent, Canterbury, Kent, CT2 7NF, England. {ats1 , P.H.Welch , F.R.M.Barnes} @kent.ac.uk Abstract. Cellular automata (CAs) are good examples of systems in which large numbers of autonomous entities exhibit emergent behaviour. Using the occam-π and JCSP communicating process systems, we show how to construct “lazy” and “just-in-time” models of cellular automata, which permit very efficient parallel simulation of sparse CA populations on shared-memory and distributed systems. Keywords. CSP, occam-pi, JCSP, parallel, CA, Life, lazy, just-in-time, simulation
Introduction The TUNA project is investigating ways to model nanite assemblers that allow their safety properties and emergent behaviour to be analysed. We are working with the occam-π language [1] and with the JCSP package for Java [2], both of which provide concurrency facilities based on the CSP process algebra and the π-calculus. The techniques described in this paper may be used in either environment; examples will be given in a pseudocode based on occam-π. Autonomous devices with emergent behaviour will be familiar to anybody who has experimented with cellular automata; indeed, some of the first models constructed by the TUNA project are in the form of CAs. While CAs are significantly simpler than the sorts of devices we want eventually to model – for example, they have very simple state, usually operate upon a regular grid, and have a common clock – they provide a good starting point for modelling approaches. We examine several sequential and parallel approaches to simulating cellular automata in occam-π and JCSP. The major desirable feature for a CA simulation is that very large scales can be achieved. This means that it should execute as fast as possible and use as little memory as possible. In particular, we would like to be able to take advantage of both distributed clusters of machines and new multi-core processor chips. We demonstrate approaches to CA modelling that satisfy these goals. 1. The Game of Life The CA that we will use as an example is John Conway’s Game of Life, usually referred to simply as “Life” [3]. First discovered in 1970, Life produces startling emergent behaviour using a simple rule to update the state of a rectangular grid, each cell of which may be either “alive” or “dead”. All cells in the grid are updated in a single time step (“generation”). To compute the new state of a cell, its live neighbours are counted, where the cell’s neighbours are those cells that are horizontally, vertically or diagonally adjacent to it. If a cell was dead
166
A. Sampson et al. / Lazy Cellular Automata with Communicating Processes
Figure 1. Five generations of a Life glider; black cells are alive.
in the previous generation and has exactly three live neighbours, it will become alive; if it was alive in the previous generation and does not have either exactly two or exactly three live neighbours, it will die. (See Figure 1.) Thirty-five years of research into Life have produced a vast collection of interesting patterns to try. Simple arrangements of cells may repeat a cyclic pattern (“blinkers”), move across the grid by moving through a cyclic pattern that ends up with the original arrangement in a different location (“gliders”), generate a constant stream of other patterns (“guns” and “puffer trains”), constantly expand to occupy more of the grid (“space-fillers”), or display many other emergent behaviours. Life is Turing-complete; it is possible to create logic gates and Turing machines [4]. Life has some features which allow it to be simulated very efficiently. The most important is that cells only change their state in response to changes in the neighbouring cells; this makes it easy to detect when a cell’s state must be recalculated. The new state rule is entirely symmetric; it does not make a difference which of a cell’s neighbours are alive, just that a given number of them are, so the state that must be propagated between cells does not need to include cell locations. Finally, the new state rule is based on a simple count of live neighbours, which can be incremented and decremented as state change messages are received without needing to compute it from scratch on each cycle. These features are not common to all CAs – and certainly will not hold for some of the models that TUNA will investigate – but are nonetheless worth investigating from the implementer’s point of view; if such a feature makes a system especially easy to simulate or reason about, it may be worth modifying a TUNA design to include it. Some simple variants on Life exist that can be simulated using near-identical code. The normal Life rule is that a cell must have three neighbours to be born and two or three neighbours to survive; many variations simply change these numbers. (For example, in the HighLife variant, a cell may also survive if it has six neighbours.) Other variations change the topology of the Life grid: HexLife uses a hexagonal grid, and 3D Life uses a threedimensional grid where cells are cubes and have 26 neighbours. Many other CAs that run on regular grids, such as WireWorld [5], may also be implemented within a Life-simulating framework, although they may require cells to keep or transfer more state. 2. Framework Input and output for most of these approaches can be handled using common code; during development we constructed an occam-π framework which could support several different simulation approaches. The input to a CA simulator consists of an initial state for all (or some) of the cells. For testing purposes, simple predictable patterns are the most useful, since correct behaviour may easily be recognised. However, some problems may be difficult to expose except under extreme load, so the ability to generate random patterns, or to load complex predefined patterns from disk, is also desirable. For CAs such as Life, learning to recognise correct and incorrect behaviour by eye is straightforward. The output clearly must include the state of all of the cells; it is also helpful to display statistics such as the number of active cells. In order to obtain reasonable display performance, it is desirable to only update the screen once per generation (or even less often); this
A. Sampson et al. / Lazy Cellular Automata with Communicating Processes
167
can be done by having a simulation process send a “tick” to the display once per generation. Depending on how the display is implemented, it may be necessary for it to keep its own state array for all the cells; this can allow more interesting visualisations than simply showing the cells’ states. For example, it is useful in Life to show how long each cell has been alive; the present framework uses the occam-π OpenGL bindings [6] to display a 3D projection of the Life grid where cells’ ages are represented by their heights and colours. 3. Sequential Approach The simplest approach to simulating Life is to walk over the entire grid for each generation, computing the new state of each cell (typically writing it into a second copy of the grid, which is exchanged with the first at the end of each step). This algorithm is O(number of cells in the grid). As the majority of existing Life implementations are sequential, some techniques have been devised to speed up simulation. The most promising is Bill Gosper’s HashLife algorithm [7], which computes hash functions over sections of the grid in order to spot repeating patterns; by caching the new state resulting from such patterns the first time they are computed, several generations of the new state for that region may simply be retrieved from the cache rather than computing it again, provided no other patterns interact with it. HashLife is particularly useful for quickly computing the outcome of a long-running Life pattern when there is no need to show the intermediate steps. The performance depends on the type of pattern being simulated; patterns with many repeating elements will perform very well, but the worst-case behaviour (where the pattern progresses without repetition) is worse than the simple approach, since hash values are being computed for no gain. The sequential algorithms typically have good cache locality, and can thus operate very efficiently on a single processor. (Life has even been implemented using image manipulation operations on a graphics card processor.) However, in order to simulate very large Life grids – those with hundreds of millions of active cells – at an acceptable speed, we need to take advantage of multiple processors and hosts; we must investigate parallel algorithms. 4. Process-per-Cell Approaches We examine a number of CSP-based parallel approaches to modelling Life in which each Life cell is represented by a process, starting with the simplest approach and demonstrating how incremental changes may be made to the model to improve performance. 4.1. Simple Concurrent Approach The simplest parallel model of Life using a CSP approach is to have one process for each cell, connected using channels to form a grid (see Figure 2). Wiring up the channels correctly is the most complex part of this approach – one approach is to say that each cell “owns” its outgoing channels, which are numbered from 0 to 7 clockwise starting from the top; channel N outgoing then connects to channel (N +4) mod 8 on its destination cell, which can be found by adding an appropriate offset to the current location (see Figure 3). The easiest way to deal with the connections at the edge of the grid is to wrap them around (making the grid topologically equivalent to a torus); alternately, they may connect to “sink cells” which behave like regular cells but act as if they are always dead. None of the cells need to know their absolute locations in the grid. On each step, each cell must find out the state of those around it. This is done with an I/O-PAR exchange [8] in which each cell, in parallel, outputs its state to its neighbours and
168
A. Sampson et al. / Lazy Cellular Automata with Communicating Processes
Figure 2. Grid of cell processes with interconnecting channels. INITIAL [Height][Width]BOOL initial.state IS [...]: [Height][Width]CHAN BOOL changes: [Height][Width][8]CHAN BOOL links: VAL [8]INT y.off IS [-1, -1, -1, 0, 0, 1, 1, 1]: VAL [8]INT x.off IS [-1, 0, 1, -1, 1, -1, 0, 1]: INT FUNCTION wrap (VAL INT v, max) IS (v + max) \ max: PAR display (changes) PAR y = 0 FOR Height PAR x = 0 FOR Width [8]CHAN BOOL from.others IS [i = 0 FOR 8 | links[wrap(y + y.off[i], Height)] [wrap(x + x.off[i], Width)] [(i + 4) \ 8]]: cell (from.others, links[y][x], changes[y][x], initial.state[y][x])
Figure 3. Code to set up Life grid.
to the display, and reads its neighbours’ state. Once the cell knows its neighbours’ states, it computes its own state for the next generation (see Figure 4). As each cell must do nine outputs and eight inputs for each generation, there is no need for an external clock; the entire grid stays synchronised. The I/O-PAR design rule guarantees that this implementation is free from deadlock. However, it runs very slowly – particularly when compared to a sequential implementation – because the majority of the time is spent doing communications, many of which are carrying state that has not changed. As we know that a Life cell’s state will not change unless its neighbours’ states have changed, this is wasteful, particularly for sparse patterns on a large grid. 4.2. Using a Barrier We thus want to avoid communicating except upon state changes: a cell should only broadcast its state to its surrounding cells when it changes. This implies that we cannot use the I/OPAR approach any more. Furthermore, it is possible that two groups of cells which are active may not be in contact with each other, so the inter-cell communications cannot provide the “generation tick”; another approach must be found.
A. Sampson et al. / Lazy Cellular Automata with Communicating Processes
169
PROC cell ([8]CHAN BOOL inputs, outputs, CHAN BOOL changes!, VAL BOOL initial.state) INITIAL BOOL my.state IS initial.state: [8]BOOL neighbour.states: WHILE TRUE SEQ PAR changes ! my.state PAR i = 0 FOR 8 PAR outputs[i] ! my.state inputs[i] ? neighbour.states[i] my.state := compute.new.state (neighbour.states) : Figure 4. Code for one Life cell using the “simple” approach.
We could synchronise all the cells by having a central “clock” process with a channel leading to each cell, which outputs in parallel to all of them; however, we are trying to reduce the number of communications per generation! Fortunately, CSP provides a more efficient alternative in multiway events, which are available in occam-π and JCSP as barrier synchronisations. Barriers maintain an “enrolled” count of processes which may synchronise upon them; a process that attempts to synchronise will not proceed until all processes enrolled with the barrier are attempting to do so. We can provide generation synchronisation by making cell processes synchronise on a barrier shared with all the other cells in the grid. Cells start by performing a single I/O-PAR exchange, as in the simple approach, in order to obtain the initial state of their neighbours; this could be avoided if all cells had access to a shared initial state array. The state of the cells around them is now held as a simple count of live cells. For each generation, a cell first computes its new state; if it has changed, it broadcasts it to the cells around it and to the display. It then synchronises on the barrier, and finally polls its input channels to collect any changes that have been sent by its neighbours, adjusting the count of live neighbours appropriately (see Figure 5). This approach would cause instant deadlock if regular unbuffered occam-π channels – which cause writes to block until a matching read comes along, and vice versa – were used to connect the processes, since all writes are done before the barrier synchronisation and reads afterwards. Instead, the channels should be one-place buffered – that is, a process may write one message to the channel without blocking, and the read end may asynchronously collect the message at some point in the future. Unfortunately, while JCSP provides N-buffered channels, occam-π does not; it is, however, possible to simulate them using an “id” buffer process running at high priority [9]. The high priority guarantees that all the buffer processes will run before the barrier synchronisation completes. (This is strictly an abuse of the priority system, which is meant to be used for advisory purposes; however, we have found priorities useful for prototyping new communications mechanisms like this.) With this approach, we are now only communicating when a state change occurs. However, all the cells on the grid are still taking part in the barrier synchronisation on each cycle; it is faster, but we can do better. 4.3. Resigning from the Barrier – The Lazy Model A process that is enrolled on a barrier may also resign from it. A resigned process acts as through it were not enrolled; the barrier does not wait for it to synchronise before allowing other processes to run. We can take advantage of this to make cells “sleep” whilst nothing
170
A. Sampson et al. / Lazy Cellular Automata with Communicating Processes PROC cell ([8]ONE-BUFFERED CHAN BOOL inputs, outputs, CHAN CHANGE changes!, BARRIER bar, VAL BOOL initial.state) INITIAL BOOL my.state IS initial.state: INT live.neighbours: SEQ ... do one I/O-PAR exchange as before to count initially-alive neighbours WHILE TRUE BOOL new.state: SEQ ... compute new.state based on live.neighbours IF new.state <> my.state PAR -- state changed my.state := new.state PAR i = 0 FOR 8 outputs[i] ! new.state changes ! new.state TRUE SKIP -- no change SYNC bar SEQ i = 0 FOR 8 PRI ALT BOOL b: inputs[i] ? b ... adjust live.neighbours SKIP SKIP -- just polling : Figure 5. Code for one Life cell using the “barrier” approach.
around them is changing. This results in “lazy simulation”, where cells only execute when it is absolutely necessary. ... IF new.state <> my.state SEQ ... broadcast new state as before TRUE SEQ -- no change, so go to sleep ... set priority to high RESIGN bar ALT i = 0 FOR 8 BOOL b: inputs[i] ? b ... adjust live.neighbours SYNC bar ... set priority to normal ... Figure 6. Changes to the “barrier” approach to support resignation.
This requires some simple modifications to the “barrier” approach. The basic idea is that if the state has not changed, then the process resigns from the barrier and performs a regular ALT across its input channels; it will thus not run again until it receives a change message
A. Sampson et al. / Lazy Cellular Automata with Communicating Processes
171
from a neighbour, at which point it will rejoin the barrier, synchronise on it, and continue as it did with the previous approach (see Figure 6). However, we have also had to insert some priority changes. If all processes are running at the same priority, then the barrier resignation causes a race condition to be present: between the ALT and the end of the RESIGN block, it is possible that all the other processes would synchronise on the barrier, meaning that when this process synchronises it must wait for the next generation. The priority changes are the simplest way to accomplish this, but other approaches are arguably more correct [10]. This optimisation causes a significant performance improvement, since only active cells occupy CPU time: a small glider moving across a huge grid will only require the cells that the glider touches to run. For typical patterns, performance is now rather better than a sequential simulation of the same grid, and the performance is much better than the first parallel approach described: after fifty generations on a randomly-initialised large grid, this approach was a factor of 15 faster than the original approach, and the relative performance increases further as the number of active cells decreases. However, it still uses far more memory, as there is a dormant process for each grid cell with a number of channels attached to it. 4.4. Using Shared Channels Memory usage may be reduced significantly by cutting down on the number of channels. Since Life cells do not care about which neighbouring cell a change message was received from, we can take advantage of another occam-π and JCSP feature: shared channels. The approach is simply to replace the eight channels coming into each cell with a single shared channel; each of the eight neighbouring processes holds a reference to the shared channel. The code is much the same as the previous approach: the only change is to the polling code, which must poll the shared channel repeatedly until it sees no data. It is also necessary for the one-place buffered channels to become eight-place buffered channels, since it is possible that all eight cells surrounding a cell may have changed. (To simulate this without real buffered channels, the approach is to make the buffers write to the eight neighbouring cells in parallel.) We have thus reduced the number of channels by a factor of eight. In memory terms, this is not quite as good as it looks, since the buffer size in each channel has been increased by a factor of eight, and some overheads are caused by the channels being shared; nonetheless we have saved memory, and made the code a little more straightforward too. More importantly, we have freed the code from the constraints of a rectangular grid. It would now be easy to use the same cells for a grid with a different number of neighbours, or even on “grids” with non-regular topologies such as Penrose tiles [11]. While this implementation scales significantly better than the conventional sequential implementation – and even performs better in many cases – its memory usage is still high. 4.5. Using Forking – The Just-In-Time Model The major problem with the previous approach is that there is still one dormant process per grid cell; while occam-π processes are extremely lightweight compared to OS threads, they still require space to hold their internal state variables. Fortunately, we can avoid dormant processes entirely using occam-π’s “forking” mechanism. Forking is a safer variant of thread-spawning, in which parameters are passed safely with the semantics of channel communication, and an enclosing FORKING block waits for all processes FORKed inside it to finish. It is commonly used to spawn worker processes to handle incoming requests, as a more efficient replacement for the “pool of workers” approach that is often found in classical occam code.
172
A. Sampson et al. / Lazy Cellular Automata with Communicating Processes REC PROC cell ([Height][Width]PORT BOOL state, running, MOBILE BARRIER bar, VAL INT y, x) SEQ SYNC bar -- Phase 2 (cells are started from Phase 1) INITIAL BOOL me.running IS TRUE: WHILE me.running BOOL new.state: SEQ SYNC bar -- Phase 1: read state, atomic set running ... compute new.state from neighbours IF new.state <> state[y][x] PAR i = 0 FOR 8 ... compute neighbour location (n.y, n.x) INITIAL BOOL b IS TRUE: SEQ atomic.swap (running[n.y][n.x], b) IF b -- neighbour already running SKIP TRUE -- neighbour not running FORK cell (state, running, changes!, bar, n.y, n.x) TRUE me.running := FALSE SYNC bar -- Phase 2: write state, clear running state[y][x] := new.state running[y][x] := FALSE : Figure 7. Code for one Life cell using the “forking” approach.
For this example, we shall do away entirely with channels for inter-cell communication – a very nontraditional approach for occam! Instead, we use shared PORT data with phased access controlled by a barrier [10]. The framework starts the simulation by FORKing off a set of cell processes for the cells that are initially active. Each generation then consists of two phases. In Phase 1, the cell reads the states of the cells around it (directly from the shared state array), computes its new state, and ensures that any cells that need to change are running. In Phase 2, the cell writes its own state back to the shared array (see Figure 7). The display update can now be done more efficiently: the display process shares the state array and the barrier with the cells, and follows the same phase discipline, reading the state array in Phase 1. It may even be possible to use the computer’s display memory directly as the state array, doing away with the separate display process entirely. The logic to ensure that cells are started correctly requires some explanation. Since a cell may become active for more than one reason – for example, if the cells above and below it both change state – it is necessary to prevent more than one cell process being FORKed for the same cell. A shared “running” array is used for this. In Phase 1, cells atomically swap a variable containing the value TRUE with the element in the array representing the cell they want to start; if the variable contains FALSE after the swap, the cell was not already running and needs to be started. In Phase 2, dying cells reset their slots in the “running” array to FALSE. As new cell processes are FORKed off from Phase 1, they must do an initial barrier synchronisation to get into Phase 2 for the top of the loop. (The only action that would normally be performed in Phase 2 is to write a changed cell’s state into the array, and a newly-forked cell will not need to do that.) The amortised cost of forking off new processes in occam-π is very low (of the order of
A. Sampson et al. / Lazy Cellular Automata with Communicating Processes
173
70 IA32 instructions), so the sample code will happily consider a cell “dead” if it has been inactive for a single generation. In practice, this is rather pessimistic for most Life patterns; many cells will toggle on and off with a period greater than two generations. If we wished to reduce the rate at which processes are created and destroyed, a simple heuristic could be put into place: count the number of generations that the cell has been inactive, and only cause the cell process to die once it has been inactive for N generations. This may result in better performance with JCSP on a system that uses native threads. We now have a very efficient parallel Life implementation in which only as many processes as are needed are running at any one time – process creation is done “just in time”. However, it relies upon shared memory, and thus cannot be implemented (efficiently) across a cluster of machines. For a cluster solution, our approach needs further modification. 4.6. Dynamic Network Creation As occam programmers have known since the 1980s, CSP channels provide a convenient way of modelling network connections between discrete processors. We would therefore like to use channels to connect up our cells while keeping as many of the advantages of the “forking” approach as possible – in particular, only having as many processes in memory as are necessary for the level of activity on the grid. To do this, we will need to dynamically build channel connections between cells – which we can do using occam-π’s mobile channels [9].
Figure 8. Ether surrounding clumps of active processes.
As with the previous approach, problems are caused when two clusters of cells split apart then rejoin, causing the cells between them to be activated for multiple reasons. In this case, it is necessary to connect up the channels correctly between the groups of rejoining cells. Previously we solved this sort of problem using shared data and atomic operations; now we shall instead use a coordinating process which manages channel ends that are not connected to active processes. As, from the modelling perspective, this process occupies the space around and between the clusters of active cells, it is called the “ether” (see Figure 8). Cells now need to know their locations relative to an arbitrary reference point, in order that the ether can identify clusters of cells that drift apart and rejoin. For a non-regular topology, it may be possible to use unique identifiers rather than coordinates, and use external data structures to represent the relationships between cells; that scheme is rather less flexible than the shared-channels approach, but may be easier to manage under some circumstances. Each cell process has channels connecting it to the cells around it (either shared or unshared), much like our previous parallel approaches, except now they are mobile channels, the ends of which may be passed around between processes. Each process also has a connection to the ether (via a channel shared between all cells); when it goes inactive and exits, it
174
A. Sampson et al. / Lazy Cellular Automata with Communicating Processes
sends a message to the ether returning its channel ends. From the perspective of the cell, all channels are connected to other cells; however, they may actually connect to the ether. When the ether receives a change notification from a cell, it spawns a new cell in the appropriate location, checking its internal data structures to see whether it should be connected to any other cells in its vicinity using other channel ends that the ether is holding. If the ether can reuse existing channels it will; otherwise it will create new mobile channels, keep one end, and pass the other to the new process. (Since the search for existing channel ends is done purely on the basis of coordinates, it should be possible to do it very efficiently within the ether.) As well as cluster-friendliness, using this approach also has the advantage that there is no longer a need for a big array of state. Indeed, sections of the grid that are inactive can just disappear, provided their states are known to the coordinating process; if they consist of empty space then this is easy. This approach should therefore work very well for testing gliders, spaceships and other Life patterns that move across the grid leaving little or nothing behind them; a feature that it has in common with HashLife. Visualising the output from a Life simulation implemented this way could be done by automatically zooming the display to encompass the section of the field that is currently being simulated; this could produce a very compelling visualisation for space-fillers and patterns such as the R-pentomino that expand from a simple cluster. One final problem: the single ether process is a classic bottleneck; not a desirable feature for a parallel system, particularly if we want to make our cluster network topology mimic the connections in our Life grid. 4.7. Removing the Bottleneck The final change is to parallelise the ether. This may be done straightforwardly by dividing it up into sections by coordinates (wrapping around so that an infinitely large grid may be simulated). Adjacent ether processes would need to communicate in order to create new processes and channels within the appropriate ether; air traffic controllers in the real world provide an appropriate analogy. As processes that need to communicate with each other will most likely be registered with the same ether, this approach offers good locality for cluster implementations of Life. In environments which do not provide transparent network channels, the ether processes can also be made responsible for setting up appropriate adaptors at machine boundaries. 5. Process-per-Block Approaches While we have described several efficient ways of implementing Life using occam-π’s facilities, all of the approaches described use one CSP process per cell, and thus still have significantly higher per-cell overhead than the existing sequential approaches. However, this is relatively easy to fix: all of the above approaches may be applied equally well to situations where each “cell” process is actually simulating a group of cells using a sequential (or even internally parallel) approach. The only change is that the state to be exchanged between processes becomes the set of states of the cells on the adjoining edges or corners. Existing sequential approaches can be used virtually unmodified to obtain high performance. It may even be possible to switch between several different sequential approaches depending on the contents of the block; for example, the trade-off between HashLife and a “plain” sequential algorithm could be made on the fly depending upon the cache hit rate. To minimise communication costs when two chunks are on the same machine, mobile arrays of data could be swapped back and forth, or shared data could be used, protected by a barrier.
A. Sampson et al. / Lazy Cellular Automata with Communicating Processes
175
6. Conclusion We have presented a number of approaches for simulating cellular automata in efficient ways in extended-CSP programming environments. It is to be hoped that some of these ideas could be used to implement highly-parallel CA simulators that can operate efficiently on extremely large grids. It should be possible to extend these ideas beyond CAs and into other cases where many autonomous entities need to be simulated – for example, finite element analysis or computational fluid dynamics. We have also presented a number of applications for new functionality in the occam-π environment: in particular, some of the first practical uses for barriers and safely-shared data. 7. Acknowledgements The authors would like to acknowledge EPSRC’s support for this work through both a research studentship (EP/P50029X/1) and the TUNA project (EP/C516966/1). References [1] F.R.M. Barnes. Dynamics and Pragmatics for High Performance Concurrency. PhD thesis, University of Kent at Canterbury, June 2003. [2] P.H. Welch. Process Oriented Design for Java: Concurrency for All. In H.R.Arabnia, editor, Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA’2000), volume 1, pages 51–57. CSREA, CSREA Press, June 2000. [3] M. Gardner. The fantastic combinations of John Conway’s new solitaire game “life”. Sci. Amer., 223:120–123, October 1970. [4] A. Adamatzky, editor. Collision-Based Computing. Springer Verlag, 2001. [5] A.K. Dewdney. Computer Recreations. Sci. Amer., 262:146, January 1990. [6] D.J. Dimmich and C.L. Jacobsen. A foreign function interface generator for occam-pi. In J. Broenink, H. Roebbers, J. Sunter, P.H. Welch, and D.C. Wood, editors, Communicating Process Architectures 2005, Concurrent Systems Engineering, pages 235–248, IOS Press, The Netherlands, September 2005. IOS Press. [7] R.W. Gosper. Exploiting regularities in large cellular spaces. Physica D, 10:75–80, 1984. [8] P.H. Welch, G.R.R. Justo, and C.J. Willcock. Higher-Level Paradigms for Deadlock-Free High-Performance Systems. In R. Grebe, J. Hektor, S.C. Hilton, M.R. Jane, and P.H. Welch, editors, Transputer Applications and Systems ”93, Proceedings of the 1993 World Transputer Congress, volume 2, pages 981–1004, Aachen, Germany, September 1993. IOS Press, The Netherlands. ISBN 90-5199-140-1. [9] F.R.M. Barnes and P.H. Welch. Prioritised dynamic communicating processes: Part 1. In J. Pascoe, P.H. Welch, R. Loader, and V. Sunderam, editors, Communicating Process Architectures 2002, volume 60 of Concurrent Systems Engineering, pages 321–352, IOS Press, The Netherlands, September 2002. IOS Press. [10] F.R.M. Barnes, P.H. Welch, and A.T. Sampson. Barrier synchronisations for occam-pi. In Proceedings of the 2005 International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA’2005). CSREA press, June 2005. to appear. [11] R. Penrose. U.S. Patent #4,133,152: Set of tiles for covering a surface, 1979.
This page intentionally left blank
Communicating Process Architectures 2005 Jan Broenink, Herman Roebbers, Johan Sunter, Peter Welch, and David Wood (Eds.) IOS Press, 2005 © 2005 The authors. All rights reserved.
177
A Unifying Theory of True Concurrency Based on CSP and Lazy Observation Marc L. SMITH Department of Computer Science, Colby College, Waterville, Maine 04901-8858, USA [email protected] Abstract. What if the CSP observer were lazy? This paper considers the consequences of altering the behavior of the CSP observer. Specifically, what implications would this new behavior have on CSP’s traces? Laziness turns out to be a useful metaphor. We show how laziness permits transforming CSP into a model of true concurrency (i.e., non-interleaved trace semantics). Furthermore, the notion of a lazy observer supports tenets of view-centric reasoning (VCR): parallel events (i.e., true concurrency), multiple observers (i.e., different views), and the possibility of imperfect observation. We know from the study of programming languages that laziness is not necessarily a negative quality; it provides the possibility of greater expression and power in the programs we write. Similarly, within the context of the Unifying Theories of Programming, a model of true concurrency — VCR — becomes possible by permitting (even encouraging) the CSP observer to be lazy. Keywords. Unifying Theories of Programming, lazy observation, true concurrency
Introduction This paper presents and explores the interrelationship of four ideas: Unifying Theories of Programming (UTP), true concurrency, CSP, and lazy observation. UTP is a body of work, conceived of and initiated by Hoare and He [1], whose goal remains one of the grand challenges of computer science. True concurrency refers to computational models that provide abstractions for reasoning directly about simultaneity in computation. CSP, originally developed by Hoare [2] and more recently by Roscoe [3], models concurrency via multiple Communicating Sequential Processes. However, CSP abstracts away true concurrency through the nondeterministic sequential interleaving of simultaneously observed events by an Olympian observer. Finally, lazy observation refers to altering the behavior of the CSP observer in a manner to be described later in this section. The result of a lazy observer is support for viewcentric reasoning (VCR) within CSP, and a place for VCR within UTP. Scientific theories serve many purposes, including the ability to describe, simulate, and reason about problems of interest, and to make predictions. The same purposes and goals exist within computer science; within a relatively short period of time, many computational abstractions have emerged to address the specification, implementation, and verification of systems. The Unifying Theories of Programming (UTP) [1] provides a framework that more closely aligns computer science with other, more traditional scientific disciplines. Specifically, UTP represents a grand challenge for computer science that is found in other mature scientific disciplines — that of achieving a unification of multiple, seemingly disparate theories. The notion of reasoning about a computation being equivalent to reasoning about its trace of observable events is central to the elegance – and utility – of CSP. CSP further exists as a theory within UTP. The metaphor of an observer recording events, one after another,
178
M.L. Smith / A Unifying Theory of True Concurrency
in a notebook supports CSP’s approach of observation-based reasoning. True concurrency is abstracted away, we are told, because the observer must record simultaneously occurring events in some sequential order. The argument follows that in the end, any such sequential interleaving is as good as any other. But there exist occasions when reasoning about true concurrency is either necessary or desirable (cf. Section 4). It should be noted that CSP, despite not being a model of true concurrency, has been a tremendously successful approach for designing and reasoning about properties of concurrent systems. The final interrelated idea presented in this paper is what the author has come to characterize as lazy observation, and refers to altering the assumed behavior of the CSP observer. The traditional CSP observer is perfect, and laziness would seem to be a departure from perfection, rather than a route toward true concurrency. To explain why this is not the case, consider first that CSP allows that the observer may witness simultaneously occurring events during a computation. Next, recognize that when forced to sequentially interleave simultaneous events, the observer must decide the order of interleaving. Such decision takes work, and thus presents an opportunity for laziness: it is easier to record the events as witnessed, occurring in parallel, than to choose which event to record before another. Furthermore, laziness provides a plausible explanation for imperfect observation: the observer being too lazy to record every event. Lazy observation, and the potential for additional, possibly imperfect observers, makes view-centric reasoning (VCR) within CSP possible. The major contribution of this paper addresses one of the many remaining challenges identified by Hoare and He, that of including a theory of true concurrency within the unifying theories of programming. CSP is already described as a theory of programming within UTP; by incorporating laziness into the CSP observer’s behavior, we present VCR, a variant of CSP that supports true concurrency, within UTP. 1. Background Some background beyond a basic familiarity with CSP is required to frame this paper’s contributions within the unifying theories of programming. First, we give a brief overview of VCR and provide some motivation for true concurrency. Next, we discuss the unifying theories of programming; first broadly, then one part more specifically. The scope of UTP is vast and much work remains. The goal of the broad discussion is to introduce the uninitiated reader to some of the motivations for UTP. The more specific discussion is intended to help focus the reader on the particular area within UTP this paper endeavors to build upon. 1.1. Origins of View-Centric Reasoning View-centric reasoning was originally developed by the author as a meta-model for models of concurrency, in the form of a parameterized operational semantics [4]. The idea was to identify parameters whose specification would, in different combinations, collectively serve to instantiate VCR for reasoning about seemingly diverse concurrency paradigms. To identify such parameters requires distilling the essence of concurrency from its many possible forms. What would be the right abstractions to achieve the goal of a general model of concurrency? Fortunately, CSP soon provided the author with a tremendous head start. While attempting to develop a taxonomy of concurrency paradigms, with varieties that ranged from sequential (as a degenerate form of parallelism) to shared memory, message passing, and generative communication (i.e., the Linda model), the author discovered CSP. What resonated was the idea of observation-based reasoning, and Hoare’s contention that reasoning about the trace of a computation’s observable events is equivalent to reasoning about the computation itself. Traces and the metaphor of an observer recording events as they occur provided the initial inspiration for VCR. The idea of accounting for views of traces arose due to simulta-
M.L. Smith / A Unifying Theory of True Concurrency
179
neously reading a book containing Einstein’s essays on relativity [5]! After reading about relativity, the observer’s behavior of interleaving simultaneous events in some arbitrary order wasn’t very satisfying (though CSP’s success in modeling concurrency is undeniable!). It seemed reasonable (i.e., possible in the real world) that if there could be one observer, there could be more; and due to the consequences of relativity, they may not all record events in the same sequence. VCR sought to account for multiple possible observers and their corresponding views. It was from this history that VCR’s parallel event traces emerged. Past work to develop a denotational semantics for VCR can be found in Smith, et al. [6,7]. One of VCR’s goals was to permit reasoning about properties of parallel and distributed computing that require knowledge of true concurrency. Where would such examples be found? One example involving the Linda predicate operations — previously known to be ambiguous in the case of failure — is discussed in Smith, et al. [8], and in the appendix of this paper. The perceived ambiguity of failed Linda predicates resulted from reasoning about their meaning based on interleaved trace semantics. Another example that proves easier to describe with true concurrency than interleaving is the I/O-PAR design pattern, previously presented in Smith[7], and also discussed briefly in the appendix of this paper. 1.2. Unifying Theories of Programming Hoare and He’s Unifying Theories of Programming [1] is a seminal body of work in theoretical computer science. The interested reader is encouraged to study UTP. The purpose of this section is to cover enough concepts and terminology of UTP to support our later discussions of true concurrency in Section 3. Section 1.2.1 introduces concepts and terminology relevant to theories of programming, and Section 1.2.2 considers the particular class of programming theories known as reactive processes. 1.2.1. The Science of Computer Programming The authors of UTP characterize the science of computer programming as a new branch of science. They introduce new language capable of describing observable phenomena, and a formal basis for devising, conducting, and learning from experiments in this realm. Since the scope of UTP includes trying to relate disparate computational models, the approach involves distilling existing models down to their essence, to facilitate comparison. In other words UTP advocates an approach akin to finding the common denominator when dealing with fractions. In the case of theories of programming, the common basis for comparison includes alphabets, signatures, laws, and normal forms. Let us elaborate briefly on each of these abstractions. Since the science of programming is a science, it is a realm for experimentation where observations can be made. These observations are observable events; to name these events we use an alphabet. Elements of an alphabet are the primitive units of composition; for a given theory of programming (or programming language), the rules for composition are known its signature. A normal form is a highly restricted subset of some programming language’s signature that has the special property of being able to implement the rules of that language’s complete signature. Intuitively, one could think of compilers that translate high-level languages to a common low-level language; such a low-level language (machine instructions) is a normal form. It should be noted that for a given language, many normal forms are possible, and in practice, one normal form may be preferable to another depending on the task at hand. For a theory of programming to be useful, it must be capable of formulating statements that may be either true or false for a given program. Such statements are called predicates. Laws are statements involving programs and predicates. Just as not all predicates are true, not all laws are true for all predicates. For a given law, predicates that are true are called healthy, in which case the law is called a healthiness condition. In Section 3 we will discuss healthiness conditions for CSP and VCR.
180
M.L. Smith / A Unifying Theory of True Concurrency
1.2.2. Reactive Processes and Environment One class of programming theories presented in UTP are the theories of reactive processes. The notion of environment is elucidated early in this presentation, as environment is essential to theories of reactive processes, examples of which include CSP and its derivative models. Essentially, the environment is the medium within which processes compute. Equivalently, the environment is the medium within which processes may be observed. The behavior of a sequential process may be sufficiently described by making observations only of its input/output behavior. In contrast, the behavior of a reactive process may require additional intermediate observations. Regarding these observations, Hoare and He borrow insight from modern quantum physics. Namely, they view the act of observation to be an interaction between a process and one or more observers in the environment. Furthermore, the roles of observers in the environment may be (and often are) played by the processes themselves! As one would expect, an interaction between such processes often affects the behavior of the processes involved. A process, in its role as observer, may sequentially record the interactions in which it participates. Recall participation includes the act of observation. Naturally, in an environment of multiple reactive processes, simultaneous interactions may be observed. CSP recording conventions require simultaneous events to be recorded in some sequence, including random. Hoare and He thus define a trace as the sequence of interactions recorded up to some given moment in time. 2. Related Work Lawrence has developed two significant CSP extensions, CSPP [9] and HCSP [10]. CSPP presents an acceptance semantics for CSP based on behaviors; HCSP extends CSPP with, among other abstractions, true concurrency. True concurrency in HCSP is represented with bags, similar in spirit to VCR’s parallel events: both abstractions may be recorded in a computation’s trace as an alternative to sequential interleaving. In addition, HCSP’s denotational semantics also provide for the explicit specification of processes participating in truly concurrent events; VCR merely supports the recording of such phenomena in the trace, should such true concurrency happen to be observed during computation. Finally, while the HCSP extensions include true concurrency, the goals of CSPP and HCSP differ from those stated for VCR in this paper. CSPP and HCSP were developed to address the challenges of hardware and software codesign; no reference to UTP appears. Sherif and He [11,12] develop a timed model for Circus, which extends the CSP model given in UTP with a definition of timed traces, and an expands relation over two timed traces to determine subsequence relationships. In this model, timed traces are sequences of observation elements (tuples), each element representing one time unit. Simultaneous events are the result of processes synchronizing both on a set of events, and the time unit those events occur. This model appears to support true concurrency, but interestingly, defines parallel composition in terms of UTP’s merge parallel composition, which nondeterministically interleaves disjoint events. This work was mentioned by one of this paper’s anonymous referees, and warrants further study. It appears these timed traces may be similar to VCR’s views, though it is still not clear to the author whether the timed model for Circus supports multiple, possibly imperfect views. 3. VCR: CSP with True Concurrency This section contains the substance of this paper; our contribution to the Unifying Theories of Programming. VCR is a model of true concurrency, and an extension of CSP. To date, CSP
M.L. Smith / A Unifying Theory of True Concurrency
181
Table 1. What is observable in the CSP theory of programming Abstraction stable state
Symbol ok ok
Meaning boolean indicating whether process has started boolean indicating whether process has terminated
waiting state
wait
boolean which distinguishes a process s quiescent states from its terminated states; when true, process is initially quiescent when true, all other dashed variables are intermediate observations; final observations, otherwise
wait
trace
refusal set
tr tr
sequence of actions that takes place before a process is started sequence of all actions recorded so far
ref ref
the set of events initially refused by a process the set of events refused by a process in its final state
has been drawn within the unifying theories of programming, but not VCR. Furthermore, this author is not aware of any other model of true concurrency (e.g., petri nets) that has been drawn into UTP, making this paper’s contribution significant. In Section 3.1 we present and describe the healthiness conditions for CSP processes, as identified within UTP. Next, in Section 3.2, we discuss the differences between traditional CSP traces and VCR-compliant CSP traces. Finally, in Section 3.3 we consider the differences between CSP traces and VCR traces, and what impact these differences have on the healthiness conditions of CSP, as we wish to preserve CSP’s healthiness conditions for VCR. 3.1. Healthiness Conditions for CSP We briefly describe the meaning of the healthiness conditions for CSP processes given in Table 2. The alphabet symbols used to express the CSP healthiness conditions are introduced in Table 1, A more complete treatment of CSP healthiness conditions can be found in UTP [1]. Since CSP processes are a special case of reactive processes, Table 2 contains healthiness conditions for both reactive processes (R1–R3) and CSP (CSP1–CSP5). Condition R1 merely states that the current value of a process’s trace must be an extension of the trace’s initial value. This may be a little confusing until one considers that a reactive process may not be the only process within the computation being observed. For a process, P, the difference between the current value of P’s trace, tr , and that trace’s initial value, tr, represents the sequence of events that P has engaged in since it began execution. This is essentially what R2 states, by specifying that the behavior of P after any initial trace is no different than the behavior of P after the empty trace. The healthiness condition R3 is a little more complicated, but not terribly so. R3 is meant to support sequential composition. If we wish to compose P and Q sequentially, we wouldn’t expect to observe events from Q before P reaches its final state. Therefore, R3 states that if a process, P is asked to start while its predecessor is in an intermediate state, the state of the P remains unchanged. All reactive processes satisfy healthiness conditions R1–R3. CSP processes satisfy R1– R3, but in addition, must also satisfy CSP1 and CSP2. Conditions CSP3–CSP5 (and others not included in UTP) facilitate the proving of CSP laws that CSP1 and CSP2 alone do not
182
M.L. Smith / A Unifying Theory of True Concurrency Table 2. Healthiness conditions for Reactive processes and CSP Process Type Reactive
Law R1 R2 R3
CSP
R1 − R3 CSP1 CSP2 CSP3 CSP4 CSP5
Predicate for program P P = P ∧ (tr ≤ tr ) P(tr, tr ) = P( , tr − tr) P = Π{tr, ref , wait} wait P where Π =df ¬ok ∧ (tr ≤ tr ) ∨ ok ∧ (tr = tr) ∧ · · · ∧ (wait = wait)
P = ¬ok ∧ (tr ≤ tr ) ∨ P P = P; ((ok ⇒ ok ) ∧ (tr = tr) ∧ · · · ∧ (ref = ref )) P = SKIP ; P P = P; SKIP P = P ||| SKIP
support. Examples of laws include properties of composition, external choice, and interleaving. Again, for a more complete treatment of how these healthiness conditions may be used to prove such laws, see UTP [1]. At a high level, CSP1 states that we cannot predict the behavior of a process before it begins executing. CSP2 states that it is possible to sequentially compose any process P with another process Q, even if Q hides everything about its execution and does so for an indeterminate amount of time, so long as it eventually terminates. Such a process Q is an idempotent of sequential composition. While CSP3–CSP5 do not play a specific role in the remainder of this paper, a few more comments may help the intuition of readers less familiar with UTP. Healthiness conditions CSP3–CSP5 further describe process composition within CSP, and depend upon refusal sets of processes. Process SKIP is employed in the statements of CSP3–CSP5; recall SKIP refuses to engage in any observable event, but terminates immediately. Moreover, a process P satisfies CSP3 if its behavior is independent of the initial value of its refusal set. For example, a → P is CSP3; similarly, a → SKIP is CSP4. The meaning behind CSP5 is less obvious; it is the equivalent of the CSP axiom that states refusal sets are subset-closed. In other words, a process that is deadlocked refuses the events offered by its environment; it would still be deadlocked in an environment offering fewer events. 3.2. The Shape of the Trace From CSP to VCR, the only real change is one of bookkeeping, which in the end, changes the shape of the traces. Since reasoning about a computation reduces to reasoning about its trace, and the trace is the basis for CSP’s process calculus, it is the trace about which we focus. Furthermore, it is easy to confuse the desire for a specification of true concurrency with the ability to observe truly concurrent events during a computation, and preserve this information in the trace. Within UTP, traces of reactive processes range over sequences from alphabet A of observable events, which may be expressed via the Kleene closure A∗ . Then, to compare traces, UTP uses the standard relations = to test equality, and ≤ to represent the prefix property. In addition, there is a quotient operator − operator defined over traces. For example, let tr, tr ∈ A∗ , where tr = abcde and tr = abcdef g. Then the following statements are true: • tr = tr since equality is reflexive, • tr ≤ tr since tr is a prefix of tr , and
M.L. Smith / A Unifying Theory of True Concurrency
183
• tr − tr = f g, since tr and tr have common prefix abcde. The UTP representation of traces as words over an alphabet is elegant. In striving to augment the unifying theories with a theory of true concurrency, we must change the shape of the trace sufficiently to represent the parallel events of VCR, but not so much that we lose the ability to define the equality (=) and prefix (≤) relations, or the quotient (−) operator. Ideally, the new definition of a trace will not lose much elegance. We begin with a new definition of trace, one that supports view-centric reasoning. Definition 1 (trace) A trace, tr, is a comma-delimited sequence of sequences over alphabet A, where , ∈ / A. Formally: tr ∈ , (A+ ,)∗ The comma ( , ) delimiter provides the ability to index and parse individual subsequences, or words, from tr. Under this definition, traces begin and end with a comma; the empty trace — represented by a single comma – is a somewhat special case, where the beginning and ending comma are one and the same. We pause briefly to discuss view-centric reasoning in light of this new definition of trace. Each word in tr represents a multiset of observable events (parallel events in VCR terminology). In other words, each word could be rewritten as any permutation of its letters, since multisets are not ordered. This notation preserves VCR’s ability to distinguish a computation’s history from its corresponding views. Since we can still parse the multisets from a trace, we can consider all possible ROPEs (Randomly Ordered Parallel Events) for each multiset, and all possible views of a trace. A ROPE of a word is simply any permutation of any subset of that word (the subsets reflect the possibility of imperfect observation). So, just as words are the building blocks for traces, ROPEs are the building blocks for views. For a more comprehensive treatment of VCR, see Smith, et al. [8]. Given this new definition of trace, it remains to define equality, prefix, and quotient. To help, we first define notions of trace length and word indexing. We begin with length. Notice that the empty trace contains one comma, and all traces that are one-word sequences contain two commas, etc. In general, traces contain one more comma than the number of words in their sequence. Thus the length of a trace reduces to counting the number of commas, then subtracting one. In UTP notation s ↓ E means ”the subsequence of s omitting elements outside E, and #s means ”the length of s.” Composing these two notations, we define the length of a trace. Definition 2 (length of trace) The length of a trace, tr, denoted | tr |, is the number of comma-delimited words in tr. Formally: | tr |= #(tr ↓ {, }) − 1 Next, we define word indexing within a trace — the ability to refer to the ith word of a trace. In the following definition, the subword function returns the subsequence of symbols exclusively between the specified indices (that is, without the surrounding commas). Definition 3 (i-th word of trace) Given nonempty trace tr, let tr[i] denote the i-th word of tr, where n =| tr | and 1 ≤ i ≤ n; and let ci denote the index of the i-th comma in tr, where c0 ≤ ci ≤ cn . Formally, tr[i] = subword(tr, ci−1 , ci ) In the preceding definition, ci−1 and ci refer, respectively, to the commas just before and just after wi in tr. We may now easily define the notions of equality, prefix, and quotient over the new definition of traces. In the following definition of equality, the permutations function returns the set of all permutations of the given word. Definition 4 (trace equality) Given two traces, tr and tr , tr = tr iff 1. | tr |=| tr | , and 2. ∀ i, 1 ≤ i ≤| tr |, ∃ w ∈ permutations(tr [i]) s.t. w = tr[i].
184
M.L. Smith / A Unifying Theory of True Concurrency
This definition states that two traces are equal if they are the same length, and for each corresponding pair of words from the two traces, one word must be equal to some permutation of the other. Next, we define the prefix relation for traces, which follows directly from the preceding definition of equality. Definition 5 (trace prefix) Given two traces, tr and tr , tr ≤ tr iff 1. | tr |= m and | tr |= n and m ≤ n; and 2. ∀ i, 1 ≤ i ≤ m, ∃ w ∈ permutations(tr [i]) s.t. w = tr[i]. This definition states that, given two traces, the first trace is a prefix of the second if the second trace is at least as long as the first, and for each corresponding pair of words, up to the number of words in the first trace, one word must be equal to some permutation of the other. Finally, with the preceding definition of prefix, we can define the quotient of two traces. In the following definition of quotient, the tail function returns the subsequence of the given trace from the given index, inclusive, to the end (that is, it includes the leading comma). Definition 6 (trace quotient) Given two traces, tr and tr , where tr ≤ tr , m =| tr |, and n =| tr |; let cm denote the index of the m-th comma in tr’, where c0 ≤ cm ≤ cn . The quotient tr − tr = tail(tr , cn ), Let’s consider some examples to further illustrate this new definition of trace, and its associated properties. Let A = {a, b, c, d, e, f , g}, tr1 = ,ab ,cd , , tr 2 = ,ba ,cd ,ef g , , and tr3 = ,ba ,dc , . Then the following statements are true: • • • •
tr1 = tr1 and tr1 = tr3 tr1 ≤ tr2 and tr3 ≤ tr2 tr2 − tr1 = ,ef g , and tr2 − tr3 = ,ef g , tr1 − tr1 = , and tr1 − tr3 = ,
3.3. Healthiness Conditions for VCR: Laziness Revisited We can think of healthiness conditions for VCR in at least two ways. First, we defined notions of trace equivalence, prefix, quotient for VCR traces; and could substitute the new definitions within UTP’s existing healthiness conditions R1–R3 and CSP1–CSP2. The revised healthiness conditions for VCR traces hold, by definition. VCR traces are still traces of processes that conform to the healthiness conditions of CSP processes. This is not surprising, since initially, all we set out to do was change the CSP observer’s behavior, and the shape of the resulting traces she records. To this point, VCR hasn’t touched a single law pertaining to specification, only observation. The result is a newly-structured CSP trace that supports view-centric reasoning. Of course, the justification for this approach of preserving healthiness conditions stems from laziness on the part of the observer due to procrastinating the work of interleaving. There is another way to think of healthiness conditions for VCR, however. The key is to consider the VCR trace an intermediate trace; one that can be transformed (i.e., reduced) to a standard CSP trace by interleaving the elements of the event multisets, or words, as we defined them. Using our UTP notation, this involves removing the commas from the VCR trace, and replacing each word with some permutation of itself to simulate the arbitrary interleaving the CSP observer would have done. Notice that once the commas are removed, the individual words are essentially concatenated together, yielding a single word over A∗ . This is laziness in the same sense as above, stemming from the observer’s reluctance to interleave simultaneous events. Let’s take a moment to compare these two approaches to preserving CSP healthiness conditions. In both cases, a lazy observer has put off the work of interleaving simultane-
M.L. Smith / A Unifying Theory of True Concurrency
185
ous events while recording the trace of a computation. The processes being observed are the same CSP processes whose events a traditional observer would record, and therefore the CSP healthiness conditions should be preserved. The two approaches to preserving CSP healthiness have one thing in common, they both rely on a transformation. In the first case, the healthiness conditions themselves are transformed with new definitions of trace equality, prefix, and quotient. In the second case, the new trace definition is viewed as an intermediate state, and transformed into the form of a traditional CSP trace. In both cases, the laziness is resolved when we wish to reason about the computation. Notice that it is not always possible to go in the other direction; that is, transform a CSP trace into a VCR trace. The context of which events were interleaved, as opposed to sequentially occurring and recorded, is not available. This suggests there may be properties of VCR traces that cannot be reasoned about with CSP traces. Indeed, there are such properties, and the interested reader can find out more information in Smith, et al. [8].
4. Conclusions and Future Work This paper begins with a simple conjecture: what if the CSP observer were lazy? From this simple conjecture we explored the Unifying Theories of Programming, Communicating Sequential Processes, and View-Centric Reasoning. In the context of UTP, CSP is a theory of programming, but not a theory of true concurrency. The CSP process algebra allows simultaneous events to occur, but the traditional interleaved trace does not permit one to reason directly about simultaneity. The metaphor of lazy observation — deferring the work of interleaving — provides a bridge from traditional CSP to a CSP that supports view-centric reasoning, thanks to a change in bookkeeping. The CSP specification remains unchanged, but our ability to reason about properties that depend on knowledge of true concurrency benefits. Thanks to Hoare and He’s elegant yet powerful use of healthiness conditions to classify processes as CSP processes (and for other theories of programming), the work to describe a theory of true concurrency within UTP focused on the CSP healthiness conditions, rather than begin from scratch developing a denotational semantics for VCR. This was a surprisingly easy way to draw true concurrency into the Unifying Theories of Programming. More work remains with respect to true concurrency, UTP, and CSP. There are probably more healthiness conditions that need to be defined to reflect properties one can reason about in VCR that one cannot in CSP. Furthermore, there are many CSP models: Traces, Stable Failures, Failures/Divergences, and others. In this paper, we have considered the impact of VCR’s parallel event traces on the process calculus of the CSP model given in Hoare and He’s UTP. In addition, there is the challenge of specification regarding true concurrency. As mentioned earlier in Section 3.2, the focus of this paper has been on observation rather than specification of true concurrency. VCR to date has only permitted the possibility of simultaneous events in computation, and provided a means to capture simultaneity in its traces when it occurs. This has proven useful, to be sure. However, the specification of true concurrency would be even more useful (e.g., regarding I/O-PAR) In addition to Lawrence’s HCSP, and other non-CSP models of true concurrency, providing a theory of programming within UTP that permits the specification of true concurrency would be another important step forward in support of this grand challenge. The author is working on algebraic laws for parallel composition and interleaving that may lead, for example, to a simplified specification for I/O-PAR and I/O-SEQ. Unlike what was possible for the work presented in this paper, these new laws will require new theorems and proofs for VCR processes.
186
M.L. Smith / A Unifying Theory of True Concurrency
Acknowledgments The author wishes to thank the anonymous referees for suggesting clarifications, and for providing corrections and pointers to additional related work. Jim Woodcock and Alistair McEwan provided valuable interpretations of UTP during the early developmental stages of this research. Allan McInnes read the submitted draft of this paper and provided feedback and an important correction. VCR and related initial explorations into models of true concurrency date back to the author’s early collaboration with Rebecca Parsons and Charles Hughes. References [1] C.A.R. Hoare and Jifeng He. Unifying Theories of Programming. Prentice Hall Series in Computer Science. Prentice Hall Europe, 1998. [2] C.A.R. Hoare. Communicating Sequential Processes. Prentice Hall International Series in Computer Science. Prentice-Hall International, UK, Ltd., UK, 1985. [3] A. W. Roscoe. The Theory and Practice of Concurrency. Prentice Hall International Series in Computer Science. Prentice Hall Europe, 1998. [4] Marc L. Smith. View-centric Reasoning about Parallel and Distributed Computation. PhD thesis, University of Central Florida, Orlando, Florida 32816-2362, December 2000. [5] Albert Einstein. The Theory of Relativity and Other Essays. Barnes & Noble Books, 1997. [6] Marc L. Smith, Charles E. Hughes, and Kyle W. Burke. The denotational semantics of view-centric reasoning. In J.F. Broenink and G.H. Hilderink, editors, Communicating Process Architectures 2003, volume 61 of Concurrent Systems Engineering Series, pages 91–96, Amsterdam, 2003. IOS Press. [7] Marc L. Smith. Focusing on tracees to link vcr and csp. In I.R. East, D. Duce, M. Green, J.M.R. Martin, and P.H. Welch, editors, Communicating Process Architectures 2004, volume 62 of Concurrent Systems Engineering Series, pages 353–360, Amsterdam, 2004. IOS Press. [8] Marc L. Smith, Rebecca J. Parsons, and Charles E. Hughes. View-centric reasoning for linda and tuple space computation. IEE Proceedings–Software, 150(2):71–84, apr 2003. [9] Adrian E. Lawrence. Acceptances, behaviours, and infinite activity in cspp. In J. S. Pascoe, P. H. Welch, R. J. Loader, and V. S. Sunderam, editors, Communicating Process Architectures – 2002, Concurrent Systems Engineering, pages 17–38, Amsterdam, 2002. IOS Press. [10] Adrian E. Lawrence. Hcsp: Imperative state and true concurrency. In J. S. Pascoe, P. H. Welch, R. J. Loader, and V. S. Sunderam, editors, Communicating Process Architectures – 2002, Concurrent Systems Engineering, pages 39–55, Amsterdam, 2002. IOS Press. [11] Adnan Sherif and Jifeng He. A framework for the specification, verification and development of real time systems using circus. Technical Report 270, UNU-IIST, P.O. Box 3058, Macau, November 2002. [12] Adnan Sherif and He Jifeng. Towards a time model for circus. In Proceedings of the 4th International Conference on Formal Engineering Methods, volume 2495 of LNCS. Springer-Verlag, October 2002. [13] David Gelernter. Generative communication in linda. ACM Transactions on Programming Languages and Systems, 7(1), January 1985. [14] P.H. Welch. Emulating Digital Logic using Transputer Networks (Very High Parallelism = Simplicity = Performance). International Journal of Parallel Computing, 9, January 1989. North-Holland. [15] P.H. Welch, G.R.R. Justo, and C.J. Willcock. Higher-Level Paradigms for Deadlock-Free High-Performance Systems. In R. Grebe, J. Hektor, S.C. Hilton, M.R. Jane, and P.H. Welch, editors, Transputer Applications and Systems ’93, Proceedings of the 1993 World Transputer Congress, volume 2, pages 981–1004, Aachen, Germany, September 1993. IOS Press, Netherlands. ISBN 90-5199-140-1. [16] J.M.R. Martin, I. East, and S. Jassim. Design Rules for Deadlock Freedom. Transputer Communications, 3(2):121–133, September 1994. John Wiley and Sons. 1070-454X. [17] J.M.R. Martin and P.H. Welch. A Design Strategy for Deadlock-Free Concurrent Systems. Transputer Communications, 3(4):215–232, October 1996. John Wiley and Sons. 1070-454X.
Appendix: Utility of True Concurrency In this appendix we give two different examples of the utility of true concurrency. The first example concerns Linda predicate operations, which were known to be ambiguous in the case of failure. The ambiguity, however, was based on reasoning about their meaning using
M.L. Smith / A Unifying Theory of True Concurrency
187
an interleaving semantics. The second example concerns the I/O-PAR design pattern, whose proper use provides guarantees of deadlock freedom. In this case, true concurrency permits more descriptive trace expressions than possible via interleaving. In both cases, the true concurrency of VCR’s parallel event traces provides a valuable abstraction for reasoning about the problems at hand. Linda Predicates Ambiguity The Linda model of concurrency is due to Gelernter [13]. Linda processes are sequential processes that interact via a shared associative memory known as Tuple Space (TS). TS is a container of tuples; a tuple is a sequence of some combination of values and/or value-yielding computations (i.e., Linda processes). A tuple is either active or passive, depending on whether all its values have been computed. Since TS is an associative memory, tuples are matched, not addressed. Linda is a coordination language consisting of four basic operations: create a new active tuple (containing one or more Linda processes) in TS, eval(t); place a new passive tuple in TS, out(t); match an existing tuple in TS, rd(t ); and remove a tuple from TS, in(t ). In the case of matching or removing tuples, only passive tuples are considered; and furthermore, rd(t ) and in(t ) are blocking operations (in the case where no matching tuple exists). Because it is not always desirable to block, non-blocking predicate versions of rd() and in() were originally proposed by Gelernter, rdp() and inp(), but later removed from the Linda language specification due to the aforementioned ambiguity. We are now ready to illustrate the ambiguity. Suppose at the same moment in time, one process places a tuple in TS while two other processes attempt to match and remove that tuple, respectively. We represent this scenario notationally, as follows: out(t).p1, rdp(t ).p2, and inp(t ).p3. This notation indicates that p1 is about to place a tuple, t, in TS before continuing its behavior as p1. Similarly, for p2 and p3, which are both about to attempt to match t (where the specified template t would match tuple t in TS). Notice the outcome of this interaction point in TS is nondeterministic, and several possibilities exist. First, it is possible for both predicate operations to succeed, as well as fail, since the matching tuple is being placed in TS at the same instant as the attempts to match it. It is in some sense both present and not present in this instant, rather akin to a quantum state of superposition. Next, it is also possible that one predicate, but not both, succeeds in this instant. In this case, consider if it were the rd(t ) that happened to fail. The failure could be due to the uncertainty properties that result from tuple t’s state of superposition; or it could also be due to the success of the in(t ) operation removing it from TS in the same instant it was placed in TS by the out(t) operation, but “before” the rd(t) operation could match it. For such a simply stated scenario, there are certainly many possibilities! Such is the challenge of nondeterminism. Let’s focus on one possible outcome. Suppose the Linda operations were observable events, and both predicate operations failed while the matching tuple t was placed in TS. Let a predicate operation decorated with complement notation indicate a failure to match the desired tuple. In a VCR trace an observer could thus record: . . . , {out(t), rdp(t ), inp(t )}, . . . The CSP observer, witnessing the same outcome, must decide an arbitrary interleaving of these three observable events. There are six possible interleavings, not accounting for imperfect observation. Not all of the interleavings make sense, however. Here are the possibilities: 1. . . . , out(t), rdp(t ), inp(t), . . .
188
2. 3. 4. 5. 6.
M.L. Smith / A Unifying Theory of True Concurrency
. . . , . . . , . . . , . . . , . . . ,
out(t), inp(t ), rdp(t), rdp(t ), out(t), inp(t), inp(t ), out(t), rdp(t), rdp(t ), inp(t ), out(t), inp(t ), rdp(t ), out(t),
. . . . . . . . . . . . . . .
In particular, the first four interleavings, where the the out(t) operation is recorded before one or both of the failed predicates would be especially concerning. When reasoning about these traces, there is no context of simultaneity preserved. It is not clear whether the events in question occurred sequentially, or simultaneously (and were interleaved by the observer). Only the last two interleavings would make sense in a CSP trace. When reasoning about the meaning of the failed predicates, it is natural to ask the question: ”This predicate just failed, but is there a tuple in TS that matches the predicate’s template?” Put another way, one should be able to reason about the state of TS at any point along a trace following a Linda primitive operation. Following a failed predicate, one should be able to reason that no matching tuple exists in TS, but given the possibility of interleaving — an additional potential level of nondeterminism — one cannot discern from the possibilities whether a matching tuple indeed exists! What just happened? In the presence of interleaving semantics, there are two levels of nondeterminism that become entangled. The first level is the outcome of simultaneous operations at an interaction point in TS. The second level of nondeterminism is the order of interleaving, at which point the context of which events occurred concurrently is lost. However, given our scenario and chosen outcome, one can reason from the given VCR trace, that after the parallel event in which both Linda predicates failed, that matching tuple t does indeed exist in TS. The meaning in this case of failure is no longer ambiguous, because the context of the failure occurred within the parallel event, not at any time after. I/O-PAR Design Pattern Additionally, it has been pointed out to the author that support for true concurrency, while not required for reasoning about certain design patterns, has the potential to greatly enhance the behavioral description of such patterns. I/O-PAR (and I/O-SEQ) are design patterns described by Welch, Martin and others in [14,15,16,17]. This example was also discussed in Smith [7]. The reason these design patterns are appealing is because arbitrary topology networks of I/O-PAR processes are guaranteed to be deadlock/livelock free, and thus they are desirable components for building systems (or parts of systems). Informally, a process P is considered I/O-PAR if it operates deterministically and cyclically, such that, once per cycle, it synchronizes in parallel on all the events in its alphabet. For example, processes P and Q, given by the following CSP equations, are I/O-PAR: P = (a → SKIP ||| b → SKIP ); P Q = (b → SKIP ||| c → SKIP ); Q VCR traces of P and Q are, respectively, all prefixes of trP and trQ : trP = {a, b}, {a, b}, {a, b}, . . . trQ = {b, c}, {b, c}, {b, c}, . . . Notice how elegantly these parallel event traces capture the essence of the behavior of processes P and Q. If one were to attempt to represent the behavior of P and Q using traditional CSP traces, the effort would be more tedious and cumbersome.
Communicating Process Architectures 2005 Jan Broenink, Herman Roebbers, Johan Sunter, Peter Welch, and David Wood (Eds.) IOS Press, 2005 © 2005 The authors. All rights reserved.
189
The Architecture of the Minimum intrusion Grid (MiG) Brian VINTER University of Southern Denmark, Campusvej 55, DK-5230 Odense M, Denmark Abstract. This paper introduces the philosophy behind a new Grid model, the Minimum intrusion Grid, MiG. The idea behind MiG is to introduce a ‘fat’ Grid infrastructure which will allow much ‘slimmer’ Grid installations on both the user and resource side. This paper presents the ideas of MiG, some initial designs and finally a status report of the implementation.
1. Introduction Grid computing is just around the top of the hype-curve, and while large demonstrations of Grid middleware exist, including Globus toolkit[8] and NorduGrid ARC[9], the tendency in Grid middleware these days is towards a less powerful model, Grid services, than what was available previously. This reduction in sophistication is driven by a desire to provide more stable and manageable Grid systems. While striving for stability and manageability is obviously right, doing so at the cost of features and flexibility is not so obviously correct. The Minimum intrusion Grid, MiG, is an attempt to design a new platform for Grid computing which is driven by a stand-alone approach to Grid, rather than integration with existing systems. The goal of the MiG project is to provide a Grid infrastructure where the requirements on users and resources alike, to join Grid, are as small as possible – thus the minimum intrusion part. While striving for minimum intrusion, MiG will still seek to provide a feature rich and dependable Grid solution. 2. Grid Middleware The driving idea behind the Minimum intrusion Grid project is to develop a Grid[7] middleware that allows users and resources to install and maintain a minimum amount of software to join the Grid. MiG will seek to allow very dynamic scheduling and scale to a vast number of processors. As such MiG will close the gap between the existing Grid systems and popular “Screen Saver Science” systems, like SETI@Home. 2.1.1 Philosophy behind MiG “The Minimum intrusion Grid”, this really is the philosophy – we want to develop a Grid middleware that makes as few requirements as possible. The working idea is to ensure that a user needs only a signed x509 certificate, trusted by Grid, and a web-browser capable of secure HTTP, HTTPS[10]. A resource on the other hand must also hold a trusted x509 certificate and in addition create a user – the Grid user – who can use secure shell, ssh, to enter the resource and once logged on can open HTTPS connections to the outside. The requirements then become:
190
B. Vinter / The Architecture of the Minimum intrusion Grid (MiG) Table 1. Requirements for using MiG
User
Resource
Must have certificate
Yes
Yes
Must have outbound HTTP
Yes
Yes
Must have inbound SSH
No
Yes
2.2 What’s Wrong with the Existing Grid systems? While there are many Grid middleware systems available most of them are based on, or descendents of, the Globus toolkit. Thus the description below addresses what the author believe to be shortcomings in the Globus toolkit, and not all issues may be relevant to all Grid systems. 2.2.1 Single Point of Failure Contrary to popular claim, all existing Grid middlewares hold a central component that, if it fails, requires the user to manually choose an alternative. While the single point of failure may not truly be a single point, but comply with some level of redundancy, none of the components scale with the size of the Grid. 2.2.2 Lack of Scheduling All existing systems perform a job-to-resource mapping. However, an actual scheduling with a metric of success is not available. Work is underway in this in the community scheduler[16] but for this scheduler to work, the resources need to be exclusively signed over to Grid, i.e. a machine can not be accessed both through Grid and a local submission system. 2.2.3 Poor Scalability The time taken to perform the job-to-resource mapping in the current systems scales linearly with the number of sites that are connected. This is already proving to be a problem in NorduGrid, which is one of the largest known Grids, though only 36 sites are connected. Imagining tens of thousands of connected sites is not likely. In the Grid service model scalability issues are more or less eliminated by absence of a single system view from a user perspective. 2.2.4 No Means of Implementing Privacy The job submission API at the users machine communicates directly with all the potential sites, thus all sites know the full identity of all jobs on the Grid. 2.2.5 No Means of Utilizing ‘Cycle-Scavenging’ Cycle-scavenging, or Screen Saver Science, utilizes spare CPU cycles when a machine is otherwise idle. This requires an estimate on how long the machine will be available and all existing systems just assume that a free resource will be available indefinitely. The model has been partly demonstrated in NorduGrid by connecting a network of workstations running Condor, to NorduGrid, but Grid itself has no means of screen-saver science.
B. Vinter / The Architecture of the Minimum intrusion Grid (MiG)
191
2.2.6 Requires a Very Large Installation on Each Resource and on the User Site The middleware that must be installed on a resource to run NorduGrid, which is probably the best of the well known Grid middlewares1, is more than 367 MB including hundreds of components. All of which must be maintained locally. This means that donating resources to Grid is associated with significant costs for maintenance; this naturally limits the willingness to donate resources. 2.2.7 Firewall Dependency To use the existing middlewares special communication ports to the resource must be opened in any firewall that protects a resource. This is an obvious limitation for growing Grid since many system-administrators are reluctant towards such port-openings. One project that seeks to address this problem is the centralized-gateway-machine project under the Nordic Data Grid Facility[17] that receives jobs and submits them to the actual resource using SSH. 2.2.8 Highly Bloated Middleware The existing middleware solutions provide a very large set of functions that are placed on each site, making the software very large and increasing the number of bugs, thus the need for maintenance, significantly. 2.2.9 Complex Implementation Using Multiple Languages and Packages The current Grid middlewares have reused a large amount of existing solutions, for datatransfer, authentication, authorization, queuing, etc. These existing solutions are written in various languages and thus the Grid middleware uses more than 6 programming languages and several shell types, in effect raising the cost of maintaining the package further. The many languages and shells also limit portability to other platforms. 3. MiG Design Criteria MiG should design and implement a functional Grid system with a minimal interface between the Grid, the users, and the resources. The successful MiG middleware implementation should hold the following properties. 3.1.1 Non-Intrusive Resources and users should be able to join Grid with a minimum of effort and with a minimum software installation. The set of requirements that must be met to join Grid should also be minimal. “Minimal” in this context should be interpreted rigidly, meaning that if any component or functionality in MiG can be removed from the resource or user end, this must be done, even if adding the component at the resource or user end would be easier. 3.1.2 Scalable MiG should be able to contain tens of thousands, even millions, of resources and users without the size of the system impacting performance. Even individual PCs should be able to join as resources. For a distributed system, such as MiG, to be truly scalable it is 1
NorduGRID is the world’s largest multi-discipline Grid and is frequently used for arguing for new features in other Grid systems.
192
B. Vinter / The Architecture of the Minimum intrusion Grid (MiG)
necessary that the efficiency and performance of the system is not reduced as the number of associated computers grows. 3.1.3 Autonomous MiG should be able to perform an update of the Grid without changing the software on the user or resource end. Thus compatibility problems that arise from using different software versions should be eliminated by design. To obtain this feature it is necessary to derive a simple and well defined protocol for interaction with the Grid middleware. Communication within the Grid can be arbitrarily complex though since an autonomous Grid architecture allows the Grid middleware to be upgraded without collaboration from users and resources. 3.1.4 Anonymous Users and resources should not see the identity of each other if anonymity is desired. This is a highly desirable feature for industrial users that are concerned with revealing their intentions to competing companies. A highly speculative, example could be two pharmaceutical companies A and B. Company A may have spare resources on a computational cluster for genome comparisons, while B may be lacking such resources. In a non-anonymous Grid model, company B will be reluctant to use the resources at company A since A may be able to derive the ideas of B from the comparisons they are making. However, in a Grid that supports anonymous users, A will not know which company is running which comparisons which makes much less valuable. In fact many comparisons will be likely to be part of research projects that map genomes and will thus reveal nothing but information that is already publicly available. 3.1.5 Fault Tolerance Failing machines or processes within the Grid should not stop users or resources from using the Grid. While obvious, the lack of fault tolerance is apparent in most Grid middlewares today. The consequences of lacking fault tolerance range from fatal to annoying. Crashes are fatal when a crashed component effectively stops users from running on Grid, i.e. a hierarchy of Meta Directory Servers. If a resource that runs users’ processes crash it becomes costly for the users that are waiting for the results of the now lost jobs. Finally crashes are merely annoying when a crashed component simply does not reply and thus slows down the users interactions with the Grid because of timeouts. 3.1.6 Firewall Compliant MiG should be able to run on machines behind firewalls, without requiring new ports to be opened in the firewall. While this requirement is quite simple to both motivate and state, actually coping within the restraints of this point may prove highly difficult. 3.1.7 Strong Scheduling MiG should provide real scheduling, not merely job-placement, but it needs to do so without requiring exclusive ownership of the connected resources. Multi-node scheduling should be possible as should user-defined scheduling for dynamic subtasking. In effect MiG should also support meta-computing2. 2
Metacomputing is a concept that precedes Grid computing. The purpose of metacomputing is to create a large virtual computer for executing a single application.
B. Vinter / The Architecture of the Minimum intrusion Grid (MiG)
193
3.1.8 Cooperative Support In order to improve the meta-computing qualities, MiG should provide access to shared user-defined data-structures. Through these data-structures a MiG based Grid system can support collaborating applications and thus improve the usability of Grid. 4. The Abstract MiG Model The principal idea behind MiG is to provide a Grid system with an overall architecture that mimics a classic, and proven, model – the Client-Server approach. In the Client-Server approach the user sends his or her job to the Grid and receives the result. The resources, on the other hand, send a request and receive a job. After completing the job the resource sends the result to the Grid which can forward the reply to the user. User
Resource User GRID User
Resource Resource
User
Figure 1. The abstract MiG model
The Grid system should be disjoint from both the users and the resources, thus the Grid appears as a centralized black-box to both users and resources. This model allow us to remain in full control of the Grid, thus upgrades and trouble shooting can be performed locally within Grid, rather than relying on collaboration from a large number of system-administrators. In addition, moving all the functionality into a physical Grid system, lowers the entry level that is required for both users and resources to join, thus increasing the chances that more users and resources do join the Grid. In MiG, storage is also an integrated component and users will have their own ‘home directory’ on MiG, which can be easily accessed and referenced directly in job-descriptions so that all issues with storage-elements and replica catalogs is entirely eliminated. For a user to join, all that is required is an x509 certificate which is signed by a certificate authority that is trusted by MiG. Accessing files, submitting jobs and retrieving results can the all be done through a web-browser that supports certificate based HTTPS. As a result the user need not install any software to access Grid and if the certificate is carried on a personal storage device, e.g. a USB key, a user can access Grid from any internet enabled machine. The requirements for resources to join MiG should also be an x509 certificate, but in addition the resource must create a Grid account in which Grid jobs are run. Initially MiG requires that this user can SSH into the account, but alternatives to this model will be investigated. 4.1 The Simple MiG Model In a simple version of the MiG model there's only a single node acting as the Grid. Clients and resources then communicate indirectly through that Grid-node. The interface between
B. Vinter / The Architecture of the Minimum intrusion Grid (MiG)
194
the user and Grid should be as simple as possible. The exact protocol remains a topic for investigation but, if possible, it will be desirable to use only the HTTP[10] protocol or a similar widely used, and trusted, protocol. Towards the resources the protocol should be equally simple, but in this case, as we also desire that no dedicated Grid service is running on the resource, one obvious possibility is to use the widely supported SSH[11] protocol. When submitting a job, the user sends it to the Grid machine which stores the job in a queue. At some point a resource requests a job and the scheduler chooses a job to match the resources that are offered. Once the job is completed the results are sent back to MiG. The user is informed that the job has completed and can now access MiG and retrieve the results. User
Resource User GRID User
Resource
User
Figure 2. The simple MiG model
4.1.1 Considering the Simple Model The simple model of course, is quite error-prone as the single Grid machine becomes both a single point of failure and a bottleneck which is not acceptable. The obvious solution is to add more Grid machines which can act a backup for each other. 4.2 The Full MiG Model The obvious flaw in using the client-server model is that achieving robustness is inherently hard in a centralized server system where potential faults include: x x x x
Crashed processes Crashed computers Segmented networks Scalability issues
To correctly function in the presence of errors, including the above, error redundancy is needed. The desired level of redundancy is a subject to further investigations, but should probably be made dynamic to map the requirements of different systems. To address the performance issues Grid itself must be distributed so that users can contact a local Grid server. Thus workload will be distributed through the physical distribution of users. Once a job arrives at a Grid server the server must ensure that the job is “deposited” at a number of other servers, according to the current replication rate. The user should not receive an acknowledgement of submission before the job has been correctly received and stored at the required number of servers. Once a resource has completed a job the resource is expected to deliver the result. If, however, the client has not provided a location for placing the result, the resource can still insist on uploading the results. To facilitate this, the Grid should also host storage to hold results and user input-files, if a resource cannot be allocated at the time the client submits a job.
B. Vinter / The Architecture of the Minimum intrusion Grid (MiG)
195
To facilitate payment for resources and storage a banking system should be implemented. To allow inter-organization resource exchange, the banking system should support multiple banks. Dynamic price-negotiation for the execution of a job is a very attractive component that is currently a research topic. Supporting price-negotiations in a system such as MiG where no central knowledge is available is an unsolved problem that must be addressed in the project. Likewise, scheduling in a system with no central coordination is very hard. User
Resource
GRID
GRID
Resource
User GRID User
Resource
Resource
Figure 3. The full MiG model
4.2.1 Considering the Full Model One topic for further investigations is: how do we schedule on multiple Grid servers? In principle we would prefer complete fairness, so that the order in which jobs are executed is not dependent on where they are submitted, i.e. to which MiG node. Such a full coordination between all nodes in MiG for each job-submission is not realistic since it will limit scalability, thus a model that allows scalability while introducing some level of loadbalancing and fairness will have to be invented. 4.3 MiG Components 4.3.1 Storage in MiG One difficulty that users report when using Grid is file access. Since files that are used by Grid jobs must be explicitly uploaded to a Grid storage element, result files must be downloaded equally explicitly. On the other hand it is a well known fact that the expenses associated with a professional backup strategy often prohibit smaller companies from implementing such programs, and relies on individual users to do the backup - a strategy that naturally results in a large loss of valuables annually. Some interesting statistics include[18]: x x x x x
80% of all data is held on PCs (Source, IDC) 70% of companies go out of business after a major data loss (Source, DTI) 32% of data loss is due to user error (Source, Gartner Group) 10% of laptops are stolen annually (Source, Gartner Group) 15% of laptops suffer hardware failure annually (Source, Gartner Group)
196
B. Vinter / The Architecture of the Minimum intrusion Grid (MiG)
By using the Grid, we do not just gain access to a series of computational resources, but also to a large amount of storage. Exploitation of this storage is already known about from peer-to-peer systems, but under “well-ordered'' conditions it can be used for true Hierarchal Storage Management, HSM. When working with HSM the individual PC or notebook only has a working copy of the data which is then synchronized with a real dataset located on Grid. By introducing a Grid based HSM system, we can offer solutions to two important issues at one time. Firstly, Grid jobs can now refer directly to the dataset in the home-catalog thus eliminating the need for explicit up- and down-loads of files between the PC and Grid. Secondly, and for many smaller companies much more importantly, we can offer a professionally driven storage-system with professional backup solutions, either conventional backup systems or, more likely, simple replica based backup - the latter is more likely because disks are becoming rapidly more inexpensive and keeping all data in three copies is easily cheaper than a conventional backup-system and the manpower to run it. A Grid based HSM system will allow small companies to outsource the service while medium and large companies can chose to either outsource or implement a Grid HSM in-house. Thus by introducing Grid based HSM, Grid can offer real value to companies that are not limited by computational power and these companies will thus be "Grid integrated" when Grid becomes the de-facto IT infrastructure. User
GRID
Resource
Disk
Figure 4. MiG Storage support
4.3.2 Scheduling Scheduling in Grid is currently done at submission-time and usually a scheduled task is submitted to a system where another level of scheduling takes place. In effect the scheduling of a job provides neither fairness for users nor optimal utilization of the resources that are connected to the Grid, and the current scheduling should probably just be called job-placement. Furthermore, the current model has a built in race-condition since the scheduling inquires all resources and submits to the one with the lowest time-to-execute. If two or more jobs are submitted at the same time they will submit to the same resource, but only one will get the expected timeslot. The MiG model makes scheduling for fairness much simpler as the local scheduling comes before the Grid scheduling in the proposed model. Scheduling for the best possible resource utilization is much harder and of much more value. The problem becomes one that may be described as: given the arrival of an available resource, and an existing set of waiting jobs, which job is chosen for the newly arrived resource so that the global utilization will be as high as possible? The above is in the common case where jobs are more frequent than resources, in the rare case that resources are more abundant than jobs, the same problem is valid on the arrival of a job. When scheduling a job, future arrivals of resources are generally not known, i.e., we are dealing with an on-line scheduling problem. On-line scheduling is an active research area, initiated as early as 1966[1] and continued in hundreds of papers, see [2] and [3] for a
B. Vinter / The Architecture of the Minimum intrusion Grid (MiG)
197
survey. This problem, however, differs from all these on-line scheduling problems investigated previously in that the resources, not the jobs, arrive over time in the common case. The problem also has some similarity with on-line variable-sized bin packing [4][5][6], but again with a twist that has not been considered before; the bins, not the items to be packed, arrive on-line. 4.3.3 Security and Secrecy In Grid, security is inherently important, and the MiG system must be at least as secure as the existing systems. The simple protocols and minimal software based on the resources make this goal easy to achieve, but still the mechanisms for security must be investigated. Secrecy is much harder and is currently not addressed in Grid. Privacy will mean much towards achieving secrecy but other issues are also interesting topics of research. I.e. if a data file is considered valuable, e.g. a genomic data sequence, how can we hold the contents of that file secret to the owner of the resource? In other words, can MiG provide means of accessing encrypted files without asking the users to add decryption support to his application? UID
SID
User
GRID
UID+SID
Resource
SID
Disk
Figure 5. Anonymity and security model
4.3.4 Fault-Tolerance In a full Grid system errors can occur at many levels and failures must be tolerated on MiG nodes, resources, network connections and user jobs. Any single instance of these errors must be transparent to the user. More complex errors of course, or combinations of the simple errors, cannot fully be hidden from the users, i.e. if a user is on a network that is segmented from the remaining internet we can do nothing to address this. Achieving fault tolerance in a system such as MiG is merely a question of never loosing information when a failure occurs, i.e. keeping redundant replicas of all information. Figure 6 shows how a submitted job is replicated when it is submitted. 1. Submit User
4. OK
GRID 2. Replica 1
GRID GRID GRID
3. Replica 2
Figure 6. Replicating a new job
198
B. Vinter / The Architecture of the Minimum intrusion Grid (MiG)
Recovering from a failure is then a simple matter of detecting the failure and restoring the required number of replicas as shown in Figure 7, where the number of replicas is three.
User
GRID
1. Failure detection GRID
3. Replica 2 GRID
GRID
2. Replica 1
Figure 7. Recovering from a failure
4.3.5 Load Balancing and Economics Load balancing in distributed systems is an interesting and well investigated issue[13]. However load balancing for, potentially, millions of resources while maintaining a well defined measure of fairness is still an unsolved issue. However adding economics to the equation actually makes this easier. Since MiG should support a market oriented economy, where the price for executing a job is based on demand and supply, this introduces a simple notion of fairness which is that resources should optimize their income while users should minimize their expenses. In case there are more jobs than resources, which is the common case, the next job to execute is the job that is willing to pay most for the available resource. In case two or more jobs bid the same for the resource the oldest of the bidders is chosen. In the rare case that there are more resources offering their services than there are jobs asking for a resource, the next available job is sent to the resource that will sell its resources cheapest. In case more resources bid at the same price, the one that have been waiting the longest wins the bid. 4.3.6 Shared Data-Structures for MiG When people with no knowledge of Grid computing are first introduced to Grid, they often mistake it for meta-computing and expect the Grid to behave as one large parallel processor and not a large network of resources. This misunderstanding is quite natural, since such a Grid computing model would be highly desirable for some applications, of course most parallel applications cannot make use of such an unbalanced virtual parallel processor. However, to support the applications that can make use of Grid as a meta-computing system, MiG will provide support for shared data-structures which are hosted on Grid.
B. Vinter / The Architecture of the Minimum intrusion Grid (MiG)
199
4.3.7 Accounting/Price-Negotiations Grid becomes really interesting once users can purchase resources on Grid, thus transforming Grid from a resource sharing tool into a market place. To support this vision, MiG will not only do accounting but also support a job bourse, where the price for a task can be dynamically negotiated between a job and a set of resources. Such dynamic pricesetting is also a known subject, but combining it with load-balancing and fairness in a truly distributed system has not been investigated. 4.3.8 User Defined Scheduling An advanced extension of the online-scheduling problem is the subtasking problem, where a job may be divided into many subjobs. If the subtasks have a natural granularity the task is trivial and known solutions exist, including functioning systems, such as SETI@Home. If, on the other hand, a subtask can be selected that solves the largest possible problem on the given resource, the problem becomes very hard and no system provides means for this today.
Figure 8. Dynamic sub-scheduling
When comparing with on-line bin packing, this variant of the problem has one further twist to it: the size of an item (a subtask) may depend on which other items are packed in the same bin, since the data needed by different subtasks may overlap. MiG will seek to develop a model where a job can be accompanied with a function for efficient sub-tasking. The demonstration application for this will be a new version of the Grid BLAST application, which is used in Bio-Science for genome comparisons. The efficiency of BLAST depends on two parameters; input-bandwidth and available memory. We currently developing a dynamic subtasking algorithm that creates subjobs fitted for resources as they become available. 4.3.9 Graphics Rendering on Grid Currently Grid is used exclusively for batch job processing. However for Grid to truly meet the original goal of “computing from a plug in the wall”, graphics and interactivity is needed. In this respect MiG makes things more complex than the existing middlewares since MiG insists on maintaining anonymity, e.g. we insist that a process can render output to a screen-buffer that it cannot know the address of. The solution to this problem is similar to the storage model. A ‘per-user’ frame-buffer is hosted in the MiG infrastructure, and resources can render to this, anonymous, region. Users on the other hand can choose to import this buffer into their own frame-buffer and thus observe the output from their processes without the hosts of these processes knowing the identity of the receiver. The approach for anonymous rendering in MiG is sketched in Figure 9.
B. Vinter / The Architecture of the Minimum intrusion Grid (MiG)
200
Resource User
GRID Resource
Figure 9. Anonymous graphics rendering in MiG
4.4 Status At the time of writing MiG is fully functional from a usage perspective, all the features described in 4.1, the simple model, further details on the implementation can be found in [14]. The storage model described in section 4.3.1 is also fully functional and described in some detail in [15]. 5. Conclusions The purpose of this paper is to motivate the work on a new Grid middleware, the Minimum intrusion Grid, MiG. MiG is motivated by a set of claimed weaknesses of the existing Grid middleware distributions, and a desire to develop a model for Grid computing that is truly minimum intrusion. The proposed model will provide all the features known in today's Grid systems, and a few more, while lowering the requirements for a user to simply having an X.509 certificate, and for a resource to have a certificate and create a Grid-user who can access the resource through SSH. While MiG is still in its very initial stage, users can already submit jobs and retrieve their results, while maintaining complete anonymity from the resource that executes the job. References [1] [2] [3] [4] [5]
[6] [7]
R. L. Graham, Bounds for Certain Multiprocessing Anomalies, Bell Systems Technical Journal, vol 45, 1563-1581, 1966. Y. Azar, On-Line Load Balancing, Online Algorithms: The State of the Art, Springer-Verlag, 1998, A. Fiat and G. J. Woeginger (ed.), Lecture Notes in Computer Science, vol. 1442. J. Sgall, On-Line Scheduling, Online Algorithms: The State of the Art, Springer-Verlag, 1998, A. Fiat and G. J. Woeginger (ed.), Lecture Notes in Computer Science, vol. 1442. J. Csirik, An On-Line Algorithm for Variable-Sized Bin Packing, Acta Informatica, 26, pp 697--709, 1989. J. Csirik and G. Woeginger, On-Line Packing and Covering Problems, Online Algorithms: The State of the Art, Springer-Verlag, 1998, A. Fiat and G. J. Woeginger (ed.), Lecture Notes in Computer Science, vol. 1442. L. Epstein and L. M. Favrholdt, On-Line Maximizing the Number of Items Packed in Variable-Sized Bins, Eighth Annual International Computing and Combinatorics Conference (to appear), 2002. I. Foster. The Grid: A New Infrastructure for 21st Century Science. Physics Today, 55(2):42-47, 2002.
B. Vinter / The Architecture of the Minimum intrusion Grid (MiG) [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18]
201
I. Foster, C. Kesselman. The Globus Project: A Status Report. Proc. IPPS/SPDP ’98 Heterogeneous Computing Workshop, pp. 4-18, 1998. P. Eerola et al. Building a Production Grid in Scandinavia. IEEE Internet Computing, 2003, vol.7, issue 4, pp.27-35. R. Fielding et al, RFC2616 Hypertext Transfer Protocol – HTTP/1.1, The Internet Society, 1999, http://www.rfc.net/rfc2616.html. T. Ylonen, SSH – Secure login connections over the internet, Proceedings of the 6th Security Symposium, p 37, 1996. S. F. Altschul et al., Basic local alignment search tool, J. Mol. Biol. 215:403-10, 1990. G. Barish and K. Obraczka, World Wide Web Caching: Trends and Techniques, IEEE Communications Magazine Internet Technology Series, May 2000. H.H. Karlsen and B. Vinter, Minimum intrusion Grid – The Simple Model, in proc. of ETNGRID 2005 (to appear). R. Andersen and B. Vinter, Transparent Remote File Access in the Minimum Intrusion Grid, in Proceeding of ETNGRID 2005 (to appear). The Community Scheduler Framework, http://csf.metascheduler.org, 2005. The Nordic Data Grid Facility (NDGF), www.ndgf.org, 2003. Data Clinic, http://www.dataclinic.co.uk/data-backup.htm.
This page intentionally left blank
Communicating Process Architectures 2005 Jan Broenink, Herman Roebbers, Johan Sunter, Peter Welch, and David Wood (Eds.) IOS Press, 2005 © 2005 The authors. All rights reserved.
203
Verification of JCSP Programs Vladimir KLEBANOV, a Philipp RÜMMER, b,1 Steffen SCHLAGER c and Peter H. SCHMITT c a University of Koblenz-Landau, Institute for Computer Science, D-56070 Koblenz, Germany b Chalmers University of Technology, Dept. of Computer Science and Engineering, SE-41296 Gothenburg, Sweden c Universität Karlsruhe, Institute for Theoretical Computer Science, D-76128 Karlsruhe, Germany Abstract. We describe the first proof system for concurrent programs based on Communicating Sequential Processes for Java (JCSP). The system extends a complete calculus for the JavaCard Dynamic Logic with support for JCSP, which is modeled in terms of the CSP process algebra. Together with a novel efficient calculus for CSP, a rule system is obtained that enables JCSP programs to be executed symbolically and to be checked against temporal properties. The proof system has been implemented within the KeY tool and is publicly available. Keywords. Program verification, concurrency, Java, CSP, JCSP
1. Introduction Hoare’s CSP (Communicating Sequential Processes) [10,16,18] is a language for modeling and verifying concurrent systems. CSP has a precise and compositional semantics. On the other hand, the semantics of concurrency in Java [8] (threads) is only given in natural language. Synchronization is based on monitors and data transfer is primarily performed through shared memory; it has turned out that engineering complex programs using these concepts directly is very difficult and error-prone. In addition, verification of such programs is extremely difficult and existing approaches do not scale up well. The JCSP approach [13,20] tries to overcome the difficulties inherent to Java threads. It defines a Java library that offers functions corresponding to the operators of CSP. Using solely JCSP library functions for concurrency and communication (i.e., no explicit creation of threads and no communication via shared memory) allows to verify the (concurrent) behavior of the Java program on the CSP level instead of dealing with monitors on Java level. Since the use of JCSP only makes sense with a strict discipline not to resort directly to Java concurrency features, this should not be a severe restriction. The paper is organized as follows. In Sect. 2 we give an overview of the architecture of our verification calculus which is presented in detail in Sect. 4–6. In Sect. 3 we present a JCSP implementation which evaluates polynomials and serves as a running example. The verification of some properties of the running example is described in Sect. 7. Finally, in Sect. 8 we relate our verification system to existing approaches and draw conclusions in Sect. 9. 1 Correspondence to: Philipp Rümmer, Dept. of Computer Science and Engineering, Chalmers University of Technology, 412-96 Gothenburg, Sweden. Tel.: +46 (0)31 772 1028; Fax: +46 (0)31 165655; E-mail: [email protected].
204
V. Klebanov et al. / Verification of JCSP Programs
JavaCard calculus (1)
CSP model of JCSP (2)
CSP calculus (3) Calculus for modal logic correctness assertions (4) Figure 1. Architecture of the verification calculus
2. Architecture of Verification Calculus Our calculus allows to derive truth of temporal correctness assertions of the kind S : φ, where S is a process term and φ a formula of some modal logic. The intended semantics is that the process described by S has the property φ or, in more technical terms, S describes the Kripke structure φ is evaluated in. Our approach is not limited to a particular modal logic. E.g., in the implementation we use an extended version of HML enriched with a least-fixed point operator, which allows to express the liveness-property we proved for the running example presented in Sect. 3. However, in order to explain our approach in this paper we restrict ourselves to plain Hennessy-Milner-Logic (HML) [9] because of its simplicity. An important part of our proof system is the calculus for the program logic JavaCard Dynamic Logic (JavaCardDL) that is developed in the KeY project [1]. JavaCard [19,5] roughly corresponds to the Java programming language omitting threads and is mainly used for programming smartcards.1 The KeY tool is a system for deductive verification of JavaCard programs, respectively of Java programs without threads. Fig. 1 shows the architecture of the verification system, which consists of four components. These correspond to the four stages of the main verification loop: 1. The first stage symbolically executes JavaCard statements until a JCSP library call is reached. This is performed by the standard KeY calculus [1]. Due to our assumptions that allow only explicit inter-process communication, there is no interference between sequential process code. The sequential calculus from the KeY tool can thus be taken without modification. From a CSP point of view pieces of sequential Java code can be seen as processes that produce only internal events. 2. The second part—operating in parallel with (1)—replaces the JCSP library calls within the program by their CSP models (see Sect. 4). 3. Stage 3 is a rewriting system, which transforms the process term into a normal form that allows to easily deduce the first steps of the process (see Sect. 5). 4. Finally, in stage 4, temporal correctness assertions are evaluated with respect to the possible initial behaviors of the process term (see Sect. 6). As an important aspect concerning interactive proving, a translation of the considered JCSP program as a whole to a different formalism does never take place. Instead, each of the components works as “lazy” as possible, and all layers play together in an interleaved manner. 3. Verification Example In order to illustrate the programs that can be handled by our verification system we start with describing a simple application, an implementation of Horner’s rule [12] in the JCSP framework. The program only makes use of some of the basic JCSP classes; other functionality like processing of integer streams, which is also provided by JCSP, is re-implemented to obtain a self-contained system. 1 JavaCard lacks some more features of Java, e.g. floating point numbers and support for graphical userinterfaces, but also offers support for transactions, which is not available in Java.
V. Klebanov et al. / Verification of JCSP Programs import jcsp.lang.*;
abstract class BinGate implements CSProcess { protected ChannelInputInt input0, input1; protected ChannelOutputInt output; public BinGate ( ChannelInputInt input0, ChannelInputInt input1, ChannelOutputInt output ) { this.input0 = input0; this.input1 = input1; this.output = output; } }
class Adder extends BinGate { public Adder ( ChannelInputInt input0, ChannelInputInt input1, ChannelOutputInt output ) { super ( input0, input1, output ); } public void run () { while ( true ) output.write ( input0.read () + input1.read () ); } }
class Multiplier extends BinGate { public Multiplier ( ChannelInputInt input0, ChannelInputInt input1, ChannelOutputInt output ) { super ( input0, input1, output ); } public void run () { while ( true ) output.write ( input0.read () * input1.read () ); } } class Prefix implements CSProcess { private int value, num; private ChannelInputInt input; private ChannelOutputInt output; public Prefix ( int value, int num, ChannelInputInt input, ChannelOutputInt output ) { this.value = value; this.num = num; this.input = input; this.output = output; } public void run () { while ( num-- != 0 ) output.write ( value ); while ( true ) output.write ( input.read () ); } }
205
class Propagator implements CSProcess { private int delay, num; private ChannelInputInt input; private ChannelOutputInt output; public Propagator ( int delay, int num, ChannelInputInt input, ChannelOutputInt output ) { this.delay = delay; this.num = num; this.input = input; this.output = output; } public void run () { while ( delay-- != 0 ) output.write ( input.read () ); while ( num-- != 0 ) CSProcessRaiseEventInt(input.read()); } } class Repeat implements CSProcess { private int[] values; private ChannelOutputInt output; public Repeat ( int[] values, ChannelOutputInt output ) { this.values = values; this.output = output; } public void run () { int i = 0; while ( true ) { output.write ( values[i] ); i = ( i + 1 ) % values.length; } } } public class PolyEval implements CSProcess { private int[] values; private int degree, num; private ChannelInputInt coeff; public PolyEval ( int[] values, int degree, int num, ChannelInputInt coeff ) { this.values = values; this.num = num; this.degree = degree; this.coeff = coeff; } public void run () { One2OneChannelInt[] c = One2OneChannelInt.create ( 5 ); new Parallel (new CSProcess[] { new Repeat (values, c[0]), new Prefix (0, num, c[4], c[1]), new Adder (c[1], coeff, c[2]), new Propagator(degree*num, num, c[2], c[3]), new Multiplier(c[0], c[3], c[4]) }) .run (); } }
Figure 2. The source code of the verified system for evaluating polynomials (JCSP library classes, interfaces, and method calls are in bold). Apart from the special call CSProcessRaiseEventInt, all classes can directly be compiled using the JCSP library [20] and a recent version of Java. The statement CSProcessRaiseEventInt(v) makes the symbolic JavaCard interpreter implemented in KeY raise an observable CSP event jcspIntEvent(v), but does not have any further effects. For actually executing the network, one can for instance replace the statement with System.out.println(v).
206
V. Klebanov et al. / Verification of JCSP Programs
The evaluation of polynomials is carried out by a network of 5 gates performing basic operations on streams of integers, which are connected using synchronous JCSP channels. The code of the complete system is given in Fig. 2 and introduces the following classes: Adder, Multiplier: Processes that compute point-wise sums and products of integer streams. In contrast to similar classes that are provided by JCSP, pairs of input values are read sequentially and not in parallel, which makes the code a lot shorter and does not affect the functionality of the network in the present setting. Prefix: A process that first outputs a fixed integer value num times, and afterwards copies its input stream to the output. Propagator: A process that copies the first delay input values to its output, and that for the subsequent num input values vi raises an observable event jcspIntEvent(vi). We use such “logging” events to make the result of the computation visible to the formula φ of a correctness assertion S : φ. Repeat: A process that creates a periodical stream of integers by repeatedly writing the contents of an array to its output. PolyEval: The complete network that evaluates a number of polynomials in parallel. The computation result is made observable by an instance of Propagator. In principle, the cyclic network can be used to evaluate an arbitrary number of polynomials pi (x) = ci,n xn + · · · + ci,0 (for i = 1, . . . , k) of the same degree n in parallel. Therefore, the input vector ¯x lists the positions (x1 , . . . , xk ) that are examined, and the network is fed the coefficients of the polynomials through the stream coeff = (c1,n , c2,n , . . . , c1,n−1 , c2,n−1 , . . .). The gates Prefix and Propagator have to be set up with the correct number k and degree n of the polynomials. For the purpose of this paper, however, we restrict the capacity of the network by choosing its channels to be zero-buffered. As each of the nodes is only able to store one intermediate result at a time, set up like this the system is bound to lock up as soon as more than three polynomials are evaluated at the same time. This can be observed both by actually executing the Java program and by symbolically simulating the network using our system. Symbolic execution with up to three polynomials is described in Sect. 7. 3.1. Verified Property of the System When evaluating polynomials (p1 , . . . , pk ) at points (x1 , . . . , xk ), the network is expected to produce, after a finite number of (hidden) execution steps, a sequence of distinguished events jcspIntEvent(p1(x1 )), . . . , jcspIntEvent(pk (xk )). In terms of temporal logic, this is captured by the requirement that on every computation path eventually this sequence occurs and is only preceded or interleaved with unobservable steps. The temporal formula describing this behavior is subsequently denoted with eventually(p1(x1 ), . . . , pk (xk )) and can for instance be expressed in the modal μ-calculus [4].2 Verification of this particular kind of properties is for a fixed number of polynomials of fixed degree possible without inductive proof arguments; for handling polynomials of unbounded degree, which lead to an unbounded runtime of the network, induction would be necessary. Since we have not yet investigated the usage of induction techniques (as in [6]) in combination with our verification system, we stick to the simpler scenario and only consider quadratic polynomials in this document. To set up the verification problem, the coefficients of the polynomials are stored in a buffered JCSP channel, and the network is created with the correct parameters. The resulting program is judged by the temporal formula, which for evaluation of two polynomials in parallel leads to the following proof obligation: 2 At this point HML is not expressive enough, because the number of computation steps is unknown. Here we have enriched HML with a least fixed-point operator borrowed from modal μ-calculus. This extension does not require induction in the calculus.
V. Klebanov et al. / Verification of JCSP Programs T( jcsp.lang.One2OneChannelInt coeff = new jcsp.lang.One2OneChannelInt ( new jcsp.util.ints.BufferInt ( 10 ) ); coeff.write(c12); coeff.write(c22); coeff.write(c11); coeff.write(c21); coeff.write(c10); coeff.write(c20); new PolyEval ( new int[] { x1, x2 }, 2, 2, coeff ).run (); ) 2 2
207
(1)
: eventually(c12 · x1 + c11 · x1 + c10, c22 · x2 + c21 · x2 + c20) 4. CSP Model of JCSP
Process algebras like CSP allow processes to be assembled using algebraic connectives, for instance using interleaving composition ||| (we assume familiarity with the CSP notation). JCSP follows this concept roughly, but offers communication means (particularly channels) that only remotely correspond to the operators of CSP. For investigating the behavior of JCSP programs we need a more accurate modeling of JCSP semantics, which we achieve by a (non-trivial) translation of JCSP primitives into CSP. This approach follows ideas from [13], though we are not aiming towards a complete replication of multi-threaded Java but concentrate on JCSP. The usage of its own interaction features is not strictly enforced by JCSP—for practical reasons—and programs can be written in an “unclean” manner and circumvent JCSP by using shared memory or similar native Java functionality. Since we believe that such programs are not in line with the principles of JCSP, we regard them as ill-shaped. The following models of JCSP operations are simplified insofar as they do not predict the correct behavior of JCSP and Java for ill-shaped programs. Using such a simplified semantics for verification is beneficial because it shortens proofs, but in practice it has to be complemented with checks that prohibit the treatment of ill-shaped programs right from the start. Though we have not yet investigated how to realize such tests, it seems possible to reach a sufficient precision by employing static analysis or type systems to this end (in a completely automated manner). Our principal idea for modeling JCSP programs is to construct a CSP process term in which sequential Java code can turn up as subterms (wrapped in an operator T(·)). JCSP components (such as channels) used to set up the network determine the way in which the sequential Java parts are connected. To illustrate this, the process term representing the scenario of two sequential JCSP processes (implemented as Java programs α, β) that communicate through a JCSP channel is: idc : CHAN |[ idc .Σ ]| T(α) ||| T(β) \ idc .Σ (2) CHAN is a process modeling the JCSP channel that interfaces with the Java processes T(α), T(β) through messages of the alphabet Σ. To distinguish different channels, messages are tagged with an identifier idc . 4.1. JCSP Processes with Disjoint Memory and their Interfaces The basis for assembling JCSP systems is to give terms T(α) that wrap Java programs semantics as processes. Therefore, we assume that such a process can only interact with its environment through the use of JCSP operations; this immediately rules out shared-memory communication, or any kind of communication that is not modeled explicitly through observable events raised by T(α). For defining the behavior of T(α), we equip Java with an operational semantics in which each execution step can 1. transform α into a continuation α , 2. change the memory state of the process T(α), or 3. make T(α) engage in an event a that is observable by the rest of the system (the three possible outcomes do not exclude each other). Designing transition rules for symbolically executing Java code based on this semantics, we were able to start with
208
V. Klebanov et al. / Verification of JCSP Programs
the operational semantics of sequential Java that is implemented in the KeY system, which essentially means that we only had to add rules for item 3. Concerning 1 and 2, the behavior of a program follows [19,8]. In JavaCardDL, memory contents are represented during the symbolic execution of a program using so-called updates, which are lists of assignments to variables, attributes and arrays. Terms and formulas can be preceded with updates in order to construct the memory contents that are in effect. With updates, for instance, the transition rule for side-effect free assignments is T({x=e; ...}) {x := e}T({...}) The KeY system covers the complete JavaCard language and large parts of Java in terms of such transition rules. Observable events are raised by a process T(α) only when JCSP operations (like channel accesses c.write(...)) are executed. The protocol that is followed for communication through a channel is described in Sect. 4.3; a simpler operation is the logging command that is used in Sect. 3 to make results visible. Such operations are handled with additional rules that insert CSP connectives as necessary: T({CSProcessRaiseEventInt(v); ...}) jcspIntEvent(v) → T({...}) 4.2. Class Parallel The most basic way of assembling processes in JCSP is the class Parallel for parallel composition. Modeling this feature in CSP is rather simple—assuming disjoint memory for processes—and boils down to inserting the interleaving operator |||. The magic operation that has to be trapped is Parallel.run, because this is the place where new processes are actually spawned. For an object parallel that is set up with children processes p1, . . . , pn, the effect of the run-method can be modeled in CSP as follows: T({parallel.run(); ...}) T({p1.run();}) ||| · · · ||| T({pn.run();}) ; T({...}) Sequential composition ; is used to make the parent process continue its execution after termination of the children. Because memory contents are stored in updates in front of terms T(α), each of the processes that are created will inherit the memory of the parent process, but will consecutively operate on a copy of that memory: write access of the programs pi are not visible to other processes. 4.3. Channels We model the different kinds of channels that are provided by the JCSP library—which differ in the way data is buffered and have different access arbitration—following ideas from [13]. As already shown in Eq. (2), the behavior of a channel is simulated by an explicit routing process CHAN that is attached to a Java process as a slave. As a starting point, we adopted the CSP model from [13] of a zero-buffered and synchronous channel (Fig. 3): LEFT = write ? msg → transmit ! msg → ack → LEFT RIGHT = ready → transmit ? msg → read ! msg → RIGHT ONE2ONECHANNEL = LEFT |[ transmit.Σ ]| RIGHT \ transmit.Σ
(3)
V. Klebanov et al. / Verification of JCSP Programs
209
Our implementation contains further channel models, for instance an extended version of the model shown here that also supports the JCSP alternation operator. Channels with bounded buffering (as used in the example Fig. 2) can be handled by the system as well. However, a complete set of CSP Figure 3. Model of a zero-buffered channel characterizations for the JCSP channels, together with a systematic verification that the models faithfully represent the actual JCSP library is still to be developed. The JCSP operations for creating and accessing channels are again realized by translating them to CSP connectives. Channels are created by allocating a new channel identifier idc (which in our implementation is just the reference to the created object of class One2OneChannel) and by spawning the appropriate routing process: T({c=new One2OneChannel(); ...}) idc : ONE2ONECHANNEL |[ idc .Σ ]| {c := idc }T({...}) \ idc .Σ The Java process can then interact with the channel according to a certain protocol, which for the zero-buffered channel looks as follows. T({c.write(o);...}) idc .write ! msgo → idc .ack → T({...}) T({o=c.read();...}) idc .ready → idc .read ? msgo → {o := . . .}T({...}) Because of the disjoint-memory assumption it is necessary to encode the complete information that messages contain as some term msgo, which we have so far implemented for integers (in combination with the JCSP channels for integers that for instance are used in Fig. 2). Treating arbitrary objects is possible through manipulations of updates and will be added to the proof system in a later version. 5. CSP calculus The gist of evaluating HML-assertions for processes is that certain events can or have to be fired in a given state. It is thus crucial to obtain, for the process term at hand, the summary of events that it can fire in the next step and the corresponding process continuations. This goal is usually achieved by rewriting the process term into a certain normal form, from which this information can be syntactically gleaned. When working with a naive total-order semantics, a typical exploration (rewriting) of a process term (here the interleaving of two processes) looks like this: a → P ||| b → Q
a → (P ||| b → Q) 2 b → (a → P ||| Q)
The subterms P and Q are duplicated, and in general the term size increases exponentially. On the other hand, Petri nets have been used in the past to give processes a partialorder semantics (also called step semantics) [3]. The net approach avoids a total ordering of independent events, which helps containing the state explosion. The representation of a transition system as a net graph is also usually more compact than a tree. Following this tradition, we combine Petri nets and conventional process terms into one formalism (we call it netCSP), which allows succinct reasoning. We model CSP events as net transitions, and the evolution of the net marking corresponds to the derivation of adjacent processes that are reached when a process performs activated execution steps
210
V. Klebanov et al. / Verification of JCSP Programs
netCSP terms are built-up incrementally from the conventional CSP process terms by the rewriting system outlined in the following. The incremental, or “lazy”, manner of exploration allows to have Java programs inside processes, since finite nets are not Turing-complete. It is the first (to our knowledge) rewriting system for efficiently creating combined process representations from conventional ones, and for exploring their behavior. 5.1. Monotonic Petri nets
Petri nets (see [15] for an introduction) are a formal and graph- E M D ically appealing model long used for modeling non-sequential proFigure 4. Life cycle of a cesses. To model CSP process behavior in a faithful and efficient place marking way we introduce a slightly modified version of Petri nets, which we call monotonic Petri nets. Every place in such a net is in one of the three following states: empty (E), marked (M), or dead (D). A transition t of a monotonic Petri net is called enabled for a marking M (a mapping from places to states), if all its input places in are marked and all its output places out are empty: M(in(t)) ⊆ {M} ∧ M(out(t)) ⊆ {E} An enabled transition t can fire leading to a new marking, which for a place p is
Mnew (p)
:=
⎧ if p ∈ in(t) ⎪ ⎪D ⎨ M if p ∈ out(t) ⎪ ⎪ ⎩ M(p) otherwise
Thus, a marking of a place can only evolve in monotonic progression as depicted in Figure 4. This allows far-reaching estimations on the behavior of the net (e.g. places depending on dead places are blocked forever). Another immediate and favorable consequence of the above net semantics is the fact that every non-isolated transition can fire at most once, just as any particular CSP event can only be raised once. Finally, since monotonic nets are easily translated to standard 1-safe Petri nets, all common analysis techniques are still available. 5.2. netCSP: Combining Nets and Process Terms The combination of conventional process terms and Petri nets is described algebraically by enriching the set of usual CSP operators with the following four: i
P
i
ao
p[v] : P P |[L[X]R]| Q
Token consumption: this term attaches a CSP process P to the place i of the net. The execution of P is now causally dependent on i. If i is marked with E then P is currently blocked. P is not blocked if i is marked with M. Then execution of P consumes the token in i. If i is marked with D then P is blocked forever (and can be removed). In lieu of a single place i a set of places can turn up. In this case a token is consumed from every place. The transition operator expresses that a CSP event a is raised by the term, whilst a causal dependency token is consumed from place i and placed in place o. Again, sets of places can play the role of i and o. The causal state operator sets the marking of the place p in P to value v (which is one of E, M, or D). This consruct is a “bookkeeping” version of the standard parallelism operator P |[ X ]| R, see Section 5.3.4.
211
V. Klebanov et al. / Verification of JCSP Programs
The new operators are initially introduced by the rewriting system, which transforms conventional CSP terms into the combined representation. This rewriting system is described in the following section. 5.3. Rewriting System For Exploring Process Behavior 5.3.1. The Desired Normal Form The rewriting system presented in this section transforms a CSP or a netCSP term into the following normal form (together with an implied marking M): i1
a1 o1 ||| · · · ||| in an on ||| R
(NF)
where ik ak ok are enabled transitions, and the remainder R is blocked w.r.t. M, i.e., cannot raise an event at the current stage. The latter condition can be checked by a simple syntactic criterion on M due to the benign properties of monotonic nets described above. The rewriting system achieves the normal form (NF) by pulling transitions out of the scope of the leading operator and moving them towards the root of the term. Since terms are finite, this procedure is guaranteed to terminate. Example 1 Rewriting the channel routing process ONE2ONECHANNEL that is defined in Sect. 4.3, Eq. (3) to normal form yields the following term (p and q are initially empty): C = ready {p} ||| write ? msg {q} ||| {p} (transmit ! msg → · · · ) |[ transmit.Σ ]| {q} (transmit ? msg → · · · ) \ transmit.Σ
R (currently blocked)
In graphical representation: ready
write ? msg
R p
q
The first steps of the process C are thus either ready or write ? msg. 5.3.2. Translating Events (Prefix Operator) Events are modeled as transitions of the Petri net. Firing of a transition corresponds naturally to the process’ engagement in an event. This Figure 5. CSP events transformation is captured by the following rule: a
P
as net transitions
a→P
p[E] : (a {p} ||| {p} P), p new in P
In practice, the rewriting strategy would, sensibly, start applying this rule at the leftmost possible position in a term. 5.3.3. Translating the Choice Operator
P
•
Q
Figure 6. Nondeterministic choice
The choice operator also lends itself to a natural representation in the Petri net process framework. This is achieved by the following rule: P2Q
p[M] : ({p} P ||| {p} Q), p new in P and Q
212
V. Klebanov et al. / Verification of JCSP Programs
5.3.4. Translating the Parallelism Operator The behavior of the parallelism operator P |[ X ]| R varies with the synchronization set X from total synchronization of two processes (P Q) to interleaving (P ||| Q). Interleaving has a special place within this scale as it introduces no dependencies between its operands. It is treated separately in the next section. Here, in contrast, we assume that the synchronization set X is not empty. For events included in X we identify “matching” transitions in both operands and “merge” them outside of the scope of the parallelism operator. Since removing transitions out of the scope loses vital information, it is necessary to do some additional bookkeeping. This is achieved with two lists of already worked-off transitions (“buffers”) L and R, which are part of the extended operator |[L[X]R]|. In the beginning, our rewriting system replaces the parallelism operator by this variant with the buffers initially empty: P |[ X ]| Q
P |[ ∅[X]∅ ]| Q
The main rewriting step then records every (synchronized) worked-off transition from an operand in the corresponding buffer: ⎧ ⎪ if a ∈ X ⎨i a o ||| P |[ L[X]R ]| Q P |[ L[X]R ]| i a o ||| Q ⎪ if a ∈ X ⎩ U ||| P |[ L[X]R ]| Q where R := R {i a o } and U is an interleaving of transitions, which arises from merging i a o with all transitions of the same name in buffer L: U
:=
il
i ∪ il o ∪ ol a ||| a ∈L ol
The stop process can stand in for an empty Q, and a symmetrical rule can be given for the left operand. Example 2 We continue Example 1 and complement term C with a process ready → Q that accesses the channel for reading. By repeatedly applying the rule for handling parallelism, pending events are added to the buffers of the parallelism operator, and it is deduced that the whole system can engage in event ready as its first step. The buffer contents are underlined. C |[ Σ ]| ready → Q · · · r[E] : C |[ Σ ]| ready {r} ||| {r} Q · · · r[E] : ready {p,r} ||| R |[ ready {p} , write ? msg {q} [Σ]ready {r} ]| {r} Q In the following net diagram, buffered transitions are denoted with dashed boxes: write ? msg
ready q
ready
ready
p
ready
r
p
r
q
Q Q
ready
write ? msg
R
R
213
V. Klebanov et al. / Verification of JCSP Programs
P P ||| Q
Q Figure 7. Interleaving of processes is easy
5.3.5. Translating Interleaving The interleaving composition of two processes (A ||| B) builds a “base case” of the rewriting system. It has a very natural Petri net representation, due to the concurrency inherent to Petri nets. This way A ||| B can be translated with the nets for A and B simply written side by side. Care should be taken though while connecting to other processes. In this case the interface places have to be duplicated, as well as the connecting transitions. This is described with the rule shown in Fig. 7. Due to lack of space we refrain from formally stating the rule and refer to [17] where a straightforward but lengthy formulation is given. 5.3.6. Further CSP Operators The CSP operators for labeling, hiding, and message passing (e.g., a ? x → P) are also treated by the system, but omitted here for space reasons. 5.3.7. Correctness of the Rewriting System We have shown the correctness of our rewriting system, by first developing a coalgebra-based denotational semantics of the process algebra at hand (based on Roscoe’s SOS [16]). Then we have proved that our rewriting system preserves the meaning of process terms relative to this semantics. This result is documented in [17]. 6. Evaluation of Temporal Correctness Assertions In this section we consider generalized correctness assertions of the form S : M : φ where S is a netCSP-term, M its initial marking, and φ is a formula of some modal logic. Here we use HML for simplicity reasons, but more expressive logics like temporal logic or μ-calculus can be handled as well. The syntax of HML is defined by the grammar ForHML ::= true | ¬ ForHML | ForHML ∧ ForHML | EventForHML where Event ranges over a set of events. The meaning of the Boolean connectives is as usual; formula aφ holds iff the concerned process, by engaging in an event a, reaches a state, in which φ holds. Tab. 1 shows some HML correctness assertions and their truth values. Two of the correctness assertions evaluate to ff. The reason is that in both cases place o is already marked and, as a consequence, event a cannot be fired (since firing a requires place o to be empty). 6.1. Evaluation of netCSP Terms in Normal Form The rules of the calculus presented in Sect. 5 transform a netCSP term into the normal form (NF) and a corresponding marking M (implied): i1
a1 o1 ||| · · · ||| in an on ||| R,
and R is blocked w.r.t. M
that is an efficient syntactical representation of the possible first events the process may fire. Now calculus rules for evaluating HML correctness assertions can be applied. We use a
214
V. Klebanov et al. / Verification of JCSP Programs
Gentzen-style sequent calculus. Sequents are of the form Γ Δ where Γ and Δ are multisets of correctness assertions. The semantics of a sequent is that the conjunction of the correctness assertions on the left of the sequent symbol implies the disjunction of the assertions on the right. The semantics of a sequent calculus rule is that if the premisses (i.e., the sequents above the horizontal line) can be derived in the calculus then the conclusion (i.e., the sequent below the line) can be derived as well. Note, that in practice sequent rules are applied from bottom to top. The following rule allows to evaluate HML correctness assertions. Applied from bottom to top, it produces a number of new correctness assertions about the continuations of the process that have to be examined subsequently. . Γ ak = b ∧ i1 a1 o1 ||| · · · ||| in an on ||| R : M + (ik , ok ) : Φ, Δ k=1,...,n (ik ,ok )∈En(M)
Γ
i1
(||| R)
a1 o1 ||| · · · ||| in an on ||| R : M : bΦ, Δ
The rule considers all transitions ak which are enabled, i.e., input places are marked and output places are empty ((ik , ok ) ∈ En(M)). The expression M + (ik , ok ) denotes the new marking after transition ak has fired. As an example we derive the HML correctness assertion {i1 }
a ||| {i2 } a : (M, M) : a atrue
expressing that there is a possibility for the process {i1 } a ||| {i2 } a with initial marking (M, M) to fire two consecutive events a. Markings M are here represented as pairs (M(i1 ), M(i2)) since the process term only contains the places i1 and i2 (we assume i1 = i2 ). A proof using rule (||| R) contains redundancy since the only difference between the newly generated correctness assertions is their marking. Both, process term and HML-formula stay the same. Thus, an obvious improvement is to consider correctness assertions with sets of markings. Then the example from above can be derived more efficiently: ∗
{i1 }
{i1 }
a ||| {i2 } a : {(D, D)} : true
(true R)
a ||| {i2 } a : {(D, M), (M, D)} : atrue {i1 }
a ||| {i2 } a : (M, M) : a atrue
(||| R) (||| R)
7. Verifying the Example After loading proof goal (1) into the KeY prover its verification proceeds without further user interaction. Automated application of rules is in KeY controlled by so-called strategies, Table 1. Examples of HML correctness assertions netCSP term S a
a
{o}
{o}
|||
{o}
b
initial marking M(o)
HML formula φ
truth value
E M
atrue
atrue
tt ff
E M M
a btrue
a btrue
btrue
tt ff tt
215
V. Klebanov et al. / Verification of JCSP Programs
Figure 8. The KeY prover after loading the verification example Table 2. Number of rule applications and invocations of JCSP primitives for evaluation of polynomials # Polynomials: Rule applications in total One2OneChannelInt.read One2OneChannelInt.write new ZeroBufferInt new BufferInt Parallel.run
1
2
3
23551 19 17 5 1 1
40647 34 32 5 1 1
57047 49 47 5 1 1
which in each proof situation select a particular rule that is supposed to be applied next. For the example we are using a strategy that is implemented as described in Sect. 6, which eventually reduces (1) to the tautology true, proving that the stated property holds. 7.1. Shape of the Proof During execution of the polynomial evaluation program essentially two phases can be identified: In a first part, the network is set up, i.e., JCSP processes are spawned and channels are created. The symbolic execution thereof needs about 7000 applications of rules and results in a CSP process term that contains 6 JCSP processes—the gates that make up the network as well as the network itself—and 6 further subterms modeling the JCSP channels according to the concept from Sect. 4. On the JavaCard level, this corresponds to 22 objects being created, of which 2 are arrays and the remaining 20 mostly belong to (the internal implementation) of channels. The second phase covers the execution of the initialized network; the number of rule applications necessary in this part depends on how many polynomials are evaluated in parallel (see Tab. 2). Further processes are not spawned in this part of the proof, which means that the shape of the CSP term is mostly preserved. Consequently, the proof gives a good presentation of the step-wise execution of the network—similarly to what can be achieved with a debugger—that is moreover completely symbolic. The second phase ends with a sequence of events jcspIntEvent(p1(x1 )), . . . , jcspIntEvent(pk (xk )) raised by an instance of class Propagator and this completes the whole proof. Tab. 2 gives an overview about the JCSP primitives that are invoked during the progression of the network. The write primitive is called less often than read, as some of the gates are already waiting for their next input (in vain) when the proof is closed. The verification for one polynomial takes about 30min on a common desktop computer (Pentium4, 2.6GHz), and is mostly determined by the currently limited performance of KeY
216
V. Klebanov et al. / Verification of JCSP Programs
when dealing with very large terms like the netCSP process term during symbolic execution. More generally, the required time depends on each of the four components of the verification system of Sect. 2. For mostly deterministic programs, symbolic execution (parts (1), (2), (3)) will be the dominating factor, which scales essentially linear in the code length, whereas for indeterministic programs the exploration of the state space (part (4)) becomes more costly. We currently only have a naive implementation of the techniques described in Sect. 6, which makes the verification time climb to about 5h when treating two or three polynomials simultaneously in our example.
8. Related Work To our knowledge, this paper describes the first verification system for Java programs in combination with the JCSP library. An approach that has already been investigated, in contrast, is the automatic generation of JCSP programs from verified “pure” CSP implementations, as for instance [14]. For JCSP systems that happen to be created this way it can be expected that verification is much simpler and can be handled more efficiently, as interpretation of Java code is avoided. We have not compared performance empirically as we consider the two problems too different. A further direction is the modeling of native Java concurrency features in CSP as a basis for verification, which is performed in [13]. Again, this idea differs significantly from the concept underlying our system. The EVT system [2] provides a verification environment for Erlang programs based on the first-order μ-calculus. Similar to our method is the usage of temporal correctness assertions in EVT, and we expect that many results derived in the EVT project—particularly concerning induction for the μ-calculus and compositional verification—can also be useful for verifying JCSP programs. A combination of Petri nets and process algebra is investigated in [3], and the algebra netCSP is designed following this idea to a considerable degree. Apart from that, the comparison of process algebra and Petri nets has a long tradition, see for instance [7]. A translation of CSP process terms to Petri nets comparable to our calculus for netCSP is outlined in [11] (but without integrating the two formalism into one language and giving a rewriting system), where the Petri net representation is used for analysis purposes.
9. Conclusion We have presented a complete verification approach for concurrent Java or JavaCard programs written using the JCSP library. The method has been implemented on top of the KeY system for deductive verification of Java programs and can be applied for ensuring properties of real-world programs, with the restriction that concurrency in the programs must be implemented purely using JCSP functionality instead of the corresponding native Java features (like shared memory). Our verification system consists of four different layers that are mostly orthogonal to each other, and that can all be realized or developed further independently. The basis is a calculus for the symbolic execution of sequential Java programs, which in our implementation is the already existing (complete) symbolic interpreter of the KeY prover. This interpreter is lifted to the concurrent case by embedding sequential Java programs in CSP terms. In order to make the execution of JCSP primitives possible, we add CSP models of JCSP classes and methods: currently a selection of different JCSP channels, alternation, and the most important JCSP process combinator (parallelism) are supported.
V. Klebanov et al. / Verification of JCSP Programs
217
These first two components enable an incremental translation of JCSP programs to CSP terms. The behavior of such terms (resp. the represented processes) is explored stepwise by a calculus for CSP, for which we have chosen a rewriting system that operates on an extension of CSP (called netCSP) integrating process algebra with Petri nets. The usage of Petri nets at this point avoids an early total ordering of execution steps and has in our implementation found to be by far more efficient than rewriting systems establishing tree-shaped normal forms of CSP terms. In a last phase, the behavior of the CSP process is checked against a temporal specification. That issue is discussed for the particularly simple logic HML in this paper, which can be regarded as basis for practically more relevant temporal logics like the μ-calculus. Apart from the interpreter for sequential Java, we consider each of the components of the verification system as target of future work: 1. Complement the set of supported JCSP features and verify that the CSP models are faithful; 2. improve the netCSP calculus by integrating Petri net reachability analysis, which can be used to simplify process terms; 3. add complete support for more powerful temporal logics and induction; 4. investigate how our method can be combined with compositional verification techniques as for instance described in [6]. Acknowledgement We thank W. Ahrendt, R. Bubel, W. Mostowski and A. Roth for important feedback on drafts of the paper. Likewise we are indebted to the anonymous referees for helpful comments. References [1] Wolfgang Ahrendt, Thomas Baar, Bernhard Beckert, Richard Bubel, Martin Giese, Reiner Hähnle, Wolfram Menzel, Wojciech Mostowski, Andreas Roth, Steffen Schlager, and Peter H. Schmitt. The KeY tool. Software and System Modeling, 4:32–54, 2005. [2] T. Arts, G. Chugunov, M. Dam, L. å. Fredlund, D. Gurov, and T. Noll. A tool for verifying software written in erlang. Int. Journal of Software Tools for Technology Transfer, 4(4):405–420, August 2003. [3] J.C.M. Baeten and T. Basten. Partial-order process algebra (and its relation to Petri nets). In J. Bergstra, A. Ponse, and S. Smolka, editors, Handbook of Process Algebra. Elsevier, North-Holland, 2001. [4] Julian Bradfield and Colin Stirling. Modal logics and mu-calculi: an introduction. In J. Bergstra, A. Ponse, and S. Smolka, editors, Handbook of Process Algebra. Elsevier, North-Holland, 2001. [5] Zhiqun Chen. Java Card Technology for Smart Cards: Architecture and Programmer’s Guide. Java Series. Addison-Wesley, 2000. [6] M. Dam and D. Gurov. Mu-calculus with explicit points and approximations. Journal of Logic and Computation, 12(2):255–269, April 2002. Abstract in Proc. FICS’00. [7] U. Goltz. On Representing CCS Programs by Finite Petri Nets. Number 290 in Arbeitspapiere der GMD. Gesellschaft für Mathematik und Datenverarbeitung mbH, Sankt Augustin, 1987. [8] James Gosling, Bill Joy, Guy Steele, and Gilad Bracha. The Java Language Specification. Addison Wesley, 2nd edition, 2000. [9] Matthew Hennessy and Robin Milner. On observing nondeterminism and concurrency. In Proceedings of the 7th Colloquium on Automata, Languages and Programming, pages 299–309. Springer-Verlag, 1980. [10] C. A. R. Hoare. Communicating Sequential Processes. Prentice-Hall, Englewood Cliffs, NJ, 1985. & 0-13-153289-8. [11] Krishna M. Kavi, Frederick T. Sheldon, and Sherman Reed. Specification and analysis of real-time systems using CSP and Petri nets. International Journal of Software Engineering and Knowledge Engineering, 6(2):229–248, 1996. [12] Donald E. Knuth. The Art of Computer Programming: Seminumerical Algorithms. Addison-Wesley, 1997. Third edition. [13] P.H.Welch and J.M.R.Martin. A CSP Model for Java Multithreading. In P. Nixon and I. Ritchie, editors, Software Engineering for Parallel and Distributed Systems, pages 114–122. ICSE 2000, IEEE Computer Society Press, June 2000.
218
V. Klebanov et al. / Verification of JCSP Programs
[14] V. Raju, L. Rong, and G. S. Stiles. Automatic Conversion of CSP to CTJ, JCSP, and CCSP. In Jan F. Broenink and Gerald H. Hilderink, editors, Communicating Process Architectures 2003, pages 63–81, 2003. [15] Wolfgang Reisig. Petri nets: an introduction. Springer-Verlag New York, Inc., 1985. [16] A. W. Roscoe. The theory and practice of concurrency. Prentice-Hall, 1998. [17] Philipp Rümmer. Interactive verification of JCSP programs. Technical Report 2005–01, Department of Computer Science and Engineering, Chalmers University of Technology, Göteborg, Sweden, 2005. Available at: http://www.cs.chalmers.se/~philipp/publications/jcsp-tr.ps.gz. [18] Steve Schneider. Concurrent and Real-Time Systems: The CSP Approach. John Wiley & Sons Ltd., 2000. [19] Sun Microsystems, Inc., Palo Alto/CA, USA. Java Card 2.2 Platform Specification, September 2002. [20] P.H. Welch and P.D. Austin. Java Communicating Sequential Processes home page. http://www.cs.ukc.ac.uk/projects/ofa/jcsp/.
Communicating Process Architectures 2005 Jan Broenink, Herman Roebbers, Johan Sunter, Peter Welch, and David Wood (Eds.) IOS Press, 2005 © 2005 The authors. All rights reserved.
219
Architecture Design Space Exploration for Streaming Applications through Timing Analysis Maarten H. WIGGERS, Nikolay KAVALDJIEV, Gerard J. M. SMIT, Pierre G. JANSEN Department of EEMCS, University of Twente, the Netherlands {wiggers,nikolay,smit,jansen}@cs.utwente.nl Abstract. In this paper we compare the maximum achievable throughput of different memory organisations of the processing elements that constitute a multiprocessor system on chip. This is done by modelling the mapping of a task with input and output channels on a processing element as a homogeneous synchronous dataflow graph, and use maximum cycle mean analysis to derive the throughput. In a HiperLAN\2 case study we show how these techniques can be used to derive the required clock frequency and communication latencies in order to meet the application’s throughput requirement on a multiprocessor system on chip that has one of the investigated memory organisations.
Introduction Advances in silicon technology enable multi-processor system-on-chip (MPSoC) devices to be built. MPSoCs provide high computing power in an energy-efficient way, making them ideal for multimedia consumer applications. Multimedia applications often operate on one or more streams of input data, for example: base-band processing, audio/video (de)coding, and image processing. An MPSoC consists of Processing Elements (PE). For scalability reasons we envision that in the near future MPSoCs will include a Network-on-Chip (NoC) for communication between PEs, as i.e. [1]. Multimedia applications can be modelled conveniently using a task graph, where the vertices represent functions and the edges data dependencies. The data streams through the graph from function to function. A subclass of multimedia applications operates under hard real-time constraints: throughput and latency requirements are put on the inputs and outputs of the task graph. To satisfy these requirements, methods are needed that allow reasoning, predicting and guaranteeing the application performance for a given mapping on a multi-processor architecture. Using such an analysis method different architectures can be compared, so that for given timing requirements the architecture that runs at the lowest clock frequency can be found. This paper analyses the temporal behaviour of multimedia applications mapped on a multiprocessor architecture by modelling the mapping with Homogeneous Synchronous DataFlow (HSDF) graphs and applying the associated analysis techniques. The contribution of this paper is that it shows how these analysis techniques can be used for design space exploration, to find an architecture instance given the timing constraints and given an optimisation criterion (in our case clock frequency) which has its influence on the energy efficiency. We explore different memory organisations for the PEs and their consequences for the clock frequency of the processor and the requirements imposed on the NoC.
220
M.H. Wiggers et al. / Architecture Design Space Exploration for Streaming Applications
The approach is based on the following assumptions: i) an upper bound on the task’s execution time can be given; ii) upper bounds on the data communication latencies can be given. Finding a tight upper bound on the execution time of a piece of code is a hard problem, but using techniques as presented by Li this can be done [1]. When multiple tasks are mapped on the same processor, then a scheduling policy needs to be applied on this processor that provides an upper bound on the waiting time of the task. An upper bound on the communication latencies can be given by a communication infrastructure that provides guaranteed latency such as [1][1]. Poplavko [4] uses SDF inter-processor communication (IPC) graphs [5] to find minimal buffer sizes by accurately modelling the Æthereal NoC [1] and analysing the temporal behaviour of a JPEG decoder mapped on an MPSoC consisting of ARM processors and the Æthereal NoC. We do not aim for buffer minimization but aim for an architecture that meets the applications timing constraints at low energy consumption. An untimed HSDF graphs is similar to a Marked Graph Petri Net [6]. The time semantics applied here for HSDF graphs is similar to time Petri Nets [7]. The organisation of this paper is as follows. In Section 1, the organisation of the MPSoC template is given. The HSDF model of computation and its associated analysis technique is presented in Section 2. In Section 3, the different memory organisations for the PEs are presented and their throughput is analysed, after which in Section 4 the consequences are described when an application is mapped over multiple PEs. Section 5 describes a case study in which the data processing part of a HiperLAN\2 receiver is mapped on a MPSoC consisting of a number of MONTIUM processing tiles [8], after which we conclude in Section 6. 1. System Organization
An abstract representation of the multiprocessor system considered in this paper is given in Figure 1. It consists of multiple Processing Elements (PEs) that are connected to a Network-on-Chip (NoC) through Network Interfaces (NI). A PE includes a processor, instruction memory, and data memory; the processor is for instance a domain-specific or general purpose processor. One or several tasks (Wi) can execute on a PE. When communicating tasks are mapped on the same PE then the communication channel between them is mapped on the local memory. When communicating tasks are mapped on different PEs then the channel is mapped over the local memories of both PEs and the NoC is used to transport data from one PE to the other. Tasks only access the PE’s local memory.
Figure 1. An abstract representation of a multiprocessor system
M.H. Wiggers et al. / Architecture Design Space Exploration for Streaming Applications
221
The NoC provides reliable, in-order, and guaranteed latency services on connections. A connection is a channel between NIs, and can go over routers in the NoC. The size of the data items on the connection is known. Guaranteed latency provides an upper bound on the time between the moment that the first word of the data item is written on the connection and the moment that the last word is available for reading. Communication over the NoC is event-triggered: data can be transferred as soon as both NIs (sending and receiving) are ready for communication on the same connection. The NI hides the NoC details from the PEs. It also has DMA (direct memory access) functionality and can transmit data from the PE’s memory on the network and write data received from the network in the memory. The organisation of a PE together with its NI is presented in Figure 2. It consists of a processor, instruction memory, data memory and a NI. The NI can operate in parallel to the processor and accesses the memory for inter-PE communication. Furthermore, the NI has separate sending and receiving parts that operate independently. In this case three parties can request memory access at a particular time – PE, sending and receiving part of the NI. An extension to more than one input or output connection can be further considered, but for clarity reasons it will not be discussed in this paper. instruction memory
Processing Element
processor
data memory
arbiter
Network Interface connection1
connection2
Network on Chip
Figure 2. PE organization
Conflicts between the three parties requesting memory access can be solved through scheduling of memory accesses or through multiple memory ports. Several options for solving the conflicts are discussed in this paper. Each of the options is studied as an HSDF model of a single task running on a PE. Throughput is derived for the models and compared. 2. Homogeneous Synchronous DataFlow
HSDF [9] is a model of computation in which multimedia applications can be conveniently modelled and with which analysis techniques are well suited to derive the throughput and latency of hard real-time applications. The vertices of an HSDF graph are called actors. Actors communicate by exchanging tokens over channels which are represented by the edges of the graph. The channels are unbounded first-in first-out (FIFO) buffers. In the HSDF graph, tokens are represented as black dots on the edges. The actors in the HSDF graph represent some activity. An HSDF actor has a firing rule that specifies the number of tokens that needs to be present on the input channels. When the
M.H. Wiggers et al. / Architecture Design Space Exploration for Streaming Applications
222
firing rule is met the actor is enabled after which it can fire. The difference between the firing time and the finish time is the execution time. At the finish time the actor atomically removes a predefined number of tokens from its input channels and places a predefined number of tokens on its output channels. By definition the actors in a homogeneous SDF graph always consume and produce a single token on a channel; SDF graphs allow the modelling of so-called multi-rate applications. For clarity reasons we restrict the present discussion to HSDF graphs, a similar approach can be taken with SDF graphs. In all the HSDF graphs the token consumption and production rates are omitted for clarity reasons. Self-timed execution of an HSDF graph means that the actor fires as soon as it is enabled. Figure 3 shows an example HSDF graph that models a bounded FIFO buffer of capacity two data items. The actors A1 and A2 are the producer and consumer on this FIFO. The number of tokens on the cycle between the actors corresponds to the capacity of the FIFO. A self edge with one initial token enforces that the previous firing of the actor must have finished before the next firing can start. A self-edge is required to model state over different firings of the same actor.
A1
A2
ET1
ET2
Figure 3. HSDF model of a FIFO
HSDF graphs have two important properties: (1) monotonicity, and (2) periodicity. Self-timed execution of an HSDF graph is monotonic [10]. This means that decreasing actor execution times will only lead to non-increasing actor firing times, and thus will only lead to increasing or unchanged throughput. After a transient phase in the beginning, the self-timed execution of a strongly connected HSDF graph will exhibit periodic behaviour. The throughput of the HSDF graph after the transient phase can be derived using Maximum Cycle Mean (MCM) analysis of a strongly connected HSDF graph [11]. The mean of a simple cycle c in an HSDF graph is defined as the sum of the execution times (ET) of the actors, a, on the cycle divided by the number of tokens on the cycle. The MCM of an HSDF graph G, OG, is found by calculating the cycle mean of every simple cycle c:
OG
ª ¦ ET (a) º » max « ac » « tokens ( c ) cG »¼ «¬
(1)
The throughput T of the graph G is:
TG
1
OG
(2)
For example, the HSDF graph in Figure 3 contains three cycles and its OG is max[ET1/1, ET2/1, (ET1+ET2)/2], while the throughput is the inverse of the OG.
M.H. Wiggers et al. / Architecture Design Space Exploration for Streaming Applications
223
3. Modelling of a Single Task on a PE
This section discusses a single task running on a PE. The task receives and sends its data from/to other PEs. It is shown how the task including the communication can be modelled as an HSDF graph, taking into account the PE architecture. The processor and the sending and receiving part of the NI access the data memory in parallel and contention may occur on the memory port. In order to resolve the contention, arbitration on the memory port is used. The arbitration can be done at two levels: token level and word level. At token level the arbitration is done on a coarse granularity. Access is granted to either the processor or the NI until it finishes its operation: processing, sending or receiving of a data item respectively. At word level the arbitration is done on a finer granularity. Access to the memory is granted on a word-by-word basis. Intuition says that arbitration on the word level is advantageous if either the processor or the NI does not access the memory every clock cycle. This will for instance occur for control-oriented tasks, and for processors with a large register set or multi-cycle operations. In this paper we only consider token level arbitration, because our focus is on the data processing part of the application that frequently accesses the memory. For a discussion on word level arbitration, see [12].
Figure 4. Mapping of an application graph on a MPSoC.
Figure 4 shows how a dataflow graph of an application is mapped on our MPSoC. The application is partitioned into three tasks: W1, W2 and W2. We call the dataflow graph in Figure 4 a mapping-unaware graph. Information about the mapping is included in the graph by extending the mapping-unaware graph with actors that model the communication latency. ETWi
ETCi-1 Ci-1
Wi
ETCi Ci
Figure 5. The dataflow between receiving part of the NI, processor, and sending part of the NI.
Figure 5 shows how the mapping-unaware graph of a single task, Wi, is extended with the knowledge that the tasks are mapped on different PEs and that communication between the tasks has a certain (guaranteed) latency. The annotated times (ETCi-1, ETWi, and ETCi) represent either the upper-bound on the execution time in the case of the tasks or the upperbound on the latency of moving a data item from one memory to another memory. The graph from Figure 5 still does not contain all the information about the PE architecture. It has to be further extended with information about the memory organisation and the arbitration on the data memory port.
M.H. Wiggers et al. / Architecture Design Space Exploration for Streaming Applications
224
We consider three data memory organisations in the following subsections: (1) a singleport, (2) dual-port or (3) three-port data memory organisation. For each organisation an HSDF model is constructed and achievable throughput is compared. In a later section it is shown how a model of a complete application running on multiple PEs can be derived using the results for a single PE. 3.1. Arbitration on 1 Memory Port
Assume a PE has one single-port data memory. To resolve the conflicts between the three entities (task, input connection and output connection) that access the memory a static schedule S0 can be applied. Figure 6 presents this schedule as an HSDF graph. Because of the 1-to-1 mapping one can view the actors modelling either the logical entities as mentioned or the processor, receiving part of the NI, and sending part of the NI. The token can be interpreted as a grant for memory usage: the actor that currently possesses the token owns the memory. The edges model the data dependencies between the entities: memory access should be first granted to the input connection Ci-1, then to the task on the processor Wi and then to the output connection Ci. The execution time of an actor equals the maximal time that the corresponding entity will keep the memory.
Figure 6. HSDF graph corresponds to schedule S0
Excluding the self edges the graph contains one cycle with one token. Applying Eq. (1) and (2) the throughput of the graph is derived:
OS
0
ETCi 1 ETW i ETCi o TS0
1 ETCi 1 ETW i ETCi
If a lower bound T on the throughput has to be guaranteed, then from the above equation we see that the following must hold:
ETCi 1 ETW i ETCi d
1 T
3.2. Arbitration on 2 Memory Ports
When the PE’s data memory is implemented as a dual-port memory or two separate singleport memories, then two entities can access it simultaneously. Note that in the case of multiple single-port memories combined with a task that carries state from one firing to the next firing special care needs to be taken for storing and retrieving the state. We assume here that the task is a function that does not have state (the self-edge only enforces sequential firings). Figure 7 and Figure 8 present HSDF graphs of two contention free schedules, S1 and S2, for that memory organization. There are two tokens circulating in the
M.H. Wiggers et al. / Architecture Design Space Exploration for Streaming Applications
225
graph that correspond to the two memory ports. The actor Wi corresponds to task i, and actors Ci-1 and Ci correspond to the task’s input and output connection respectively.
Figure 7. The HSDF graph corresponding to schedule S1.
Figure 8. The HSDF graph corresponding to schedule S2
Applying Eq. (1) and (2) the throughput of the schedules is:
OS
1
OS
2
ETCi 1 ETW i ETCi o TS1 2
2 , ETCi 1 ETW i ETCi
maxETCi 1 ETW i , ETW i ETCi o TS2
1 maxETCi 1 ETW i , ETW i ETCi
The throughput of S1 is greater than or equal to the throughput of S2. This is because in S2 the task is granted access to both memory ports. If a lower bound T on the throughput has to be guaranteed, then from the above equation it is seen that the following must hold:
2 ETCi 1 ETW i ETCi d , for S1; T
1 T , for S 2 1 ETW i ETCi d T ETCi 1 ETW i d
3.3. Arbitration on 3 Memory Ports
When the PE data memory is implemented as a three port memory or three separate singleport memories, then all three actors can access a memory simultaneously. Arbitration on the memory ports is not needed. It is only necessary to keep the data dependencies. Two HSDF graphs, S3 and S4, for that memory organisation are shown in Figure 9 and Figure 10.
M.H. Wiggers et al. / Architecture Design Space Exploration for Streaming Applications
226
Figure 9. This HSDF graph corresponds to schedule S3
Figure 10. This HSDF graph corresponds to schedule S4.
Applying Eq. (1) and (2) we derive the throughput of the schedules:
OS
3
OS
4
ETCi 1 ETW i ETCi 3 , o TS3 ETCi 1 ETW i ETCi 3 1 § ET ETW i ETW i ETCi · max¨ Ci 1 , ¸ o TS4 2 2 § ETCi 1 ETW i ETW i ETCi · © ¹ max¨ , ¸ 2 2 © ¹
The throughput of schedule S3 is greater than or equal to the throughput of schedule S4. If a lower bound T on the throughput has to be guaranteed, then from the above equations it is seen that the following must hold: 2 ETCi 1 ETW i d 3 T , for S ETCi 1 ETW i ETCi d , for S3; 4 2 T ETW i ETCi d T Extending this discussion to multiple tasks mapped on the processor and thus multiple connections can either be done by extending the static order schedule with these tasks and connections or applying i.e. Time Division Multiple Access (TDMA) arbitration, as presented by Bekooij [12], on the processor and NIs. 3.4. Comparison
Table 1 summarises the result for the memory organisations discussed above. For each of them, the table gives the throughput and the constraints on the actors’ execution times implied by an application throughput bound T.
M.H. Wiggers et al. / Architecture Design Space Exploration for Streaming Applications
227
Table 1. Summary of the results Throughput
Mem. Singleport
S0
TS 0
1 ETCi 1 ETW i ETCi
Dualport
S1
TS 1
2 ETCi 1 ETW i ETCi
S2
TS 2
1 maxETCi 1 ETW i , ETW i ETCi
S3
TS 3
3 ETCi 1 ETW i ETCi
S4
TS 4
Threeport
1 § ETCi 1 ETW i ETW i ETCi · max ¨ , ¸ 2 2 © ¹
Constraints ETCi 1 ETW i ETCi d
1 T
ETCi 1 ETW i ETCi d
2 T
1 T 1 ETW i ETCi d T ETCi 1 ETW i d
ETCi 1 ETW i ETCi d
3 T
2 T 2 ETW i ETCi d T ETCi 1 ETW i d
To compare the throughput results we assume the same actors’ execution times (ETCi-1, ETWi and ETCi) in the five cases. This results in a lattice: TS0 TS1 TS3 ½ ° TS2 TS4 ° ¾ TS2 d TS1 ° ° TS4 d TS3 ¿
S0 has lowest throughput and S3 has highest throughput. As can be expected an increase in memory ports (or the number of separate memories used) leads to an increase of the PE throughput. Given an application throughput bound T, the maximal achievable processor utilisation can be derived from the constraints in Table 1. Higher processor utilization leads to lower clock frequencies and therefore to lower power consumption. Processor utilisation U is defined as the ratio between the time a processor is busy and the period at which the data arrives. For each data item a processor is busy for time ETWi . The data arrival period is 1/T. Thus U=T*ETWi. Taking into account that the throughput bound requires that the execution times for all the actors are smaller than or equal 1/T, from the constraints we derive the maximal achievable ETWi and thus the maximal achievable processor utilisation. The results are given in Table 2. S0 has worst utilisation while S1, S3 and S4 allow for 100% utilisation of the processor. Table 2. Maximal achievable processor utilization Maximal processor utilisation
Mem. Single-port
S0
1 T ETCi 1 ETCi
Dual-port
S1
1
Three-port
S2
min>1 T ETCi 1 , 1 T ETC @
S3
1
S4
1
M.H. Wiggers et al. / Architecture Design Space Exploration for Streaming Applications
228
In the same way the latency requirements can be compared. Consider the constraints inequalities in Table 1 and assume that the processing time ETWi is fixed. Then it can be seen that the latency requirements (ETCi-1 and ETCi) are most difficult for S0 and most relaxed for S3 and S4. 4. Application Model
The previous section discussed how a single task of an application can be modelled such that information about the PE architecture where the task runs is included in the HSDF graph. Here the model is extended to the entire application. Consider the application shown in Figure 4 and assume that all its tasks (W1, W2 and W3) are mapped on PEs with a single-port memory. The HSDF graph of the mapping is shown in Figure 11. It is constructed by extending the original application graph with the communication latencies and the constraints between the different actors due to the scheduling on the memory port. The communication latency ETCi is the time that it takes to move a token (data item) from the data memory in PEi to the data memory in PEi+1. ETW1
W1
ETW
W2
ETW3
W3
C0
C1
C2
C3
ETC0
ETC1
ETC2
ETC3
Figure 11. An HSDF graph of the application from Figure 4 assuming PEs with a single-port memory and direct communication between the tasks
This graph contains three simple cycles each with a single token. Applying Eq. (1) and (2) for this HSDF graph we find that the throughput of the application is:
TG
1 max ETCi 1 ETW i ETCi
i^1, 2 , 3`
The last can be restated in the following way: the necessary and sufficient condition for the application having throughput equal to or higher than 7 is:
ETCi 1 ETW i ETCi d
1 , for i ^1,2,3` T
This system of inequalities gives the relation between the global application throughput requirement 7 and the constraints for a particular mapping of the tasks. When the communication between PEs is not direct and data is buffered in between then the application HSDF graph is changed as shown in Figure 12 for a buffer capacity of n data items. For example, data is written through the network to a logical FIFO properly implemented on a memory that is larger than the local memories and later read again through the network. The execution times of the send (S) and receive (R) actors equal the latency guarantees given by the NoC for transmission of the data to and from this secondary memory plus the time required to update the FIFO administration.
M.H. Wiggers et al. / Architecture Design Space Exploration for Streaming Applications
ETWi
229
ETW1+1 Wi+1
Wi n
Ri
Si
Ri+1
Si+1
TRi
TSi
TRi+1
TSi+1
Figure 12. Buffered communication between the PEs. It is assumed storage with FIFO organization and capacity of n data items
Figure 13 presents an HSDF model of the application from Figure 4 assuming PEs with a dual-port memory using schedule S2. It is derived by extending the original application graph with details about the PEs architecture as in Figure 8. The communication between the PEs is direct. ETW1
W1
ETW2
W2
ETW3
W3
C0
C1
C2
C3
ETC0
ETC1
ETC2
ETC3
Figure 13. HSDF graph of the application from Figure 4 assuming PEs with dual-port memory and direct communication between the tasks.
The graph contains six simple cycles each with one token. Applying the Eq. (1) and (2) the throughput of the application is derived: TG
1 max >ETCi 1 ETW i , ETW i ETCi @
i^1, 2 , 3`
If a lower bound T of the application throughput has to be guaranteed then the following should hold:
1 T , for i ^1,2,3` 1 ETW i ETCi d T
ETCi 1 ETW i d
In the same way HSDF models for the other PE organizations can be constructed. It is not necessary for all PEs to have the same organization, the architecture can be heterogeneous as for each PE a corresponding HSDF graph is substituted. Figure 14 shows an example HSDF graph of the same application assuming that the first PEs has a dual-port memory with schedule S1, the second PE has a three-port memory with schedule S4, and the PE where task W3 is mapped on has a single port memory.
M.H. Wiggers et al. / Architecture Design Space Exploration for Streaming Applications
230
ETW1
W1
ETW
W2
ETW3
W3
C0
C1
C2
C3
ETC0
ETC1
ETC2
ETC3
Figure 14 . HSDF graph of the application from Figure 4 assuming PEs with dual port memory and direct communication between the tasks
The graph contains 4 simple cycles – three with two tokens on them and one with a single token. According Eq. (1) and (2) the throughput of the application is: TG
1 ª§ ETC 0 ETW i ETC1 · § ETC1 ETW i · § ETW i ETC 2 · § ETC 2 ETW i ETC 3 ·º max «¨ ¸» ¸, ¨ ¸, ¨ ¸, ¨ 2 2 2 1 ¹¼ ¹© ¹© ¹© ¬©
Each of the four terms in the max function corresponds to one of the cycles in the graph. If lower bound T of the application throughput has to be guaranteed then it should be provided: ETC 0 ETW i ETC1 d
2 T
2 T 2 d T
ETC1 ETW i d ETW i ETC 2
ETC 2 ETW i ETC 3 d
1 T
5. HiperLAN/2 Example
In this section a HiperLAN/2 receiver is used as an example to demonstrate how HSDF throughput analysis is applied for real streaming applications. HiperLAN/2 [13] is a wireless local area network (WLAN) standard, based on Orthogonal Frequency Division Multiplexing (OFDM), which is defined by the European Telecommunications Standards Institute (ETSI). The HiperLAN/2 receiver will run on three PEs. The PEs are MONTIUM processing tiles [8] – domain-specific processors for the domain of mobile communications. The tiles communicate through a NoC as presented in [1]. The application is partitioned in three tasks [14] each of which will run on a separate PE. The dataflow graph is given in Figure 15. The tasks W1, W2 and W3 implement the base band processing of the HiperLAN/2 receiver. The graph is annotated with the sizes of the data items on the communication channels and the number of cycles required for processing the data item on a Montium. In order to request a guaranteed latency connection the data item size is required. The number of cycles enables calculation of the task execution times. Further the graph is a homogeneous SDF graph: all consumption and production rates are 1.
M.H. Wiggers et al. / Architecture Design Space Exploration for Streaming Applications
231
Data item size [Byte]
256 B
256 B
192 B
36 B
IN T=4us
OUT
W2
W1
204
67
W3
110
Processing duration [clock cycles]
W1 - Frequency offset correction W2 - Inverse OFDM W3 - Equalization, Phase offset correction and De-mapping
Figure 15. Process graph of a HiperLAN/2 receiver
A HiperLAN/2 receiver has to handle a new OFDM symbol (data item) every 4 μs. This is the throughput requirement of this application. It is required that the application has a throughput greater than or equal to 1/(4 μs) = 250 OFDM symbols per ms. The MONTIUM tile has a single-port memory and the NoC provides direct communication without buffering. Therefore, the HSDF graph from Figure 11 can be directly used for modelling the application. Here the arriving OFDM symbols correspond to tokens arriving to the application. The lower bound on the application throughput is 7=250 [token/ms]. Assuming that the three tiles run on a clock frequency of 100 MHz and considering the number of cycles per firing given in Figure 15 we can calculate the execution times for the processing actors in the HSDF graph: ETW1 = 0.67 μs, ETW2 = 2.04 μs, ETW3 = 1.1 μs. Taking into account the throughput requirement 7and system of inequalities given for the graph in Figure 11, ETCi 1 ETW i ETCi d
1 , for i ^1,2,3` , T
we derive the constraints for the communication latencies: ETC 0 ETC1 d 3.33Ps ETC1 ETC 2 d 1.96 Ps ETC 2 ETC 3 d 2.9 Ps
One possible solution of this system of inequalities is: ETC0 = 2.35 μs, ETC1 = 0.98 μs, ETC2 = 0.98 μs, ETC 3= 1.92 μs. These are the upper bounds on the latency guarantees to be requested from the network. The utilisation of the MONTIUM tiles will be: U1 = 0.17, U2 = 0.51, U3 = 0.28. In the case that the network cannot provide the requested latency guarantees we can take the lowest possible latency that can be provided. Now starting with these fixed latencies the system of inequalities will give the minimum task execution times ETW1, ETW2 and ETW3 and consequently the minimum processor clock frequencies.
232
M.H. Wiggers et al. / Architecture Design Space Exploration for Streaming Applications
If the MONTIUM tiles had dual-port memory, then according Table 2 it would be possible to achieve 100% processor utilisation (applying S1). Assume that this is the case. In order to keep the tiles busy all the time, the tasks execution times are set equal to the arrival period of the data items: ETW1 = ETW2 = ETW3 = ET = 4 μs. Taking into account the number of cycles given in Figure 15 the tiles clock frequencies are calculated: f1 = 16.75MHz, f2 = 51MHz, f3 = 27.5MHz. Considering schedule S1, the graph in Figure 7 is used for constructing the HSDF graph, given in Figure 16, of the application running the three tiles. ETW1
W1
ETW
W2
ETW3
W3
C0
C1
C2
C3
ETC0
ETC1
ETC2
ETC3
Figure 16. HSDF graph of a HiperLAN/2 receiver running on three Montium tiles assuming the tiles had dual-port memories organized according schedule S1
The throughput equations for the graph in Figure 7 are already derived. They give the necessary and sufficient conditions for guaranteeing a lower bound on the application throughput T: ETCi 1 ETW i ETCi d
2 , for i ^1,2,3` , T
Since the tasks execution times are already fixed, for the communication latencies it must hold that: ETC 0 ETC1 d 4 Ps ETC1 ETC 2 d 4 Ps ETC 2 ETC 3 d 4 Ps
One possible solution of this system of inequalities is: ETC0 = 2 μs, ETC1 = 2 μs, ETC2 = 2 μs, ETC3 = 2 μs. Compared with the results for the real MONTIUM architecture we see that with dual-port memory and S1 the highest possible tile utilisation is achieved while the latency requirements are the same or relaxed. 6. Conclusion
We have shown how different memory organisations of the processing elements that constitute an MPSoC can be compared based on their throughput. Further we have shown how the throughput of a mapping can be evaluated by first modelling the application as an HSDF graph and then extending this graph with actors that model the effects of the mapping, e.g. the latency of the communication channels. Even though we have only presented an application that is organised as a pipe, we believe that this approach can be extended in a straightforward way to include arbitrary application graph topologies. One of the strengths of this approach is that we can model the application as well as the mapping on possibly heterogeneous PEs in a single graph in an intuitive way. Throughput
M.H. Wiggers et al. / Architecture Design Space Exploration for Streaming Applications
233
can be derived from this graph by analytical means, allowing for tool support, which will be necessary for larger or multi-rate graphs. HSDF graphs can only model static behaviour, in the sense that it cannot model dynamic token consumption or production rates or dynamic (data dependent) execution times. How we can accurately model and analyse the interaction between the control and data parts of the application is therefore future work. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15]
N. Kavaldjiev, G.J. M. Smit, P.G. Jansen, A Virtual Channel Router for On-chip Networks, Proceedings of IEEE International SOC Conference, pp. 289-293, September 2004. Y.-T. S. Li and S. Malik, Performance Analysis of Real-Time Embedded Software, ISBN 0792383826, Kluwer Academic Publishers, 1999. E. Rijpkema, K.G.W. Goossens, and A. Radulescu, Trade-Offs in the Design of a Router with Both Guaranteed and Best-Effort Services for Networks on Chip. Proceedings of DATE’03, 350-355, ACM, 2003. P. Poplavko, T. Basten, M. Bekooij, J. van Meerbergen, and B. Mesman. Task-Level Timing Models for Guaranteed Performance in Multiprocessor Networks-on-Chip, CASES’03, October 2003. S. Sriram, and S.S. Bhattacharyya, Embedded Multiprocessors: Scheduling and Synchronization, Marcel Dekker, Inc., 2002. T. Murata, Petri Nets: Properties, Analysis, and Applications. Proceedings of the IEEE, vol. 77, no. 4, pp. 541-580, April 1989. A. Cerone and A. Maggiolo-Schettini, Time-based Expressivity of Time Petri Nets for System Specification. Theoretical Computer Science, 216, pp. 1-53, 1999 P.M. Heysters, G.J.M. Smit. and Molenkamp E. A Flexible and Energy-Efficient Coarse-Grained Reconfigurable Architecture for Mobile Systems, The Journal of Supercomputing, vol. 26, issue 3, Kluwer Academic Publishers, November 2003. E.A. Lee, and D.G. Messerschmitt, Synchronous Data Flow. Proceedings of the IEEE, vol. 75, pp. 1235-1245, 1987. M. Bekooij, O. Moreira, P. Poplavko, B. Mesman, M. Pastrnak, and J. van Meerbergen. Predictable embedded multi-processor system design. Scopes 2004, 8th International Workshop on Software and Compilers for Embedded Systems. Amsterdam, The Netherlands, 2-3 September 2004. F. Baccelli, G. Cohen, G.J. Olsder, and J.-P. Quadrat, Synchronization and Linearity. New York: Wiley, 1992. M. Bekooij, S. Parnar, and J. van Meerbergen, Performance Guarantees by Simulation of Process Networks. To appear in Scopes 2005. ETSI, Broadband Radio Access Networks (BRAN); HIPERLAN Type 2; Physical (PHY) layer, ETSI TS 101 475 V1.2.2 (2001-02), 2001. G.K. Rauwerda, P.M. Heysters, G.J.M. Smit, Mapping Wireless Communication Algorithms onto a Reconfigurable Architecture, Journal of Supercomputing, Kluwer Academic Publishers, December 2004. Pascal T. Wolkotte, Gerard J.M. Smit, L.T. Smit, Partitioning of a DRM Receiver, Proceedings of the 9th International OFDM-Workshop, pp. 299-304, Dresden, September 2004.
This page intentionally left blank
Communicating Process Architectures 2005 Jan Broenink, Herman Roebbers, Johan Sunter, Peter Welch, and David Wood (Eds.) IOS Press, 2005 © 2005 The authors. All rights reserved.
235
A Foreign-Function Interface Generator for occam-pi Damian J. DIMMICH and Christian L. JACOBSEN Computing Laboratory, University of Kent, Canterbury, CT2 7NZ, England. {djd20 , clj3} @kent.ac.uk Abstract. occam-π is a programming language based on the CSP process algebra and the pi-calculus, and has a powerful syntax for expressing concurrency. occam-π does not however, come with interfaces to a broad range of standard libraries (such as those used for graphics or mathematics). Programmers wishing to use these must write their own wrappers using occam-π’s foreign function interface, which can be tedious and time consuming. SWIG offers automatic generation of wrappers for libraries written in C and C++, allowing access to these for the target languages supported by SWIG. This paper describes the occam-π module for SWIG, which will allow automatic wrapper generation for occam-π, and will ensure that occam-π’s library base can be grown in a quick and efficient manner. Access to database, graphics and hardware interfacing libraries can all be provided with relative ease when using SWIG to automate the bulk of the work.
Introduction This paper presents a tool for rapid and automated wrapping of C libraries for occam-π [1], a small and concise language for concurrent programming. While occam-π already has a foreign function interface (FFI)[2], which provides the means for extensibility, creating and maintaining large scale foreign library support manually is time consuming. However, by automating the wrapping of foreign libraries, access can be provided to a large existing codebase from within occam-π, without a prohibitively large investment in time. Both language developers and users will benefit, as both will be able to easily add support for new libraries. SWIG (Simple Wrapper and Interface Generator)[3] is a multi-language foreign function interface generator, providing the infrastructure needed for automatic library wrapping. Support for the generation of wrappers for individual languages is provided by self-contained modules, which are added to SWIG. This paper describes the occam-π module1 , which was created by the authors to enable automatic generation of occam-π wrappers for C code. We start the paper by providing details on the background and motivation for this work and the tools used. In section 2 we give a brief overview of the occam-π foreign function interface, followed by implementation details of the occam-π module for the SWIG framework in section 3. Section 4 provides an overview of using SWIG, and section 5 has examples of using SWIG to wrap a simple fictitious C library as well as how occam-π can be integrated with the OpenGL library. Finally, section 6 provides concluding remarks and ideas for future work. 1
The occam-π module is not at the time of writing part of the official SWIG distribution.
236
D.J. Dimmich and C.L. Jacobsen / An FFI Generator for occam-pi
1. Background and Motivation occam-π is a programming language designed explicitly for concurrent programming and is a concise expression of the CSP[4][5] process algebra. It also incorporates ideas from the pi-calculus[6]. While occam-π is a small language, it is nevertheless powerful enough to express applications ranging from control systems for small robots[7] to large applications modelling aspects of the real world[8]. It is not however, always feasible to program an entire solution exclusively in occam-π. Applications needing to deal with file I/O, graphical user interfaces, databases or operating system services, are not possible unless a very large existing code-base is rewritten in occam-π, or made available to occam-π through existing foreign libraries implementing such services. occam-π’s foreign function interface allows users to reuse existing C code or write certain portions of an application in C. It does however require a large amount of wrapper code to interface the library with occam-π. This is not a big problem when dealing with small amounts of code, but writing a wrapper for even a relatively modest library can quickly become time-consuming, tedious, and therefore error-prone. This becomes a further problem when one considers that the library being wrapped can evolve over time, and the wrappers must be updated to reflect changes in the library in order to be useful. Without better access to existing libraries and code, it may be difficult to argue that occam-π is a better choice for architecting large, complex systems than other languages. It must be made simpler to leverage the large amount of existing work and infrastructure that is provided through operating system and other libraries. We believe that it is imperative for the future success of occam-π that it does not just evolve the mechanisms, which will be needed to express new and exciting concurrent ideas, but also that it is able to make use of the large amount of existing work, in the form of system, graphical and database libraries, which has gone before it, and which will add functionality to occam-π. 1.1. Interface Generators The work presented in this paper is not the first to provide automatic wrapping of foreign library code for occam-π. The occam to C interface generator Ocinf[9] was the first widely available interface generator for occam. Ocinf can generate the glue code to wrap C functions, data structures and macros so that they could be used from occam2. Since the occamπ syntax is a superset of occam2’s and they share the same FFI mechanisms, it would still be possible to use Ocinf to generate interfaces for occam-π. Ocinf however, has not been maintained since 1996 and relies on outdated versions of Lex and Yacc[10]. It has proven to be difficult to get Ocinf to work, since Lex and Yacc have evolved and no longer support much of the old syntax. Making the Ocinf code base work with the current versions of the tools would require rewriting significant portions of 7,000 lines of Lex and Yacc productions. With the emergence of SWIG as the de facto standard in open source interface and wrapper generation, we chose not to pursue the Ocinf tool further. The SWIG framework is a general purpose code generator which has been in constant development since 1996. It currently allows more than 11 languages to interface with C and C++ code, including Java, Perl, Python and Tcl. SWIG is a modular framework written in C++, allowing the addition of a new language specific module providing the means for adding interface generation support for virtually any programming language with a C interface. Additionally, SWIG comes with good documentation and an extensive test suite which can help to ensure higher levels of reliability. Uses of SWIG range from scientific[11] to business[12] to government. While other interface generators exist, they are generally language-specific and not designed to provide wrapper generation capabilities for other languages. SWIG was from the outset designed to be language-independent, and gives it a wide and active user base.
D.J. Dimmich and C.L. Jacobsen / An FFI Generator for occam-pi
237
1.2. SWIG Module Internals SWIG itself consists of two main parts, an advanced C/C++ parser, and language specific modules. The C/C++ parser reads header files or specialised SWIG interface files and generates a parse tree. The specialised interface files can provide the parser additional SWIG specific directives which allow the interface file author to rename, ignore or apply contracts to functions, use target language specific features or otherwise customise the generated wrapper. The language specific module inherits a number of generic functions for dealing with specific syntax. Functions are overloaded by the modules and customised to generate wrapper code for a given language. The actual wrapper code is generated after the parse tree has undergone a series of transformations, some of which a specific module may take part in. Library functions are provided to allow easy manipulation and inspection of the parse tree. The SWIG documentation[13] provides more detailed insight into how SWIG functions and provides details on how one would go about writing a new language module. 2. An Overview of the occam-π FFI The occam-π FFI requires that foreign C functions be wrapped up in code that informs the occam-π compiler KRoC [14] how to interface with a given foreign function. This is required as the calling conventions for occam-π and the C code differ. The wrapper essentially presents the arguments on the occam-π stack to the external C code in a manner that it can use. We will illustrate the wrapping process with the following C function: int aCfunction(char ∗a, int b);
occam-π performs all FFI calls as a call to a function with only one argument, using C calling conventions. The argument is a pointer into the occam-π stack, in which actual arguments reside, placed there by virtue of the external function signature provided to the occamπ compiler. The arguments on the stack are all one word in length, and the pointer into the stack can therefore conveniently be accessed as an array of ints. In order to correctly access an argument, it must first be cast to the correct data-type, and possibly also dereferenced in cases where the argument on the stack is in fact only a pointer to the real data. void aCfunction(int w[]) { ∗(int) ∗(w[0]) = aCfunction((char ∗)w[1], (int)w[2]); }
The code above defines an occam-π callable external C function with the name takes an array of ints. This functions job is to call the real aCfunction with the provided arguments, which then performs the actual work. The array passed to aCfunction contains pointers to data or in some cases the data itself, which is to be passed to the wrapped function, as well as a pointer used to capture the return value of the called function. While a function in C may have a return value, external functions in occam-π are presented as PROCs, which may not return a value directly. Instead reference variables can be used for this task. In cases where a function has no return value one simply omits the use of a reference variable to hold the result of the called external function. In essence, the wrapper function just expands the array int w[] to its sub components and typecasts them to the correct C types that the wrapped function expects. The occam-π components that completes the wrapping are defined as follows: aCfunction which
#PRAGMA EXTERNAL "PROC C.aCfunction(RESULT INT result, BYTE a, VAL INT b) = 0"
238
D.J. Dimmich and C.L. Jacobsen / An FFI Generator for occam-pi
INLINE PROC aCfunction(RESULT INT result, BYTE a, VAL INT b) C.aCfunction(result, a, size) :
The first component is a #PRAGMA compiler directive which informs the compiler of the name of foreign function, its type and parameters. The #PRAGMA EXTERNAL directive is similar to C’s extern keyword. Function names are prefixed with one of “C.”, “B.”, or “BX.” for a standard blocking2 C call, a non-blocking C call and an interruptible non-blocking C call respectively. This prefix is used to determine the type of foreign function call, and is not used when determining the name of the external C function, which should in fact be prefixed with an underscore instead (regardless of its type): the PROC C.aCfunction will call the C function called aCfunction. The second PROC is optional and serves only to provide a more convenient name to the end user by presenting the wrapped function without the call type prefix. While this is not strictly necessary, it enables the wrapper to provide an interface which follows the wrapped library closer. As demonstrated, it should be clear that manually producing wrapper code for a small number of functions is not a problem. However writing such code for larger bodies of functions is laborious and error prone. The OpenGL[15] library is prime example of a library where automation is a must, as the library consists of over five hundred functions. More information on how to use KRoC’s foreign function interface and various types of system calls can be found in D. C. Wood’s paper[2]. Details of performing non-blocking system calls from KRoC can be found in Fred Barnes’ paper[16]. Non-blocking foreign functions (“B.” and “BX.” prefixes) can not currently be generated automatically by SWIG, and is an area of future work. 3. Using SWIG to Generate Wrappers The wrapper code generated by SWIG is much the same as one would generate by hand as demonstrated above. In this section we will provide more detail on how the occam-π SWIG module performs the mapping from the interface file to the generated wrapper. 3.1. Generating Valid occam-π PROC Names In order to allow C names to be mapped to occam-π, all ‘ ’ characters must be replaced by ‘.’ characters. This is done as the occam-π syntax does not allow underscore characters in identifier names. A function such as int this function(char a variable) would map to PROC this.function(RESULT INT r, BYTE a.variable). The only real effect this has is on function names and struct naming since parameter names are not actually used by the programmer.
3.2. Autogenerating Missing Parameter Names SWIG needs to to generate parameter names automatically for the occam-π wrappers should they be absent in function definitions. Consider a function prototype such as: int somefn(int, int);
SWIG will automatically generate the missing parameter names for the PROCs which wrap such functions. This does not affect the user of the wrappers, as the parameter names are of 2
by blocking, we mean blocking of the occam-π runtime kernel
D.J. Dimmich and C.L. Jacobsen / An FFI Generator for occam-pi
239
no importance, other than possibly providing semantic information about their use. Parameter names are however necessary in order to make the occam-π wrapper code to compile. The code listed above would map to a PROC header similar to: PROC somefn(RESULT INT return.value, VAL INT a.name0, VAL INT a.name1)
The occam-π module for SWIG generates unique variable names for all autogenerated parameter names in PROC headers, ensuring that there are no parameter name collisions. 3.3. Data Type Mappings The mapping of primitive C data types to occam-π are straightforward, as there is a direct association from one to the other. The mappings are based on the way parameters are presented on the occam-π stack during a foreign function call. For example an occam-π INT maps to a C int ∗ (that is, the value on the occam-π stack is a pointer to the pass by reference INT and dereferencing is needed to get to the actual value). The complete set of type mappings can be found in [2]. 3.4. Structures C’s structs can be mapped to occam-π’s PACKED RECORDs. Ordinary occam-π RECORDs cannot be used, as the occam-π compiler is free to lay out the fields in this type of record as it sees fit. PACKED RECORDs on the other hand are laid out exactly as they are specified in the source code, leaving it up to the programmer (or in this case SWIG) to add padding where necessary. As an example, the following C struct: struct example { char a; short b; };
would map to the following PACKED RECORD on a 32 bit machine: DATA TYPE example PACKED RECORD BYTE a: BYTE padding: INT16 b: :
The handling of structs is somewhat volatile however as it relies on C structs being laid out in a certain way. This may not necessarily be the case across different architectures or compilers and certainly not when wordsizes or endianness differ. This makes the use of structs a potential hazard when it comes to the portability of the generated wrapper. In cases where this would be a problem, it is possible to use the set of C accessor and mutator functions automatically generated by SWIG for the structure. These can be used by the occam-π program to access and mutate a given structure. It should even be possible for SWIG to produce code to automatically convert from a occam-π version of a structure to a C version (and vice versa), in order to provide more transparent struct access to the end user. This would of course be significantly slower than mapping the structures directly into occam-π PACKED RECORDs.
240
D.J. Dimmich and C.L. Jacobsen / An FFI Generator for occam-pi
3.5. Unions A C union allows several variables of different data types to be declared to occupy the same storage location. Since occam-π does not have a corresponding data type a workaround needs to be implemented. The occam-π RETYPES keyword allows the programmer to assign data of one type into a data structure of another. A struct that is a member of a union could then be retyped to an array of bytes. This is useful since the PROC wrapping a function that takes a union as one of it’s arguments can take an array of BYTE’s which are then cast to the correct type in the C part of the wrapper. This means that any data structure which is a member of a union can be easily passed to C. Functions which return unions can return a char ∗ which can then be retyped to the corresponding structure in occam-π. The remaining difficulty with this approach is that occam-π programmers need to make sure that they are retyping to and from the correct type of data, as it is easy to mistakenly assign the BYTE array to the wrong data structure and vice versa. The occam-π compiler, like the C compiler, is unable to check for the correctness of such an assignment. 3.6. Pointers and Arrays It is not always possible to know if a pointer is just a pointer to a single value, or in fact a pointer to an array. Different wrapping functions would be needed in each case. The problem occurs as in C, an array can be specified as using square brackets, int a[], or as a pointer, int ∗a. By default pointers are treated as if they are not arrays, and mapped into a pass by reference occam-π parameter. If a parameter actually does refer to an array, it is possible to force SWIG to generate a correct wrapper for that function by prepending ‘array ’ to the parameter name. Examples of this are provided throughout the paper where needed. 3.7. Typeless pointers The current default behaviour for type mapping void ∗ to occam-π is to use an INT data type. Since void ∗ can be used in a function which takes an arbitrary data type, this restricts its usage somewhat to only allow INT’s to be passed to the function. As an example of mapping a void ∗, OpenGL’s glCallLists function is shown here: void glCallLists(GLsizei n, GLenum type, const GLvoid ∗array lists);
Note that the ‘array ’ string is prepended to the ‘lists’ variable name, to indicate that it receives an array. The GLsizei n3 variable tells the function the size of the data type being passed to it, the GLenum type variable specifies the type being passed into the GLvoid ∗array lists so that the function knows how to cast it and use it correctly. Since the default type mapping behaviour here is to type map a C void ∗ to an occam-π INT, some of the ability to pass it data of an arbitrary type is lost. So, when calling glCallLists from occam-π one always has to specify that one is passing an integer to glCallLists, by passing the correct enum value. Here is an example of calling glCallLists from occam-π: −− A simple PROC that takes in an arbitrarily length array −− of BYTEs and prints them to screen. The code needs to −− cast the BYTEs to INTs so that they can be passed −− to the wrapped glCallLists function. PROC printStringi(VAL []BYTE s, VAL INT fontOffset, VAL INT length) MOBILE []INT tmp: SEQ tmp := MOBILE [length]INT
D.J. Dimmich and C.L. Jacobsen / An FFI Generator for occam-pi
241
SEQ i = 0 FOR length tmp[i] := (INT (s[i])) glPushAttrib(GL.LIST.BIT) glListBase(fontOffset) −− SIZE s returns the number of elements in s. glCallLists(SIZE s, GL.UNSIGNED.INT, tmp) glPopAttrib() :
It is possible, by manually writing some C helper functions, to allow the end users of a library to pass a greater range of data types to functions taking void ∗ parameters. This can be done by writing proxy C functions for every type of data that the original function accepts, which then calls the typeless function with the appropriate parameters. In this way it would for example be possible to provide a PROC glCallLists.R32(VAL INT n, []REAL32 lst) which accepts REAL32 data. Access to PROCs accepting other types can be provided in a similar manner. It may even be possible to let SWIG automate much of this work by using its macro system. 3.8. Dealing with Enumerations C enums allow a user to define a list of keywords which correspond to a growing integer value. These are wrapped as series of integer constants. So for an enum defined as: enum example { ITEM1 = 1, ITEM4 = 4, ITEM5, LASTITEM = 10 };
the following occam-π code is generated: −− Enum VAL INT VAL INT VAL INT VAL INT −− Enum
example wrapping Start ITEM1 IS 1: ITEM4 IS 4: ITEM5 IS 5: LASTITEM IS 10: example wrapping End
If several enumerations define the same named constant, a name clash occurs when the wrapper is generated. If this is a problem, it is possible to change the names of the enum members in the interface file (while not affecting the original definition). This will in turn affect the names of the generated constants in the wrapper, thus making it possible to avoid the name clash. Should enum name clashes be a regular occurrence, it would be possible to implement an option of naming the enums differently to ensure that the wrapped constants have unique names. For example, the wrapped constant names above could be generated as follows VAL INT example.ITEM1 IS 1:, using the enum’s name as a prefix to the constants name. The programmer using the wrapped code would have to be aware of the convention used in the names of enum constants. 3 GLsizei, GLenum and GLvoid are simply C typedef declarations, mapping them to int, enum and void types respectively. These are used to enable more architecture independent code, should types work slightly differently on other platforms.
242
D.J. Dimmich and C.L. Jacobsen / An FFI Generator for occam-pi
3.9. Preprocessor directives C’s #define preprocessor directives are treated similarly to enums. Any value definitions are mapped to corresponding constants in occam-π. More complex macros are ignored. The following listings show how some #define statements map to occam-π: /∗ C ∗/ #define AN INTEGER 42 #define NOT AN INTEGER 5.43 −− occam−pi VAL INT AN.INTEGER IS 42: VAL REAL64 NOT.AN.INTEGER IS 5.43:
3.10. Wrapping Global Variables SWIG’s default behaviour for wrapping global variables is to generate wrapper functions, which allow the client language to get and set their values. While this is not an ideal solution in a concurrent language, as one could be setting the global variable in parallel leading to race conditions and unpredictable behaviour, it is the simplest solution. occam-π itself does not normally allow global shared data other than constants. There are plans to address this issue by adding functionality to the SWIG occam-π module, which will allow the usage of a locking mechanism, such as a semaphore, to make sure that global data in the C world does not get accessed in parallel. The wrapper generator could generate two wrapper PROC’s for getting and setting the global variable, as well as a third PROC, which would need to be called by the user to initialise the semaphores at startup. occam-π provides an easy to use, lightweight semaphore library, and it would therefore be easy to manage access to global data from occam-π. If a library is not itself thread safe, the end user of the library currently needs to be aware of the dangers presented by global shared data, if the library contains it. 4. Using SWIG Using SWIG is very simple. From the command line you can generate wrappers from a C header file by running the command: $ swig −occampi −module myheader myheader.h
where myheader.h is the C header file that contains the function definitions of the functions that you would like to make use of from occam-π. In many cases it is enough to simply point SWIG at a C header file and have it generate a wrapper for occam-π from that header file. However it is generally better to take a copy of the C header (.h) file, which describes the functions and data-structures to be wrapped and copy that into a SWIG interface (.i) file. So when wrapping the OpenGL library, gl.h, the OpenGL header file, would be copied to gl.i. SWIG specific directives can be defined at the head of the file, such as defining the name of the module, which will determine the names of the generated wrappers. At the top of the interface file we might add the following code: %module gl %{ #include %}
D.J. Dimmich and C.L. Jacobsen / An FFI Generator for occam-pi
243
This names the SWIG generated wrappers “gl” and tells SWIG to include the code #include in the generated wrappers, so that, when compiled they are able to reference the targets original header files. To have SWIG generate a wrapper for occam-π from the newly created .i interface file the following command would be run: $ swig −occampi gl.i
Note, that the “-module” command line option is no longer needed, since the module name is specified by the %module gl directive above. As the previous section already stated, the occam-π modules default behaviour is to typemap pointers to single length corresponding primitives in occam-π. The interface file must specify which pointers are pointers to arrays. This can be done by modifying the name of the variable that is to be typemapped by prefixing it with ‘array ’. For example, in the OpenGL interface file, for the function call glCallLists, which takes an int, an enum and an array of void ∗ one would modify the code from: void glCallLists(GLsizei n, GLenum type, const GLvoid ∗lists);
to the following: void glCallLists(GLsizei n, GLenum type, const GLvoid ∗array lists);
When SWIG is run, it generates three files: modulename wrap.c, modulename wrap.h and occ modulename.inc, where modulename is the name that the interface file specifies with the %module directive. The generated C files can then be compiled and linked into a shared library. On Linux one would run the commands: $ gcc −I. −g −Wall −c −o modulename wrap.o modulename wrap.c $ ld −r −o liboccmodulename.so modulename wrap.o
The .inc file needs to then be included in the occam-π program with the following directive: #INCLUDE "occ modulename.inc"
This previously created shared library is then linked in to the KRoC binary along with the library that has just been wrapped. This command may need to be modified to include the correct library include and linking path. $ kroc myprogram.occ −I. −L. −loccmodulename −lwrappedlibrary
SWIG has many other features which are not specific to the occam-π module, designed to aid the interface builder in creating more advanced interfaces between higher-level languages and C. These are fully documented in the SWIG documentation[13]. 5. Examples 5.1. A Simple Math Library Demo This example is written to illustrate how one would use SWIG to interface with C code. A basic knowledge of occam-π and C will help to understand the example but is not necessary. In order to build the listed code, KRoC, SWIG and gcc are required. For this example we are using a fictitious floating point library called “calc.c” which contains a range of standard floating point arithmetic functions. The listing on the following page shows the header file for this “calc.c” library.
244 float float float float float
D.J. Dimmich and C.L. Jacobsen / An FFI Generator for occam-pi add(float a, float b); subtract(float a, float b); multiply(float a, float b); divide(float a, float b); square(float a);
An interface file for SWIG was created from the calc.h header file. In order to create the interface file, the calc.h header file was copied to a new file called calc.i. This file was then modified to look like the listing below. The first line tells SWIG what the names of the generated wrapper files are to be called. The next three lines inform SWIG that it should embed the #include "calc.h" statement into the generated C header file. It is in the interface file that any additional information, such as whether pointers to data are arrays or single values must be included. In this case the only modifications to the interface file that are needed are the four lines of code that where added at the start of the file: %module calc %{ #include "calc.h" %} float add(float a, float b); float subtract(float a, float b); float multiply(float a, float b); float divide(float a, float b); float square(float a);
The occam-π program calculate.occ was written to demonstrate the use of the C functions: #USE "course.lib" #INCLUDE "occ calc.inc" PROC main (CHAN BYTE kyb, scr, err) INITIAL REAL32 a IS 4.25: INITIAL REAL32 b IS 42.01: REAL32 result: SEQ out.string("SWIG/Occam−pi example for CPA 2005∗n∗n", 0, scr) add(result, a, b) out.string("Result of addition: ", 0, scr) out.real32(result, 0, 3, scr) subtract(result, a, b) out.string("∗nResult of subtraction: ", 0, scr) out.real32(result, 0, 3, scr) multiply(result, a, b) out.string("∗nResult of multiplication: ", 0, scr) out.real32(result, 0, 3, scr) divide(result, a, b) out.string("∗nResult of division: ", 0, scr) out.real32(result, 0, 3, scr) square(result, a) out.string("∗nResult of squaring: ", 0, scr) out.real32(result, 0, 3, scr) out.string("∗n∗n",0,scr) :
A build script was then written which incorporates SWIG in the build process:
D.J. Dimmich and C.L. Jacobsen / An FFI Generator for occam-pi
245
#!/bin/bash #Generate wrappers swig −occampi calc.i #compile C source file gcc −c −o calc.o calc.c #compile C wrapper file gcc −c −o calc wrap.o calc wrap.c #link source and wrapper into shared library ld −r −o libcalc.so calc.o calc wrap.o #compile occam control code, linking in newly #created library and the course library. kroc calculate.occ −L. −lcalc −lcourse
That is all that is required for a simple set of functions to be wrapped. 5.2. Wrapping OpenGL The initial development of the SWIG occam-π module was initiated by the need for a robust graphics library for occam-π. The OpenGL library was chosen as the target to be wrapped since it is an industry-standard which is supported on most modern platforms, often with hardware acceleration. The OpenGL standard itself contains no window management functionality or support for GUI events, so another library must provide the functionality needed to open a window which establishes an OpenGL rendering context as well as an input interface. For the window management, the SDL graphics and user interface library was chosen, due to its simplicity and high level of cross-platform compatibility. A subset of the SDL library was wrapped to allow the user to create and control windows as well as creating a rendering context for OpenGL. In order to create the wrapper for OpenGL the header files gl.h and glu.h where copied to gl.i and glu.i respectively. The newly created .i files had the following code added to their headers (for gl.i and glu.i respectively): %module gl %{ #include %} ...
%module glu %{ #include %} ...
A third file was then created called opengl.i which linked the previous two modules together into one: %module opengl %include gl.i %include glu.i
Finally, the SWIG occam-π module was run to generate the wrappers: $ swig −occampi opengl.i
This generated three files, opengl wrap.c, opengl wrap.h and occ opengl.inc. To make use of the OpenGL library, the C wrappers were compiled into a shared library and the
246
D.J. Dimmich and C.L. Jacobsen / An FFI Generator for occam-pi
occ opengl.inc file was added to the program requested. The wrappers can be found at http://www.cs.kent.ac.uk/people/rpg/djd20/project.html, along with more detailed instructions on how to generate them . An application using the OpenGL library is described in [17]. Figure 1 is an image of the running application, depicting a cellular automaton written in occam-π.
Figure 1. Lazy simulation of Conway’s Game of Life.
6. Further Work 6.1. occam-π as a Control Language A number of high level languages have been used as control or infrastructure languages for legacy C code, allowing the user to combine the ability to express more concisely things that would be difficult in C, as well as harnessing the speed and power of the existing C code. The existing C code may be in the form of libraries or legacy applications, as well as new code specifically written for an application. High level languages are able to provide features not often often found in lower level languages, such as pattern matching, higher order data structures, or in the case of occam-π, a powerful set of concurrency primitives. Higher level languages are often considered to be simpler to maintain than lower level programming languages in terms of the infrastructure that one is able to create with them. The additional syntax-enforced structure that control languages provide allow for a cleaner implementation of the entire system. It has been noted in [11] that the overall quality of the code, including that of the faster, lower level code was improved through the use of a stricter, more structured control language as the legacy code is adapted to work better with the control infrastructure. Further areas of exploration could involve experimenting with occam-π to help ease the parallelising of scientific code or exploring the use of occam-π as a control infrastructure for robotics or sensor networks. It would be interesting to work with the Player/Stage[18] project for example, which is used for modelling robotics and provides a comprehensive staging environment for these.
D.J. Dimmich and C.L. Jacobsen / An FFI Generator for occam-pi
247
The authors feel that occam-π’s CSP based model would be very suitable for parallelising existing code as well as making use of technologies for distributing CPU workload across different machines using upcoming technologies such as KRoC.net[19]. SWIG also allows for easy wrapping of libraries such as MPICH2[20] or LAMMPI[21], allowing occam-π to take advantage of industry-standard MPI communications mechanisms on clusters. An MPI library wrapper for occam-π will be available shortly. While the goal of the occam-π module was to initially support the wrapping of C libraries, there is nothing preventing it from supporting the wrapping of existing C++ code. SWIG is capable of parsing and generating C code and target language wrappers for most C++ code. A future modification to the SWIG occam-π module would be to add support for wrapping C++ classes. SWIG creates C++ wrappers by generating C wrapper code which is much simpler to interface with by foreign function interfaces. Wrapping of C++ code is not yet fully supported by the occam-π module for SWIG. Adding C++ support would allow a larger codebase to be used from occam-π. One could potentially wrap C++ classes as mobile processes and call member methods by communicating with them down channels, leaving them with a slightly object oriented feel. 6.2. Further Improvements to the SWIG Module Throughout the text we have mentioned several areas where the occam-π SWIG module could be improved in order to generate code automatically more often, with less intervention by the user. These areas will be addressed as the occam-π module approaches maturity. Non-blocking and interruptible non-blocking C calls, as mentioned in section 2 on page 237, are currently not supported. It should be possible, by using SWIG’s ‘feature’ directive, to allow users to mark which functions they want wrapped as blocking (“C”), non-blocking (“B”) and/or interruptible non-blocking (“BX”). Automatic wrapping of interruptible non-blocking systems calls would be especially desirable, they need to be further wrapped in a reasonable amount of template occam-π code in order to make them useful. The issue of cross platform compatibility of generated wrappers when using C structs needs to be investigated, and a good default behaviour for the module must be chosen. The effect of name classes for enums likewise needs to be investigated, and a default behaviour needs to be decided upon. Finally it would be desirable for SWIG be able to automatically generate code for functions using typeless pointers, so that they can be passed a range of data types. Further investigation of the SWIG macro system, which implements a similar feature for malloc and free, would be needed. 6.3. occam-π on Other Platforms With the development of the Transterpreter[22] it is possible to run occam-π applications on practically any platform which has a C compiler. A Symbian port of the Transterpreter is close to completion which will allow it to run on Nokia Series 60 and similar class devices. The Transterpreter also runs on the LEGO Mindstorms, custom robotics hardware and standard desktop hardware. The recent release of a new OpenGL based standard for graphics specifically targeted at mobile devices, called OpenGL ES[23] would allow, with SWIG generated wrappers, one to write mobile phone applications or games on such devices. Currently the Transterpreter runs as a little endian machine on all platforms, which is a problem when using the FFI on a big endian machine, as passed parameters have the wrong endianness. While it might be possible to instrument the occam-π SWIG module to generate code to byte swap arguments as they pass in and out of C functions, we are planning on eventually running the Transterpreter with the same endianness as the host architecture. While this would solve the problem, the changes needed in the compiler to mark up data contained
248
D.J. Dimmich and C.L. Jacobsen / An FFI Generator for occam-pi
in the bytecode file, have not yet been implemented. Such metadata would enable endianness correction at load time. Acknowledgements Many thanks to Matthew Jadud for the valuable feedback he gave us whilst writing this paper, as well as regularly suggesting good ideas or new avenues to explore. References [1] F.R.M. Barnes and P.H. Welch. Communicating Mobile Processes. In I. East, J. Martin, P. Welch, D. Duce, and M. Green, editors, Communicating Process Architectures 2004, volume 62 of WoTUG-27, Concurrent Systems Engineering, ISSN 1383-7575, pages 201–218, Amsterdam, The Netherlands, September 2004. IOS Press. ISBN: 1-58603-458-8. [2] David C. Wood. KRoC – Calling C Functions from occam. Technical report, Computing Laboratory, University of Kent, Canterbury, August 1998. [3] David M. Beazley. SWIG: An easy to use tool for integrating scripting languages with C and C++. 4th Annual Tcl/Tk Workshop, 1996. [4] C.A.R. Hoare. Communicating Sequential Processes. Prentice-Hall, 1985. [5] W.A. Roscoe. Theory and Practice of Concurrency. Prentice-Hall, 1998. [6] R. Milner, J. Parrow, and D. Walker. A Calculus of Mobile Processes – parts I and II. Journal of Information and Computation, 100:1–77, 1992. Available as technical report: ECS-LFCS-89-85/86, University of Edinburgh, UK. [7] Christian L. Jacobsen and Mathew C. Jadud. Towards concrete concurrency: occam-pi on the LEGO Mindstorms. St. Louis, Missouri, USA, February 2005. SIGCE’05. [8] TUNA Group. Theory Underpinning Nanotech Assemblers, 2005. http://www.cs.york.ac.uk/nature/tuna/. [9] C.S. Lewis. OCINF – The Occam-C Interface Generation Tool. Technical report, Computing Laboratory, University of Kent, Canterbury, 1996. [10] Doug Brown John Levine, Tony Mason. lex & yacc. O’Reilly, 1992. [11] David M. Beazley. Feeding a large-scale physics application to Python. 6th International Python Conference, San Jose, California, 1997. [12] Greg Stein. Python at Google. Google at PyCon 2005, March 2005. [13] David M. Beazley et al. SWIG-1.3 Documentation. Technical report, University of Chicago, 2005. [14] P.H. Welch, J. Moores, F.R.M. Barnes, and D.C. Wood. The KRoC Home Page, 2000. Available at: http://www.cs.ukc.ac.uk/projects/ofa/kroc/. [15] Tom Davis Mason Woo, Jackie Nieder and Dave Schriener. OpenGL Programming Guide, Third Edition. Addison Wesley, Reading, Massachusetts, third edition, 1999. [16] F.R.M. Barnes. Blocking System Calls in KRoC/Linux. In P.H. Welch and A.W.P. Bakkers, editors, Communicating Process Architectures, volume 58 of Concurrent Systems Engineering, pages 155–178, Amsterdam, the Netherlands, September 2000. WoTUG, IOS Press. ISBN: 1-58603-077-9. [17] A.T. Sampson, P.H. Welch, and F.R.M. Barnes. Lazy cellular automata with communicating processes. In J. Broenink, H. Roebbers, J. Sunter, P. Welch, and D. Wood, editors, Communicating Process Architectures 2005, Amsterdam, The Netherlands, September 2005. IOS Press. [18] The Player/Stage project, 2005. http://playerstage.sourceforge.net/. [19] Mario Schweigler, Fred Barnes, and Peter Welch. Flexible, Transparent and Dynamic occam Networking with KRoC.net. In Jan F Broenink and Gerald H Hilderink, editors, Communicating Process Architectures 2003, volume 61 of Concurrent Systems Engineering Series, pages 199–224, Amsterdam, The Netherlands, September 2003. IOS Press. [20] The MPICH2 project home page, 2005. http://www-unix.mcs.anl.gov/mpi/mpich2/index.htm. [21] Greg Burns, Raja Daoud, and James Vaigl. LAM: An Open Cluster Environment for MPI. In Proceedings of Supercomputing Symposium, pages 379–386, 1994. [22] C.L. Jacobsen and M.C. Jadud. The Transterpreter: A Transputer Interpreter. In I.R. East, D. Duce, M. Green, J.M.R. Martin, and P.H. Welch, editors, Communicating Process Architectures 2004, volume 62 of Concurrent Systems Engineering Series, pages 99–106. IOS Press, Amsterdam, September 2004. Available from http://www.transterpreter.org/. [23] OpenGL ES, 2005. http://www.khronos.org/opengles/spec/.
Communicating Process Architectures 2005 Jan Broenink, Herman Roebbers, Johan Sunter, Peter Welch, and David Wood (Eds.) IOS Press, 2005 © 2005 The authors. All rights reserved.
249
Interfacing C and occam-pi Fred BARNES Computing Laboratory, University of Kent, Canterbury, Kent, CT2 7NF, England. [email protected] Abstract. This paper describes an extension to the KRoC occam-π system that allows processes programmed in C to participate in occam-π style concurrency. The uses of this are wide-ranging, from providing low-level C processes running concurrently as part of an occam-π network, through to concurrent systems programmed entirely in C. The easily extended API for C processes is based on the traditional Inmos C API, used also by CCSP, extended to cover new features of occam-π. One of the motivations for this work is to ease the development of low-level network communication infrastructures. A library that provides for networking of channel-bundles over TCP/IP networks is presented, in addition to initial performance figures. Keywords. C, occam-pi, concurrency, processes, networks
Introduction The occam-π language [1] extends classical occam [2] in numerous ways. Included in these extensions, and supported by the KRoC [3] implementation, are mechanisms that allow occam-π processes to interact with the external environment. Classical occam on the Transputer [4] had a very physical environment — hardware links to other Transputers. In contrast, modern systems support highly dynamic application environments, e.g. file-systems and networking, that occam-π applications should be able to take full advantage of. In most cases, interaction with anything external to an occam-π program requires interfacing with C — since the environments in which KRoC programs run have C as a common interface (e.g. UNIX). There are a few exceptions, however, such as the mechanism that provides low-level hardware I/O access directly from occam-π using “PLACED PORT”s (described in [5]). The mechanisms currently support by KRoC for interfacing with C are: simple external C calls [6]; blocking external C calls [7]; and a “user defined channels” mechanism that allows C calls (blocking and non-blocking) to be placed behind channel operations, including direct support for ALTing on completion of external calls. These mechanisms, although mostly adequate, lack the level of flexibility that programmers require. For example, it is not immediately clear as to how a low-level network communication infrastructure, such as that required by KRoC.net [8], would be implemented using the existing mechanisms. All of these existing mechanisms essentially attach ‘dead’ C function calls to various occam-π operations. Programming interactions between these calls, which would be required if multiplexing channels over IP links, is difficult and prone to error. On the other hand, most of the infrastructure could be programmed in occam-π, with only the lowest-level I/O inside C functions. However, occam-π does not lend itself to the type of programming we might wish to employ at this level — e.g. deliberate pointer aliasing for efficiency (which we know to be safe, but which cannot be checked by the current occam-π compiler). The C interface mechanism presented here (CIF) attempts to address these issues, by providing a very general framework for the construction of parallel processes and programs
250
F. Barnes / Interfacing C and occam-pi
in C. In some respects, this mechanism provides exactly what CCSP [9] provided in terms of support for C programs, but with the added benefits of occam-π (e.g. mobiles and extended synchronisations) and the ability to support mixed occam-π and C process networks. The interface presented to applications is based on the original Inmos and CCSP APIs. The uses for this are wide-ranging. Applications that require only a limited amount of external interaction can encapsulate these in concurrent C processes, avoiding the overheads of repeated external C calls. The CIF mechanism can also be used to migrate existing C code into occam-π systems — e.g. minimal-effort porting of Linux device-drivers to RMoX [5]. At the far end of the scale, the CIF mechanism can be used to program entire concurrent systems in C. In contract with some alternative parallel C environments, CIF offers very low overheads and a reasonable level of control. Unlike occam-π, however, the C compiler — typically ‘gcc’ [10] — does not perform parallel-usage checks, leaving the potential for racehazard errors. The opportunity for such error can be minimised by good application design. Section 1 examines the technical aspects of the C interface, implementation and API. Section 2 presents a specific application of CIF for networking mobile channel-bundles, in addition to a general discussion of potential application areas. Conclusions and initial performance results are presented in section 3, together with plans for future work. 1. Interfacing C and occam-π The C interface operates by encapsulating C processes such that the KRoC run-time system sees them as ordinary occam-π processes. No changes are required in the KRoC run-time to support these C processes, and no damage is caused to the performance of existing occam-π code. As a consequence, C processes incur a slight overhead each time they interact with the run-time system (switching from a CIF process-context to an occam-π one). This overhead is small, however (less than 100 nanoseconds on an 800 MHz Pentium-3). C processes are managed through a variety of API calls, the majority of which require a C process context. Some do not, however, including those used for initial creation of C processes. Creation and execution of the first C process in a system is slightly complicated, requiring the use of the basic C calling mechanism. For example, using the C interface, the standard ‘integrate’ component could be written as: void integrate (Process *me, Channel *in, Channel *out) { int v, total = 0; for (;;) { ChanInInt (in, &v); total += v; ChanOutInt (out, total); }
in?
integrate
out!
}
The ‘me’ parameter given to CIF processes gives the process a handle on itself. The CIF infrastructure always knows which particular C process is executing, however, raising questions about the necessity of this extra (and automatically provided) parameter. The above process shows examples of the ‘ChanInInt’ and ‘ChanOutInt’ API calls, whose usage is mostly obvious. 1.1. Starting C Processes To create an instance of the above ‘integrate’ process requires a call to either ‘ProcAlloc’ or ‘ProcInit’. To do this from occam-π requires the use of an external C call:
F. Barnes / Interfacing C and occam-pi
251
void real_make_integrate (Channel *in, Channel *out, Process **p) { *p = ProcAlloc (integrate, 1024, 2, in, out); } void _make_integrate (int *ws) { real_make_integrate ((Channel *)(ws[0]), (Channel *)(ws[1]), (Process **)(ws[2])); }
that can be called from an occam-π program after declaring with: #PRAGMA EXTERNAL "PROC C.make.integrate (CHAN INT in?, out!, RESULT INT p) = 0"
The usage of this in occam-π is slightly peculiar since the call will return providing a process address in ‘p’, but having already consumed its ‘in?’ and ‘out!’ parameters. An inline occam-π procedure is provided by CIF that executes the C process, returning only when the C process has terminated — at which point it could be freed1 using ‘ProcAllocClean’. For example: #INCLUDE "cifccsp.inc" PROC external.integrate (CHAN INT in?, out!) INT proc: SEQ C.make.integrate (in?, out!, proc) cifccsp.startprocess (proc) :
Creating and executing C processes inside a CIF process is much simpler. Processes are created in the same way using ‘ProcAlloc’, but are executed using ‘ProcPar’ (or one of its variants). It should be noted that the above two C functions, the entry-point ‘ make integrate’ and ‘real make integrate’, could be made into a single function. Separating them out gives the parameters passed explicit names, however, instead of using indices into the ‘ws’ array. The ‘real’ function can be declared ‘inline’ to get equivalent performance if desired. 1.2. Masquerading as occam In order to present themselves as occam-π processes, CIF processes need a valid occam-π process workspace. This is a fixed-size block that contains the state of the CIF process, in addition to the ‘magic’ workspace fields used for process control. Figure 1 shows the layout of this structure, with word-offsets relative to the ‘Process’ pointer (equivalent to an occam process’s workspace-pointer). The workspace below offset 0 is that normally associated with suspended occam-π processes. These are used only when the CIF process is inactive, e.g. blocked on channel communication. The workspace offsets from 0 to 2 are used by CIF processes that have gone parallel and are waiting for their sub-processes to terminate, in the same way that occam-π processes do. The workspace offsets from 4 to 12 hold the CIF-specific process state, including the stored state of the run-time system when a CIF process is executing (held in processor registers for occam-π processes). When a CIF process is initially created, its entry-point is set to the C function specified in the call to ‘ProcAlloc’. The iptr field is set to point at an assembler routine that starts the 1
There seems little point in cleaning up after this ‘integrate’ process, since it is not expected to terminate.
252
F. Barnes / Interfacing C and occam-pi 12 11 10 9 8 7 6 5 4 3 2 1 0 −1 −2 −3 −4 −5 −6
nparamwords endp−link call−succ Bptr Fptr entry−point occam−stack c−stack−base c−stack−pointer par−priority par−count temp/par−succ iptr link priority pointer/state timer−link timer−state
CIF process state
used by PAR
"magic" workspace
Figure 1. CIF process workspace
process for the first time and handles its shutdown. When a CIF process is blocked, the entrypoint field holds the real ‘return’ address in the user’s C code, whilst the iptr field points to an assembler routine that resumes the process. Figure 2 shows the life-cycle of a CIF process.
ProcAlloc() creates process
ProcAlloc() creates process CIF process
ProcPar() cifccsp.startprocess schedules processes schedules process suspends self suspends self
CIF process scheduled save occam state restore C state (user C code)
reschedule if last
ProcPar() returns
process terminates reschedules parent restore occam state
enter run−time kernel
save C state restore occam state
run−time kernel interaction
cifccsp.startprocess returns ProcAllocClean() destroys process
Figure 2. CIF process life-cycle
When entering the run-time kernel, a CIF process must set up its workspace in the same way that an occam-π process would. Furthermore, it must also use the correct calling convention for the particular entry-point. In-line assembler macros are used to achieve this, containing code very similar to that generated by the KRoC translator, ‘tranx86’ [11]. The returnaddress (in iptr) is always to a pre-defined block of assembler, however, that restores the CIF process correctly when it is rescheduled.
F. Barnes / Interfacing C and occam-pi
253
As an example, the following shows the pseudocode for the ‘ChanInInt()’ assembler routine (placed in-line within the C code): // chan : channel address (in register) // ptr : destination pointer (in register) chan-in-int (chan, ptr): 3: push (frame-pointer) 4: save-c-state 5: restore-occam-state 6: wptr[iptr] ⇐ global-resume-point 7: jump ( Y in32, chan, ptr) local-resume-point: 8: pop (frame-pointer) 1: 2:
There is a certain degree of unpleasantness in the actual assembler code. Much of it due to subtle differences in the way that different GCC versions handle in-line assembler macros such as these2 . The actual kernel call here ‘ Y in32’ expects to be called with the channel-address in the EAX register and the destination address in the EBX register. These are handled using register constraints (a GCC feature) in the assembler-C interface. The assembler macros represented by ‘save-c-state’ and ‘restore-occam-state’ are implemented respectively with: 1: 2: 3:
frame-pointer ⇐ wptr wptr[c-stack-pointer] ⇐ stack-pointer wptr[entry-point] ⇐ local-resume-point
and: 1: 2: 3:
stack-pointer ⇐ wptr[occam-stack] Fptr ⇐ wptr[fptr] Bptr ⇐ wptr[bptr]
The first of these saves the globally visible ‘cifccsp wptr’ variable (containing the workspace-pointer for the CIF process, ‘wptr’) in the EBP register, that holds the workspacepointer of occam-π processes. The current stack pointer is saved inside the CIF workspace, along with the address at which the C process should resume. The second of these macros restores the occam run-time state, consisting of its stack-pointer (which is the actual C stackpointer of the run-time system), and the current run-queue pointers (that are held in the ESI and EDI registers). Strictly speaking, the copying of ‘cifccsp wptr’ to the EBP register is part of restoring the occam run-time state, but since these macros typically always follow each other, restoring EBP early results in more efficient code. The actual return address of the CIF process, as seen by the run-time system, is the address of the ‘global-resume-point’. This is a linked-in assembler routine that performs, effectively, the inverse of these two macros, before jumping to the stored resume point. 1.3. Providing the API The application interface and user-visible types are contained in the header file “cifccsp.h”. Files containing CIF functions need only include this to access the API. The various functions that make up the API are either preprocessor macros that expand to blocks of in-line 2 This is not so much the fault of GCC, but rather certain distributions that included development (and potentially unstable) versions of GCC.
254
F. Barnes / Interfacing C and occam-pi
assembler (as shown above), or for some more complex operations (e.g. ‘ProcPar()’ and ‘ProcAlt()’), actual C functions provided by the CIF library. The API includes the majority of functions available in the original Inmos C API and the CCSP API. Additional functions are provided specifically for new occam-π mechanisms, again a mixture of assembler macros and C functions. These include, for example, ‘ProcFork()’ to fork a parallel process (following the occam-π ‘FORK’ mechanism) and ‘DMemAlloc()’ to dynamically allocate memory. A complete description of the supported API, and some basic examples, can be found on the CIF web-page [12]. In addition to the standard and extended API functions, four additional macros are provided to make external C calls. The first two of these are used to make blocking C calls, i.e. that run in a separate thread with the expectation that they will block in an OS system-call. The second pair of macros are used to make ordinary external C calls, but only for certain functions. For each macro pair, there is one that is used to call functions with no arguments, and a second to call functions with an arbitrary number of arguments. For example: void do_write (int fd, const void *buf, size_t count, int *result) { *result = write (fd, buf, count); } void my_process (Process *me, Channel *in, Channel *out) { for (;;) { void *mobile_array[2]; int fd, result; /* input INT descriptor followed by a MOBILE []BYTE * array of data. */ ChanInInt (in, &fd); ChanMIn64 (in, mobile_array); BLOCKING_CALLN (do_write, fd, mobile_array[0], (size_t)(mobile_array[1]), &result); DMemFree (mobile_array[0]); ChanOutInt (out, result); } }
This process inputs an integer file-descriptor, followed by a dynamic mobile array from the ‘in’ channel, then writes that data to the given file-descriptor (typically a network socket). After the call the dynamic mobile array is freed, followed by communication of the underlying ‘write’ result on the ‘out’ channel. The corresponding occam-π interface for ‘my process’ would be: PROTOCOL FD.DATA IS INT; MOBILE []BYTE: PROC my.process (CHAN FD.DATA in?, CHAN INT out!)
It should be noted that ordinary CIF routines may not be used inside an external C call. For blocking calls (e.g. ‘do write()’ in the above), code executes with a thread stack, not in the CIF process’s stack. For ordinary (non-blocking) external C calls, code may or may not execute in a thread stack. For example, the ‘BLOCKING CALLN’ in the above could be replaced with:
F. Barnes / Interfacing C and occam-pi
255
EXTERNAL_CALLN (do_write, fd, mobile_array[0], (size_t)(mobile_array[1]), &result);
The decision of whether to run ‘do write’ in the CIF process’s stack, or the occam-π run-time’s stack, depends on whether POSIX threads [13] are enabled. Where POSIX threads are not enabled (and the run-time system uses Linux’s native ‘clone’ thread mechanism), the above call will be reduced to just: do_write (fd, mobile_array[0], (size_t)(mobile_array[1]), &result);
When POSIX threads are enabled, the call is redirected to a linked-in assembler routine, that performs the call on the occam-π run-time’s stack. This stack-switch is actually only required when the POSIX threads implementation stores thread-specific information in the stack, rather than in proessor registers. In this case it is relevant since the ‘write()’ call sets the global ‘errno’ value; however, the standard C library, in the presence of POSIX threads, re-directs this to a thread-specific ‘errno’ (so that concurrent system-calls in different threads do not race on ‘errno’). In cases where the POSIX threads implementation is built to store the thread-identifier in processor registers, locating this thread-specific ‘errno’ is no problem — and can be done safely when code is executing in a C stack. However, if POSIX threads are configured to use the stack to store thread-specific data, making the call from a CIF stack results in a crash (as the ‘pthreads’ code walks off the top of the CIF stack whilst looking for thread-specific data). Linux distributions vary in their handling of this, but it is arguably better to use spare processor registers for holding the thread identifier (avoiding the chance of false-positives in a stack search). 2. Applications CIF has a potentially huge range of application. Generally speaking, it allows the programmer to interface C with occam-π in a naturally compatible way, i.e. channel communication and other CSP-style concurrency mechanisms [14]. Despite the safety and practicality of occam-π, there are some things which are still more desirable to program in C — particularly low-level interface code that typically deals with pointers, which occam-π does not support natively. Explicit pointer types (such as those found in C) create the potential for aliasing and race-hazard errors, requiring care on the programmer’s part. One of the original motivations for CIF was in order to ease implementation of the ‘ENCODE.CHANNEL’ and ‘DECODE.CHANNEL’ compiler built-ins [15]. These transform occam channel communications into address,size pairs, using extended inputs to block the process outputting whilst the resulting address and size are handled. These “protocol converters” are necessary for implementing the KRoC.net infrastructure [8]3 — as well as other similar infrastructures — transforming application-level communications into something suitable for network communication. The standard implementation of ‘ENCODE.CHANNEL’ and ‘DECODE.CHANNEL’ is by means of tree re-writing inside the compiler, necessary because different channel protocols require different handling, for which run-time information is generally not available. Although the mechanism is fully sufficient for its intended uses, making it compatible with new occam-π types, e.g. a ‘MOBILE BARRIER’ [16], is non-trivial and time-consuming. A generic implementation of ‘ENCODE.CHANNEL’ and ‘DECODE.CHANNEL’ in C is relatively simple, provided that information about the structure of the channel-protocol is available. Recent versions of the KRoC system have the option of including this information in generated code. In practice, this is only supported for mobile channel-types, since they pro3
KRoC.net will be known as “pony” when released, to avoid confusion with a KRoC that targets .NET.
256
F. Barnes / Interfacing C and occam-pi
vide a convenient place to store a pointer to the generated type-description block. Figure 3 shows an example of how a generic protocol decoder could be used with an occam-π application.
cif_decode_channel
application
network_iface
(tcp)
Figure 3. Generic protocol decoding in C
Unlike the compiler built-in versions of these protocol converters, the C implementations are substantially simpler. In the case of figure 3, the two C routines could be combined to a certain degree, providing a single CIF process that deals with networking of occam-π channels directly — such a mechanism would be non-transparent, unlike KRoC.net where transparency is key. The following section presents a library that uses CIF processes to provide networked mobile channels. Each channel-bundle networked results in multiple encode/decode processes and the necessary infrastructure to support them. 2.1. Networking Mobile Channels A simple mobile channel-type networking mechanism for occam-π is currently being developed. In particular it aims to facilitate the multiple-client/single-server arrangement of communication, of an arbitrary mobile channel-type. For example: PROTOCOL REQUEST IS MOBILE []BYTE: PROTOCOL RESPONSE IS MOBILE []BYTE: CHAN TYPE APP.LINK MOBILE RECORD CHAN REQUEST req?: CHAN RESPONSE resp!: :
Figure 4 gives an idea of what such a networked application might look like. New clients can connect to a server, and “plug-in” a client-end of the desired channel-type, provided they know where it is — i.e. host-name and TCP port. Unlike the KRoC.net infrastructure, this “application link layer” is unable to cope with the communication of mobile channel ends, that could alter the TCP ‘wiring’, and is beyond its scope in any case. The implementation under development allows the user to specify different behaviours for the networked “virtual mobile-channel”. In this example, and in order to operate as we intend, the infrastructure needs to know how communications on ‘req?’ correspond with those on ‘resp!’ — if at all. To a certain extent, this is related to how the shared client-end ‘CLAIM’ gets handled. For the network shown in figure 4, application nodes will compete internally for access to the server, or will delegate that responsibility to the server. Which behaviour is chosen can affect performance significantly. For instance, if each communication on ‘req?’ is followed by a communication on ‘resp!’, the client-end semaphore claim can remain local to application nodes — the server knows that whichever client communicated on ‘req?’ will be expecting a response on ‘resp!’, or rather, to which client the communication on ‘resp!’ should be sent. However, if the application behaviour is such that communications on ‘resp!’ can happen independently of those on ‘req?’, the server needs to be aware of client-end claims, so that it knows which client to send data output on ‘resp!’ to.
257
F. Barnes / Interfacing C and occam-pi
application
application
net−iface
node1
(tcp)
net−iface
(tcp) net−iface
node2
virtual mobile−channel
node3
server
Figure 4. Networking any-to-one shared mobile-channels
The primary aims of this link-layer are simplicity and efficiency. To connect to a server using the above protocol, a client will use code such as: SHARED APP.LINK! app.cli: APP.LINK? app.svr: INT result: SEQ app.cli, app.svr := MOBILE APP.LINK all.client.connect (app.svr, "korell:3238", result) IF result = 0 SKIP -- else STOP ... code using "app.cli"
The call to ‘all.client.connect’ dynamically spawns the necessary processes to handle communication, connecting to the server and verifying the protocol before returning. It is the server that specifies how communication is handled, for example: SHARED APP.LINK! app.cli: APP.LINK? app.svr: INT result: SEQ app.cli, app.svr := MOBILE APP.LINK all.server.listen (app.cli, "**:3238", "**(0 -> 1)", result) IF result = 0 SKIP -- else STOP ... code using "app.svr"
The string “*(0 -> 1)” is given as the usage-specification, stating that each communication on channel 0 (‘req?’) is followed by a communication on channel 1 (‘resp!’), repeated indefinitely. These usage-specifications are essentially regular-expression style traces (for that channel-type only), and like the direction-specifiers are specified from the server point-of-view. Table 1 gives an overview of the supported specification language, in order of precedence. The usage specification, in addition to controlling the behaviour of client-side ‘CLAIM’s, is used to build a state-machine. This state machine is used by client and server nodes to keep track of the current trace position. In particular, the infrastructure will not allow a communication to proceed if it not ‘expected’.
258
F. Barnes / Interfacing C and occam-pi Table 1. Supported usage-specification expression syntax Syntax (X) *X +X X | Y n -> X n
Description sub-expression, where X is an expression X repeated zero or more times, where X is an expression X repeated one or more times, where X is an expression X or Y, where X and Y are expressions n followed by X, where n is a channel index and X is an expression communication on n, where n is a channel index
most binding
least binding
The infrastructure comprising this “application link layer” is dynamically created behind the relevant client and server calls. Figure 5 shows the infrastructure created at the server-end, for the above ‘APP.LINK’ channel type. req?
encode_channel shutdown_delta
server.process resp!
decode_channel
all_server_linkif (op−channels) all_sock_if
(TCP/IP)
application link layer
Figure 5. Server-side channel-type networking infrastructure
The three ‘op-channels’ emerging from the channel-bundle are specially inserted by the compiler, that generates communications on entry and exit from a ‘CLAIM’ block, and when the channel-end is freed by the application (i.e. when it leaves scope). Programming this infrastructure in C makes easier the handling of dynamically created ‘encode’ and ‘decode’ processes. Internally, ‘all server linkif’ ALTs across its input channels and processes them accordingly. The ‘all sock if’ process is responsible for network communication and operates by waiting in a ‘select()’ system-call, that allows it to be interrupted without side-effects, before reading or writing data. The low-level protocol used by the current implementation does not respect occam-π channel semantics. Instead, the individual channels transported behave as buffered channels, where the size of the buffer is determined by the network and operating-system. This will be addressed in the future, once confidence in the basic mechanism has been established — i.e. successfully using CIF to transport occam-π channel-communications over an IP network. The current implementation is reliable, however. A future implementation will likely use UDP [17] instead of TCP [18], giving the linklayer explicit control over acknowledgements, timeouts and packet re-transmission. Having available a description of channel usage enables some optimisations to be made in the underlying protocol, that are currently being investigated. 3. Conclusions and Future Work The C interface mechanism presented in this paper has a wide range of uses, from providing low-level C functionality to occam-π applications through to supporting entire CSP-style
F. Barnes / Interfacing C and occam-pi
259
applications written in C. Although CIF processes incur additional overheads (saving and restoring the C and occam states), these are not significantly damaging to performance. The ‘commstime’ benchmark is traditionally used to measure communication overheads in occam-π; it has been rewritten using CIF in order to get a practical measurement of the CIF overheads. On a 3.2 GHz Pentium-4, each loop for the occam-π commstime takes approximately 89 nanoseconds, 396 nanoseconds for CIF. This corresponds to a complete save/restore overhead of 26 nanoseconds, which will be an acceptable overhead for the majority of applications. The current CIF implementation is not intended to be excessively efficient (i.e. in-lining of certain run-time kernel calls, as ‘tranx86’ optionally does). These will gradually appear in future releases of KRoC, as the C interface matures. The one major drawback of the CIF interface is the inability of the C compiler to guarantee correct usage. This particularly applies to the handling of dynamic mobile types, whose internal reference-counts must be correctly manipulated. Incorrect handling can lead to memory-leaks, deadlocks and/or undefined behaviour (chaos). Despite this, it is hoped that users will find this C interface useful, for both its use with occam-π and as a softwareengineering tool to apply CSP concurrency in C applications (e.g. migrating threaded C applications to a more compositional, and predictable/provable, framework).
References [1] P.H. Welch and F.R.M. Barnes. Communicating mobile processes: introducing occam-pi. In A.E. Abdallah, C.B. Jones, and J.W. Sanders, editors, 25 Years of CSP, volume 3525 of Lecture Notes in Computer Science, pages 175–210. Springer Verlag, April 2005. [2] Inmos Limited. occam2 Reference Manual. Prentice Hall, 1988. ISBN: 0-13-629312-3. [3] P.H. Welch, J. Moores, F.R.M. Barnes, and D.C. Wood. The KRoC Home Page, 2000. Available at: http://www.cs.kent.ac.uk/projects/ofa/kroc/. [4] M.D. May, P.W. Thompson, and P.H. Welch. Networks, Routers and Transputers, volume 32 of Transputer and occam Engineering Series. IOS Press, 1993. [5] F.R.M. Barnes, C.L. Jacobsen, and B. Vinter. RMoX: a Raw Metal occam Experiment. In J.F. Broenink and G.H. Hilderink, editors, Communicating Process Architectures 2003, WoTUG-26, Concurrent Systems Engineering, ISSN 1383-7575, pages 269–288, Amsterdam, The Netherlands, September 2003. IOS Press. ISBN: 1-58603-381-6. [6] David C. Wood. KRoC – Calling C Functions from occam. Technical report, Computing Laboratory, University of Kent at Canterbury, August 1998. [7] F.R.M. Barnes. Blocking System Calls in KRoC/Linux. In P.H. Welch and A.W.P. Bakkers, editors, Communicating Process Architectures, volume 58 of Concurrent Systems Engineering, pages 155–178, Amsterdam, the Netherlands, September 2000. WoTUG, IOS Press. ISBN: 1-58603-077-9. [8] M. Schweigler, F.R.M. Barnes, and P.H. Welch. Flexible, Transparent and Dynamic occam Networking with KRoC.net. In J.F. Broenink and G.H. Hilderink, editors, Communicating Process Architectures 2003, WoTUG-26, Concurrent Systems Engineering, ISSN 1383-7575, pages 199–224, Amsterdam, The Netherlands, September 2003. IOS Press. ISBN: 1-58603-381-6. [9] J. Moores. CCSP – a Portable CSP-based Run-time System Supporting C and occam. In B.M. Cook, editor, Architectures, Languages and Techniques for Concurrent Systems, volume 57 of Concurrent Systems Engineering series, pages 147–168, Amsterdam, The Netherlands, April 1999. WoTUG, IOS Press. ISBN: 90-5199-480-X. [10] Free Software Foundation inc. Using the GNU Compiler Collection (GCC), version 3.3.5, 2003. Available at: http://gcc.gnu.org/onlinedocs/gcc-3.3.5/gcc/. [11] F.R.M. Barnes. tranx86 – an Optimising ETC to IA32 Translator. In Alan Chalmers, Majid Mirmehdi, and Henk Muller, editors, Communicating Process Architectures 2001, volume 59 of Concurrent Systems Engineering, pages 265–282, Amsterdam, The Netherlands, September 2001. WoTUG, IOS Press. ISBN: 1-58603-202-X. [12] F.R.M. Barnes. The occam-pi C interface, May 2005. Available at: http://www.cs.kent.ac.uk/projects/ofa/kroc/cif.html.
260
F. Barnes / Interfacing C and occam-pi
[13] International Standards Organization, IEEE. Information Technology – Portable Operating System Interface (POSIX) – Part 1: System Application Program Interface (API) [C Language], 1996. ISO/IEC 9945-1:1996 (E) IEEE Std. 1003.1-1996 (Incorporating ANSI/IEEE Stds. 1003.1-1990, 1003.1b-1993, 1003.1c-1995, and 1003.1i-1995). [14] C.A.R. Hoare. Communicating Sequential Processes. Prentice-Hall, London, 1985. ISBN: 0-13-153271-5. [15] Frederick R.M. Barnes. Dynamics and Pragmatics for High Performance Concurrency. PhD thesis, University of Kent, June 2003. [16] P.H. Welch and F.R.M. Barnes. Mobile Barriers for occam-pi: Semntics, Implementation and Application. In J. Broenink, H. Roebbers, J. Sunter, P. Welch, and D. Wood, editors, Communicating Process Architectures 2005. IOS Press, September 2005. [17] J. B. Postel. User datagram protocol. RFC 768, Internet Engineering Task Force, August 1980. [18] J. B. Postel. Transmission control protocol. RFC 793, Internet Engineering Task Force, September 1981.
Communicating Process Architectures 2005 Jan Broenink, Herman Roebbers, Johan Sunter, Peter Welch, and David Wood (Eds.) IOS Press, 2005 © 2005 The authors. All rights reserved.
261
Interactive Computing with the Minimum intrusion Grid (MiG) John Markus BJØRNDALEN a, Otto J ANSHUS a and Brian VINTER b a University of Tromsø, N-9037 Tromsø, Norway b University of Southern Denmark, Campusvej 55, DK-5230 Odense M, Denmark Abstract. Grid computing is finally starting to provide solutions for capacitycomputing, that is problem solving where there is a large number of independent tasks for execution. This paper describes the experiences with using Grid for capability computing, i.e. solving a single task efficiently. The chosen capability application is driving a very large display which requires enormous processing power due to its huge graphic resolution (7168 x 3072 pixels). Though we use an advanced Grid middleware, the conclusion is that new features are required to provide such coordinated calculations as the present application requires.
1. Introduction Grid computing promise endless computing power to all fields of science and is already being established as the primary tool for eScience. Applications run on Grid, however, are generally capacity class applications, i.e. applications that can trivially be divided into a large number of tasks without intercommunication or deadlines for termination. If Grid should really provide processing power for all kinds of applications it must also support capability class applications [1], e.g. support deadlines or intercommunication. In this paper we investigate the current performance of Grid when running a deadline-driven application, the rendering of very large images for a display wall. A display wall is a high-resolution computer display the size of a wall, with the combined resolution and other graphical capabilities of several common off-the-shelf display cards. The display wall features as a physical wall in a room with digital video sensors for calibration, gesture recognition, video-recording etc., and with multi-channel sound systems for audio input and output. The size and resolution are typically at 230 inches and 22 megapixels – an order of magnitude larger than a high-end 23-inch display. Creating the content for such a large high-resolution display, coordinating the individual computers to deliver coherent images, and moving the individual megapixel tiles to each computer for displaying are all challenges. In this paper we report on the use of a Grid-type computing resource to quickly create content for interactive use. 1.1 Capacity vs. Capability Computing High Performance Computing, HPC, is typically divided into two groups [1]: capacitycomputing and capability-computing. Capacity-computing targets solutions that are not feasible on an ordinary computer, e.g. ‘Grand Challenge Computing’. Tasks that are capability driven may be divided into three rough groups. The first group of problems requires so much memory that they only fit particular supercomputers. The computer that currently has the most shared memory is the NASA Colombia with 20TB of memory which is addressable from any processor in the machine. The second group of capability driven applications include those with such large computational
J.M. Bjørndalen et al. / Interactive Computing with the Minimum Intrusion Grid (MiG)
262
requirements that starting them on a PC would not make sense since waiting for faster computers would be faster. An example of this is shown in Figure 1, where a computation task that uses a current PC (Year 0) will take 10-years, but waiting a year, to benefit from faster computers, allows the total time to wait for a result to drop to just under 8 years, and in fact the best scenario is to wait three years which will allow a final result in just 5.5 years from Year 0. 100.00%
Remaining Work
Year 0 Year 1 Year 2 Year 3 Year 4
90.00%
80.00%
70.00%
60.00%
50.00%
40.00%
30.00%
20.00%
10.00%
0.00% 1
2
3
4
5
6
7
8
9
10
Year
Figure 1. The start time vs. end time of a fixed calculation when starting the computation immediately, or waiting to benefit from faster computers introduced later.
The third and final application group for capability computing is applications with a deadline. This can be hard deadlines, such as whether forecasting, or soft deadlines, such as applications that are meant for human interaction. An example of such an application is the process of planning cancer radiation therapy [1]. Capacity-computing, on the other hand, involve simpler tasks which typically can be executed on a PC; the only challenge is that there are a vast number of them. Such capacitydriven applications are common in science and are thus the driving motivation for many Grid projects. Examples of capacity-driven problems include parameter studies, Monte Carlo simulations and genetic algorithm design. 1.2 Grid Computing vs. Public Resource Computing Looking for examples of Grid computing, many people first think of Public Resource Computing (PRC). Popular PRC projects include SETI@Home and Folding@Home. A successful platform for PRC is the Berkeley BOINC project [2]. PRC computing is, like Grid computing, very well suited for capacity-computing. The main difference between Grid and PRC is that in Grid computing, the resources allow multiple users running arbitrary applications, while in PRC resources allow a specific server to submit input tasks (called work units) to a specific application on the resource, thus Grid computing is far more flexible than PRC when it comes to diverse use of the resources.
J.M. Bjørndalen et al. / Interactive Computing with the Minimum Intrusion Grid (MiG)
263
1.3 Grid for Capability Computing This paper seeks to investigate the possibilities for utilizing Grid for capability computing rather than only serving the current capacity-computing model. None of the current Grid middlewares support real-time or even deadline scheduling of tasks, which in essence makes capability computing on Grid very hard. It is our hope that we will expose a set of deficiencies in Grid with respect to capability computing which can then function as input for a process to introduce the required functions in order to enable Grid to handle capability computing. The motivating example will be graphics rendering for a very large display, described further in Section 2. The Grid we use for the experiments, Minimum intrusion Grid, is described in Section 3 and Section 4 describes the experiments while Section 5 analyses the results. 2. The Display Wall The display wall1 used in the experiments reported on in this paper is located at the University of Tromsø, Norway. The display wall use back-projection employing 28 off-theshelf projectors, each with a resolution of 1024 x 768 pixels. The projectors are physically tiled as a 7 x 4 matrix giving a combined resolution of 7168 x 3072 pixels (see Figure 8 for an image of the Display Wall). Separate display cards in separate display hosts drive each projector. The 28 tiles of the display are software coordinated over a COTS (commodity off-the-shelf) local area network to achieve the appearance of one large, high-resolution, seamless computer display. Each computer driving a projector executes a VNC (Virtual Network Computer) [4] client fetching a tile from a VNC server running on a remote computer. A 1-Gigabit Ethernet is used for interconnect. The compute resources for the display wall are physically located close to the display wall, but they are accessed through a Grid interface (MiG, see below) located on a computer at the University of Southern Denmark. File storage is also handled by the Grid interface, including the physical storage of the files. 3. Minimum intrusion Grid MiG [5] is a Grid middleware model and implementation designed with previous Grid middleware experiences in mind. In MiG central issues such as security, scalability, privacy, strong scheduling and fault tolerance are included by design. Other Grid middlewares tend to suffer from problems with at least one of those issues. The MiG model seeks to be non-intrusive in the sense that both users and resources should be able to join the Grid with a minimal initial effort and with little or no maintenance required. One way to obtain these features is keeping the required software installation to a functional minimum. The software that is required to run MiG includes only ‘need to have’ features, while any ‘nice to have’ features are completely optional. This design philosophy has been used, and reiterated, so stringently that in fact neither users nor resources are required to install any software that is MiG-specific. Another area where MiG strives to be non-intrusive is the communication with users and resources. Users in general and resources in particular can not be expected to have unrestricted network access in either direction. Therefore the MiG design enforces that all communication with resources and users should use only the most common protocols 1
Supported by the project “Advanced Scientific Equipment: Display Wall and Compute Cluster”, Norwegian Research Foundation (NFR project no. 155550/420)
264
J.M. Bjørndalen et al. / Interactive Computing with the Minimum Intrusion Grid (MiG)
known to be allowed even with severely restricted networking configurations. Furthermore resources should not be forced to run any additional network-listening daemons.
User
Resource
GRID
GRID
Resource
User GRID
User
Resource
Resource Figure 2. The abstract MiG model
Figure 2 depicts the way MiG separates the users and resources with a Grid layer, which users and resources securely access through one of a number of MiG servers. The MiG model resembles a classic client server model where clients are represented by either users or resources. The servers are represented by the Grid itself, which in the case of MiG is a set of actual computers, not simply a protocol for communicating between computers. Upon contacting Grid any client can request to either upload or download a file. Users in turn can additionally submit a file to the job queue while resources can request a job. Most of the actual functionality is located at the MiG servers, where it can be fully maintained and controlled by the MiG developers. Thus, in addition to minimizing the user and resource requirements, the Grid layer simplifies consistent deployment of new versions of the software. The security infrastructure relies on all entities: users, MiG-servers and resources, being identified by a signed certificate and a private key. The security model is based on sessions and as such requires no insecure transfers or storage of certificates or proxycertificates, as it is required with some Grid middlewares. Users communicate securely with the server by means of the HTTPS protocol using certificates for two-sided authentication and authorization. Server communication with the resources is slightly more complicated as it combines SSH and HTTPS communication to provide secure communication and the ability to remotely clean up after job executions. MiG jobs are described with mRSL, which is an acronym for minimal Resource Specification Language. mRSL is similar to other Resource Specification Languages, but keeps the philosophy of minimum intrusion, thus mRSL tries to hide as many aspects of Grid computing as possible from the user. To further hide the complexities of Grid computing from the user, MiG supplies every user with a Grid home directory where input and output files are stored. When a job makes a reference to a file, input or output, the location is simply given relative to the user’s home directory and thus all aspects of storage elements and transfer protocols are completely hidden from the user. The user can access her home directory through a web interface or through a set of simple MiG executables for use with scripting.
J.M. Bjørndalen et al. / Interactive Computing with the Minimum Intrusion Grid (MiG)
265
Job management and monitoring is very similar to file access so it is also done either through the web interface or with the MiG executables. Users simply submit jobs to the MiG server, which in turn handles everything from scheduling and job hand-out to input and output file management. An important aspect of this is that a job is not scheduled to a resource before the resource is ready to execute the job. Resources request jobs from the MiG server when they become ready. The MiG server then seeks to schedule a suitable job for execution at the resource. If one is found, the job, with input files, is immediately handed out to the resource. Otherwise the resource is told to wait and request a job again later. Upon completion of a job, the resource hands the result back to the MiG server which then makes the result available to the user through her home directory. Even though MiG is a new model, we have already implemented a stable single-server version. It relies on the Apache web server (http://httpd.apache.org/) as a basis for the web interface and further functionality is handled by a number of cgi-scripts communicating with a local MiG server process. We have decided to implement as much of the project as possible in Python (http://www.python.org/) since it provides a very clear syntax and a high level of abstraction, and it allows rapid development. 4. Experiments As an example application, we use POV-Ray [5] to render an image with full resolution for the display wall. Rendering the example chess2.pov file at the full 7168 x 3072 resolution required a small change to the POV-Ray control file to render at the correct aspect ratio. We compare the time it takes to render a single image using one cluster node in Tromsø with the time it takes to run on MiG using jobs with 1, 2, 4, 8, 16, 23, and 63 tasks. For the Grid benchmarks, we split the image into equal-size parts, and submit job description files describing the necessary parameters to each POV-Ray task. For the remainder of the paper, we use the term Job to describe the collection of tasks that produce the necessary fragments of an image to form the full image. Each task is submitted to MiG as a separate MiG job, but to avoid confusion, we will call these tasks. We use the +SR (Start Row) and +ER (End Row) parameters to POV-Ray to limit the number of rows each task should render. The partial images are downloaded to the client computer and combined. The execution time of the entire operation and of individual tasks are examined and compared to the sequential execution time. Also, we profile the application to examine MiGs limitations for near-interactive use, and to provide design input for MiG. 4.1 Methodology We measure the time from when the first task is submitted until the last image fragment is received and all fragments are combined into one file. We also measure the time it takes to download the individual image fragments. To profile the execution of our tasks in MiG, we use MiGs job status reporting facility. MiG provides a log that shows the time when each task is received by the MiG server, when the task is entered in the queue system, when the task is picked up and starts executing, and when the task finishes. This provides us with a tool to examine when each task was executed, how long it executed and to examine some of the overheads in MiG that may limit scalability.
266
J.M. Bjørndalen et al. / Interactive Computing with the Minimum Intrusion Grid (MiG)
4.2 Hardware Twenty-three of the MiG nodes have POV-Ray 3.6 installed, which limit the scalability we can study in this paper. The 23 nodes are part of a larger 40-node Rocks [6] cluster, consisting of Dell Precision Workstation 370 nodes, each with a 3.2-GHz Intel Pentium 4 Prescott EMT64 and 2 GB RAM. The MiG client is a Dell Precision Workstation 360 with a 3.0-GHz Intel P4 Northwood processor with 2 GB RAM, running Debian Linux (Debian unstable). The client machine is connected to the department’s 100-MBit backbone Ethernet. The Rocks cluster is located in Tromsø, the MiG server is located in Denmark, and the client machine is located in Tromsø. 4.3 Skewed System Clocks in the Experiments Note that during our experiments, we found that the system clock in the MiG server was about 47 seconds slow (tasks were registered in the log 47 seconds before the client node sent them). Furthermore, the “finished” timestamps in the MiG status logs use the local clocks of the compute nodes, thus subtracting the “finished” timestamp from the “executing” timestamp produces an incorrect execution time for the experiments. To correct for this, we subtracted 47 seconds from the “finished” timestamp when calculating the execution times measured with the MiG status log facility. This did not significantly alter our results or conclusions, except in one place: in Figure 6, instead of observing that the first task finished a few seconds after the last task started executing (which was the original conclusion), we now have 3 tasks that finish before the last task starts executing. Thus, we are not guaranteed that we in fact are using 23 nodes in the cluster. The results have been verified through multiple runs that showed similar behavior, but only one set of experiments is reported on in this paper. Create image 9000 8000 7000 Seconds
6000 5000 Create image
4000 3000 2000 1000
23
63
ta sk s ta sk 16 s ta sk s 8 ta sk s 4 ta sk s 2 ta sk s 1 ta se sk qu en tia l
0
Figure 3. Total execution time from a client submits the first task until all results are received and combined into an image. The “sequential” time is the execution time of POV-Ray rendering the entire image on one of the cluster nodes without using MiG.
J.M. Bjørndalen et al. / Interactive Computing with the Minimum Intrusion Grid (MiG)
267
5. Results Rendering time on one of our cluster nodes without using MiG is 8169.37 seconds, or 2 hours, 16 minutes, 19.37 seconds. A speedup of 23 would reduce this to 355.17 seconds, or 5 minutes and 55.17 seconds. Figure 3 shows the total execution time of creating an image using MiG, including the time to submit the tasks, retrieve partial image files and combine the fragments. The graph shows that the minimum job execution time, at 23 tasks, is 754 seconds, or 12 minutes 34 seconds. This is fast enough to get a result image while attending a meeting. The speedup, however, is 10.84, which is less than linear. 5.1 Task Submission and Result Retrieval Overheads Part of the overhead when rendering with MiG, is the time necessary to send the tasks to the MiG server (task submission overhead), and the time to retrieve the image fragments to the client and combine them. Figure 4 shows the total execution time broken down into MiG task submission time, time spent waiting for and downloading results and the time spent combining the image fragments to a single image. Component times 3000 2500
Seconds
2000
combine Compute and recv
1500
Submit 1000 500 0 63 tasks
23 tasks
16 tasks
8 tasks
4 tasks
Figure 4. Component times.
Submitting tasks takes about 1 second per task. Compared to the total execution time of the rendering job, this is relatively small. The only overhead visible on the graphs is the task submission at 63 tasks where task submission takes 67 seconds out of a total of 1057 seconds. The overhead of transferring files from the MiG server to the client is partially hidden by transferring the files immediately when the client discovers them. The client retrieves a list of available files from the MiG server every 10 seconds. When the client discovers that an image fragment is available, the client immediately downloads the fragment. With 23 tasks, the fragment sizes are typically transferred in 4 to 5 seconds. Since the tasks do not finish at the same time, the download time of the last file is usually the only visible download overhead contributing to the total execution time for the job.
J.M. Bjørndalen et al. / Interactive Computing with the Minimum Intrusion Grid (MiG)
268
Combining the image fragments takes from 0.4 to 0.8 second for all variations of number of tasks. For up to 63 tasks, this doesn’t add enough overhead to significantly impact the scalability. In total, for the 23-fragment job that has the lowest execution time, the reported time shown in Figure 4 is 754 seconds. Removing the measured overheads (including the download time of the last fragment), we get a MiG execution time of 728 seconds, which would have given us a speedup of 11.22. 5.2 MiG Internal Overheads The major contribution to scalability in the system is clearly not the client side or the communication between the client and the server, so we investigate the internal overhead using the MiG job status log to profile the tasks. Figure 5 shows a timeline of the rendering job split into 4 tasks. Shown are: x x x x
receive time, the time it takes before a task is received (measured from when the first task was received); queue overhead – the time it takes from when the task was received until it is entered into the job queue; queue time – the time the task spends in the queue until it is executed and execution time – the execution time of the task. Timeline for 4 tasks
task003
task002 receive queue overhead queue exectime task001
task000
0
500
1000
1500
2000
2500
3000
Seconds from start
Figure 5. Timeline for 4 tasks. Queue time is not significant here. The timeline shows that the execution time of each part of the image varies by more than a factor 2.5.
The figure shows that the irregular execution time of POV-Ray on different parts of the image, which is a result of the varying computational complexity of each image fragment, is a major contribution to the scalability of the application. This pattern continues to be a problem at all problem sizes that we have studied. Note that the other times have no impact for this example.
J.M. Bjørndalen et al. / Interactive Computing with the Minimum Intrusion Grid (MiG)
269
Timeline for 23 tasks task022 task020 task018 task016 task014 receive
task012
queue overhead queue
task010
exectime
task008 task006 task004 task002 task000 0
100
200
300
400
500
600
700
Seconds from start
Figure 6. Timeline for 23 tasks. The first three tasks finished before the last three tasks started. Task execution time varies by a factor 3. Tasks also spend a significant amount of time in the job queue before execution starts.
At 23 tasks, equal to the number of worker hosts, the minimum execution time is 144 seconds, the average 354 seconds and the maximum 460 seconds. In Figure 6, we see that most tasks wait a minute or more before they start executing, contributing significantly to the computation time. What is worse, the task that starts latest is also the task that has the longest execution time; the most complex image fragment was scheduled last. This suggests a number of remedies to improve the scaling of the application using better task execution time balancing, better scheduling, and reduced overhead for the MiG queues. Better task execution time balancing would bring the execution time down to around 354 seconds for each of the 23 tasks, reducing the total time by 106 seconds, but both this and better scheduling would require knowledge of the computational complexity of each row of the image. We may be able to approximate this by first rendering the image at a lower resolution, recording the computation time of sections of the image, but we have not experimented with this. Reducing queue overhead in MiG would also improve the scheduling: the overhead of task submission and queue time for the task that finished last was in total 181 seconds: it took 20 seconds until the task was received by MiG, 1 second to queue the task, and the task waited 160 seconds in the queue before it started executing. A simple method of improving load balancing that often works in parallel applications is to divide the job into more tasks than we have workers. With 63 tasks, the execution time increases rather than decreases. Figure 7 shows that although the range of execution times is smaller than for 23 tasks (72 to 221 seconds vs. 144 to 354 for 23 tasks), new tasks are not immediately picked up by MiG workers, so much of the potential load balancing improvement is wasted in queue overheads.
270
J.M. Bjørndalen et al. / Interactive Computing with the Minimum Intrusion Grid (MiG)
Timeline for 63 tasks task060 task055 task050 task045 task040
receive queue overhead queue exectime
task035 task030 task025 task020 task015 task010 task005 task000 0
200
400
600
800
1000
Seconds from start
Figure 7. Timeline for 63 tasks. There are more tasks than workers, but we fail to benefit from a potentially better load balancing, in part because the workers do not pick up new jobs immediately.
5.3 MiG “Null Call” To examine the internal overhead in the job management system, we measured a simple job that only executed the unix “date” command. The execution time of this operation is recorded to be 40 seconds, while the queuing time was 1 second and the queue time was 0 seconds. 6. Discussion Using remote resources for rendering introduces two main overheads that need to be coped with: the time it takes to submit the rendering job to the remote server, and the time it takes to retrieve the results back to the node that requested the rendering. Task submission overhead did not significantly impact our jobs, but for jobs with a higher number of tasks or shorter tasks, the overhead should be reduced. One of the ways this can be improved is by introducing a “multi-task” job, which allows the user to submit multiple MiG jobs using a single job submission request. The result retrieval, in this case retrieving image fragments, is not a significant problem in the experiments we have run. The main reason for this is that the tasks do not finish at the same time. For the 23-task job, the first task is finished after 145 seconds, while the last task finishes after 641 seconds (shown as, respectively, the lowermost and topmost tasks in Figure 6), which allows nearly all of the fragments to be downloaded before the last task finishes. Figure 6 provides an explanation for the large difference in the task completion time. The first problem is seen in the upper left section of the figure, where the worker nodes are idle for a long time before the tasks start executing. Task submission overhead contributes to this, but the main reason for the idle time is that tasks spend a long time in the queue
J.M. Bjørndalen et al. / Interactive Computing with the Minimum Intrusion Grid (MiG)
271
before a worker host picks them up. MiG workers pick up tasks by polling the server with a configurable number of seconds between each attempt. Reducing the polling period would reduce the idle time somewhat, at the cost of increasing the load of the server. Alternatively, a signalling mechanism would be useful, where a worker could keep up a connection to a local MiG server and wait for “task ready” signals. A signalling mechanism would also be useful for clients – our client code has to poll the server to determine when tasks have completed, and when files are ready to be retrieved. At the lower left side of Figure 6, we see the result of the load imbalance problem. This problem can be solved by a better partitioning of the image. To do this, we need to know where the computationally intensive parts of the image are. This, however, depends on the rendered image, and is not trivially known before attempting to render the image. To give an idea of the computational task at hand, Figure 8 shows a photograph of the final picture on the display wall.
Figure 8. The completed image. For size comparisons, note that the portable computer on the table is a 17inch notebook and that the display wall is about 3 meters from the table.
7. Related Work Grid Computing has been hyped for a number of years [8]. The most common Grid project is Globus [9] which has changed a lot since its beginning and is now moving towards a simple Web-service model. A fork from Globus revision 2 is NorduGrid ARC which sticks more closely to the original Grid computing model [10]. The only Grid to include PRC, except MiG, is Condor [11] which is an advanced model that includes dynamic process migration.
J.M. Bjørndalen et al. / Interactive Computing with the Minimum Intrusion Grid (MiG)
272
In [12], an approach to using a Grid to support interactive visualization based on the Grid Visualization Kernel (GVK) is described. Two models for using a Grid for interactive visualization are identified: (i)
local visualization: compute the results on the Grid, download the results, and compute the visualizations locally;
(ii) remote visualization: compute the results and the visualization on the Grid, download the finished visualization. gSlick [13] is a Grid-enabled collaborative scientific visualization environment based on the Globus toolkit (GTK). While GVK and gSlick are built on top of or as extensions to existing Grid models, MiG is a Grid. We have not built any environments specifically for visualization, but are using MiG directly to move data and tasks to the compute resources, and fetch the results afterwards. Thereby, we can provide performance data for the MiG grid model without extra overhead. It also demonstrates the flexibility of the MiG approach. Large format digital displays have traditionally been used for high-end applications in science and engineering. Examples include CAVE [14], InfinityWall[15], Princeton’s scalable display wall [16], the MIT DataWall, Stanford’s Interactive Mural [17], and the PowerWall at the University of Minnesota. 8. Conclusions In this paper we have introduced the problem of using a Grid for capability computing, and run an example of rendering a large image for experiments. The overall conclusion is that while performance improvements can be obtained using the Grid computing model there are still a number of features that need to be added or improved to represent a true alternative for capacity-computing. First of all a more convenient interface for retrieving results as soon as they are ready must be implemented. Secondly a strong prioritization mechanism must be implemented to ensure that deadline-driven applications are scheduled before capacity-driven applications. Finally it is evident that better tools for monitoring timing and performance of an application are needed to perform the kinds of experiments that we execute in this paper. 9. Future Work Follow-up on this work will be twofold: first we will add prioritization to MiG and introduce a simpler interface for having results delivered when ready. Once these improvements have been implemented we will rerun the experiments to verify their efficiency. Secondly, we will look into supporting capability computing that requires intercommunication. Since no existing Grid software support intercommunication, except a special Globus version of MPI, we will seek to introduce different intercommunication mechanisms and test capability computing on Grid with true intercommunication support.
J.M. Bjørndalen et al. / Interactive Computing with the Minimum Intrusion Grid (MiG)
273
References [1]
A. Natrajan, M. Humphrey, A. Grimshaw, Capacity and Capability Computing using Legion, ICCS 2001 May 28-30, LNCS 2073, p. 273, 2001.
[2] [3] [4]
http://www.irs.inms.nrc.ca/inms/irs/EGSnrc/EGSnrc.html http://boinc.berkeley.edu/
[5] [6] [7] [8] [9] [10] [11]
[12]
[13] [14]
[15] [16]
[17]
T. Richardson, Q. Stafford-Fraser, K.R. Wood and A. Hopper, Virtual Network Computing, IEEE Internet Computing, Vol.2 No.1 pp. 33-38, Jan/Feb 1998. B. Vinter, The Architecture of the Minimum Intrusion Grid, Proceedings of CPA 2005, pp. 189-201, IOS Press. September, 2005. www.povray.org http://www.rocksclusters.org/
M. Bernhardt, Grid Computing – Hype or Tripe?, GRID Today, December 6, 2004: vol. 3 no. 49. I. Foster and C. Kesselman, Globus: A Metacomputing Infrastructure Toolkit. Intl J. Supercomputer Applications, 11(2):115-128, 1997. O. Smirnova et al., The NorduGrid Architecture and Middleware for Scientific Applications, ICCS 2003, LNCS 2657, p. 264. P.M.A. Sloot et al. (Eds.) Springer-Verlag Berlin Heidelberg 2003. D. Thain, T. Tannenbaum, and M. Livny, Condor and the Grid, in Fran Berman, Anthony J.G. Hey, Geoffrey Fox, editors, Grid Computing: Making The Global Infrastructure a Reality, John Wiley, 2003. ISBN: 0-470-85319-0. D. Kranzlmüller, H. Rosmanith, P. Heinzlreiter, M. Polak, Interactive Virtual Reality on the Grid, in Proceedings of the Eighth IEEE International Symposium on Distributed Simulation and Real-Time Applications (DS-RT’04), pp. 152-158, October 2004 E.C. Wyatt, P. O’Leary, Interactive Poster: Grid-Enabled Collaborative Scientific Visualization Environment, in IEEE Visualization Proceedings of the Conference on Visualization, 2004 C. Cruz-Neira, D. J. Sandin , and T. A DeFanti. (1993). Surround-screen projection-based virtual reality: the design and implementation of the CAVE™. In Proceedings of Computer Graphics, Anaheim, CA, USA, (pp. 135-142). New York, NY, USA: ACM. M. Czernuszenko, D. Pape, D. Sandin, T. DeFanti, L. Dawe, and M. Brown. The ImmersaDesk and InfinityWall Projection-Based Virtual Reality Displays. In Computer Graphics, May 1997. R. Samanta, J. Zheng, T. Funkhouser, Kai Li, and Jaswinder Pal Singh. Load Balancing for MultiProjector Rendering Systems. SIGGRAPH/Eurographics Workshop on Graphics Hardware, Los Angeles, California – August 1999. G. Humphreys and P. Hanrahan, A Distributed Graphics System for Large Tiled Displays, Proceedings of IEEE Visualization ’99. and Graphics Arts III, 1998, pp 292-301.
This page intentionally left blank
Communicating Process Architectures 2005 Jan Broenink, Herman Roebbers, Johan Sunter, Peter Welch, and David Wood (Eds.) IOS Press, 2005 © 2005 The authors. All rights reserved.
275
High Level Modeling of Channel-Based Asynchronous Circuits Using Verilog Arash SAIFHASHEMI a, 1 and Peter A. BEEREL b PhD Candidate, University of Southern California, EE Department, Systems Division b Associate Professor, University of Southern California, EE Department, Systems Division a
Abstract. In this paper we describe a method for modeling channel-based asynchronous circuits using Verilog HDL. We suggest a method to model CSP-like channels in Verilog HDL. This method also describes nonlinear pipelines and highlevel channel timing properties, such as forward and backward latencies, minimum cycle time, and slack. Using Verilog enables us to describe the circuit at many levels of abstraction and to use the commercially available CAD tools. Keywords. CSP, Verilog, Asynchronous Circuits, Nonlinear Pipelines
Introduction A digital circuit is an implementation of a concurrent algorithm [2]. Digital circuits consist of a set of modules connected via ports for exchanging data. A port is an electrical net whose logical value is read and/or updated. A complex module may consist of a collection of simpler modules working in parallel, whose ports are connected by wires. At a higher level of abstraction, however, complex modules can often be modeled as a process, which communicate with other complex modules through communication channels [1] that are implemented with a set of ports and wires and a handshaking protocol for communication. This paper focuses on the modeling and simulation of a large class of asynchronous circuits which use CSP (Communicating Sequential Processes [1, 2]) channels for communication. In particular, any digital circuit that does not use a central clock for synchronization is called asynchronous. In channel-based asynchronous circuits, both synchronization and data communication among modules are implemented via channel communication. In fact, communication actions on channels are synchronous, i.e. the read action in a receiving module is synchronized with the write action of the sending module. This synchronization removes the need of a global clock and is the foundation of a number of demonstrated benefits in low-power and high-performance [9,10]. Unfortunately, asynchronous circuits will not gain a large foothold in industry until asynchronous design is fully supported by a commercial-quality CAD flow. In this paper, we present a method to enhance Verilog with CSP constructs in order to use commercially available CAD tools for developing channel-based asynchronous circuits. To model the high-level behavior of channel-based asynchronous designs, designers typically use some form of CSP language that has two essential features: channel-based communication and fined grained concurrency. The former makes data exchange between module models abstract actions. The latter allows one to define nested sequential and concurrent threads in a model. Thus, a practical Hardware Description Language (HDL) for 1 Corresponding Author: Arash Saifhashemi, Department of Electrical Engineering – Systems, EEB 218, Hughes Aircraft Electrical Engineering Building, 3740 McClintock Ave, Los Angeles, CA, 90089, USA; Email: [email protected]
276
A. Saifhashemi and P.A. Beerel / Modeling of Asynchronous Circuits Using Verilog
high-level asynchronous design should implement the above two constructs. Furthermore, similar to many standard HDLs, the following features are highly desired: x
Support for various levels of abstraction: There should be constructs that describe the module at both high and low levels (e.g., transistor-level) of abstraction. This feature enables modeling designs at mixed-levels of abstraction, which provides incremental verification, as units are decomposed into lower-levels and enables arrayed units (e.g., memory banks) to be modeled at high-levels of abstraction to decrease simulation run-time. Also, this enables mitered co-simulation of two-levels of abstraction, in which the lower-level implementation can be verified against the higher-level, golden specification.
x
Support for synchronous circuits: A VLSI chip might consist of both synchronous and asynchronous circuits [13]. The design flow is considerably less complex if a single language can describe both, so that the entire design can be simulated using a single tool. Consequently, modeling clocked units should be straightforward.
x
Support for timing: Modeling timing and delays is important at both high and low-levels of the design. Early performance verification and analysis of the high-level architecture using estimated delays is critical to avoid costly redesign later in the design cycle. Later, it is essential to verify the performance of the more detailed model of the implementation using accurate back-annotated delays.
x
Using available supporting CAD tools: In addition to the availability of powerful simulation engines, hooks to debugging platforms (e.g., GUI-based waveform viewers), synthesis tools, and timing-analyzers should also be available. There are many powerful CAD tools available in these areas, but in most cases they only support standard hardware design languages such as VHDL and Verilog.
x
A standard syntax: The circuit description should be easily exchangeable among a comprehensive set of CAD tools. Using a non-standard syntax causes simulation of the circuit to become tool dependant.
Several languages have been used for designing asynchronous circuits in the literature. They can be divided in the following categories: x
A new language. A new language with syntax similar to CSP is created, for which a simulator is developed. Two examples of this method are LARD and Tangram [3,14]. Simulation of the language is dependant on the academic tool and tool support/maintenance is often quite limited. Also, the new language usually does not support modeling the circuit at the lower levels of abstraction such as at the transistor and logic levels.
x
Using software programming languages like C++ and Java. For example, Java has been enhanced with a new library, JCSP [4, 12], in order to support CSP constructs in Java. This approach does not support timing and mixed level simulation. Furthermore, integration with commercially-available CAD tools is challenging.
A. Saifhashemi and P.A. Beerel / Modeling of Asynchronous Circuits Using Verilog
x
277
Using standard hardware design languages such as Verilog and VHDL. Because of the popularity of VHDL and Verilog among hardware designers, and also the wide availability of commercial CAD tool support, several approaches have been made to enhance these languages so as to model channel-based asynchronous circuits. Previous works by Frankild et al. [5], Renaudin et al. [6], and Myers [7] employ VHDL to design asynchronous circuits. In VHDL however, implementing fine grained concurrency is cumbersome, because modeling the synchronization between VHDL processes requires extra signals. Moreover, in some cases [6], the native design language must be translated into VHDL. This makes the debugging phase more cumbersome, because the code that is debugged in the debugger is different from the original code. Signals may have been added, and port names may have been changed, forcing the designer to know the details of the conversion. T. Bjerregaard, et al. [18] propose using SystemC to model asynchronous circuits and have created a library to support CSP channels. Similar to VHDL, implementing fine-grained concurrency in SystemC is cumbersome. Also, modeling timing is not addressed in their approach. In [8], Verilog together with its PLI (Programming Language Interface) has been proposed. Using Verilog, modeling the fine-grain concurrency is easily available by using the built-in fork/join constructs of Verilog. The PLI has been used for interfacing Verilog and pre-compiled C-routines at the simulation time. Using the PLI, however, has two disadvantages: first, the PLI interface significantly slows down the simulation speed, and secondly, the C code must be recompiled for all system environments, making compatibility across different system environments a challenge. Lastly, in the Verilog-PLI approach, handshaking variables are shared among all channels of a module. Unfortunately, this scheme breaks down for systems such as non-linear pipelined circuits in which multiple channels of a module are simultaneously active.
This paper addresses the problems of the Verilog-PLI method [8] and makes CSP constructs available in Verilog, without the above limitations. Besides the basic channel implementation, we propose to model performance of asynchronous pipelines by modeling the forward/backward latency and minimum cycle time of channels as timing parameters to our high-level abstract model. It is worthwhile mentioning that using Verilog also enables one to migrate to SystemVerilog [19], which commercial CAD tools are beginning to support. Since SystemVerilog is a superset of Verilog, our method will be directly applicable to future CAD tools that support SystemVerilog. The remainder of this paper is organized as follows. In Section 1, relevant background on CSP and non-linear pipelines is presented. Section 2 explains the details of implementing SEND/RECEIVE macros in Verilog. Section 3 describes the modeling of asynchronous pipelines using these macros. Section 4 describes further improvements to the method such as monitoring the channels’ status, implementing channels that reshuffle the handshaking protocol, and supporting mixed mode simulation. Section 5 presents a summary and conclusions. 1. Background In this section we briefly describe relevant background on CSP communication actions and asynchronous nonlinear pipelines [9].
278
A. Saifhashemi and P.A. Beerel / Modeling of Asynchronous Circuits Using Verilog
1.1 Communicating Sequential Processes Circuits are described using concurrent processes. A process is a sequence of atomic or composite actions. In CSP, a process P that is composed of atomic actions s1, s2, …, sn, repeated forever, is shown as follows: P = *[s1 ; s2 ; … ; Sn]
Usually, processes do not share variables, but they communicate via ports which are connected by channels. Each port is either an input or an output port. A communication action consists of either sending a variable to a port or receiving a variable from a port. Suppose we have a process S that has an output port out and a process R that has an input port in, and suppose S.out is connected to R.in via channel C. The send action is defined to be an event in S that outputs a variable to the out port and suspends S until R executes a receive action. Likewise, a receive action in R is defined to be an event that suspends R until a new value is put on channel C. At this point, R resumes and reads the value. The completion of send in S is said to coincide with the completion of receive in R. In CSP notation, sending the value of the variable v on the port out is denoted as follows: (out!v)
Receiving the value v from the port in is denoted as: (in?v)
Another construct, called a probe, has also been defined in which a process p1 can determine if another process p2 is suspended on the shared channel C for a communication action to happen in p1 [2]. Using the probe, a process can avoid deadlock by not waiting on receiving from a channel on which no other process has a pending write. Probe also enables the modeling of arbitration [2]. For two processes P and Q, the notion P||Q is used to denote that processes P and Q are running concurrently. On the other hand, the notion P;Q denotes that Q is executed after P. We can also use a combination of these operators, for example, in the following: *[(p1 || (p2;p3) || p4) ; p5]
process p1 will be executed in parallel with p4. At the same time (p2; p3) will be executed. Finally, once all p1, p2, p3, and p4 finish, p5 will be executed. This nested serial/concurrent processes at deeper levels enable modeling fine grained concurrency. 1.2 Asynchronous pipelines A channel in an asynchronous circuit is physically implemented by a bundle of wires between a sender and a receiver and a handshaking protocol to implement send and receive actions and the synchronization. Various protocols and pipeline stage designs have been developed that trade-off robustness, area, power, and performance. Channels are point-topoint from an output port of one process to an input port of another process. Linear pipelines consist of a set of neighboring stages with one input and one output port. We can describe a stage of a simple linear pipeline that receives value x from its left port and sends f (x) on its right port as follows: Buffer = *[in?x ; y=f(x) ; out!y]
For this stage we define the following performance metrics [9] that are defined with the assumption that data is always ready at the in port, and a receiver is ready to receive data from the out port:
A. Saifhashemi and P.A. Beerel / Modeling of Asynchronous Circuits Using Verilog
279
1. Forward latency: The minimum time between the consecutive receive at the in port and send at the out port 2. Backward latency: The minimum time between the consecutive send at the out port and the receive at the in port 3. Minimum Cycle time: The minimum time between two consecutive receive actions (or between two consecutive send actions). In the above example, the minimum cycle time is equal to the minimum value of the sum of execution times of receive, f(x) calculation, and send A pipeline is said to be non-linear if a pipeline stage has multiple input and/or output channels. A pipeline stage is said to be a fork if it can send to multiple stages. A pipeline stage is said to be a join if it has input channels from multiple predecessor stages. Furthermore, complex non-linear pipelines support conditional communication on all channels, i.e., depending on the value read from a certain control input channel, the module either reads from or writes to different channels. Asynchronous circuits are often implemented using fine grained non-linear pipelines to increase parallelism. In this paper, we show how to model the performance properties of such a pipeline at a high level of abstraction. In particular, in high-level performance models, it is necessary to estimate the amount of internal pipelining within a process. This pipelining is characterized as slack and is associated with ports of pipeline stages as follows: 1. Input port slack: The maximum number of receive actions that can be performed at the input port, without performing any send action at the output port(s) of the pipeline stage. 2. Output port slack: The maximum number of send actions that can be performed at the output port, without performing any receive action at the input port(s) of the pipeline stage. We adopt the modeling philosophy that the performance of the pipeline stage can be adequately modeled by specifying the forward, backward, and the minimum cycle time of the associated slack at the input and output ports. In Section 3, we will describe how to capture and model the slack in our Verilog models. 2. Communication Actions in Verilog Our approach to modeling communication actions in Verilog is to create two macros SEND and RECEIVE that model a hidden concrete implementation of the handshaking protocol [2] for synchronization. The challenge we faced is associated with the limited syntax and semantics of Verilog macros: Verilog macros only supports textual substitution with parameters, but do not support creating new variables via parameter concatenation as is available in software languages like C. Among different protocols, the bundled data handshaking protocol [10] has the lowest simulation overhead: for a bundle of signals, we must have an extra output signal called req in the sender and an extra input called ack in the receiver. When the sender wants to send data, it asserts the value of this extra bit, req, to assert that the new data is valid. Then, it waits for the receiver to receive the data. Once the data is received, the receiver informs the sender by asserting the ack signal. Finally, both req and ack will be reset to zero. The behavior of this protocol in Communicating Hardware Processes (CHP) notation [2], a hardware variant of CSP, is as follows:
280
A. Saifhashemi and P.A. Beerel / Modeling of Asynchronous Circuits Using Verilog
Sender: *[req=1 || d7…d0=produced data;[ack];req=0;[~ack]] Receiver: *[ [req] ; buffer= d7…d0 ; ack=1 ; [~req] ; ack=0]
Here, [x] means wait until the value of sox becomes true. d0
.. .
d1
Sender
d7
Receiver
req ack
Figure 1: Bundled data Protocol
Our goal is to use Verilog macros to hide the handshaking details and make the actions abstract. First, we hide the extra handshaking signals, i.e., req, ack. This can be achieved by having two extra bits on each port: bit 0 is used for the req signal, and bit 1 is used for the ack signal. A naive Verilog implementation of the bundled data protocol, using those bits is shown in Figure 2. Suppose that the out port of the Sender module is connected to the in port of the Receiver module. module Sender(out); output [7+2:0]out; reg [7:0]d; always begin // Produce d out[9:2]=d; out[0]=1; wait(out[1]==1); out[0]=0; wait (out[1]==0); end endmodule
module Receiver(in); input [7+2:0] in; reg [7:0] d; always begin wait (in[0]==1); d=in[9:2]; in[1]=1'b1; wait (in[0]==0); in[1]=0; //error // Consume d end endmodule
Figure 2: Verilog Implementation of Sender and Receiver Modules (a Naive Version)
The above code, however, does not work, because in the receiver module we are writing to an input port which is illegal in Verilog. Changing the port type to inout does not solve the problem, because writing to an inout port is also illegal in the sequential blocks of Verilog, i.e., the always block. Our solution is to use the force keyword that allows us to change the value of any net type, and in particular the reg type, which is a variable type in Verilog that is used in sequential blocks. Our goal is to hide the handshaking protocol using macros such as `SEND(port, value) and `RECEIVE(port, value). In the above code, one issue is that the width of in and out ports must be available to the macros, so that the macros can assign the eight significant bits of in to d (d=in[9:2]). Rather than passing this width as an extra parameter to the macros, we used a dummy signal as shown in Figure 3:
A. Saifhashemi and P.A. Beerel / Modeling of Asynchronous Circuits Using Verilog
module Sender(out); output [2+7:0] out; reg [7:0]d; always begin // Produce d force out={d,out[1],1]}; wait(out[1]==1); force out[0]=0; wait (out[1]==0); end endmodule
281
module Receiver(in); input [2+7:0] in; reg [7:0]d; reg[1:0] dummy; always begin wait (in[0]==1); {d,dummy}=in; force in[1]=1; wait (in[0]==0); force in[1]=0; // Consume d end endmodule
Figure 3: Correct Version of Sender and Receiver
The dummy signal is two bits, thus the variable d is always assigned to the actual data bits of in, i.e., bit 2 and higher. Therefore, the first two bits – the handshaking variables – are thrown away. Notice that the dummy signal is written, but never read. We make the above code more efficient by moving the resetting phase of the handshaking protocol to the beginning of the communicating action, thereby, removing one wait statement. In this way, the Sender both resets the ack signal of the Receiver (bit 1) and sets its own req signal (bit 0). Similarly, the Receiver reads data, and then both resets the req signal of the Sender and sets its own ack signal. module Sender(out); output [2+7:0] out; reg [7:0]d; always begin // Produce d force out={d,2'b01]}; wait(out[1]==1); end endmodule
module Receiver(in); input [2+7:0] in; reg [7:0]d; reg[1:0] dummy; always begin wait (in[0]==1); {d,dummy}=in; force in[1:0]=2'b10; // Consume d end endmodule
Figure 4: Optimized Version of Sender and Receiver
The final definitions of the two macros for SEND and RECEIVE are as follows: `define SEND(_port_,_value_) begin\ force _port_={_value_,2'b01};\ wait (_port_[1]==1'b1);\ end `define RECEIVE(_port_,_value_) begin\ wait (_port_[0]==1'b1);\ {_value_,dummy}=_port_;\ force _port_[1:0]=2'b10;\ end
282
A. Saifhashemi and P.A. Beerel / Modeling of Asynchronous Circuits Using Verilog
We also need to hide the dummy signal definition and input/output port definitions: `define `define `define `define
USES_CHANNEL OUTPORT(port,width) INPORT(port,width) CHANNEL(c,width)
reg [1:0] dummy; output[width+1:0] port; input[width+1:0] port; wire[width+1:0] c;
The designer should use the ‘USES_CHANNEL macro in modules that incorporate the communication protocol. The INPORT/OUTPORT and CHANNEL macros add two more bits to each port for handshaking. The final versions of Sender and Receiver together with a top module that instantiates them are shown in Figure 5. module Sender(out); `OUTPORT(out,8); `USES_CHANNEL reg [7:0]d; always begin // Produce d `SEND(out,d) end endmodule module top;
module Receiver(in); `INPORT(in,8); `USES_CHANNEL reg [7:0]d; always begin `RECEIVE(in,d) // Consume d end endmodule
`CHANNEL (ch,8) Sender p(ch); Receiver c(ch); endmodule
Figure 5: Final Version of Sender and Receiver
As shown in Figure 5, the SEND/RECEIVE macros are used at the same level of abstraction as they are used in CSP. 3. Modeling Performance In this section we show how we can incorporate the pipeline performance properties such as forward/backward latency, minimum cycle time, and slack in our model. 3.1 Timing The buffer described in Section 1.2 can be described in Verilog as shown in Figure 6. FL and BL are the forward and backward latencies as defined in Section 1.2. The slack of this buffer is 1 on both ports. Now, consider the description of a simple two-input function, func, with the following description in CHP notation: func: *[A?a||B?b ; c=func(a,b) ; C!c]
Also, consider a pipelined implementation of the above function that has slack 3 on A, 2 on B, and 2 on C. We can model the behavior of the pipeline using the circuit shown in Figure 7.
A. Saifhashemi and P.A. Beerel / Modeling of Asynchronous Circuits Using Verilog
283
module buf (left, right); parameter width = 8; parameter FL = 5; parameter BL = 10; `USES_CHANNEL `INPORT(left,width) `OUTPORT(right,width) reg [width-1:0] buffer; begin `RECEIVE(left, buffer) #FL; `SEND(right, buffer) #BL; end endmodule
Figure 6: Modeling a Simple Buffer ASlack b u f
b u f
b u f
Aprt
func
Cprt
b u f
b u f
Bprt
b u f
b u f
CSlack
BSlack
Figure 7: A pipelined two-input function with slacks 3 and 2 on inputs and 2 on the output
The pipeline can have different forward/backward latencies on each port. For highlevel modeling, it is desirable to make these parameters (forward/backward latency, and slack) abstract, and avoid the requirement of explicitly instantiating extra buffers on each port. We propose to enhance the INPORT/OUTPORT macros so that they include all these parameters, i.e., all slack buffers are instantiated through INPORT/OUTPORT macros in module f automatically. Suppose we have the following information about the ports given in Table 1: Table 1: Information about ports of the pipeline Port
Width
Slack
Forward Latency
Backward Latency
A B C
8 8 8
3 2 2
5 10 15
10 5 15
We can define a new macro, INPUT, as follows: `INPUT (slackModName, portName, portAlias, width, slack, BL, FL)
In a similar way, the OUTPUT macro can be defined for output ports. The INPUT macro instantiates a module called slackModule and identifies the value of forward/backward latency and slack through parameter passing. It also connects slackModule to the func module. Figure 7 shows how we use this macro. The details of the INPUT and OUTPUT macros are given in Figure 8.
284
A. Saifhashemi and P.A. Beerel / Modeling of Asynchronous Circuits Using Verilog
module adder (A, B, C); `USES_CHANNEL reg [width-1:0] abuf, bbuf, cbuf; `INPUT(ASlack,A,aPort,8,3,5,10) `INPUT(BSlack,B,bPort,8,2,10,5) `OUTPUT(CSlack,C,cPort,8,3,15,15) always begin fork `RECEIVE(aPort, abuf) `RECEIVE(bPort, bbuf) join cbuf = func(abuf,bbuf); `SEND(cPort, cbuf) end endmodule
Figure 7:A Pipelined Implementation of func
`define INPUT(slackName,portName,portAlias,\ width,slack,BL,FL)\ input[width+1:0] prtName;\ wire [width+1:0] prtAlias;\ slackModule #(width,slack,BL,FL)\ slackName(prtName,prtAlias); `define OUTPUT(slackName,prtName,prtAlias,\ width,slack,BL,FL)\ output[width+1:0] prtName;\ wire [width+1:0] prtAlias;\ slackModule #(width,slack,BL,FL)\ slackName(prtAlias,prtName); Figure 8: INPUT/OUTPUT Macros for a Pipeline module slackModule (left, right); parameter width = 8; parameter SLACK = 5; parameter FL = 0; parameter BL = 0; `USES_CHANNEL `INPORT(left,width) `OUTPORT(right,width) wire [width+1:0] im [SLACK-1:0]; genvar i; generate for (i=0; i<SLACK; i=i+1) begin:stage if (i==0) buffer #(width,FL, BL) buff(left, im[0]); else if (i==SLACK-1) buffer #(width,FL,BL) buff(im[i-1],right); else buffer #(width,FL,BL) buff(im[i-1],im[i]); endgenerate endmodule
Figure 9: Description of slackModule
A. Saifhashemi and P.A. Beerel / Modeling of Asynchronous Circuits Using Verilog
285
slackModule contains a chain of connected buffers as defined in Figure 6. In this module, the values of parameters specify how many buffers should be instantiated and what their latencies are. Although one can consider more efficient ways to implement this module to make the simulation faster, for simplicity here we use the generate loop construct of Verilog 2001 [15] in Figure 9.
4. Further Improvements In this section, we first show how to monitor the status of channels and ports in a GUI debugging tool. Next, we explain how to model non-pipelined circuits that have a reshuffled handshaking protocol. Then, we explain how further improvements can be obtained using a converter program. Finally, we consider some extensions. 4.1 Debugging As described before, since the SEND and RECEIVE actions are blocking, a circuit might deadlock when some module executes a RECEIVE action on a port for which no other module will execute a SEND action. Thus, for any language that implements the CSP communication actions, it is essential to make the debugging of channels straight-forward. One important issue is that the designer should see the status of each port and channel while simulating the circuit. This can be achieved by monitoring the handshaking signals, i.e., the extra two bits on each port. This will not work, however, for input ports since the RECEIVE action is passive, i.e., it does not change the value of the handshaking signals until it actually receives a value. To overcome this limitation, we used one more extra bit in the ports (i.e., a total of 3 extra bits per port). So, whenever the RECEIVE executes, it sets the third bit, and when it finishes, it resets the third bit. An example of monitoring the status of channels using GUI is shown in Figure 10, where we used mnemonic definitions sensitive to the last three bits of the channels. Another usage of this extra bit is that the designer can use it for implementing probe, which was defined in Section 1.1. Other researchers at Fulcrum Microsystems have independently identified a similar strategy.
Figure 10: Debugging in GUI
4.2 Reshuffling the Handshaking Protocols In the bundled data protocol that we described in Section 2, output ports are active (i.e. they initiate the communication), and input ports are passive [10]. It is possible to consider other handshaking protocols and/or to reshuffle the handshaking protocol of input and output ports. One example is a protocol that shuffles the handshaking of receive at the input and send at the output. In fact the CALL process [2], which essentially implements a slack-0 communication and is common in non-pipelined systems [14], can be implemented by the reshuffling of the handshaking protocol. This module can be described as follows:
286
A. Saifhashemi and P.A. Beerel / Modeling of Asynchronous Circuits Using Verilog CALL: *[ [l_req]; buffer= dn…d0; r_req=1 || dn…d0=buffer; [r_ack]; r_req=0; [~r_ack]; l_ack=1; [~l_req]; ack=0 ]
The above buffer splits the handshaking of the left port in two parts, shown in bold face, and sends the value to the right port in the middle of these parts. Figure 11 shows the equivalent code in Verilog. It is straight-forward to split our RECEIVE macro and create the new macros RECEIVE_PART1 and RECEIVE_PART2 and use it as in Figure 11. After reading data from the left port, the module connected to the left port will remain suspended, until the send on the right port is done. module CALL (left, right); parameter width = 8; parameter FL = 5; parameter BL = 10; `USES_CHANNEL `INPORT(left,width) `OUTPORT(right,width) reg [width-1:0] buffer; begin `RECEIVE_PART1(left, buffer) `SEND(right, buffer) `RECEIVE_PART2(left, buffer) end endmodule Figure 11: Implementation of a CALL Process
4.3 Improving the Performance via Conversion As described in previous sections, there is overhead associated with these macros. Simulation speed can be further improved by converting these macros to the bundled data protocol in which the extra handshaking signals are explicitly declared as regs and manipulated without the use of force with an external converter program. The resulting Verilog code will be more cumbersome to debug, so we recommend using this converter on a unit-by-unit basis after each unit’s correctness has been verified and only where the simulation speed is of great importance. 4.4 Extensions Some of the extensions to the CSP channels that have been implemented in other methods can be considered here as well. For example, although not commonly used for asynchronous circuit designs, by using the PLI, communication actions can be extended to use TCP/IP ports and communicate through an entire network of computers, possibly for distributing the simulation load on a compute farm. For mixed mode simulation, it is straightforward to use our method in a module that can communicate both at high and low levels of abstraction. For example, in figure 12, the mixed_buf module communicates at a high level on the left, but at a low level on the right, and the module uses explicit handshaking on its right side. Therefore, it can interface a circuit described at high-level to a circuit described at low-level, such as transistor level.
A. Saifhashemi and P.A. Beerel / Modeling of Asynchronous Circuits Using Verilog
287
module mixed_buf (left, right_data right_req, right_ack,); parameter width = 8; `USES_CHANNEL `INPORT(left,width) output [width-1:0] right_data; output right_req; input right_ack; reg [width-1:0] buffer; begin //Left side’s high-level communication `RECEIVE(left, buffer) //Receive from left //Right side’s explicit handshaking Right_data=buffer; right_req = 1; wait(right_ack==1); right_req = 0; wait(right_ack==0); end endmodule
Figure 12: Modeling a Simple Buffer for Mixed Mode Simulation to Interface Modules Describe at Both High and Low Levels of Abstraction
5. Conclusions This paper demonstrates that standard Verilog HDL can be used to model channel-based asynchronous circuits at a high level of abstraction, continuing to bridge the gap between the pervasiveness of commercially-standard tools with the advantages of asynchronous implementations. In particular, we described how to model CSP communication primitives using Verilog macros and its application to modeling asynchronous nonlinear pipelines and their typical performance characteristics, such as forward/backward latencies and slack. We then showed how to monitor the status of channels during debugging and also we provided an implementation of handshaking protocols for non-pipelined designs. Finally, we showed how to perform mixed mode simulations and to interface a module described at the high level to a module described at the low level. Compared to the state-of-the-art, this work is the first to support abstract non-linear pipelines in Verilog. 6.
References
[1] [2]
C.A.R. Hoare., Communicating Sequential Processes, Prentice Hall International, 1985 A. J. Martin, Synthesis of Asynchronous VLSI Circuits, Caltech-CS-TR-93-28, California Institute of Technology P. Endecott and S. Furber, Modelling and Simulation of Asynchronous Systems using the LARD,
[3]
http://www.cs.man.ac.uk/amulet/projects/lard/
[4] [5] [6] [7] [8] [9]
D. Nellans, V. Krishna Kadaru, and E. Brunvand, ASIM - An Asynchronous Architectural Level Simulator, GLSVLSI’04 S. Frankild and J. Sparso, Channel Abstraction and Statement Level Concurrency in VHDL++, Danish Maritime Institute & Technical University of Denmark M. Renaudin, P. Vivet, F. Robin, A Design Framework for Asynchronous/Synchronous Circuits Based on CHP to HDL Translation, Async99 C. J. Myers, Asynchronous Circuit Design, John Wiley and Sons, July 2001. A. Saifhashemi and H. Pedram, Verilog HDL, Powered by PLI: a Suitable Framework for Describing and Modeling Asynchronous Circuits at All levels of Abstraction, DAC 40th A. M. Lines, Pipelined Asynchronous Circuits, M.S. Thesis, Caltech, 1995.
288
A. Saifhashemi and P.A. Beerel / Modeling of Asynchronous Circuits Using Verilog
[10] S. Hauck, Asynchronous Design Methodologies: An Overview, Proceedings of the IEEE, Vol. 83, No. 1, pp. 69-3, January, 1995. [11] K. van Berkel, R. Burgess, J. Kessels, M. Roncken, F. Schalij, and A. Peeters, Asynchronous Circuits for Low Power: A DCC Error Corrector, In IEEE Design & Test of Computers, 11(2):22-32, summer 1994. [12] P. D. Austin and P. H. Welch, Communicating Sequential Processes for Java – JCSP, http://www.cs.ukc.ac.uk/projects/ofa/jcsp/
[13] P. A. Beerel, J. Cortadella, and A Kondratyev, Bridging the Gap between Asynchronous Design and Designers, VLSID’04 [14] K. van Berkel, J. Kessels, M. Roncken, R. Saeijs, and F. Schali, The VLSI-programming language Tangram and its translation into handshake circuits, EDAC’91 [15] IEEE Std. 1364-2001, IEEE Standard for Verilog Hardware Description Language, 2001 [16] Mentor Graphics, http://www.model.com/ [17] Cadence Design Systems, http://www.cadence.com/ [18] T. Bjerregaard, S. Mahadevan, and J. Sparsø, A Channel Library for Asynchronous Circuit Design Supporting Mixed-Mode Modeling, PATMOS04 [19] http://www.systemverilog.org
Communicating Process Architectures 2005 Jan Broenink, Herman Roebbers, Johan Sunter, Peter Welch, and David Wood (Eds.) IOS Press, 2005 © 2005 The authors. All rights reserved.
289
Mobile Barriers for occam-pi: Semantics, Implementation and Application Peter WELCH and Fred BARNES Computing Laboratory, University of Kent, Canterbury, Kent, CT2 7NF, England. {P.H.Welch , F.R.M.Barnes} @kent.ac.uk Abstract. This paper introduces a safe language binding for CSP multiway events (barriers — both static and mobile) that has been built into occam-π (an extension of the classical occam language with dynamic parallelism, mobile processes and mobile channels). Barriers provide a simple way for synchronising multiple processes and are the fundamental control mechanism underlying both CSP (Communicating Sequential Processes) and BSP (Bulk Synchronous Parallelism). Formal semantics (through modelling in classical CSP), implementation details and early performance benchmarks (16 nanoseconds per process per barrier synchronisation on a 3.2 GHz Pentium IV) are presented, along with some likely directions for future research. Applications are outlined for the fine-grained modelling of dynamic systems, where barriers are used for maintaining simulation time and the phased execution of time steps, coordinating safe and desired patterns of communication between millions (and more) of processes. This work forms part of our TUNA project, investigating emergent properties in large dynamic systems (nanite assemblies). Keywords. Barriers, events, processes, mobility, occam-pi, CSP, pi-calculus
Introduction This paper describes the addition of multiway barrier synchronisation to the KRoC [1,2] occam-π system. occam-π [3,4,5] extends classical occam [6], including mechanisms for data, channel and process mobility (taken from Milner’s π-calculus [7]), dynamic parallelism, extended rendezvous and process priority. Static barriers for occam-π were first reported in [8] — for completeness, some of that information is repeated here. The barriers presented in this paper may also be mobile, allowing them to be communicated to newly forked processes, as well as between processes. This lets us experiment with novel modelling techniques that closely follow real-world systems — such as the merging of biological organelles represented by clusters of parallel processes controlled through synchronisation on internal barriers. Barriers are a synchronisation primitive on which parallel processes enrol, synchronise and resign. When a process synchronises on a barrier, it is blocked until all other processes enrolled on the barrier have also synchronised. Once the barrier is completed, all blocked processes are rescheduled. The semantics of barrier synchronisation are exactly those of an event in Communicating Sequential Processes (CSP) [9,10]. However, the dynamics of barrier mobility, construction, enrolment and resignation have no immediate counterparts in terms of CSP events. Nevertheless, we present a full CSP formalisation, modelling each barrier as a process rather than, directly, as an event. This occam-π language binding is safe in the sense that enrolment and resignation are automatically coordinated and that a process can synchronise on a barrier if, and only if, it is enrolled.
290
P. Welch and F. Barnes / Mobile Barriers for occam-π
Barriers are used for a variety of purposes and with varying granularity in parallel programs. For example, the Bulk Synchronous Parallelism (BSP) [11] model describes parallel processes that run (mostly) independently on separate processors, but periodically synchronise on a single global barrier to exchange data. Such models will be supported by the networked version of occam-π (not yet released [12]). In this paper, we are concerned with much finer levels of control, with processes enrolling, synchronising and resigning dynamically on multiple barriers. We are particularly interested in applying these mechanisms to the design and implementation of highly dynamic massively parallel systems, such as those being investigated in our TUNA [13] project. A previous implementation of barriers in KRoC [14] provided user-defined abstract data types [15]. ‘BARRIER’ variables could be declared, explicitly flagged as shared (through the use of compiler directives which overrode parallel usage checks) and operated via a number of procedure-calls (‘initialise.barrier’, ‘synchronise.barrier’, etc.) implemented in ETC (Extended Transputer Code [16]) assembler. This was functional and fast, but the programmer had to ensure that barriers were initialised correctly, that only enrolled processes could synchronise or resign and that barriers were not assigned or communicated (the semantics of which were undefined). In the language binding presented here, barriers may be declared static or mobile, in line with occam-π data types and channels. Static barriers are fixed — they may not be communicated or assigned. Mobile barriers may be communicated, assigned and cloned (so that the source variable of the communication or assignment does not lose it). All barriers may be renamed through parameter passing and abbreviation. Any process that declares a static barrier, or constructs a mobile one, is automatically enrolled on that barrier. Only processes enrolled on a barrier can synchronise on it. If an enrolled process itself goes parallel, there is a default constraint that at most one of its sub-processes inherits the enrolment — this is checked at compile time. However, an enrolled process may override this constraint by explicitly enrolling all parallel sub-processes on specific barrier(s) at the relevant PAR. An enrolled process automatically resigns from its barrier if it loses it (through communication, exit from scope of the variable referencing it or, sometimes, on termination), so that other processes may continue to synchronize on it. A process automatically enrols on a barrier if it gains it (through communication). An enrolled process may temporarily resign from a barrier — crucial for the ‘lazy’ execution of simulation processes that are idle for long periods of ‘time’ (see [17]). This is expressed through an explicit RESIGN block, with automatic re-enrolment at the end of the block. Often, such re-enrolment needs careful synchronisation and language support is proposed. These occam-π barriers are more general than those of BSP (an occam-π system can contain any number of barriers, with some processes ignoring them and some registered with many). They are also more general than CSP events, incorporating ideas of mobility from the π-calculus and higher-level design patterns for process resignation. On the other hand, they are also, currently, less general than those of CSP (occam-π processes must commit to barrier synchronisation — which cannot, therefore, be used as a guard in a choice or ALT). The language binding and informal semantics for occam-π barriers is covered in Section 1. A formal semantics is given in Section 2. An implementation outline is in Section 3, together with some early benchmarking results. Sample applications follow in Section 4. Finally, Section 5 summarises and discusses future work.
P. Welch and F. Barnes / Mobile Barriers for occam-π
291
1. Language Binding and Informal Semantics 1.1. Static Barriers: Declaration Barriers are declared in the same way as ordinary channels and variables, with the process following the declaration automatically enrolled. For example: BARRIER b: -- declaration of ‘b’ ... process(es) synchronising on ‘b’ 1.2. Static Barriers: Parallel Enrolment To enrol all sub-processes on a barrier, the parallel composition must explicitly declare this. For example: PAR P Q R
ENROLL b (b) (b) (b)
-- all these -- sub-processes -- are enrolled on ‘b’
A replicated parallel may also enrol its sub-processes: PAR i = 0 FOR n ENROLL b worker (i, b) -- all enrolled on ‘b’ In network diagrams, we represent a barrier as a ‘bar’, connected to all enrolled processes. Figure 1 shows the process network for the above ‘worker’ fragment. worker (0)
worker (1)
worker (n−1)
b
Figure 1. Barrier synchronised worker processes
1.3. Static Barriers: Synchronisation Barrier synchronisation is expressed through a new SYNC primitive. For example: PROC worker (VAL INT id, BARRIER x) SEQ ... phase 0 computation SYNC x ... phase 1 computation : The execution of the above SYNC line blocks until all other processes enrolled on the barrier similarly SYNC. It divides the global (parallel) computation of the system in Figure 1 into two time-separated phases (or ‘supersteps’, in BSP terminology). Note that if a process has a barrier parameter, any invocation must have passed a barrier argument on which the invoking process was enrolled. Hence, we (and the compiler) may assume that a process with a barrier parameter is enrolled on whatever barrier is passed and that it is legal to synchronise.
292
P. Welch and F. Barnes / Mobile Barriers for occam-π
1.4. Static Barriers: Parallel Non-Enrolment An enrolled process that goes parallel in the normal way (i.e. without explicitly enrolling its barrier) passes its enrolment to at most one of its sub-processes. For example: PROC worker (VAL INT id, BARRIER x) PAR A () -- not enrolled B (x) -- enrolled on ‘x’ C () -- not enrolled : For a normal non-enrolling PAR such as this, exactly which (if any) of its sub-processes takes the enrolment does not matter. The compiler checks that no more than one enrols. 1.5. Static Barriers: Resign Blocks An enrolled process may temporarily resign from a barrier through the use of a RESIGN-block. For example: PROC worker (VAL INT id, BARRIER x) SEQ P (x) -- enrolled on ‘x’ RESIGN x A () -- not enrolled on (and cannot reference) ‘x’ R (x) -- enrolled on ‘x’ : Whilst executing process ‘P(x)’, this ‘worker’ must synchronise on the barrier (or it will block other enrolled processes that are synchronising). However, whilst executing the RESIGN-block ‘A()’, it plays no part in the barrier and other enrolled processes can synchronise amongst themselves freely. After the RESIGN-block, it is back in the barrier. Note that some care must be taken to avoid non-determinism after exit from a RESIGNblock, since the precise time of that exit and consequent re-enrolment in the barrier is scheduling dependent. This is considered further in Section 1.9 below. 1.6. Static Barriers: Usage Rules Process enrolment on a barrier is determined by the scope of its declaration, PAR ENROLL compositions and RESIGN blocks. The following usage rules for barriers are enforced by compiler checks: • a process may only reference a barrier (i.e. SYNC, RESIGN or pass to a procedure) if, and only if, it is enrolled on that barrier; • at most one component process of a non-enrolling PAR remains enrolled on any barrier for which the PAR, as a whole, is enrolled; • an individual barrier may be passed to only one parameter of a PROC. Strict antialiasing laws apply: different barrier names always refer to different barriers. 1.7. Parallel Enrolment with Multiple Barriers We may enrol multiple barriers in the same PAR construct. In the following example, the ‘*.timer’ processes controls the timing of ‘process.a’ and ‘process.b’ by synchronising on their respective barriers regularly (at ‘long’ or ‘short’ time intervals). Processes
P. Welch and F. Barnes / Mobile Barriers for occam-π
293
‘process.a’ and ‘process.b’ (which may resign from either or both time-slicing controls from time to time) also use a private barrier, ‘b’, to synchronise between themselves: BARRIER long, short: PAR ENROLL long, short PAR long.timer (long) short.timer (short) BARRIER b: PAR ENROLL b, long, short process.a (long, short, b) process.b (long, short, b) 1.8. Parallel Enrolment, Termination and Auto-Resignation Each component process of a PAR ENROLL construct resigns from its so-enrolled barrier(s) just before it terminates, apart from the last one to finish. This means that all components do not have to terminate in the same barrier cycle to avoid deadlock (as would be the case if occam-π barriers were direct reflections of CSP multiway events). Consider the example given in section 1.2: PAR i = 0 FOR n ENROLL b worker (i, b) -- all enrolled on ‘b’ Any worker process may terminate early, leaving its companion processes running and synchronising with each other successfully on ‘b’ — the early-terminated process has resigned from the barrier. In CSP, termination of components of a parallel composition happens simultaneously. If one component is ready to terminate, it commits exclusively to that and refuses all other events. So, if other components have not terminated and continue to try to synchronise on a multiway event bound to that parallel, there would be deadlock. For occam-π, we want to be able to build collections of processes disciplined by common synchronisation on barrier(s); but which do not have to be kept running and synchronising when their job is done, just so that they may terminate together. The chosen semantics give us this directly. If we really need the raw CSP semantics, we just declare and enrol an extra barrier and synchronise on it once: BARRIER alldone: PAR i = 0 FOR n ENROLL b, alldone SEQ worker (i, b) SYNC alldone Now, when one worker terminates, its driving process commits to engage in ‘alldone’ and refuses ‘b’, on which it is still enrolled. All other worker processes must also terminate without further engagement on ‘b’ — else deadlock. Another nice property from these occam-π semantics is that SKIP is a unit of all versions of its PAR operator: PAR P (b) SKIP
=
PAR ENROLL b P (b) SKIP
=
P (b)
294
P. Welch and F. Barnes / Mobile Barriers for occam-π
In the first system, ‘b’ must be a global barrier and SKIP is not enrolled. Hence, SKIP’s existence and termination have no impact on the continuing operation of P(b). In the second system, SKIP is enrolled on ‘b’. Unless P(b) finishes first, all this SKIP does is resign from ‘b’ and wait to terminate. Otherwise, it just terminates (together with P(b)). If P(b) synchronises on ‘b’, it cannot finish first and blocks until the SKIP has resigned (which will happen) and, then, continues as normal. If/when P(b) terminates, it does so with the waiting SKIP. Either way, the SKIP has no impact and we are left with P(b). In CSP, SKIP is a unit only of parallel interleaving. It is not a unit of any parallel operator bound to an event. 1.9. Controlled Exit from Resign Blocks A subtle problem can arise through the careless exit from RESIGN blocks. Consider: PROC always (BARRIER a, b) WHILE TRUE SEQ SYNC a ... phase A compute (no SYNCs) SYNC b ... phase B compute (no SYNCs) : PROC sometimes (BARRIER a, b) WHILE TRUE SEQ SYNC a ... phase A compute (no SYNCs) SYNC b ... phase B compute (no SYNCs) IF ... decide on a holiday RESIGN a, b ... enjoy holiday (e.g. sleep) TRUE SKIP : PAR ENROLL a, b always (a, b) sometimes (a, b) So long as ‘sometimes’ stays enrolled in its barriers, all goes well — ‘sometimes’ and ‘always’ will continue their respective phased computations in parallel, keeping in step with each other as each phase ends. If ‘sometimes’ decides to go on holiday, it resigns from its barriers and does other things (like sleep), leaving ‘always’ to continue on its own — all is still well. The problem arises if ‘sometimes’ decides to come back. When it exits its RESIGN block, it re-enrols on its barriers and waits to SYNC on ‘a’. If ‘always’ is in its phase B when this happens, we are lucky and the two processes resume in perfect synchronisation. But if ‘always’ is in phase A, its next SYNC is on ‘b’ and the system will deadlock. To do this safely, ‘sometimes’ must coordinate its return with ‘always’. One way to do this is for ‘sometimes’ to request permission from ‘always’ to return to their joint compu-
P. Welch and F. Barnes / Mobile Barriers for occam-π
295
tations. The ‘always’ process only grants this permission in its phase B and, then, waits for confirmation that ‘sometimes’ has re-enrolled (i.e. has left its RESIGN block). This behaviour is easy to manage by signalling and polling over standard channels: PROC sometimes (BARRIER a, b, CHAN BOOL signal!) WHILE TRUE SEQ SYNC a ... phase A compute (no SYNCs) SYNC b ... phase B compute (no SYNCs) IF ... decide on a holiday SEQ RESIGN a, b SEQ ... enjoy holiday signal ! TRUE -- request comeback signal ! TRUE -- confirm comeback TRUE SKIP : PROC always (BARRIER a, b, CHAN BOOL signal?) WHILE TRUE SEQ SYNC a ... phase A compute (no SYNCs) SYNC b ... phase B compute (no SYNCs) PRI ALT BOOL any: signal ? any -- grant comeback signal ? any -- wait for confirm SKIP SKIP : and where the system is now: CHAN BOOL signal: PAR ENROLL a, b always (a, b, signal?) sometimes (a, b, signal!) In a larger system, there may be many processes, like ‘sometimes’, that retire from the computation from time to time. Examples arise in large scale simulations of dynamic systems, where not all processes need to be continually active (because nothing is changing in their neighbourhood) but need to rejoin some barrier synchronisation (e.g. for managing simulation ‘time’) when something happens close to them — see [17]. In such cases, the above comeback/confirm protocol may be used between each resigning process and just one specialised process, like the above ‘always’, that is always cycling and
P. Welch and F. Barnes / Mobile Barriers for occam-π
296
synchronising (and which need do nothing else). Separate comeback and confirm channels will be needed, SHARED at the resigning process ends. We are considering language support for such a protocol. For example, the resigning processes execute: RESIGN b ... resign block (may not reference ‘b’, ‘c’ or ‘d’) RESUME c! d! where ‘c’ and ‘d’ are SHARED CHAN BOOLs. In the correct phase, the resuming process executes: RESUME c? b? where this may be used as an ALT (or PRI ALT) guard. 1.10. Mobile Barriers: Declaration Mobile barriers follow the same general rules for declaration, construction, communication and assignment as mobile channels and processes. The declaration introduces the variable name but leaves it undefined. Barrier variables become defined either through construction, communication or assignment. The compiler tracks defined-status and prevents use of undefined variables. Explicit run-time checks (using the DEFINED prefix operator) are forced for cases where the compiler cannot deduce the defined-status. MOBILE BARRIER b: ... process (initially, ‘b’ is undefined) At the end of scope of a mobile barrier declaration, if the variable ended up as defined, the process automatically resigns from the referenced barrier. 1.11. Mobile Barriers: Construction Construction and assignment to a mobile variable are bound together: b := MOBILE BARRIER where ‘b’ must be a MOBILE BARRIER variable. If ‘b’ were currently defined, the executing process would first resign from the currently referenced mobile barrier. After this statement, ‘b’ is now defined and references a new mobile barrier and the executing process is enrolled. Note that a mobile barrier may be declared and constructed in one line with the standard (though, in this case, rather unusual looking) initialising declaration: INITIAL MOBILE BARRIER b IS MOBILE BARRIER: ... process (enrolled on ‘b’) 1.12. Mobile Barriers: Communication Communication follows the same semantics for all occam-π mobiles: the item moves to the new place, leaving the source variable undefined. For mobile barriers, there are additional rules about enrolment and resignation. In the following, ‘a’ and ‘b’ are MOBILE BARRIER variables, ‘b’ must be defined and‘c’ is a CHAN MOBILE BARRIER (i.e. a channel carrying mobile barriers). c ? a
P. Welch and F. Barnes / Mobile Barriers for occam-π
297
If ‘a’ is defined, the receiving process first resigns from the held barrier and, then, receives the new reference. Otherwise, it just receives the new reference. Either way, it is now enrolled on the received barrier. c ! b This moves the barrier to another process, leaving ‘b’ undefined. This sending process resigns from the barrier. If we didn’t want to lose it, we must send a clone: c ! CLONE b In this case, the sending process remains enrolled on the barrier. All relevant enrols and resigns of the processes happen automatically and atomically with the communication. 1.13. Mobile Barriers: Assignment Assignment follows the same mobility and enrol/resign rules. Again, suppose ‘a’ is a MOBILE BARRIER and ‘b’ is a defined MOBILE BARRIER. a := b If ‘a’ were defined, the process first resigns from that barrier. The barrier reference moves from ‘b’ to ‘a’ and the process remains enrolled on it. Such assignments cannot introduce aliasing. However: a := CLONE b
-- this may get banned!
always introduces aliasing. As before, if ‘a’ were defined, the process resigns from that barrier. The barrier reference is copied from ‘b’ to ‘a’ — variables ‘a’ and ‘b’ now reference the same barrier and the process is enrolled twice! This aliasing may not be as bad as it seems. For example, the code on the left below is safe and may serve some purpose: SEQ a := CLONE b PAR P (a) Q (b)
PAR ENROLL b P (b) Q (b)
It is almost the same as the code on the right, but omits the auto-resignation semantics (see Section 1.8). To remain compatible with the rest of occam-π and to satisfy our intuition, assignment and communication should be related by laws that, in these cases, take the form: a := b
=
a := CLONE b
=
CHAN MOBILE BARRIER c: PAR c ? a c ! b
and: CHAN MOBILE BARRIER c: PAR c ? a c ! CLONE b
P. Welch and F. Barnes / Mobile Barriers for occam-π
298
We definitely need to allow the cloned output mechanism — so simply banning cloned assignments is not enough to prevent aliasing. 1.14. Mobile Barriers: Forking Passing arguments to forked processes in occam-π means communicating them — see [3]. Hence, forked processes may take mobile barrier parameters. If ‘b’ is a defined mobile barrier, then: FORK P (b) moves the barrier to the new process. The forking process resigns from the barrier and ‘b’ becomes undefined. More usually, of course, the forking process retains the barrier (for passing to processes it may fork in the future) by passing a clone and remaining enrolled: FORK P (CLONE b) Either way, the forked process is enrolled on the barrier. Just before the forked process terminates, it automatically resigns from whatever barrier (if any) its parameter is referencing. We need this for the same reason that auto-resignation was specified for parallel enrolled processes (1.8). Note that the forking process must be enrolled on the barrier to be able to pass it to its forked processes. This enables the release of forked processes in the correct phase of barrier synchronisation with existing processes holding that barrier. Enrolment of the forked process happens atomically with its forking. 1.15. Mobile Barriers: Synchronisation, Parallel Enrolment and Resign Blocks Synchronisation, parallel enrolment, parallel non-enrolment and resign blocks for mobile barriers have the same syntax and semantics as those for static barriers. The usual parallel usage rules for read/write access to variables apply to mobile barrier variables. A process enrolled on a mobile barrier is considered to have read access on the variable — i.e. its value cannot be changed in parallel. Marrying this with the usage rules for static barriers (Section 1.6), we note one extra rule: • component processes in a PAR ENROLL construct whose bound barrier(s) is mobile may not change the held reference (e.g. by assignment, input or non-cloned output). That also means that such component processes may only pass a PAR-ENROLL-bound mobile barrier to a static barrier parameter/abbreviation. 2. A CSP Model for Mobile and Static Barriers Our original approach was to model occam-π barriers directly as CSP multiway events. The dynamics of mobility, resignation and enrolment was to be handled with auxiliary spinner processes, interleaving with the application processes on the barriers and taking over synchronisation on them when the application process was not enrolled. This worked well for static barriers, but managing the infinite sets of spinners needed to explain mobile barriers was proving troublesome (and would be hard for model checkers to accommodate). The approach presented here models each barrier as a process, rather than an event. It documents how they are supported by the occam-π kernel. It captures all the dynamic semantics of occam-π mobile barriers: run-time construction, communication and assignment, cloning, parallel enrolment and non-enrolment, termination resignation and resign blocks, and passing as arguments to forked processes.
P. Welch and F. Barnes / Mobile Barriers for occam-π
299
Initially, we consider mobile barriers — the model for static barriers then follows trivially. A formal semantics for occam-π barriers then derives from the semantics of CSP. 2.1. Modelling an occam-π Mobile Barrier with a Process and Shared Channels The insight is to give up trying to model these dynamic barriers directly with CSP events and spinner processes (maintaining synchronisation when their buddy application processes disengage). Instead, we model mobile barriers with processes and shared channels, but with added flexibility for the dynamic enrolment and resignation of processes. So, occam-π mobile barrier variables become (mobile) integer variables, holding indices to the actual barriers. The latter are (kernel) processes, running in parallel to all application processes, and created dynamically as needed. This means that they are always accessible to all application processes, even though they are triggered within individual ones. So, we don’t require the awkward scope extrusion concept of the π-calculus. Table 1. Mobile barrier process fields
Field b refs n count
Name index reference count enrolled count sync count
Purpose identification — unique for each barrier the number of mobile barrier variables currently holding a reference to ‘b’ the number of processes currently enrolled on ‘b’ the number of processes still left to synchronise on ‘b’ (to complete the barrier)
Table 2. Mobile barrier process events
Event enrol.b.p resign.b tresign.b tenrol.b sync.b ack.b
Purpose enrol ‘p’ processes on barrier ‘b’ resign one process from barrier ‘b’ temporarily resign one process from barrier ‘b’ (‘RESIGN’ block)’ re-enrol one (temporarily resigned) process on barrier ‘b’ offer (committed) to synchronise on barrier ‘b’ complete synchronisation on barrier ‘b’
A mobile barrier process has four integer fields — shown in Table 1. System constraints will impose that (b > 0) and (refs n count 0). Index zero is reserved for mobile barrier variables currently undefined — this is just for convenience in the following model (not strictly necessary). The mobile barrier process with index ‘b’ engages on the events described in Table 2. Here is the process: BAR (b, refs, n, count) = enrol.b.p → BAR (b, resign.b → BAR (b, tresign.b → BAR (b, tenrol.b → BAR (b, sync.b → BAR (b,
refs + p, n + p, count + p)
refs − 1, n − 1, count − 1)
refs, n − 1, count − 1)
refs, n + 1, count + 1)
refs, n, count − 1) ,
BAR (b, refs, n, 0) = BAR ACK (b, refs, n, 0),
2 2 2 2 if (count > 0)
if (n > 0)
BAR ACK (b, refs, n, count) =
ack.b → BAR ACK (b, refs, n, count + 1),
if (n > count)
P. Welch and F. Barnes / Mobile Barriers for occam-π
300
BAR ACK (b, refs, n, n) = BAR (b, refs, n, n)
BAR (b, refs, 0, 0) = tenrol.b → BAR (b, refs, 1, 1),
if (refs > 0)
BAR (b, 0, 0, 0) = SKIP The difference between ‘resign.b’ and ‘tresign.b’ is that the latter does not decrement the reference count. There is a similar difference between ‘enrol.b.1’ and ‘tenrol.b’. ‘tresign.b’ and ‘tenrol.b’ will be used to bracket RESIGN blocks, whose existence is the only reason that reference and enrolled counts may differ. SYNC operations, in application processes, map to a sequence of a ‘sync.b’ immediately followed by an ‘ack.b’. The former just decrements the synchronisation count. If that reaches zero, the barrier process locks into a sequence of ‘ack.b’ events with length equal to the current enrolled count — these will all succeed, since there will be precisely that number of application processes blocked and waiting for them. Note: application processes interleave amongst themselves for engagement on all these barrier process control events. Any ‘resign.b’ event that reduces the reference count to zero will also, given the earlier constraint, have reduced the enrolled and synchronisation counts to zero — in which case, the barrier process simple terminates. Note that ‘tresign.b’ does not change the reference count and, so, cannot reduce it to zero. 2.2. Kernel and Application Processes The mobile barrier processes are forked off as needed by a generator process:
MB (b) = getMB!b → BAR (b, 1, 1, 1) MB (b + 1)
2 noMoreBarriers → SKIP
For convenience, we also define:
UNDEFINED BAR = resign.0 → UNDEFINED BAR
2 noMoreBarriers → SKIP
Now, if SYSTEM is the occam-π application and SYSTEM is the CSP modelling of its mobile barrier primitives (see below), the full model is:
(SYSTEM o9 noMoreBarriers → SKIP)
MobileBarrierKernel \ kernelchans
{kernelchans}
where: MobileBarrierKernel = MB (1)
UNDEFINED BAR
{noMoreBarriers}
and: kernelchans = enrol.b.p, resign.b, tresign.b, tenrol.b, sync.b, ack.b, getMB, noMoreBarriers | (b 0), (p 1)
P. Welch and F. Barnes / Mobile Barriers for occam-π
301
2.3. Extending CSP with Variables and Assignment For making precise the semantics of mobile barriers, we shall be using the syntax of Circus [18]. This introduces, amongst other things, variables and assignment into CSP. It allows us to work at a slightly higher, and clearer, level than pure CSP. Such variables and assignments could be removed by introducing parallel terminatable state-processes for each variable, whose duration matches their scope; plus ‘load’, ‘store’ and ‘kill’ channels for reading and writing their values and for termination. For example, the variable declaration and process: Var x : N • P becomes:
(P o9 killX → SKIP)
VarX \ {loadX , storeX , killX }
{loadX ,storeX ,killX }
where:
VarX (x) = loadX !x → VarX (x)
2 store ?tmp → Var (tmp) 2 kill X
X
X
→ SKIP
and P is the result of removing similar variables from P. An assignment process: x := y becomes: loadY ?tmp → storeX !tmp → SKIP Any expression involving such variables requires prefixing with a sequence of loads into separate registers. For example: c!(x + y) becomes:
loadX ?tmp0 → loadY ?tmp1 → c!(tmp0 + tmp1 ) → SKIP
loadY ?tmp1 → loadX ?tmp0 → c!(tmp0 + tmp1 ) → SKIP
2
All occam-π variables — including those for mobile barriers — map to such Circus variables. When reasoning formally about such CSP mappings, we should also take into account that occam-π processes are bound by its parallel usage rules. These need formalising. 2.4. Modelling the occam-π Primitives for Mobile Barriers 2.4.1. Mobile Barrier Declaration Mobile barrier variables map into mobile integer (actually natural number) variables, holding indices to the referenced barrier processes: MOBILE BARRIER b: P
Var b : N • b := undefined o9 P o9 resign.b → SKIP
where undefined is zero and P is the CSP model of P. Note that if ‘b’ is undefined when P terminates, the ‘resign.b’ is swallowed harmlessly by the UNDEFINED BAR kernel process.
2.4.2. Mobile Barrier Construction b := MOBILE BARRIER
getMB?tmp → (b := tmp)
P. Welch and F. Barnes / Mobile Barriers for occam-π
302
2.4.3. Mobile Barrier Synchronisation
SYNC b
sync.b → ack.b → SKIP
2.4.4. Mobile Barrier Send (Uncloned)
c ! b
c!b → (b := undefined)
2.4.5. Mobile Barrier Send (Cloned) c ! CLONE b
enrol.b.1 → c!b → SKIP
c?tmp → resign.b → (b := tmp)
2.4.6. Mobile Barrier Receive c ? b
2.4.7. Mobile Barrier Assign (Uncloned)
a := b
resign.a → (a := b) → (b := undefined)
2.4.8. Mobile Barrier Assign (Cloned)
(enrol.b.1 → SKIP) (resign.a → SKIP) o9 a := CLONE b (a := b) 2.4.9. Mobile Barrier Resign Block (Uncontrolled Resume)
RESIGN b tresign.b → P o9 tenrol.b → SKIP P 2.4.10. Mobile Barrier Resign Block (Controlled Resume) RESIGN b tresign.b → P o9 c → tenrol.b → d → SKIP P RESUME c! d! To coordinate resumption in the right phase, the resuming process should be enrolled on ‘b’. It executes:
RESUME c? d?
c → d → SKIP
Note: one resuming process can manage many resign-block processes. The latter interleave amongst themselves on the ‘c’ and ‘d’ channels, but synchronise on them with the former. We call them ‘channels’ since only two-way synchronisation is involved. No values are communicated over them. 2.4.11. Mobile Barrier Parallel Enrolment PAR i = start FOR n ENROLL b P (i, b)
ParCount (n)
{down}
start + (n − 1)
enrol.b.(n − 1) →
|||
P (i, b) o9 down?n → SKIP (n = 0) resign.b → SKIP
i = start
where P (i, b) is the CSP model of P (i, b) and: ParCount (n) = down!(n − 1) → ParCount (n − 1), ParCount (0) = SKIP
if (n > 0)
P. Welch and F. Barnes / Mobile Barriers for occam-π
303
The usual occam-π parallel usage rules apply for the barrier variable ‘b’ here. So, the replicated process may use ‘b’ but may not change it. All it may do is SYNC on it, RESIGN from it and release CLONEs. Note that this captures the required semantics (Section 1.8) that each component process of the PAR ENROLL resigns from the barrier as it terminates, apart from the last one to finish. 2.4.12. Mobile Barrier Parallel Non-Enrolment No special semantics are needed in this case: the parallel just maps to a CSP parallel construction. The occam-π parallel usage rules apply — i.e. only (at most) one of the component processes may change the barrier variable. However, occam-π imposes a stricter constraint: only (at most) one of the component processes may reference the barrier at all (i.e. SYNC on it, RESIGN from it, CLONE it, change it). 2.4.13. Mobile Barrier Passing to a Forked Process FORK P (b)
forkP!b → (b := undefined)
where ‘forkP’ is a channel specific for forking instances of P. More usually, of course, the forking process retains the barrier (for passing to processes it may fork in the future) by passing a clone and remaining enrolled: FORK P (CLONE b)
enrol.b.1 → forkP!b → SKIP
Note that, either way, synchronisation on the barrier referenced by ‘b’ cannot afterwards complete without participation by the forked process (e.g. by synchronisation or resignation). To fork a process, we must be running in a FORKING block (which, by default, is the whole system). An explicit such block, that forks only instances of P(b) for some mobile barrier variable ‘b’:
FORKING (X o9 done → SKIP) ForkP \ {forkP, done} X
{forkP,done}
where X is the CSP model of X, ‘done’ is chosen so that it does not occur free in X or P(b), and:
ForkP = forkP?b → (P (b) o9 resign.b → done → SKIP) ForkP
{done}
2
done → SKIP and P (b) is the CSP model of P(b). Note that forked processes — like components of a PAR ENROLL construct — resign from whatever barriers (if any) are referenced by their parameters as they terminate. Note also that termination of the forking block waits for all forked processes to terminate. 2.5. Modelling the occam-π Primitives for Static Barriers The semantics of static barriers did work out with the spinner mechanism previously considered. However, static barriers can always be replaced by mobile barriers that take no advantage of their mobility (i.e. communication and assignment). So, we may as well go with these new semantics!
P. Welch and F. Barnes / Mobile Barriers for occam-π
304
To transform static barriers into mobiles, their declarations: BARRIER b: simply become the combined mobile declaration and initialisation: INITIAL MOBILE BARRIER b IS MOBILE BARRIER: All BARRIER parameters/abbreviations become MOBILE BARRIERs. No other transformations are needed, so we have their semantics. Note: with static barriers, all we can do is synchronise, parallel enrol and resign block. If that is sufficient, use them rather than mobiles. There can be no aliasing problems with static barriers and their run-time overheads (memory and execution) are slightly lower. 3. Implementation and Benchmarking Implementation follows all the mechanisms documented in the formal semantics given in Section 2. However, scheduling of the barrier processes is automatically serialised with inline instructions generated by the occam-π compiler, supported by its kernel — no actual processes or channels are introduced. Each barrier is managed though just five words of memory: three for the reference, enrolled and synchronisation counts (see Section 2.1) and two holding the front and back pointers to a queue holding processes blocked on the barrier. Barrier variables hold the start address (index) of this structure. For mobile barriers, the space is allocated dynamically in occamπ mobile-space (see [19]); for static barriers, the space lives on the stack of the declaring process. A process synchronising on a barrier, unless the last to synchronise, is held on the barrier queue (rather than on an ‘ack.b’ channel) and the next process is scheduled. A process completing a barrier (i.e. reducing the synchronisation count to zero) releases all the others — this is done in unit time by simply appending the barrier queue to the run queue, leaving the former empty. All adjustments to the barrier counts follow the rules defined in Sections 2.1 and 2.4 for modelling all the occam-π primitives in CSP. Figure 2 shows the results of a benchmark that measures the time per barrier synchronisation for increasing numbers of concurrent processes, run on 3.2 GHz Pentium IV machines. Each process synchronises a fixed number of times, from which the average individual synchronisation time is calculated. A stride length is used to control the start-up (and subsequent scheduling) order of parallel sub-processes, demonstrating the effect of the processor’s cache pre-fetching. Each curve in the figure reflects a different stride. The memory foot-print for the 16 million process benchmark (actually 224 ) was just over 700 mega-bytes (approximately 44 bytes per process), so cache-misses will be heavy. The processes are allocated their workspaces contiguously according to their index. The stride forces their scheduling so that consecutively run process workspaces are (44*stride) bytes apart. For small strides, the Pentium IV cache pre-fetching eliminates the problem of cache miss. For larger strides, and especially for the randomised striding, the pre-fetching is defeated and cache miss penalties are felt. Despite this, Figure 2 shows the implementation to be ultra-lightweight. The time for a sixteen-million-wide barrier synchronisation was only 16 ns per process in the best case (163 ms for the whole barrier) and 247 ns per process in the worst case. Typical application mixes will show some coherence in memory usage — the worst case above is really cruel! Also, applications running real processes (with real work to do) will not be able to afford more than the order of a million of them (because of memory limitations with current technology). The barrier mechanisms presented in this paper are useful and fast.
P. Welch and F. Barnes / Mobile Barriers for occam-π
305
250
1 4 16 1024 16384 65536 random
sync time per process (ns)
200
150
100
50
0
1
16
256
4k 64k number of processes
1M
16M
Figure 2. Synchronisation time for different strides
4. Sample Applications 4.1. The TUNA Project This work binding barrier synchronisation safely and efficiently into the occam-π language was prompted by needs for TUNA (Theory Underpinning Nanite Assemblers) [13], a project involving researchers from the Universities of York, Surrey and Kent in the United Kingdom. This is investigating the emergent properties of systems containing millions of interacting agents — such as nanites or biological organelles. Here, goals are achieved by emergent behaviour from force of numbers, not by complicated programming or external direction. Such systems are complex, but not complicated. Medium term aims are the development of sufficient theory to enable the design of self-assembling nanite systems with controlled and predictable properties for application in human medicine. A working case study looks at mechanisms of blood clotting. The model is loosely based on the medical process of haemostasis. Platelets are passive quasi-cells carried in the bloodstream. A platelet becomes active when a balance of chemical stimulators and suppressants changes in favour of activation, usually because of physical damage to the linings of blood vessels. Activated platelets become sticky, form clusters that restrict blood flow — a necessary first phase in limiting blood loss, healing of the wound and recovery. Unlike systems developed for traditional embedded and parallel supercomputing applications, TUNA networks will be highly dynamic — with elements, such as channels and processes, growing and decaying in reaction to environmental pressures. Computational network topologies continually evolve as the organelles/nanites replicate, combine and decay. To model more directly (and, hence, simply) the underlying biological/mechanical interactions, extremely fine-grained concurrency will be used. Complex behaviour will be obtained not by direct programming of individual process types, but by allowing maximum flexibility for self-organisation following encounters between mobile processes — randomised modulo physical constraints imposed by their modelled environments. We will need to develop location awareness for the lowest level processes, so they may be aware of other processes in their neighbourhood and what they have to offer. We will need to synchronise the development of organisms to maintain a common awareness of time.
P. Welch and F. Barnes / Mobile Barriers for occam-π
306
Barrier mechanisms with user-defined and dynamic binding to processes are promising to be very helpful in this context. 4.2. Static Barrier Application: First Blood Clotting Model (Busy) The clotting model and implementation described here are a gross simplification of what we will eventually require for TUNA. It is crucial, however, that we have a firm understanding and confidence in simple models, before attempting more elaborate models. We would not wish for any emergent behaviour of the system to be wholly determined by implementationspecific artifacts, such as programming errors arising from a lack of understanding. Space is modelled as a one-dimensional pipeline of ‘cell’ processes representing a section of a blood vessel. Platelets are in their activated (i.e. sticky) state. They flow through the cells at (average) speeds inversely proportional to the size of the clot in which they become embedded — these speeds are randomised slightly. Clots that bump together stay together, forming larger clots spanning many cells. Each cell maintains internal state indicating whether it contains a platelet. The model is time-stepped by having the cells synchronise on a barrier [8], which is also used to coordinate safe access to shared data. 4.2.1. System Network and Two-Phased Cycles gen
cell
cell
running
display state
display
keywatch
hole
cell
draw
(screen)
(keyboard)
Figure 3. ‘Busy’ clotting model process network (phase 0)
gen
cell
cell
display state
keywatch
display
hole
cell
running
draw
(screen)
(keyboard)
Figure 4. ‘Busy’ clotting model process network (phase 1)
Figures 3 and 4 shows the two computational phases of the process network used in this clotting model. The ‘generator’ process determines (stochastically) whether a new platelet is generated and, if so, injects it. The ‘hole’ process just acts as a sink for platelets flowing out of the pipeline. The ‘display’ process renders the (full or empty) state of the cells for visualisation and shows system parameters (such as platelet generation and display rates). The
P. Welch and F. Barnes / Mobile Barriers for occam-π
307
‘keywatch’ process allows user-interaction for setting those parameters and for terminating the system. The ‘display state’ and ‘running’ flag are not actually processes, but variables shared between the ‘cell’ and ‘display’ processes. (Such variables could, of course, be made into processes if we were worried about this — see Section 2.3). Figures 3 and 4 extends the symbology of Figure 1. The shaded rounded boxes represent state variables. They are stuck on the barrier, ‘draw’, to indicate that access to them is controlled through the barrier. The dotted arrows between the processes and the shared variables indicate two things: reading/writing (depending on the arrow direction) and that the processes must synchronise on the underlying barrier to coordinate that reading or writing. Race hazards to shared memory (and consequential loss of control) are avoided normally by occam-π’s parallel usage rules, which enforce CREW (Concurrent Read Exclusive Write) principles. However, these apply between component processes of a PAR or between a FORKed process and the rest of the system. Here, we need a finer granularity of enforcement and this is managed through the ‘draw’ barrier. All ‘cell’ processes together with ‘generator’, ‘hole’ and ‘display’ cycle through two phases, synchronised by the ‘draw’ barrier on which they are enrolled. To check CREW conformance, we just have to check that no read/write or write/write on shared state happens in the same phase. In this system, different components of the ‘display state’ are written by the cells in phase 1; they are read by the rendering ‘display’ process in phase 0. The ‘running’ flag is read by all enrolled processes in phase 0 and written, by ‘display’, in phase 1. 4.2.2. The ‘cell’ Process Here is outline code for the ‘cell’. The first two reference data parameters give this process access to its component of the ‘display state’ (shared with the ‘display’ process) and the ‘running’ flag (shared with most other processes): PROC cell (BYTE my.visible.state, BOOL running, BARRIER draw, CHAN CELL.CELL left.in?, left.out!, right.in?, right.out!) ... local declarations / initialisations (phase 0) WHILE running SEQ SYNC ... ... ... ... ... ... ... ... ... ...
draw -- phase 1 PAR-I/O exchange of full/empty state with neighbour cells if full discover clot size (initiate or pass on count) if head of clot decide on move (non-deterministic choice) if move, tell empty cell ahead (push decision) else receive decision from cell ahead (pull decision) if not tail of clot, pass movement decision back (pull) if tail and movement, become empty else if clot behind exists and moves (push), become full
SYNC draw -- phase 0 ... update my.visible.state :
P. Welch and F. Barnes / Mobile Barriers for occam-π
308
The ‘CELL.CELL’ protocol used for communication between cells is defined with: PROTOCOL CELL.CELL CASE state; BOOL -push; BOOL -pull; BOOL -size; INT -:
full/empty move/no-move decision move/no-move decision clot size
The barrier synchronisation forces all enrolled processes to start their phase 1 computations together. The I/O-PAR communications of state between the ‘cell’s, which only use the above ‘state’ variant, cannot introduce deadlock [20]. After that, each cell knows the state of its immediate neighbours and works out what further communications, using the other variants of the ‘CELL.CELL’ protocol, are needed. All cells follow the same rules and reach matching decisions about those communications — so there can be no deadlock, despite this part of the logic not being I/O-PAR. The ‘generator’ and ‘hole’ processes are cut-down versions of the ‘cell’. Additionally, ‘generator’ polls its input channel from ‘keywatch’ for user-updates to the generation rate and makes decisions, based on that rate, for releasing new platelets (which it does by appearing empty or full to the first ‘cell’ process). The ‘keywatch’ process is lazy and not enrolled on the barrier. It is triggered solely by user keystrokes. It is worth noting that the movement decisions (by a ‘cell’ process at the head of a clot) and the new platelet release decisions (by the ‘generator’) are the only places in the system where non-determinism occurs (modelled in CSP as an internal choice). The ‘cell’ processes do not even contain a single ALT construct. 4.2.3. Scaling Up In this system, every cell is always active, regardless of whether it contains a platelet — it is a classic busy Cellular Automaton (CA). It works well for systems with the order of hundreds of thousands of cells. For TUNA, we will need to be working in three dimensions, modelling many different types of agent all with much richer rules of engagement. To enable scaling up two (and more) orders of magnitude, these automata must become lazy, whereby only processes with things to do remain in the computation. One technique for achieving this are given in the next section; another is reported in [17]. 4.3. Mobile Barrier Application: Second Blood Clotting Model (Lazy) Something unsatisfactory about the CA approach described in the previous section is that the logic focusses on the cell processes. The rules for different stages in the life cycle of platelets or clots are coded into different cycles of the cells. From the point of view of the cell, which is what we design and program, we see lots of different platelets — sometimes bunched together forming clots — passing through. No process models the development of an individual clot. 4.3.1. Mobile Barriers, Mobile Channels and Forking This model focusses on the life cycle of clots, each one directly represented by a ‘clot’ process. Initially, these are forked off by the ‘generator’ process as singleton platelets, straggling the first cell in the pipeline. Because these ‘clot’s need enrolment on the barrier,
P. Welch and F. Barnes / Mobile Barriers for occam-π
309
the barrier must be passed to it by the ‘generator’. Because passing arguments to forked processes involves communication, the barrier must be a mobile. As before, space is represented by the pipeline of ‘cell’ processes — but this time they are not enrolled on the barrier. These cell processes are passive servers, responding to client requests on their service channel bundles — represented in Figures 5-10 by the vertical bidirectional channels on the top of the cells. Neighbourhood topology is determined by each cell’s (shared) access to the next cell’s service channels. Because we only support forward clot movements in this model, a cell only needs forward access — it would be easy to make connections in both directions should other models need this. Cells hold state indicating whether they are being straddled by a passing clot; this state is shared with the ‘display’ process. They are idle except when the front and rear boundaries of a clot passes through them. Each ‘clot’ process connects feeler channels to the cells immediately before and after the group of cells currently straddled — see the figures. It also connects to the last cell in its group, in which it deposits the writing end of its tail-channel — that deposition is not shown in the figures, but left free standing for clarity. All channels, apart from those connecting ‘keywatch’ and the ‘generator’ and ‘display’ processes, are mobile. The cell processes are shown underlain by the ‘draw’ barrier. This means that processes connected to them (i.e. the clots and the display) must be enrolled on that barrier and coordinate their interaction with the cells through synchronisation on the barrier. 4.3.2. Computation Phase 0
gen
clot
cell
cell
cell
cell
cell
cell draw
keywatch
display
(screen)
(keyboard)
Figure 5. ‘Lazy’ clotting model — before move (phase 0)
gen
clot
cell
cell
cell
cell
cell
cell draw
keywatch
display
(screen)
(keyboard)
Figure 6. ‘Lazy’ clotting model — after move (phase 0)
P. Welch and F. Barnes / Mobile Barriers for occam-π
310
Through barrier synchronisation, we maintain the following invariant at the start of phase 0 of each cycle: for each clot in the system, there are empty cells on either side of the (full) cells in the clot. This condition is shown in Figure 5. The computation proceeds by deciding and, if positive, moving the clot forwards by one cell — Figure 6. This requires communicating the client-ends of the cell service channel-bundles through the existing connections of the clot process, updating those connections accordingly, dragging the clot’s tail forward one cell, marking the old rear cell empty and the new front one full. This all happens in phase 0, during which the ‘display’ process is not reading the cell states (maintaining CREW rules). 4.3.3. Computation Phase 1 Following another barrier synchronisation, we are in phase 1. The invariant here is that no clots are moving. This allows them to inspect their environment — location awareness — by interrogating through their front and rear feelers. If other clots are detected, the bumping clots coalesce — Figures 7-10. In Figure 7, two clots detect that they have touched. The left one, using its front feeler, acquires the writing end of the tail-channel of the one on the right (which was deposited in the cell probed by that feeler). The two clot processes have dynamically set up a connection between them — Figure 8. gen
clot
cell
cell
clot
cell
cell
cell
cell draw
keywatch
display
(screen)
(keyboard)
Figure 7. ‘Lazy’ clotting model — bump detected (phase 1)
gen
clot
cell
cell
clot
cell
cell
cell
cell draw
keywatch
display
(screen)
(keyboard)
Figure 8. ‘Lazy’ clotting model — communication established (phase 1)
The left clot communicates four items: its size, the reading end of its tail-channel and the client ends of its rear feeler and last clot cell services. The right clot increments its size accordingly and overwrites its corresponding connections with the three channel/bundle-ends received — Figure 9. Finally, the left clot terminates, the right clot having taken over the merger — Figure 10.
P. Welch and F. Barnes / Mobile Barriers for occam-π
gen
clot
cell
cell
311
clot
cell
cell
cell
cell draw
keywatch
display
(screen)
(keyboard)
Figure 9. ‘Lazy’ clotting model — tail and back legs passed (phase 1) gen
clot
cell
cell
cell
cell
cell
cell draw
keywatch
display
(screen)
(keyboard)
Figure 10. ‘Lazy’ clotting model — clots merged, rear one terminated (phase 1)
During this phase, the (full or empty) state of the cells do not change and it is safe for the ‘display’ process to read and render them. Not shown in these figures is a shared ‘running’ flag, operated across the phases in the same way as for the previous model — Section 4.2. Terminating the cell processes cannot be via this running’ flag, since they are not enrolled on the barrier and have no way, safely, to read its value and ensure that all read it in the same cycle. Instead, termination has to be done in the classical way, using a poison message sent through the pipeline — see [21]. 4.4. Performance of the Models For the ‘busy’ cellular automata of Section 4.2, performance is proportional to the number of cells since they are all active all the time. It also depends on the number of platelets in the system, since cells holding platelets have additional work to do. Further, clot sizes are recomputed every cycle — so large clumps also increase the cost. For the ‘lazy’ but dynamic system of Section 4.3, the number of cells only impacts on memory requirements — though that may cause cache-miss problems at run-time. Otherwise, its performance depends only on the number of clots in the system — their size (i.e. the number of platelets) is irrelevant. Table 3 gives the cycle times per cell for systems of around 10K cells, running on a 2.4 GHz Pentium 4-m. The number of platelets in the system depends on the generation rate — these are given in the first column as fractions of 256 and represent the probability of release in each cycle. Each run, of course, has different properties but the overall performance does not change much. These results are averaged over 10 runs for each model and for each generation rate.
312
P. Welch and F. Barnes / Mobile Barriers for occam-π Table 3. Cell cycle times for the two models
Generation Rate (n / 256) 0 1 2 4 8 16 32
‘Busy’ (ns) 650 660 670 680 700 740 1070
‘Lazy’ (ns) 0 8 12 14 16 18 0
A generation rate of zero implies no work is done by the ‘lazy’ model. A generation rate of 32/256 is too much for the bloodstream and causes a total jam, with the vessel containing one continuous clot. This causes extra work for the ‘busy’ model, computing its length each cycle — as well as cycling all processes. For the ‘lazy’ model, there is again nothing to do. On balance, the ‘lazy’ model is more than 40 times faster than the ‘busy’ cellular automaton — in some circumstances, it is infinitely more efficient. Its logic is also simpler, more directly modelling the players in the system. 4.5. Emergent Behaviour The clotting model presented here is particularly simple. It has been developed to try out techniques that need to be matured before the real modelling can be attempted. Nevertheless, unprogrammed behaviour has emerged that is encouraging and relevant to our TUNA investigations. Considering the 1-dimensional pipeline as a capillary in the blood circulation system, these results reflect certain observed realities. Above a certain probability of platelet activation (resulting, initially, from tissue damage) and length, such a capillary always becomes blocked. Figure 11 shows a screen-shot of a visualisation for a 100∗50 cell grid (arranged as a 1-dimensional pipe) using 16 pixels-per-cell and with a 4/256 probability of clot platelet generation at the start of the pipe (top-left in the picture).
Figure 11. Clot model visualisation
The pipeline is displayed snaking down the image, with the first cell at the top-left, the next cells moving right along the first row, then left along the second row, etc. In the early rows of Figure 11, only small (mainly single-celled) clots are seen. Further down the pipeline (blood vessel), small randomised variations in their speed have resulted in them bumping and coalescing into larger and slower moving clots. Even so, they manage to flow away fast enough that the faster moving singletons behind them coalesce into similarly large clots that cannot catch them and the stream continues to flow.
P. Welch and F. Barnes / Mobile Barriers for occam-π
313
With higher probabilities of clot generation (not shown in the above figure), larger clots are formed that move slower still. Above a threshold (to be found by in silico experiment), these larger clots cannot escape being caught by smaller clots behind them — which leads to eventual catastrophic clotting of the whole system. 4.6. TUNA Perspective For the introduction of nanites implementing artificial blood platelets, getting the balance right between the stimulation and inhibition of clotting reactions will be crucial to prevent a catastrophic runaway chain reaction. This model is a crude (as yet) platform for investigating the impact of many factors on that balance. Our ambitions in the TUNA project call for scaling the size of these models through three orders of magnitude (i.e. tens of millions of processes) and hard-to-quantify orders of complexity. We will need to model (and visualise) two and three dimensional systems, factor in a mass of environmental stimulators, inhibitors and necessary supporting materials (such as fibrinogen) and distribute the simulation efficiently over many machines (to provide sufficient memory and processor power). We suspect that simple cellular automata, as described in Section 4.2, will not be sufficient. We need to develop lazy versions, in which cells that are inactive make no demands on the processor. We also need to concentrate our modelling on processes that directly represent nanites/organelles, that are mobile and that attach themselves to particular locations in space (which can be modelled as passive server processes that do not need to be timesynchronised). Barrier resignation will be crucial to manage this laziness; but care will need to be applied to finding design patterns that overcome the non-determinism that arises from unconstrained use. Such an approach is taken in the model developed in Section 4.3. Another is presented in [17]. Achieving this will be a strong testing ground for the dynamic capabilities (e.g. mobile processes, channels and barriers) built into the new occam-π language, its compiler and runtime kernel. Currently, occam-π is the only candidate software infrastructure (of which we are aware) that offers support for our required scale of parallelism and relevant concurrency primitives. Further, it is backed up with compiler-checked rules against their misuse. We need the very high level of concurrency to give a chance for interesting complex behaviour to emerge that is not pre-programmed. We need to be able to capture rich emergent behaviour to investigate and develop the necessary theories to underpin the safe deployment of Nanite technology in medicine and elsewhere. How those theories will/may relate to the process algebra underlying occam-π semantics (i.e. Hoare’s CSP and Milner’s π-calculus) is a very interesting and very open question. This work will contribute to the (UK) ‘Grand Challenges for Computer Science’ areas 1 (In Vivo ⇔ In Silico) and 7 (Non-Standard Computation). 5. Summary and Future Work This paper has reported the introduction of mobile BARRIERs into the occam-π multiprocessing language. These provide an extra synchronisation mechanism, based upon the concept of multiway events from CSP and mobility from the π-calculus. The language binding, rules and semantics were presented first informally — followed by complete formal semantics through modelling in standard CSP. The current implementation mechanisms for occam-π were outlined, together with benchmark performance figures (from systems with up to 16 million processes). Finally, an application was described whose efficiency is transformed through the use of these barriers and their ability to be communicated.
314
P. Welch and F. Barnes / Mobile Barriers for occam-π
The desired semantics for occam-π barrier synchronisation are precisely the same as those for CSP multiway events. Despite this, the former are not directly modelled by the latter, because of the need to capture the dynamics of run-time construction, enrolment, resignation and mobility (which are alien to CSP events). However, it turned out surprisingly easy to capture both the fundamental (CSP) synchronisation of barriers with their (π-calculus) dynamics — and we didn’t have to step outside of standard CSP. All that proved necessary was to model the support built into the occam-π kernel and the code generation sequences from the compiler (that interact with the kernel). Barriers become kernel processes operated through indexed control channels over which all application processes interleave. It would, perhaps, have been a better story to say that this CSP modelling came first (accompanied by some formal sanity check verifications and/or model checking) before the kernel and compiler were developed. Alas, we thought and did things the other way around. This CSP modelling gives us both a denotational semantics (through the standard traces/failures/divergences semantics of CSP) and an operational semantics (describing the implementation). It enables formal verification and (finite) model checking for occam-π systems using mobile barriers. The denotational aspect further supports formal system specification and development through refinement. The operational aspect provides machineindependent formal documentation of the necessary compiler code generation and run-time kernel support. This work has triggered a similar approach for the modelling of (occam-π) mobile channels in CSP. Again, kernel processes, rather than channels, are used to capture the synchronisation and dynamic semantics. This is a very recent result and will have to be reported elsewhere. It may now be possible to provide a formal CSP model documenting the entire occamπ run-time kernel and supporting code generation. That would enable formal specification, development and analysis of all application systems, as well as provide a formal specification for the porting of occam-π to new target platforms (including the design of direct silicon support in future microprocessors). Another development of this work could lead to a complete formal specification of a compiler from occam-π down to a simple register-based machine code — for example, see Section 2.3. Adding in formal constraints imposing the parallel and anti-aliasing usage rules of occam-π would further permit re-ordering of code sequences, necessary for the efficient operation of many modern microprocessors. Assistance for this is also given by avoiding unnecessary serialisation of code sequences in the formal definition — for example, Sections 2.3 and 2.4.8, where refinement into particular serialisations can be chosen at any stage (including their deferral till run-time). These re-orderings would be both safe (in terms of sequential consistency and multiprocessor execution) and understandable (by mortal systems designers and coders). Such work is for the future, but should be relevant and within the timescale of the UK ‘Grand Challenges in Computer Science’ [22] project on Dependable Systems [23]. The TUNA applications work, described in Section 4, are the beginings of contributions towards two of the other Grand Challenge areas: In Vivo ⇔ In Silico [24] and Non-Classical Computation [25].
Acknowledgements We are grateful to our colleagues on the TUNA project for insights and much debate. Thanks especially to Jim Woodcock, Steve Schneider and Ana Cavalcanti for suggesting the blood clotting case study and for their own CSP models developing it — and for motivating us to
P. Welch and F. Barnes / Mobile Barriers for occam-π
315
the importance of finding a formal semntics for the occam-π mobiles. We would also like to thank the anonymous reviewers for their helpful comments on an earlier version of this work.
References [1] P.H. Welch and D.C. Wood. The Kent Retargetable occam Compiler. In Proceedings of WoTUG 19, pages 143–166. IOS Press, March 1996. ISBN: 90-5199-261-0. [2] P.H. Welch, J. Moores, F.R.M. Barnes, and D.C. Wood. The KRoC Home Page, 2000. Available at: http://www.cs.kent.ac.uk/projects/ofa/kroc/. [3] P.H. Welch and F.R.M. Barnes. Communicating mobile processes: introducing occam-pi. In A.E. Abdallah, C.B. Jones, and J.W. Sanders, editors, 25 Years of CSP, volume 3525 of Lecture Notes in Computer Science, pages 175–210. Springer Verlag, April 2005. [4] Frederick R.M. Barnes. Dynamics and Pragmatics for High Performance Concurrency. PhD thesis, University of Kent, June 2003. [5] F.R.M. Barnes and P.H. Welch. Prioritised dynamic communicating and mobile processes. IEE Proceedings – Software, 150(2):121–136, April 2003. [6] Inmos Limited. occam 2.1 Reference Manual. Technical report, Inmos Limited, May 1995. Available at: http://wotug.org/occam/. [7] R. Milner. Communicating and Mobile Systems: the Pi-Calculus. Cambridge University Press, 1999. ISBN-10: 0521658691, ISBN-13: 9780521658690. [8] F.R.M. Barnes, P.H. Welch, and A.T. Sampson. Barrier synchronisations for occam-pi. In Proceedings of the 2005 International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA’2005). CSREA press, June 2005. [9] C.A.R. Hoare. Communicating Sequential Processes. Prentice-Hall, London, 1985. ISBN: 0-13-153271-5. [10] A.W. Roscoe. The Theory and Practice of Concurrency. Prentice Hall, 1997. ISBN: 0-13-674409-5. [11] L.G. Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8):103–111, August 1990. [12] M. Schweigler. Adding Mobility to Networked Channel-Types. In Proceedings of Communicating Process Architectures 2004, pages 107–126, September 2004. ISBN: 1-58603-458-8. [13] S. Stepney, P.H. Welch, F.A.C. Pollack, J.C.P. Woodcock, S. Schneider, H.E. Treharne, and A.L.C. Cavalcanti. TUNA: Theory underpinning nanotech assemblers (feasibility study), January 2005. EPSRC grant EP/C516966/1. Available from: http://www.cs.york.ac.uk/nature/tuna/index.htm. [14] Peter H. Welch and David C. Wood. Higher Levels of Process Synchronisation. In Proceedings of WoTUG 20, pages 104–129. IOS Press, April 1997. ISBN: 90-5199-336-6. [15] D.C. Wood and J. Moores. User-Defined Data Types and Operators in occam. In Proceedings of WoTUG 22, pages 121–146. IOS Press, April 1999. ISBN: 90-5199-480-X. [16] M.D. Poole. Extended Transputer Code - a Target-Independent Representation of Parallel Programs. In Proceedings of WoTUG 21, pages 187–198. IOS Press, April 1998. ISBN: 90-5199-391-9. [17] A.T. Sampson, P.H. Welch, and F.R.M. Barnes. Lazy Simulation of Cellular Automata with Communicating Processes. In J. Broenink, H. Roebbers, J. Sunter, P. Welch, and D. Wood, editors, Communicating Process Architectures 2005. IOS Press, September 2005. [18] J.C.P. Woodcock and A.L.C. Cavalcanti. The Semantics of Circus. In ZB 2002: Formal Specification and Development in Z and B, volume 2272 of Lecture Notes in Computer Science, pages 184–203. Springer-Verlag, 2002. [19] F.R.M. Barnes and P.H. Welch. Mobile Data, Dynamic Allocation and Zero Aliasing: an occam Experiment. In Alan Chalmers, Majid Mirmehdi, and Henk Muller, editors, Communicating Process Architectures 2001, volume 59 of Concurrent Systems Engineering, pages 243–264, Amsterdam, The Netherlands, September 2001. WoTUG, IOS Press. ISBN: 1-58603-202-X. [20] P.H. Welch, G.R.R. Justo, and C.J. Willcock. Higher-Level Paradigms for Deadlock-Free High-Performance Systems. In R. Grebe, J. Hektor, S.C. Hilton, M.R. Jane, and P.H. Welch, editors, Transputer Applications and Systems ’93, Proceedings of the 1993 World Transputer Congress, volume 2, pages 981–1004, Aachen, Germany, September 1993. IOS Press, Netherlands. ISBN 90-5199-140-1. See also: http://www.cs.kent.ac.uk/pubs/1993/279. [21] P.H. Welch. Graceful Termination – Graceful Resetting. In Applying Transputer-Based Parallel Machines, Proceedings of OUG 10, pages 310–317, Enschede, Netherlands, April 1989. Occam User Group, IOS Press, Netherlands. ISBN 90 5199 007 3.
316
P. Welch and F. Barnes / Mobile Barriers for occam-π
[22] UKCRC. Grand Challenges for Computing Research, 2004. http://www.nesc.ac.uk/esi/events/Grand Challenges/. [23] J.C.P. Woodcock. Dependable Systems Evolution, May 2004. Available from: http://www.nesc.ac.uk/esi/events/Grand Challenges/proposals/. [24] R. Sleep. In Vivo ⇔ In Silico: High fidelity reactive modelling of development and behaviour in plants and animals, May 2004. Available from: http://www.nesc.ac.uk/esi/events/Grand Challenges/proposals/. [25] S. Stepney. Journeys in Non-Classical Computation, May 2004. Available from: http://www.nesc.ac.uk/esi/events/Grand Challenges/proposals/.
Communicating Process Architectures 2005 Jan Broenink, Herman Roebbers, Johan Sunter, Peter Welch, and David Wood (Eds.) IOS Press, 2005 © 2005 The authors. All rights reserved.
317
Exception Handling Mechanism in Communicating Threads for Java Gerald H. HILDERINK Boulevard 1945 – 139, 7500 AE, Enschede, The Netherlands [email protected]
Abstract. The concept of exception handling is important for building reliable software. An exception construct is proposed in this paper, which implements an exception handling mechanism that is suitable for concurrent software architectures. The aim of this exception construct is to bring exception handling to a high-level of abstraction such that exception handling does scale well with the complexity of the system. This is why the exception construct supports a CSP-based software design approach. The proposed exception construct embraces informal semantics, but which are intuitive and suitable to software engineering. The exception construct is prototyped in the CSP for Java library, called CTJ. Keywords. Exception handling, CSP, concurrency, real-time, embedded software.
Introduction Reliable software should deal with all circumstances in its environment, which can affect the behaviour of the program. The environment of the program encompasses the computer hardware. Unusual circumstances are exceptional occurrences that can bring the program, when not dealt with, in a state of undesirable behaviour. This causes an exceptional state, which manifests an error or simply an exception. The processes in the program that are affected by the exception should not progress after the occurrence of the exception. Each process that is encountering an exception should escape to a handler process that is able to deal with the exception. This handler process is called an exception handler. Reasoning about the behaviour of the program in the presence of exceptions can be very complex. Branching to an exception handler can occur at many places in the program. Exceptions occurring in exception handlers require branching from exception handler to exception handler. Exceptions are related to the concurrent behaviour of the system. Exceptions can occur asynchronously or simultaneously in concurrent systems. Exceptions should be handled by proper design concepts that deal with its complexities. Therefore, a proper concurrency model is inevitable in order to manage the complexity of exception handling. Proper design concepts can be found in the CSP concurrency model. The CSP concepts provide sufficient abstraction, compositionality and sound semantics that are very suitable for designing and implementing mission-critical embedded software. However, CSP does not specify a simple solution for describing exception handling. An informal description of an exception construct is presented, which offers a simple solution to handle exceptions in concurrent software architectures. On the other hand, the approach is in accordance with CSP terminology. A formal description and analysis in CSP is not part of this paper. The feasibility of the exception construct has been investigated. The exception constructs has been prototyped in the Communicating Threads for Java (CTJ) library [1; 2]. At one hand, the method of approach is based on a software engineering perspective.
G.H. Hilderink / Exception Handling Mechanism in CTJ
318
The notion of exceptions is discussed in Section 1. This notion follows the CSP terminology. The role of the environment of the program and poisoning channel and processes are discussed. The concept of exception handling is discussed in Section 2. An example program with nested exception constructs is described in Section 3. Various aspects of the exception construct are discussed in Section 5. Section 5 deals with the conclusions. 1. Exceptions 1.1. Processes, Events, and Channels An elegant way to design an implement mission-critical software in embedded systems is the use of Communicating Sequential Processes (CSP) concepts [3; 4]. CSP is a theory of programming, which offers formal concepts for describing and reasoning about the behaviour of concurrent systems. Furthermore, CSP offers pragmatic concepts and guidelines for developing reliable and robust concurrent process architectures. These concepts are process-oriented and they offer abstraction, compositionality, separation of concerns and a concurrency model with simple and clean semantics. CSP is a notation for describing concurrent systems by means of processes and events. Processes are self-contained entities, which are defined in terms of events. Processes do not see each other, but they interact with other processes via channels. An event is an occurrence in time and space. An event represents the rendezvous between two or more processes on which they synchronize together. A process that is willing to engage in an event must wait until another process is also willing to engage in the event. This is called the rendezvous or handshake principle. The rendezvous between two processes that are willing to communicate via a share channel is called the communication event. A process that successfully terminates engages in a termination event with its subsequent process. 1.2. Process in Exception An exception is a state in which a)
An instruction is causing an error and the instruction cannot complete or successfully terminate; e.g. division by zero or illegal address.
b) A communication event is refused by the environment of the program; e.g. the channel implementation is malfunctioning. In either case, the environment in which the program runs cannot let the processes continue after the point of exception. A process in exception will never engage in any event after the exception has been raised in the process; i.e. it behaves as STOP. Furthermore, a process is in exception when at least one of its sub-processes is in exception. Conceptually, the exception construct interrupts the process in exception and it will be replaced by a handler process—the exception handler. If the exception handler terminates, the process terminates normally. 1.3. The Role of the Environment The role of the environment in the exception handling mechanism is important in order to understand the source of exceptions. For some reason it could happen that a device in hardware (the environment of the program) is malfunctioning and it cannot establish or complete communication. In other words, the communication event is refused by the
G.H. Hilderink / Exception Handling Mechanism in CTJ
319
environment of the program. The event will never occur and the process may wait forever for the event to happen. This could cause the program to deadlock or livelock. This is an inconvenient circumstance, which manifests an exception. It is more convenient for the program to escape from the exception and to do something useful; e.g. dealing with the exception. The environment of a program is usually something complex from which the software engineer wants to abstract away from. The software engineer is interested in the causality between a misbehaving environment and the behaviour of the program. The CSP channel model supports this view. Figure 1 illustrates two parallel processes communicate via channel c. The figure is called a CSP diagram [1]. The writer process W writes data to the channel and the reader process R reads the data from the channel. The figure does not show exactly when they communicate. The points of communication will be illustrated in Section 2.2. During the design of a program, the environment is not included in the design, but its effect on the design should be considered. c
W (a)
R
Communication diagram
W (b)
||
R
Composition diagram
Figure 1. Two parallel processes communicate via a channel (CSP diagram).
The abstract role of the environment is illustrated in Figure 2. The figure illustrates that the environment can be depicted as a parallel process, named ENV. The environmental process ENV is listening on channel c and it decides whether or not to participate in the communication event. ENV is dotted to illustrate the role of the environment, but the environment is not part of the design. The environmental process is hidden in the design, but it is integral part of the implementation of the design. In fact, the channel implementation can be viewed as an environmental process, because the channel implementation can directly control the underlying hardware. The interface of the channel separates the program from its environment. The processes in the program should only access the devices via channels. This abstraction and separation of concern keeps the processes free from hardware depending code. Of course, this hardware-independency excludes the integrity of the processes, i.e. processes depend on the data provided by the channels (or devices). environment accepts c
environment refuses c
c
c
ENV (a)
W
8
throw exception
ENV
R
W
Environment is functioning properly
(b)
R throw exception
Environment is malfunctioning
Figure 2. The role of the environmental process.
In case the channel c breaks, the environment will refuse c from happening. This is illustrated in Figure 2b. Instead, the channel will throw (or raise) an exception to each involved process. The grey arrows indicate the source and destination of throwing
320
G.H. Hilderink / Exception Handling Mechanism in CTJ
exceptions from the channel implementation to the invoking processes. The channel is modelled as an active partner process in its communications, that can be put into “refusal” and exception-throwing mode, where “exceptions” are just events for which other processes have (CSP) interrupt handlers awaiting. 1.4. Poisoning Channels and Processes A channel being refused by the environment of the program is called a poisoned channel. A poisoned channel will never cause a communication event to happen as long as it is poisoned. A poisoned channel devotes its channel ends to throw exceptions to the processes that are willing to communicate via its poisoned channel ends. After the exception is thrown, the process is poisoned and it will eventually die; i.e. a poisoned process never engages in any event with its environment and it never terminates normally. Successively, the exception will be caught by the exception construct. In CTJ, the methods refuse(channel), refuse(channel,exc) and accept(channel) were introduced [2]. These methods are defined by a static call-channel that is connected to the environmental process ENV. Invoking one of the refuse methods will request a “refuse” service from the environmental process ENV. After ENV accepts the request, any communication event on the specified channel will be refused. In case the exception argument exc is specified, the channel will throw exc via the channel end on which a process is willing to perform a read or write. For example, invoking refuse(c,exc) corresponds to the situation as depicted Figure 2b. We prefer the method name refuse rather than poison, which is accordance with the CSP terminology of “refusals”. The method accept(channel) requests the environmental process to accept communication events on channel; i.e. undo the poisoning, if possible. The refuse(..) and accept(..) methods are meant to be used by the implementation of channels or by the underlying kernel. The program could use these methods for studying the effects of poisoning channels on the behaviour of the program; i.e. simulating the effect of malfunctioning devices. In all other situations, we do not encourage these methods to be used by the program. Poisoning channels by processes can be error-prone and therefore it should not be encouraged for deliberately killing processes. For example, poisoning channels by an exception handler could cause exceptions to propagate outside the scope of the exception construct. If this is not desired, poisoning channels is not useful. The C++CSP library [5; 6] uses a different approach, whereby the poison() method is part of the channel interface. This is called stateful poisoning of channels. A process is being poisoned while attempting to access a poisoned channel must poison all channels it uses before terminating gracefully. Special functions can be used, which provide channel ends that cannot be used to poison the channel. Processes can choose whether the channel ends they pass to their sub-processes can be poisoned or not. The refuse() or poison() methods can be misuse. It is safe to leave the killing (deliberately poisoning) of processes up to the channels and the exception constructs. A process in exception does not need to poison its channel ends. The process of poisoning channel ends is performed by the exception constructs. This mechanism is elaborated in Section 2.
G.H. Hilderink / Exception Handling Mechanism in CTJ
321
1.5. Termination and Resumption This exception handling approach encompasses two models of exception handling, namely: Resumption model. The resumption model allows an exception handler to correct the exception and then return to the point where the exception was thrown. This requires that recovery is possible with acceptable overhead costs. The resumption model is most easily understood by viewing the exception handler as an implicit procedure which is called when the exception is raised. The resumption model is also called retry model [7]. Termination model. In the termination model, control never returns to the point in the program execution where the exception was raised. This will result in the executing process being terminated. The termination model is necessary when error recovery is not possible, or difficult to realize, with acceptable overhead costs. The termination model is also called escape model [7]. Error recovery or resumption is sometimes possible at the level of communication, i.e. by the channels. The channel implementation can detect errors and possibly fix them with if-then-else or try-catch constructs. In case the error is fixed and communication is reestablished by the channel, this can be viewed as resumption. In this case, processes are not aware of any exceptions that were fixed. A channel that cannot fix the internal error should escape from resumption. The channel should raise (or throw) an exception via its interface to the process domain. The process domain and the channel domain are depicted in Figure 3. processes process domain channel channel domain
try-catch if then
devices Figure 3. Process and channel domains.
The channel interface separates both domains. The process domain supports the termination model and the channel domain support the resumption and termination model. The exception construct resides in the process domain and supports the termination model. 2. Exception handling 2.1. Exception Construct In Hilderink [1], a notation was introduced to describe exception handling in a compositional way. The exception handling is based on an exception construct with a formal graphical syntax, but with informal semantics. The exception construct composes two processes P and EH, which is written as: )& P 'EH
exception construct
G.H. Hilderink / Exception Handling Mechanism in CTJ
322
This process behaves as EH when P is in exception; otherwise it behaves as P. Process P is in exception on the occurrence of an error from which P must not continue. At the point of exception P behaves as STOP. Process EH is the exception handler. On the occurrence of an exception, the exception construct requires that all the channel ends, being claimed by P, are released. The exception construct must reckon with a complex composition of sub-processes of P. The released channel ends can be re-claimed by other processes, for example, by the exception handler EH. A poisoned channel end cannot be re-claimed as long as it is poisoned. The exception construct has resemblance with the interrupt operator in CSP. We will omit a theoretical discussion between the formal interrupt operator and the informal exception construct. Consider the CSP diagram in Figure 1. This example is enhanced with exception constructs. Figure 4 illustrates two different enhancements. The processes are shown transparently. Each compositional diagram depicts a different composition. Figure 4a illustrates the two processes W and R that are guarded by an exception construct. Exception handler EHW deals with the exception at the writer’s side of channel c and EHR deals with the reader’s side of channels c. On exception in c, the processes EHW and EHR run in parallel. Figure 4b illustrates the circumstance where the exception handler EH deals with both sides of channel c. EH could be any sequence of EHW and EHR.
Y
X W
||
W
R
||
EH
P
EHR
R
EH (b) Joint exception construct.
(a) Disjoint exception constructs.
Figure 4. Compositional diagrams enhanced with exception handling.
Figure 5 shows an equivalence property between two compositions. The process SKIP doesn’t do anything, except successfully terminating. Since SKIP does not deal with any exception and therefore the exception handler EH will take over.
P
P
W
R
W
||
|| SKIP
SKIP
{ EH
EH Figure 5. Equivalent exception compositions.
R
G.H. Hilderink / Exception Handling Mechanism in CTJ
323
2.2. Exception Handling Mechanism The conceptual behaviour of the exception handling mechanism of the proposed exception construct is described in this section. The required steps that are performed by the mechanism are explained by a simple example. Furthermore, the channel ends and the scope of the exception construct are explained. The following steps that are taken by the exception handling mechanism: 1. Registering. Register each channel ends, being invoked by a process, to the associated exception construct. Also, register each nested exception constructs to the upper exception construct. 2. Notifying. Notify the exception construct that an exception has occurred and the exception will be collected by the exception construct. 3. Poisoning. Poison the registered channel ends and nested exception constructs. A poisoned exception construct will propagate its poison. All poisoned channel ends that were claimed by a process will be release. 4. Throwing. The channel ends throw NULL exceptions, which exceptions propagate via the CSP constructs until they are caught by the exception construct. 5. Healing. Before the exception handler is executed the registered channel ends and nested exception constructs must be healed. Otherwise these channel ends cannot be re-claimed by the exception handler. Those channel ends that belong to poisoned channels remain poisoned. Those channel ends cannot be re-claimed by the exception handler. 6. Handling. The associated exception handler reads the exception set and handles each exception one by one. Exceptions that have been handled by the exception handler must be removed from the set. Step 1 is performed when no exception has occurred. The steps 2 till 6 are performed on the first occurrence of an exception. Each of these steps is explained in the following example. The example consists of the processes U, P, T and EH. See Figure 6. Process P is defined by the processes R and S. The communication relationships a, b and c, and the compositional relationships are depicted one diagram. The compositional relationships are in grey. Process P is related to the exception handler EH. The channel inputs and outputs are depicted by primitive reader and writer processes, respectively labelled with ‘?’ and ‘!’. These primitive reader and writer processes mark the channel ends of the associated channel. In R and S, the channel ends are related to a sequential composition, which defines: first input, then output. Each exception construct defines a scope to which a group of channel ends is related. This example illustrates that the channel ends in P are in the scope of the nearest exception construct associated with EH. The channel ends of U and T are not within the scope of the exception construct. The processes R and S are randomly scheduled on a single processor system. We start with process R. Assume R is performing the input on the channel a. Since the start of P, this is the first time this channel end is accessed. On this first access, the channel end is registered to the nearest exception construct. See step n in Figure 7. A second access does not require registering, since the channel end was already registered to an exception construct. Note that each thread keeps a reference to the exception construct to which it is part of. After S is scheduled, S is willing to input from channel b. Also this channel end will be registered to the exception construct in step o. Process S is waiting for channel b.
324
G.H. Hilderink / Exception Handling Mechanism in CTJ
P
|| U
|| R
a
?
!
b
|| S ?
!
c
T ?
!
EH Figure 6. Example of a program consisting of four processes U, P, T and EH.
P
U
|| R
a !
?
b !
S ?
c !
T ?
o
n
EH Figure 7. Registering of channel ends to the exception construct.
In the meantime something bad happened with the implementation of channel c. After R is scheduled and received data from channel a, R is willing to output on channel c. Since the channel c is poisoned, its channel ends are also poisoned. Registering of a poisoned channel end is not necessary, which saves at least one registering operation. On the output operation, the channel end will notify the exception construct that an internal exception has occurred. See step p in Figure 8. The exception is collected by the exception construct. P U
a !
|| R ?
b
!
8
S ?
c !
T ?
p EH Figure 8. The channel notifies the exception construct that an exception has occurred.
After notifying the occurrence of an exception to the exception construct, the exception construct will immediately poison all registered channel ends. A poisoned channel end will release its synchronization with any process. In this example, the registered channel ends
G.H. Hilderink / Exception Handling Mechanism in CTJ
325
were the input channel end in R and the input channel end in S. See step q in Figure 9. The input channel end in S is blocking S and therefore it will unblock S. The input channel end in R needs no unblocking, because R does no longer claim the channel end of a. The procedure of poisoning the registered channel ends can detect other exceptions in the associated channels. The newly detected exceptions will be collected by the exception construct. It is possible that not all exceptions are detected by this procedure. This is not a problem, since the yet undetected exceptions will be detected at a later time or they will not be detected at all. In the latter case, no harm will be done since these channel ends are never used again. P U
|| R
a !
b
?
!
8
S
c
T ?
!
?
q EH Figure 9. The exception construct poisons the registered channel ends.
The channel ends of a poisoned channel will throw NULL exceptions to each process that accesses the channel end. These exceptions are passed to the hierarchy of compositional constructs until the associated exception construct is reached. See the steps r and s for process R and the steps t and u for process S in Figure 10. P
U
a !
||
s
R ?
b
!
8
r
u
S
c
?
!
T ?
t
EH Figure 10. NULL exceptions are thrown from the channel end up the parallel construct.
The NULL exception does not contain information about the actual exception. Note: the actual exception was already collected in step p in Figure 8 and NULL exceptions are not collected. This concept of throwing NULL exceptions provides a mechanism of immediately terminating processes in modern programming languages, such as Java and C++. In case a process performs an illegal instruction, an ordinary exception can be thrown instead of a NULL exception. This exception will be caught by the CSP construct in which the process runs. The CSP construct makes sure that the exception will be collected by the nearest exception construct and a single NULL exception will be thrown further. This way,
326
G.H. Hilderink / Exception Handling Mechanism in CTJ
duplicated exceptions are avoided and sets of exceptions do not have to be thrown. Furthermore, compatibility is preserved with the try-catch clauses in Java or C++. The parallel construct will wait until all parallel branches have joined. Subsequently, a NULL exception is passed to the exception construct. See step v in Figure 11. The exception construct catches the NULL exceptions and it will try to heal the registered channel ends. See step w. The channel ends of channel b cannot be healed and they remain poisoned as long as the channel remains poisoned. After healing, the exception construct will perform process EH. EH gets the set of exceptions. The set of exception must not be empty, otherwise EH can be ignored. The non-empty set of exceptions must be read by EH. The set of exceptions does not contain NULL exceptions. P
U !
a
|| R ?
b
8
!
S ?
c !
T ?
v w
EH Figure 11. The parallel construct throws a NULL exception, which is caught by the exception construct. The exception construct tries to heal its channel ends before EH is executed.
After EH has terminated and not all exceptions have been handled, the exception construct will notify the upper exception construct and passes the remaining exceptions to the exception construct in the same way as channel ends do (channel ends pass only one exception). This example illustrates that the processes U and T are not affected by the exception in P. In case U and T must terminate due to an exception in P, the program must be designed such that the composition of exception constructs and exception handlers specify this behaviour. The method refuse() or poison() is not required. 2.3. Example of Nested Exception Constructs An example of nested exception constructs is illustrated in this section. The steps in the previous described example are also briefly discussed in this example. This example illustrates that the exception constructs can be composed in various ways, which results in nested behaviours. The example will illustrate three kinds of behaviours which can be modelled with this exception handling mechanism. This example is implemented with CTJ. Figure 12 shows a CSP diagram of the parallel processes, which model a pipeline of communication via the channels a, b and c. The processes EHPQ and EHR are in parallel. The grey arrows in Figure 13 illustrate the registration of channel ends to their exception construct and the registration of lower exception constructs to upper exception constructs. This figure illustrates a complete registration of all elements, i.e. channel ends and nested exception constructs. The same arrows depict the possible paths of notification. The reverse arrows depict the paths of poisoning and healing the registered elements. See Figure 14.
G.H. Hilderink / Exception Handling Mechanism in CTJ
|| P
a !
||
|| Q ?
b !
R ?
||
c
S
!
?
EHR
EHPQ
EH Figure 12. Example of nested exception constructs.
P
a !
Q ?
b !
R ?
c
S
!
?
EHR
EHPQ
EH Figure 13. Registering elements to the nested exception constructs.
P
a
Q ?
!
b
R ?
!
c
S
!
EHR
EHPQ
EH Figure 14. Poisoning or healing elements.
?
327
328
G.H. Hilderink / Exception Handling Mechanism in CTJ
In case channel a is in exception and process Q is the first process willing to communication via channel a, this process is the first to go in exception. That is, process Q will stop engaging in any event. The channel end will add the exception to the associated exception construct and throws a NULL exception. This notification starts with the bold arrow between the input of process Q and the exception construct. See Figure 15.
P
a
!
8
Q ?
b
R ?
!
c !
S ?
EHR
EHPQ
EH Figure 15. Example of a chain reaction in a nested exception construct.
The exception construct will immediately poison its registered channel ends. The exception remains hidden by the exception construct until the exception handler dealt with the exception and terminates. The exception cannot be observed by the upper exception construct. In case the exception handler terminates and one or more exceptions were not handled, the exceptions become observable by the upper exception construct. The exception will be notified and passed to the upper exception handler EH. See the chain reaction of the dotted arrows. The upper exception construct will poison all other registered elements. This makes sure that the sub-processes go into exception. After all sub-processes are in exception and the exception construct catches a NULL exception, the registered channel ends will be healed. Otherwise EH cannot reclaim the channel ends. See also Figure 14. After healing, the exception handler EH will be executed. When channel c is poisoned then the exception will be added to the exception construct of EHR or to the exception construct of EH. This choice depends on which thread of control in R or S was first to execute a channel end of c. Assume process S was executed before process R. See the bold arrow in Figure 16. This exception starts a chain reaction whereby all process in the scope of the exception construct will be poisoned. An exception construct that is poisoned before it executes will not execute at all. This can happen for the processes P, Q, and R in this example. In case process R outputs on c before S inputs on c, process EHR will be executed. If EHR uses channel ends then these channel ends will be poisoned when S scheduled and tries to input from c. EHR will go into exception. However, EHR can perform communication events in the meantime. Thus, an exception in channel c results in a nondeterministic choice between different traces of events. A trace of events is a sequence of communication events in which a process can engage. If certain traces of events are unwanted, the following measures can be applied for this example: 1. EHR should be designed such that it immediately terminates when an exception occurs on channel c, i.e. it must not engage in any communication event. EH should take care of the exception, not EHR.
G.H. Hilderink / Exception Handling Mechanism in CTJ
329
2. Process S could be executed at a higher priority than R, which makes the choice of possible traces of events deterministic.
P
a
Q ?
!
b
R ?
!
c
!
8
S ?
EHR
EHPQ
EH Figure 16. S detects exception before R on channel c.
3. Example program 3.1. Source Code of Program In this section, the CTJ (Java) code of the example in the previous section is listed. A detailed discussion of the implementation of the exception construct itself is deferred to a later paper. public static void main(String[] args) { // Declare the channels and channel ends final DataChannel a = new DataChannel(); final ChanIn a_in = a.in(); final ChanOut a_out = a.out(); final DataChannel b = new DataChannel(); final ChanIn b_in = b.in(); final ChanOut b_out = b.out(); final DataChannel c = new DataChannel(); final ChanIn c_in = c.in(); final ChanOut c_out = c.out(); // Declare the processes Process p = new Process() { public void run() throws Exception { System.out.println("P: running"); System.out.println("P: writing to channel a"); a_out.write(10); System.out.println("P: terminated"); } };
330
G.H. Hilderink / Exception Handling Mechanism in CTJ Process q = new Process() { public void run() throws Exception { System.out.println("Q: running"); System.out.println("Q: reading from channel a"); int x = a_in.read(null); System.out.println("Q: writing to channel b"); b_out.write(x); System.out.println("Q: terminated"); } }; Process r = new Process() { public void run() throws Exception { System.out.println("R: running"); System.out.println("R: reading from channel b"); int y = b_in.read(null); System.out.println("R: writing to channel c"); c_out.write(y); System.out.println("R: terminated"); } }; Process s = new Process() { public void run() throws Exception { System.out.println("S: running"); System.out.println("S: reading from channel c"); int z = c_in.read(null); System.out.println("S: value = " + z); System.out.println("S: terminated"); } }; // Declare the exception handlers Process ehpq = new Process() { public void run() throws Exception { System.out.println("EHPQ: running"); LinkedList<Exception> exclist = ExceptionCatch.getExceptionSet(); //... exclist.removeFirst(); // exception is handled, remove from set System.out.println("EHPQ: terminated"); } }; Process ehr = new Process() { public void run() throws Exception { System.out.println("HER: running"); LinkedList<Exception> exclist = ExceptionCatch.getExceptionSet(); //... exclist.removeFirst(); // exception is handled, remove from set System.out.println("HER: terminated"); } };
G.H. Hilderink / Exception Handling Mechanism in CTJ
331
Process eh = new Process() { public void run() throws Exception { System.out.println("EH: running"); LinkedList<Exception> exclist = ExceptionCatch.getExceptionSet(); //... exclist.removeFirst(); // exception is handled, remove from set System.out.println("EHL terminated"); } }; // Declaring the compositional construct Process proc = new ExceptionCatch( new Parallel(new Process[] { new ExceptionCatch( new Parallel(new Process[] {p,q}), ehpq), new ExceptionCatch( r, ehr) ,s, }), eh); // Poison one or more channels to study its effects csp.lang.System.refuse(c, new Exception("Exception in channel c")); // Start the program try { proc.run(); } catch (Exception ex) { java.lang.System.out.println("Exception = " + ex); } java.lang.System.out.println("\nProgram has terminated"); } }
After a channel is declared, its input and output channel ends must be received from the channel using respectively the in() and out() methods on the channel. The processes can read from a input channel end or write on a output channel end. The references to the channel ends are final, which channel ends are allowed to be directly used by the processes. This makes the use of constructors superfluous and keeps the program compact (for the purpose of this paper). The exception construct is implemented by the process ExceptionCatch. An exception handler must retrieve the set of exceptions with: LinkedList<Exception> exclist = ExceptionCatch.getExceptionSet();
The getExceptionSet() method is a read-only static method. The method returns the set of exception. Note: The ExceptionCatch plays the role of a call-channel. Any process can invoke the getExceptionSet() method. Only exception handlers can retrieve the set of exceptions; otherwise the set will be empty. This also implies that the set of exception can be retrieved by parallel exception handlers associated to the same exception construct.
G.H. Hilderink / Exception Handling Mechanism in CTJ
332
The exception handler can retrieve the first exception in the set with exclist.getFirst() as shown in the example. Since the set is an iteration object, other
useful methods are available. After the exception has been handled, it must be removed from the set with exclist.removeFirst() or with other methods that are specified by the iteration object. Careful, a race-condition of simultaneously deleting elements must be prevented. Therefore, parallel exception handling must be disjoint. Handling exception twice is asking for trouble anyways. 3.2. Results In case, channel c is poisoned, the possible paths of abnormal termination are given in the Table 1. Table 1. Output of the program with channel c poisoned. Result 1
Result 2
Result 3
Q: running Q: reading from channel a P: running P: writing to channel a P: terminated Q: writing to channel b R: running R: reading from channel b R: writing to channel c EHR: running EHR: terminated Q: terminated S: running S: reading from channel c EH: running EH: terminated
S: running S: reading from channel c EH: running EH: terminated
Q: running Q: reading from channel a P: running P: writing to channel a P: terminated Q: writing to channel b S: running S: reading from channel c EH: running EH: terminated
Program has terminated
Program has terminated
Program has terminated
4. Discussion The steps that are performed by the implementation of the exception construct and channel ends are concurrent paths of executions. These paths of execution must be properly synchronized. This resulted in a multithreaded object-oriented framework that is too detailed for the human mind. Fortunately, the exception construct encapsulates this complex and hazardous framework and turns it into a simple and secure design pattern. The exception handling mechanism has been carefully design such that the overhead is reasonable low. The overhead is allotted to the process of registering, poisoning and healing of channel ends and nested exception constructs. A program that does not move channel ends or processes around, register its channel ends and its lower exception constructs only once. For a program that is never in exception this costs an instruction (i.e. a Boolean check which remains false) for each channel communication and entering or leaving the exception constructs. In case the program never goes into exception, the exception constructs can be removed from the composition. This design decision lowers the overhead even further. In most cases, there is always one outer exception construct present. For example, this outer exception construct prints the strings of exceptions in the console provided by the operating system.
G.H. Hilderink / Exception Handling Mechanism in CTJ
333
After the channel ends and nested exception constructs are registered to the upper exception construct, the process of poisoning or healing by the upper exception construct is based on a short list of elements. Poisoning and healing is straightforward, deterministic and light weight. There can be more than one path of abnormal termination for a single exception. The performance of each path of abnormal termination needs to be taken into account for realtime systems. As long as the traces of events are deterministic, the delays will be deterministic. Poisoning channels and processes via the exception construct is faster than gracefully termination [8] and faster than poisoning channel ends by processes [6]. The read and write operations can be viewed as illegal instructions. Hence, throwing exceptions by channels is similar to throwing exceptions by illegal instructions. Therefore this approach does not conflict with the ordinary try-catch mechanism in Java or C++. The application programming interface (API) was not affected by adding the exception construct to CTJ. The protected interfaces of the channel ends required a few additional methods for poisoning and healing the channel ends. These methods are invisible for the user. A process that performs an infinite loop and which does not invoke channel ends, cannot be poisoned. In this circumstance the method Expr.evaluate(Boolean expression) can be used in while(..) statements. Normally, the method returns the result of the Boolean expression. The surrounding exception construct can poison the method so that it will throw a NULL exception. The loop will immediately terminate. In future work, the implementation of the exception construct need to be formalized and model-checked in order to prove that the implementation is free from pathological problems, such as race-hazards, deadlock or livelock. The alternative construct was not discussed in the examples. The alternative construct has been adapted to support asynchronous exceptions. The alternative construct has the simple task not to perform when at least on guard is poisoned. This is obvious, since no legitimate choice can be made when a guard is poisoned. In CTJ, a channel end can play the role of a guard. The exception of each poisoned guard must be notified to the surrounding exception construct, which collects all the exceptions. Subsequently, the alternative construct will throw a NULL exception. 5. Conclusions We have implemented and presented a simple exception construct in CJT for capturing exceptions in concurrent systems. The steps that are required to perform the exception handling mechanism were discussed. The concept of poisoning channels and processes is intuitive and easy to understand. The behaviour of exception handling is attributed to the composition of constructs. This approach is justified in CSP terms. The semantics of this exception construct is informal and need to be formalized in CSP. A full CSP description is in our future work plans. Researchers are invited to contribute. Acknowledgements The author wants to thank Peter Welch for his comments and input. Thoughts have been exchanged about formalizing this exception construct in CSP.
334
G.H. Hilderink / Exception Handling Mechanism in CTJ
References [1] G. H. Hilderink, Managing Complexity of Control Software through Concurrency, Ph.D Thesis, Laboratory of Control Engineering, University of Twente, ISBN 90-365-2204-8. 2005. [2] G. H. Hilderink and J. F. Broenink, Sampling and Timing: a Task for the Environmental Process, Commmunicating Process Architectures 2003, in J. F. Broenink and G. H. Hilderink, IOS Press, Volume 61 – Concurrent Systems Engineering Series, September 2003. [3] C. A. R. Hoare, Communicating Sequential Processes, Prentice-Hall, London, UK. 1985. [4] A. W. Roscoe, The Theory and Practice of Concurrency, Series in Computer Sciences, C. A. R. Hoare and R. Bird, Prentice-Hall. 1998. [5] N. C. C. Brown, C++CSP Networked, Communicating Process Architectures 2004, in I. R. East, J. M. R. Martin, P. H. Welch, D. Duce and M. Green, IOS Press, Volume 62 – Concurrent Systems Engineering Series, pp. 185-200, September 2004. [6] N. C. C. Brown and P. H. Welch, An Introduction to the Kent C++CSP Library, Communicating Process Architectures 2003, in J. F. Broenink and G. H. Hilderink, IOS Press, Volume 61 – Concurrent Systems Engineering Series, pp. 139-156, September 2003. [7] A. Burns and A. Wellings, Real-Time Systems and their Programming Languages, International Computer Science Series, Addison-Wesley Publishing Company. 1990. [8] P. H. Welch, Graceful Termination – Graceful Resetting, Applying Transputer-Based Parallel Machines, Proceedings of OUG 10, pp. 310-317, occam User Group, IOS Press, Enschede, Netherlands, April 1989.
Communicating Process Architectures 2005 Jan Broenink, Herman Roebbers, Johan Sunter, Peter Welch, and David Wood (Eds.) IOS Press, 2005 © 2005 The authors. All rights reserved.
335
R16: A New Transputer Design for FPGAs John JAKSON Marlboro MA, USA [email protected], [email protected] Abstract. This paper describes the ongoing development of a new FPGA hosted Transputer using a Load Store RISC style Multi Threaded Architecture (MTA). The memory system throughput is emphasized as much as the processor throughput and uses the recently developed Micron 32MByte RLDRAM which can start fully random memory cycles every 3.3ns with 20ns latency when driven by an FPGA controller. The R16 shares an object oriented Memory Manager Unit (MMU) amongst multiple low cost Processor Elements (PEs) until the MMU throughput limit is reached. The PE has been placed and routed at over 300MHz in a Xilinx Virtex-II Pro device and uses around 500 FPGA basic cells and 1 Block RAM. The 15 stage pipeline uses 2 clocks per instruction to greatly simplify the hardware design which allows for twice the clock frequency of other FPGA processors. There are instruction and cycle accurate simulators as well as a C compiler in development. The compiler can now emit optimized small functions needed for further hardware development although compiling itself requires much work. Some occam and Verilog language components will be added to the C base to allow a mixed occam and event driven processing model. Eventually it is planned to allow occam or Verilog source to run as software code or be placed as synthesized co processor hardware attached to the MMU. Keywords. Transputer, FPGA, Multi Threaded Architecture, occam, RLDRAM
Introduction The initial development of this new Transputer project started in 2001 and was inspired by post-Transputer papers and articles by R. Ivimey-Cook [1], P. Walker [2], R. Meeks [3] and J. Gray [4] on what could follow the Transputer and whether it could be resurrected in an FPGA. The conclusion by J. Gray was that it was a poor likelihood; he also suggested the 4-way threaded design as a good candidate for implementation in FPGA. In 2004 M. Tanaka [5] described an FPGA Transputer with about 25 MHz of performance, limited by the long control paths in the original design. By contrast DSPs in FPGA can clock at 150 MHz to 300 MHz and are usually multi-threaded by design. Around 2003, Micron [6] announced the new RLDRAM in production, the first interesting DRAM in 20 years. It was clear that if a processor could be built like a DSP, it might just run as fast as one in FPGA. It seems the Transputer was largely replaced by the direct application of FPGAs, DSPs and by more recent chips such as the ARM and MIPS families. Many of the original Transputer module vendors became FPGA, DSP or networking hardware vendors. The author concludes that the Transputer 8-bit opcode stack design was reasonable when CPUs ran close to the memory cycle time but became far less attractive when many instructions could be executed in each memory cycle with large amounts of logic available. The memory mapped register set or workspace is still an excellent idea but the implementation prior to the T9000 paid a heavy price for each memory register access. The real failure was not having process independence. Inmos should have gone fabless when that trend became clear, and politics probably interfered too. Note that the author was an engineer at Inmos during 1979-1984.
336
J. Jakson / R16: A New Transputer for FPGAs
In this paper, Section 1 sets the scene on memory computing versus processor computing. Section 2 gives our recipe for building a new Transputer, with an overview of the current status on its realization in Section 3. Section 4 outlines the instruction set architecture and Section 5 gives early details of the C compiler. Section 6 details the processor elements, before some conclusions (with a pointer to further information) in Section 7. 1. Processor Design – How to Start 1.1 Processor First, Memory Second It is usual to build processors by concentrating on the processor first and then build a memory system with high performance caches to feed the processor bandwidth needs. In most computers, which means in most PCs, the external memory is almost treated as a necessary evil, a way to make the internal cache look bigger than it really is. The result is that the processor imposes strong locality of reference requirements onto the unsuspecting programmer at every step. Relatively few programs can be constructed with locality in mind at every step, but media codecs are one good example of specially tuned cache aware applications. It is easy to observe what happens though when data locality is nonexistent by posting continuous random memory access patterns across the entire DRAM. The performance of a 2GHz Athlon (XP2400) with 1GByte of DDR DRAM can be reduced to about 300ns per memory access even though the DRAMs are much faster than that. The Athlon typically includes Translation Look-aside Buffers (TLBs) with 512 ways for both instruction and data references, with L1 cache of 16 Kbytes and L2 cache of 256 Kbytes or more. While instruction fetch accesses can exhibit extremely good locality, data accesses for large scale linked lists, trees, and hash tables do not. A hash table value insertion might take 20 cycles in a hand cycle count of the source code but actually measures 1000 cycles in real time. Most data memory accesses do not involve very much computation per access. To produce good processor performance, it is necessary, when using low cost high latency DRAM, to use complex multilevel cache hierarchies with TLBs hiding a multi level page table system with Operating System (OS) intervention occurring on TLB and page misses. 1.2 DRAM, a Short History The Inmos 1986 Data book [7] first described the T414 Transputer alongside the SRAM and DRAM product lines. The data book describes the first Inmos CMOS IMS2800 DRAM. Minimum random access time and full cycle time was 60ns and 120ns respectively for 256Kbits. At the same time the T414 also cycled between 50ns to 80ns, they were almost matched. Today almost 20 years later the fastest DDR DRAM cycles about twice as fast with far greater I/O bandwidth and is now a clocked synchronous design storing 1Gbit. Twenty years of Moore's law was used to quadruple the memory size at a regular pace (about every 3 years) but cycle performance only improved slightly. The reasons for this are well known, but were driven by the requirement for low cost packaging. Since the first 4K bit 4027 DRAM from Mostek, the DRAM has used a multiplexed address bus, which means multiple sequence operations at the system and PCB level. This severely limits the opportunities for system improvement. Around the mid 1990s, IBM [8] and then Mosys [9] described high performance DRAMs with cycle times close to 5ns. These have been used in L3 cache and embedded in many Application Specific Integrated Circuits (ASICs).
J. Jakson / R16: A New Transputer for FPGAs
337
In 2001 Micron and Infineon announced the 256Mbit Reduced Latency DRAM (RLDRAM) for the Network Processor Unit (NPU) industry, targeted at processing network packets. This not only reduced the minimum cycle time from 60ns to a maximum cycle time of 20ns, it threw the address multiplexing out in favor of an SRAM access structure. It further pipelined the design so that new accesses could start every 2.5ns on 8 independent memory banks. This product has generated little interest in the computer industry because of the focus on cache based single threaded processor design and continued use of low cost generic DRAMs. 1.3 Memory First, Processor Second In the reverse model, the number of independent uncorrelated accesses into the largest possible affordable memory structure is maximized and then given enough computing resources to make the system work. Clearly this is not a single threaded model but requires many threads, and these must be communicating and highly interleaved. Here, the term processes are used in the occam sense and threads are used in the hardware sense to carry processes while they run on a processor. This model can be arbitrarily scaled by replicating the whole memory processor model. Since the memory throughput limit is already reached, additional processors must share higher-order memory object space via communication links – the essence of a Transputer. With today’s generic DRAMs the maximum issue rate of true random accesses is somewhere between 40ns to 60ns rate which is not very impressive compared to the Athlon best case of 1ns L1 cache but is much better than the outside 300ns case. The typical SDRAM has 4 banks but is barely able to operate 1.5 banks concurrently. The multiplexing of address lines interferes with continuous scheduling. With RLDRAMs the 60ns figure can be reduced to 3.3ns with an FPGA controller giving 6 clocks or 20ns latency which is less than the 25ns instruction elapsed microcycle period. An ASIC controller can issue every 2.5ns with 8 clocks latency. The next generation RLDRAM3 scales the clock 4/3 to 533MHz and the issue rate to just below 2ns with 15ns latency with 64Mbytes. There may even be a move to much higher banking ratios requested by customers, any number of banks greater than 8 helps reduce collisions and push performance closer to the theoretical limit. The author suggests the banking should follow DRAM core arrays, which means 64K banks for 16Kbit core arrays, or at least many times the latency/command rate. Rambus is now also describing XDR2 a hoped for successor to XDR DRAM with some threading but still long latency. Rambus designs high performance system interfaces and not DRAM cores – hence no latency reduction. Rambus designed the XDR memory interface for the new Playstation3 and Cell processor. There are other modern DRAMs such as Fast Cycle DRAM (FCDRAM) but these do not seem to be so intuitive in use. There are downsides, such as bank and hash address collisions that will waste some cycles and no DIMM modules can be used. This model of computing though can also still work with any memory type, but with different levels of performance. It is also likely that a hierarchy of different memory types can be used, with FPGA Block RAM inner most, plus external SRAM or RLDRAM and then outermost SDRAM. This has yet to be studied, combining the benefits of RLDRAM with the lower cost and larger size of SDRAM; it would look like a 1Million-way TLB. It isn't possible to compete with mainstream CPUs by building more of the same, but turn the table upside down by compiling sequential software into slower highly parallel hardware in combination with FPGA Transputing, and things get interesting.
338
J. Jakson / R16: A New Transputer for FPGAs
1.4 Transputer Definition In this paper, a Transputer is defined as a scalable processor that supports concurrency in the hardware, with support for processes, channels and links based on the occam model. Object creation and access protection have been added which protects processes and makes them easier to write and validate. When address overflows are detected, the processor can use callbacks to handle the fault or allow the process to be stopped or other action taken. 1.5 Transputing with New Technology The revised architecture exploits FPGA and RLDRAM with Multi Threading, multiple PEs and Inverted Page MMU. Despite these changes, the parallel programming model is intended to be the same or better, but the changes do affect programming languages and compilers in the use of workspaces and objects. Without FPGAs the project could never have been implemented. Without multi threading and RLDRAM, these other changes could not have occurred and the FPGA performance would be much poorer. 1.6 Transputing at the Memory Level Although this paper is presented as a Transputer design, it is also a foundation design that could support several different computing styles that includes multiple or single PEs to each MMU. The Transputer as an architecture exists mostly in the MMU design which is where most of the Transputing instructions take effect. Almost all instructions that define occam behaviour involve either selective process scheduling and or memory move through channels or links and all of this is inside the MMU. The PEs start the occam instructions and then wait for the MMU to do the job usually taking a few microcycles, always less than 20 microcycles. The PE thread may get swapped by the process opcodes as a result. 1.7 Architecture Elements The PE and MMU architectures are both quite detailed and can be described separately. They can be independently designed, developed, debugged, modeled and even replaced by alternate architectures. Even the instruction set is just another variable. The new processor is built from a collection of PEs and a shared MMU, adding more thread slots until the MMU memory bandwidth limit is reached. The PE to MMU ratio varies with the type of memory attached to the MMU and the accepted memory load. The ratio can be higher if PEs are allowed to wait their turn on memory requests. The number of Links is treated the same way, more Links demand more MMU throughput with less available for the PEs. A Link might be viewed as a small specialized communications PE or Link Element (LE) with a Physical I/O port of an unspecified type. Indeed a Transputer with no PEs but many LEs would make a router switch. Another type of attached cell would be a Coprocessor Element or CE, this might be an FPU or hardware synthesized design. 1.8 Designing for the FPGA The new processor has been specifically targeted to FPGAs, which are much harder to design for because many limits are imposed. The benefit is that one or more Transputers can be embedded into a small FPGA device with room to spare for other hardware structures, and at potentially low cost nearing $1 per PE based on 1 Block RAM and about
J. Jakson / R16: A New Transputer for FPGAs
339
500 LUTs. The MMU cost is expected to be several times a single PE depending on capabilities included. Unfortunately the classic styles of CPU design – even RISC designs – transferred to FPGA do not produce great results, and designs such as the Xilinx MicroBlaze [10] and the Altera NIOS [11] hover in the 120-150 Mips zone. These represent the best that can be done with Single Threaded Architecture (STA) aided by vendor insight into their own FPGA strengths. The cache and paging model is expensive to implement too. An obvious limit is the 32-bit ripple add path, which gives a typical 6ns limit. The expert arithmetic circuit designer might suggest Carry select, Carry look ahead and other well known speed up techniques [12], but these introduce more problems than they solve in FPGA. Generally VLSI transistor level designs can use elaborate structures with ease; a 64 bit adder can be built in 10 levels of logic in 1ns or less. FPGAs by contrast must force the designer to use whatever repeated structure can be placed into each and every LUT cell – nothing more and nothing less. A 64-bit ripple adder will take 12ns or so. Using better logic techniques means using plain LUT logic, which adds lots of fanout and irregularity. A previous PE design tried several of them and they each consumed disproportionate amounts of resources in return for modest speed up over a ripple add. Instead, the best solution seems to be to pipeline the carry half way and use a 2-cycle design. This uses just over half the hardware at twice the clock frequency. Now 2 PEs can be had with a doubling of thread performance. 1.9 Threading The real problem with FPGA processor design is the sequential combinatorial logic, the STA processor must make a number of decisions at each pipeline clock and these usually need to perform the architecture specified width addition in 1 clock along with detecting branch conditions and getting the next instruction ready just in time, difficult even in VLSI. Threading has been known about since the 1950s and has been used in several early processors such as the Control Data Corp CDC 6600. The scheme used here is Fine Grained or Vertical Multi Threading which is also used by the Sun Niagara (SPARC ISA), Raza (MIPS ISA), and the embedded Ubicom products [13, 14, 15]. The last 2 have been focused towards network packet processing and wireless systems. The Niagara will upgrade the SPARC architecture for throughput computing in servers. A common thread between many of these is 8 PEs, each with 4-way threading sharing the classic MMU and cache model. The applications for R16 are open to all comers using FPGA or Transputer technology. The immediate benefit of threading a processor is that it turns it into a DSP-like engine with decision making logic given N times as many cycles to determine big changes in flow. It also helps simplify the processor design; several forms of complexity are replaced by a more manageable form of thread complexity which revolves around a small counter. A downside to threading is that it significantly increases pressure on the traditional cache designs but in R16, it helps the MMU spread out references into the hashed address space. Threading also lets the architect remove advanced STA techniques such as Super Scalar, Register Renaming, Out-of-Order Execution and Very Long Instruction Word (VLIW) because they are irrelevant to MTA. The goal is not to maximise the PE performance at all cost, instead it is to obtain maximum performance for a given logic budget, since more PEs can be added to make it up. More PE performance simply means less PEs can be attached to the MMU for the same overall throughput: the MMU memory bandwidth is the final limit.
340
J. Jakson / R16: A New Transputer for FPGAs
1.10 Algorithms and Locality of Reference, Big O or Big Oh Since D. Knuth first published ‘The Art of Computer Programming’ Volumes 1-3 [16] from 1962, these tomes have been regarded as a bedrock of algorithms. The texts describe many algorithms and data structures using a quaint MIX machine to run them with the results measured and analyzed to give big O notation expressions for the cost function. This was fine for many years while processors executed instructions in the same ballpark or so as the memory cycle time. Many of these structures are linked list or hashing type structures and do not exhibit much locality when spread across large memory, so the value of big O must be questioned. One of the most important ideas in computing: random numbers can not be used in indexing or addresses without paying the locality tax except on very small problems. 1.11 Pentium Grows Up When the 486 and then Pentium-100 were released, a number of issues regarding the x86 architecture were cleaned up: the address space went to a flat 32-bit space, segments were orphaned and a good selection of RISC-like instructions became 1-cycle codes. The Pentium offered a dual data path presenting even more hand optimization possibilities. This change came with several soft cover optimization texts by authors such as M. Abrash [17], and later M. Schmit [18], and R. Booth [19] that concentrated on making some of the heavier material in Knuth and Sedgewick [20] usable in the x86 context. At this time the processors clocked near 100MHz and were still only an order faster than the DRAM and caches were much smaller than today. The authors demonstrated assembly coding techniques to hand optimize for all aspects of the processor as they understood it. By the time the Out-of-Order Pentium Pro arrived, the cycle counting game came to an end. Today we don't see these texts any more; there are too many variables in the architecture between Intel, AMD and others to keep up. Few programmers would want to optimize for 10 or more processor variations some of which might have opposing benefits. Of course these are all STA designs. Today there is probably only one effective rule: memory operations that miss the cache are hugely expensive and even more so as the miss reaches the next cache level and TLBs. But all register-to-register operations and even quite a few short branches are more or less free. In practice the processor architects took over the responsibility of optimizing the code actually executed by the core by throwing enough hardware at the problem to keep the IPC from free falling as the cache misses would go up. It is now recognized by many that as the processor frequency goes up the usual trick of pushing the cache size up with it doesn't work anymore since the predominant area of the chip is cache which leaks. Ironically DRAM cells (which require continued refreshing) leak orders of magnitude less than SRAM cells: now if only they could just cycle faster (and with latency hiding, they effectively can). That does make measuring the effectiveness of big O notation somewhat questionable if many of the measured operations are hundreds of times more expensive than others. The current regime of extreme forced locality must force software developers to either change their approach and use more localized algorithms or ignore it. Further most software running on most computers is old, possibly predating many processor generations, the operating system particularly so. While such software might occasionally get recompiled with a newer compiler version, most of the source code and data structures were likely written with the 486 in mind rather than the Athlon or P4. In many instances, the programmers are becoming so isolated from the processor that they cannot do anything
J. Jakson / R16: A New Transputer for FPGAs
341
about locality … consider that Java and .NET use interpreted bytecodes with garbage collecting memory management and many layers of software in the standard APIs. In the R16, the PEs are reminiscent of the earlier processors when instructions cycled at DRAM speeds. Very few special optimizations are needed to schedule instructions other than common sense general cases making big O usable again. With a cycle accurate model, the precise execution of an algorithm can be seen; program cycles can also be estimated by hand quite easily from measured or traced branch and memory patterns. 2. Building a New Transputer in 8 Steps Acronyms: Single-Threaded Architecture (STA), Multi-Threaded Architecture (MTA), Virtual Address (VA), Physical Address (PA), Processor Element (PE), Link Element (LE), Co-processor Element (CE). [1] Change STA CPU to MTA CPU. [2] Change STA Memory to MTA Memory. [3] Hash VA to PA to spread PA over all banks equally. [4] Reduce page size to something useful like a 32-byte object. [5] Hash object reference (or handle) with object linear addresses for each line. [6] Use objects to build processes, channels, trees, link lists, hash tables, queues. [7] Use lots of PEs with each MMU, also add custom LEs, CEs. [8] Use lots of Transputers.
In step 1, the single-threaded model is replaced by the multi-threaded model; this removes considerable amounts of design complexity in hazard detection and forwarding logic at the cost of threading complexity and thread pressure on the cache model. In step 2, the entire data cache hierarchy and address-multiplexed DRAM is replaced by MTA DRAM or RLDRAM which is up to 20 times faster than SDRAM. In step 3, the Virtual to Physical address translation model is replaced by a hash function that spreads linear addresses to completely uncorrelated address patterns so that all address lines have equal chance to be different. This allows any lg (N) address bits to be used for the bank select for N-way banked DRAM with the least amount of undesired collisions. This scheme is related to Inverted Page Table MMU where the tables point to conventional DRAM pages of 4 Kbyte or much larger and use chained lists rather than rehashing. In step 4, reduce the page size to something the programmer might actually allocate for the tiniest useful object, a string of 32 bytes or a link list atom or a hash table entry. This 32-byte line is also convenient for use as the burst block transfer unit which improves DRAM efficiency using DDR to fetch 4 sequential 64-bit words in 2 clocks which is 1 microcycle. At this level, only the Load Store operations use the bottom 5 address lines to select parts of the lines, otherwise the entire line is transferred to ICache, or to and from RCache, and possibly to and from outer levels of MTA SDRAM.
342
J. Jakson / R16: A New Transputer for FPGAs
In step 5, objects are added by use of a private key or handle or reference into the hash calculation. This is simply a unique random number assigned to the object reference when the object is created by New[] using a Pseudo-random number generator (PRNG) combined with some software management. The reference must be returned to Delete[] to reverse the allocation steps. The price paid is that every 32-byte line will require a hit flag to be set or cleared. Allocation can be deferred until the line is needed. In step 6, the process, channel, and scheduler objects are created that use the basic storage object. At this point the MMU has minimal knowledge of these structures but has some access to a descriptor just below index 0, and this completes a basic Transputer. Other application level objects might use a thin STL like wrapper. Even the Transputer occam support might now be in firmware running on a dedicated PE or thread but perhaps customized to do the job of link list editing schedule lists. In step 7, combine multiple PEs with each MMU to boost throughput until bandwidth is stretched. Mix and match PEs with other CEs and LEs to build an interesting processor. A CE could be a computing element like an FPU from QinetiQ [21] or a co-processor designed in occam or Verilog that might run as software or then switched to a hardware design. A LE is some type of Link Element, Ethernet port etc. All elements share the physical memory system but all use private protected objects, which may be privately shared through the programming model. In step 8, combine lots of Transputers first inside the FPGA, then outside to further boost performance using custom links and the occam framework. But also remember that FPGAs have the best value for money with the middle size parts and also the slower grades. While the largest FPGA may hold more than 500 Block RAMs, they are limited to 250 PEs before including MMUs and likely would be starved for I/O pins for each Transputer MMU to memory port. Every FPGA has a limit on the number of practical memory interfaces that can be hosted, because each needs specialized clock resources for high speed signal alignment. Some systolic applications may be possible with no external memory for the MMU, instead using spare local Block RAMs. In these cases, many Transputers might be buried in an FPGA if the heat output can be managed. Peripheral Transputers might then manage external memory systems. The lack of internal access to external memory might be made up for by use of more Link bandwidth using wider connections. 3. Summary of Current Status 3.1 An FPGA Transputer Soft Core A new implementation of a 32-bit Transputer is under development targeted for design in FPGA at about 300MHz, but also suitable for ASIC design at around 1GHz. Compared to the last production Transputers, the new design is probably 10 to 40 times faster per PE in FPGA, and can be built as a soft core for several dollars worth of FPGA resources and much less in an ASIC ignoring the larger NRE issue. 3.2 Instruction Set The basic instruction word format is 16 bits with an optional 3 more 16 bit prefixes. The instruction set is very simple using only 2 formats. The primary 3 register RRR form and the 1 register with literal RL form. The prefix can be RRR or RL and follows the meaning of the final opcode. Prefixes can only extend the R and L fields. The first prefix has no cycle penalty so most instructions with 0 or 1 prefix take 1 microcycle.
J. Jakson / R16: A New Transputer for FPGAs
343
The R register specifier can select 8, 64, 512, or 4096 registers mapped onto the process workspace (using 0-3 prefixes). The register specifier is an unsigned offset from the frame pointer (fp). The lower 64 registers offset from fp are transparently cached to the register cache to speed up most RRR or RL opcodes to 1 microcycle. Register references above 64 are accessed from the workspace memory using hidden load store cycles. Aliasing between the registers in memory and register cache is handled by the hardware. From the compiler and programmer’s point of view, registers only exist in the workspace memory and the processor is a memory-to-memory design. By default, pointers can reach anywhere in the workspace (wp) data side and, with another object handle, anywhere through other objects. Objects or workspace pointers are not really pointers in the usual sense, but the term is used to keep familiarity with the original Transputer term. For most opcodes, wp is used implicitly as a workspace base by composing or hashing with a linear address calculation. Branches take respectively 1, 2, or several microcycles if not taken, taken near, or taken far outside the instruction cache. Load and Store will likely take 2 microcycles. Other system instructions may take longer. The instructions conform to recent RISC ISA thinking by supplying components rather than solutions. These can be assembled into a broad range of program constructs. Only a few very simple hand prepared programs have been run so far on the simulators while the C compiler is readied. These include Euclid's GCD and a dynamic branch test program. The basic branch control and basic math codes have been fully tested on the pipeline model shown in the schematic. The MMU and the Load Store instructions are further tested in the compiler. Load and Store instructions can read or write 1, 2, 4 or 8 byte operands usually signed, and the architecture could be upgraded to 64 bits width. For now registers may be paired for 64-bit use. Register R0 is treated as a read 0 unless just written. The value is cleared as soon as it is read or a branch instruction follows (taken or not). Since the RRR codes have no literal format, the compiler puts literals into RRR instructions using a previous load literal signed or unsigned into R0. Other instructions may also write R0, useful for single use reads. 3.3 Multi Threaded Pipeline The PEs are 4-way threaded and use 2 cycles (a microcycle) to remove the usual hazard detection and forwarding logic. The 2-cycle design dramatically simplifies and lowers the FPGA cost to around 500 LUTs from a baseline of around 1000 LUTs, and 1 or more Block RAMs of 2 KBytes per PE giving up to 150 Mips per PE. The total pipeline is around 15 stages, which is long compared to the 4 or 5 stages of MIPS or ARM processors; but instructions from each of the 4 threads use only every fourth pair of pipelines. The early pipeline stage includes the instruction counter and ICache address plus branch decision logic. The middle pipeline stage is the instruction prefetch expansion and basic decode and control logic. The last stage is the datapath and condition code logic. The PEs execute code until an occam process instruction is executed or a time limit is reached and then swap threads between the processes. The PEs have reached place and route in Xilinx FPGAs and the PE schematic diagram is included – see Figure 3. 3.4 Memory System The MMU supports each different external memory type with a specific controller; the primary controllers are for RLDRAM, SRAM and DRAM. The memory is assumed to have a large flat address space with constant access time and is multi banked and low cost. All large transfers occur in multiples of 32-byte lines.
344
J. Jakson / R16: A New Transputer for FPGAs
A single 32 MByte RLDRAM and its controller has enough throughput to support many PEs possibly up to 20 if some wait states are accepted. Bank collisions are mostly avoided by the MMU hashing policy. There are several Virtex-II-Pro boards with RLDRAM on board which can clock the RLDRAM at 300MHz with DDR I/O, well below the 400MHz specification, but the access latency is still 20ns or 8 clocks. This reduction loses 25% on issue rate but helps reduce collisions. The address bus is not multiplexed but the data bus may be common or split. The engineering of a custom RLDRAM FPGA PCB is particularly challenging, but is the eventual goal for a TRAM like module. An SRAM and its very simple controller can support several PEs, but size and cost is not good. Many FPGA evaluation boards include 1MByte or so of 10 ns SRAM and no DRAM. The 8-way banking RLDRAM will be initially modelled with an SRAM with artificial banking on a low cost Spartan3 system. An SDRAM or DDR DRAM and controller may only support 1 or 2 PEs and has much longer latency, but allows large memory size and low cost. The common SDRAM or DDR DRAM is burdened with the multiplexed Row and Column address that does not like true random accesses contrary to RLDRAM. These have effectively 20 times less throughput with at least 3 times the latency and severe limits on bank concurrency. But a 2 level system using either SRAM or RLDRAM with very large SDRAM may be practical. For a really fast expensive processor, an all Block RAM design may be used for main memory. This would allow many PEs to be serviced with a much higher banking ratio than even RLDRAM and an effective latency of 1 cycle. The speed is largely wasted since all PEs send all memory requests through 1 MMU hash translation unit but the engineering is straightforward. An external 1MByte SRAM is almost as effective. 3.5 Memory Management Unit The MMU exists as a small software library used extensively by the C compiler. It has not yet been used much by either of the simulators. The MMU hardware design is in planning. It consists of several conventional memory interfaces specific to the memory type used combined with the DMA engines, the hashing function, and interfaces for several PEs with priority or polled arbitration logic. It will also contain the Link layer shared component to support multiple Links or LEs. 3.6 Hash Function The address hash function must produce a good spread even for small linear addresses on the same object reference. This is achieved by XORing several components. The MMU sends the bottom 5 bits of the virtual address directly to the memory controller. The remaining address is XORed with itself backwards and with shifted versions of the address and also the object reference and with a small table of 16 random words indexed by the lowest 4 address lines being hashed. The resulting physical line address is used to drive the memory above the 5 lower address bits. If a collision should occur, the hash tries again by including an additional counter value in the calculation. The required resources are mostly XOR gates and a small wide random table. For a multi-level DRAM system there may be a secondary hash path to produce a wider physical hashed address for the second memory. 3.7 Hit Counter Table Of course in a classic hash system, eventually there are collisions, which require that all references be checked by inspecting a tag. For every 32-byte line, there is a required tag which should hold the virtual address pair of object reference and index. To speed things
J. Jakson / R16: A New Transputer for FPGAs
345
up, there is a 2-bit hit counter for each line which counts the number of times an allocation occurred at the line. The values are 0, 1, many or unknown. This is stored in a fast SRAM in the MMU. When an access is performed, this hit table is checked and data is fetched anyway. If the hit table returns a 0, the access is invalid and the provided object reference determines the next action. If the hit-tables return a 1, the access is done and no tag needs to be checked. Otherwise the tag must be checked and a rehash performed, possibly many times. When a sparse structure is accessed with unallocated lines and the access test does not know in advance if the line is present, the tag must be checked. 3.8 The Instruction Cache The Instruction Cache or ICache is really an instruction look-ahead queue. It can be up to 128 opcodes long and is a single continuous code window surrounding the instruction pointer (ip). When a process swap, function call, function return, or long branch occurs, the ICache is invalidated. For several microcycles the thread is stalled while the MMU performs 2 bursts of 32-byte fetch of opcodes (16 opcodes each) into the ICache into 2 of 8 available lines. As soon as the second line starts to fill, ip may resume fetching instructions until another long branch occurs. When ip moves, it may branch backwards within the ICache queue for inner loops or branch forward into ICache. There will be a hint opcode to suggest that the system fetch farther ahead than 2 lines. If a loop can fit into the ICache and has complex branching that can jump forwards by 16 or more it should hint first and load the entire loop. The cycle accurate simulations show that the branch instruction mechanism works well, it is expected that half the forward branches will be taken and 1 quarter of those will be far enough to trigger a cache refill. The idea is to simply reduce the required instruction fetch bandwidth from main memory to a minimum. While the common N-way set-associative ICache is considered a more obvious solution, this is really only true for STA processors, and these designs use considerably more FPGA resources than a queue design. The single Block RAM used for each PE gives each of the 4 threads an ICache and a Register Cache. 3.9 The Register Cache The Register Cache (RCache) uses a single continuous register window that stays just ahead of the frame pointer (fp). In other words the hardware is very similar to the ICache hardware except that fp is adjusted by the function entry and exit codes, and this triggers the RCache to update. Similarly process swaps will also move fp and cause the RCache to swap all content. A fixed register model has been in use in the cycle simulation since the PE model was written. That has not yet been upgraded with a version of the ICache update logic since the fp model has not been completed either. Some light processes will want to limit RCache size to allow much faster process swaps, possibly even 8 registers will work. 3.10 The Data Cache There is no data cache since the architecture revolves around RLDRAM and its very good threading or banked latency performance to hide multiple fetch latencies. However each RLDRAM is a 32 MByte chip and could itself be a data and instruction cache to another level of SDRAM. This has yet to be explored. Also a Block RAM array might be a lower level cache to RLDRAM but about 1000 times more expensive per byte and not much faster. It is anticipated that the memory model will allow any line of data to be exclusively in RCache, ICache, RLDRAM and so on out to SDRAM. Each memory system duplicates
346
J. Jakson / R16: A New Transputer for FPGAs
the memory mapping system. The RLDRAM MMU layer hashes and searches to its virtual address width. If the RLDRAM fails, the system retries with a bigger hash to the slower DRAM and if it succeeds transfers multiple 32-byte lines closer to the core either to RLDRAM or either RCache or DCache but then invalidates the outer copy. 3.11 Objects and Descriptors Objects of all sorts can be constructed by using the New[] opcode to allocate a new object reference. All active objects must have a unique random reference identifier usually given by a PRNG. The reference could be any width determined by how many objects an MMU system might want to have in use. A single RLDRAM of 32 MBytes could support 1 million unique 32-byte objects with no descriptor. An object with a descriptor requires at least 1 line of store just below the 0 address. Many interesting objects will use a descriptor containing multiple double links, possibly a call back function pointer, permissions, and other status information. A 32-bit object reference could support 4 billion objects, each could be up to 4 GBytes provided the physical DRAM can be constructed. There are limits to how many memory chips a system can drive so a 16 GByte system might use multiple DRAM controllers. One thing to consider is that PEs are cheap while memory systems are not. When objects are deleted, the reference could be put back into a pool of unused references for reuse. Before doing this all lines allocated with that reference must be unallocated line by line. For larger object allocations of 1 MByte or so, possibly more than 32000 cycles will be needed to allocate or free all memory in one go, but then each line should be used several times, at least once to initialize. This is the price for object memory. It is perfectly reasonable to not allocate unless initializing so that uninitialised accesses can be caught as unallocated. A program might write a memory line with unknown by deallocating it; this sort of use must have tag checking turned on, which can be useful for debugging. For production, a software switch could disable that feature and could then avoid tag checking by testing the hit table for fully allocated structures. When an object is deleted, any dangling references to it will be caught as soon as they are accessed provided the reference has not been reused for a newer object. 3.12 Privileged Instructions Every object reference can use a 32-bit linear space; the object reference will be combined with this to produce a physical address just wide enough to address the memory used. Usually an index that is combined with an object reference uses unsigned indexes and never touches the descriptor. But privileged instructions would be allowed to use signed indexes to reach in to the descriptors and change their contents. A really large descriptor might actually contain an executable binary image well below address 0. Clearly the operating system now gets very close to the hardware in a fairly straightforward way. 3.13 Back to Emulation Indeed the entire Transputer kernel could be written in privileged microcode with a later effort to optimize slower parts into hardware. The STL could also be implemented as a thin wrapper over the underlying hardware. Given that PEs are cheap and memory systems are not, the Transputer kernel could be hosted on a privileged or dedicated or even customized PE rather than designing special hardware inside the MMU. If this kernel PE does not demand much instruction fetch bandwidth, then the bandwidth needed to edit the process and channel data structures may be the same, but the latency a little longer using software.
J. Jakson / R16: A New Transputer for FPGAs
347
3.14 Processes and Channels Whether the Transputer kernel is run as software on a PE or as hardware in the MMU, could also change the possible implementation of Processes and Channels. Assuming both models are in sync using the same data structures, it is known that process objects will need 3 sets of double linked lists for content, instance and schedule or event link stored in the descriptors for workspaces. To support all linked list objects the PE or MMU must include some support for linked list management as software or hardware. In software that might be done with a small linked list package and executed as software with possible help from special instructions. As hardware the same package would be a model for how that hardware should work. Either way the linked list package will get worked out in the C compiler as the MMU has already done. The Compiler uses linked lists for the peephole optimizer and code emit, and could use them more extensively in the internal tree. 3.15 Process Scheduler The schedule lists form a list of lists, the latter are for processes waiting for the same priority or the same point in future time. This allows occam style prioritized process to share time with hardware simulation. Every process instance is threaded through 1 of the priority lists. 3.16 Instruction Set Architecture Simulator This simulator includes the MMU model so it could run some test functions when the compiler can finish up the immediate back end optimizations and encoding. So far it has only run hand written programs. This simulator is simply a forever switch block. 3.17 Register Transfer Level Simulator Only the most important codes have been implemented in the C RTL simulator. The PE can perform basic ALU opcodes and conditional branch from the ICache across a 32-bit address space. The more elaborate branch-and-link is also implemented with some features turned off. The MMU is not included yet; the effective address currently goes to a simple data array. 3.18 C Compiler Development A C compiler is under development that will later include occam [22] and a Verilog [23] subset. This is used to build test programs to debug processor logic; it will be self ported to R16. It currently can build small functions and compiles itself with much work remaining. The compiler reuses the MMU and linked list capabilities of the processor to build structures. 4. Instruction Set Architecture 4.1 Instruction Table The R16 architecture can be implemented on 32- or 64-bit wide registers. This design uses a 32-bit register PE for the FPGA version using 2 cycles per instruction slot, but an ASIC version might be implemented in 1 cycle with more gates available. An instruction slot is
J. Jakson / R16: A New Transputer for FPGAs
348
referred to as 1 microcycle. Registers can be paired for 64-bit use. Opcodes are 16 bits.Instruction Set Architecture The Instruction Set is very simple and comes in 3 Register RRR or 1 Register with Literal RL format. The Register field is a multiple of 3 bits, the Literal field is a multiple of 8 bits. The PREFIX opcode can precede the main opcode up to 3 times so Register selects can be 3-12 bits wide and the Literal can be 1-4 bytes wide. The first PREFIX has no cycle penalty. These are used primarily to load a single use constant into an RRR opcode which has no literal field. The 3 Register fields are Rz <= Rx op Ry with a base value 0x0.z.x.y. Table 1: Instruction Opcodes [15:12]
0x000,008
0x080,088
0x800,808
0x880,888
RRR Format
0
add, adc
sub, sbb
and, orl
msk, xor
Arithmetic,Logicals
1
sll, srl
slc, src
slx, srx
srs, srr
Shifts Left Right
2
mul, div
rem, TBD
extends
swaps
Extended Math
3,4,5,6
TBD
TBD
TBD
TBD
Reserved
7
occam
object
memory
other
Reserved for MMU
8
ld1, ld2
ld4, ld8
st1, st2
st4, st8
Load, Store N Bytes
9
mv1, mv2
mv4, mv8
la1, la2
la4, la8
Block Move, Address
10
ji0, ji1
ceq, cne
cgt, cle
clt, cge
Conditional Reg Jump
11
jcc
teq, tne
tgt, tle
tlt, tge
Test Reg
[15:12]
0xC0xx
0xC0xx
0xC8xx
0xC8xx
RL Format
12
bcc
bcc
bcc
bcc
Conditional Branch
13
push
push
pop
pop
Arithmetic,Logicals
14
ldi
ldi
adi
adi
Load, Add Literal
15
cmi
cmi
PREFIX
PREFIX
Compare,FIX Literal
There are 16 basic arithmetic, logical and shift opcodes. Notes: msk is x&~y. In the shift set, srs and srr are shift right signed or rounded. The extended math opcodes are not defined but might include the usual mul, div, rem as well as various sign extends, byte swaps, bit count, priority encoding etc. Blocks 3-6 are reserved or undefined. Block 7 is reserved for the MMU instructions to support occam, objects, and other memory operations. Blocks 8 and 9 are related. Load and Store are opposing z <= x[y] or z => x[y]. Load address is obviously z <= &x[y]. Block Move or mv is x[] <= y[] for Rz transfers. In all 4 cases, the transfer size is specified as 1, 2, 4, 8 bytes, and all memory transfers are byte aligned. The MMU does not require special alignment rules because the design is much simpler than the usual cache & page table designs. DRAMs can be configured to stream bytes directly.
J. Jakson / R16: A New Transputer for FPGAs
349
4.2 Calls, Returns, Jumps, Branches Block 10 gives conditional ji0, ji1, ceq, cne, cgt, cle, clt, cge register defined jump opcodes. A set of 2 boolean bit[0] and 6 signed arithmetic tests on any Rx register can be used to conditionally call or jump through Ry with the option to save ip to Rz. Several variations can be found by using R0 in any of the Rz, Rx, Ry fields: Rz <= ip, if test (Rx, 0) ip <= Ry [+ip];
If Rz is R0, ip is not saved, used for calls and later the return, or even coroutine. If Rx is R0, the test is null, the indirect version is selected, and ip is not added to Ry. If Ry is R0, then the jump target is R0+ip or just R0 if Rx is also R0. If R0 was not previously just written the effect is either to jump to 0+ip (a redundant skip) or 0 (a stop condition), otherwise the target is R0+ip or just R0. These 4 versions could be restated as skip, stop, jmp relative or jmp absolute to target using the R0 value, all 4 still have the option to save ip. These would be used for skip, stop, call, return, and switch tables. Note that R0 may have been written with a load literal for relative or absolute branching or any other normal instruction for computed branching. 4.3 Conditional tcc Test, jcc Jump Opcodes Block 11 gives 6 tcc opcodes which are similar to the previous ccc opcodes except that Ry is used for the test with Rx with no jump with result saved in Rz: Rz <= test (Rx, Ry); for signed arithmetic compares.
The jcc opcodes use Rx for the condition code and performs: Rz <= ip, if (ccc is T/F) ip <= ip + Ry;
Again if Rz is R0, ip is not saved, and if Ry is R0, the result is a skip or relative or computed relative conditional branch. This is the long form of the bcc opcode. 4.4 Conditional bcc Branch Opcodes Block 12 uses the literal compact form of relative conditional branch: Rz <= ip, if (ccc is T/F) ip <= ip + L;
The 16 conditional bcc relative branches include an 8-bit signed literal offset, but have no option to save ip. If this is not flexible enough, the previous register jumps can be used. Clearly most short if, for, while, goto, break, continue, statements will use this bcc form or a prefixed version to increase the offset. All the various branch and jumps above use the same underlying mechanism. The ip register is a 32-bit index, so an 8-bit offset can reach +127 to -128 opcodes into a code or program object cp. The MMU logically reads from cp[0[ip]], the outer [] is the MMU hashing access function. The MMU may hold the various object cp, wp, fp registers.
350
J. Jakson / R16: A New Transputer for FPGAs
4.5 Condition Codes Modern RISC architects today frown on the classic condition code model although R16 supports both condition codes and register testing models. The issues regarding condition codes raised in the computer architecture texts are not relevant to R16 since it is not expected to ever implement superscalar, out of order, or register renaming, or VLIW techniques. Indeed the x86 and PPC use condition codes too, despite the architectural complexity that this adds. The 8 condition flag values are C, V, N, Z, and Lt, Le, Ls, 0. The C, V, N, Z flags are decoded into the LessThan, LessEqual, LessSame flags for signed or unsigned tests, very similar to the 68000. The bcc, jcc opcodes take either the Rz or Rx field respectively as a 3 bit ccc field which selects 1 of the 8 condition flags above and this is conditionally inverted for the branch decision or test value. The mnemonics currently used are the same as the 68000 but arranged differently. 4.6 Registers R16 does not define a limit on the number of registers available, Rz, Rx, Ry are simply offsets from fp into wp workspace memory. There is a process specified limit that forces register names below that to use the register cache (RCache). This value could be any multiple of 8 up to 64 or so. Register names beyond that limit force the PE to issue hidden Load, or Store microcycles to main memory in each particular case. This makes R16 both a register-to-register and memory-to-memory ISA. This is only practical because it uses no data cache and the minimum memory latency is actually less than the instruction latency. Typically 0 or 1 prefix is used, 0 for hot low registers and 1 for higher register access. There will be a mechanism for choosing other objects to index into besides wp for channels. The RCache holds multiples of 8 registers, which is also the same size as 16 opcodes or 32 bytes, which is the MMU line size used for burst transfers around the processor and is also the basic memory allocation unit. Implementations can contain different amounts of RCache and ICache. In R16, each process thread is currently assigned up to 64 registers and 128 opcodes in the ICache (really a fetch-ahead queue). The compiler will work with function frames that are multiples of 8 registers bumping fp up or down by 8N on entry and exit. As fp moves, the RCache follows the fp window using almost the same hardware mechanism as when the ICache follows ip. Workspace accesses that alias into RCache are handled by having RCache drag a shadow hole over the workspace to trap these accesses. 4.7 Register R0 == 0 As in many RISC designs, R0 is treated differently. R0 value is usually read as 0 unless it was just written – but reverts back to 0 after reading or branching. This allows the 0 value to be very close and always available. No instructions in the RRR format have a literal field; instead the compiler should write R0 with the literal using ldi Rz, L or ldu Rz, L for unsigned and then the following instruction consumes R0. If the literal is needed more than once in the immediate time line, it should not be stored in R0. Many ISAs with a literal option in each source field impose odd restrictions on the effective literal size and can have complex variable length encodings. Since RRR has no literal field, either Rx, or Ry can read R0 to fetch a single use literal adding 1 microcycle for 8 or 16 bit values. Literals are most commonly needed by Load, Add, Compare, and conditional bra and the RL form allows these to do so with an 8 or 16 bit literal with no cycle penalty.
J. Jakson / R16: A New Transputer for FPGAs
351
4.8 Functions Function calls are simply branches or jumps with a save of ip to a designated safe place. The return instruction is just an indirect branch. The compiler places each stack frame starting from fp[0] and on up and much of this will overlap with the RCache. When a function is called, parameters are written further up fp starting 8n words higher just after last in scope variable. With 1 prefix almost all parameters will be register writes into RCache. On function entry, fp will be pulled up to the first parameter and the new function body gets to use the RRR opcodes for its own context. Since R fields cannot be negative, it can't see below fp, but it can use fp as a pointer into wp[], load store can be used to explore the frames. To minimize the opcode sizes, the compiler may want to rearrange the frame so that the bottom few registers are used as hot temporaries with the parameters and locals just above this. Return results would likely be left in the hot registers, there wouldn't necessarily be any limit on the number of return values either. 4.9 Switch Statements Whilst building switch statement target address, the skip opcode can save ip to Rz which can then be used to compute a target address to jump through. By placing this switch block code just after the last case or default statement, the switch jump can also save ip again just before jumping. Now the usual case breaks can be replaced by a return through that last saved ip. There is an even more interesting version of the switch statement that takes advantage of the associative nature of the MMU memory system. In this scheme a sparse array of 32byte labels is allocated as a label array object. The switch simply accesses the sparse array for the target address stored in the label cell. If a callback is stored in the label descriptor the switch code amounts to just a load target register sequence and register jump. The callback handles the default case where there is no valid label. The compiler would have to pre-build a label array for each switch statement. For a critical inner loop switch statement, only a few opcodes are needed for any arbitrary switch statement. There is also the possibility the label array can be dynamically altered, and since it is associative, many labels can switch to the same case allowing wild xxx values or case ranges something that is found in other languages such as Verilog. The C compiler will probably use this scheme. 4.10 Orthogonality The use of prefixes allows the R16 ISA to give very compact code and is both orthogonal and very simple to decode. The various bra and jump codes are minor variations on the basic branch operation and mostly affect the ICache logic. While the math codes are variations on the add instruction that affect the data path, RCache and condition codes. The Load, Store and Move codes are also variations that affect mostly the MMU. 4.11 Hennessy & Patterson On a side note the Hennessy & Patterson text [24] was thoroughly reviewed for typical statistics on all aspects of instruction opcodes. The text merely confirmed the obvious choices and designer prejudices. The highest priority was given to opcodes that the compiler must actually use to compile C codes and to support objects. Half of the RRR format is reserved for the future or application specific opcodes. Many well-known instruction sets were also referenced during the instruction selection. An unknown figure is
352
J. Jakson / R16: A New Transputer for FPGAs
the Load Store to other opcode ratio, which will have some impact on the PE to MMU ratio. 5. Compiler Development 5.1 Compiler Introduction This section describes the on-going C compiler development, which will later include occam and Verilog extensions. The compiler can now compile itself to about 8000 opcodes but requires much more work. Smaller examples such as Quicksort are almost ready to run, see Figure 1. The C compiler is based on a long run of earlier compiler work dating back to C to Verilog and Verilog to C translators and more recent C compiler prototypes. The Hanson and Fraser textbook, “A Retargetable C Compiler: Design and Implementation” [25], has been very useful. Half the project time has been used in compiler development. 5.2 Reusing MMU Components The C compiler and the ISA and Cycle simulators are combined into a single project so that the MMU memory allocation code can also be used by the compiler to manage its own hash tables and linked list structures. Previously the compiler had its own memory management code but it became obvious that developing 2 similar packages did not make sense. If the compiler could use high-level functions that are in the R16 instruction set, then these opcodes would get thoroughly debugged in the compiler before even executing on the processor. Further when the compiler is later cross-targeted to R16, much of the code will be native instructions to the MMU to manage or access its memory rather than a compile of a software package. This will likely give the R16 native compiler a boost in performance compared to the PC hosted C compiler which runs the MMU as low locality functions. 5.3 Test Programs To design and debug a processor requires that machine programs be prepared to run on several simulation models to prove correctness of the design as well as to explore various architectural ideas. R16 uses prefixed variable length opcodes, which quickly increases the effort of manual hand assembly of test programs more than a few lines long. A few test programs have been hand written in hex code and these have been enough to thoroughly check out the entire Instruction Cache and the conditional branch logic as well as the multithreaded pipeline and data path. Some of these programs are dynamically changed to stress the processor hardware design while comparing the behaviour with a predicted model. To push the development cycle much further requires that C functions be compiled to the ISA, which can then be compared against the same C code compiled into the simulator compiler package via the host Visual C compiler. This not only tests the processor design, it also tests the developing C compiler design. Differences in results between the ISA and cycle simulators running a cross compiled test program and the results of the same Visual C compiled C test program can usually identify bugs to be fixed. The first batch of C test programs to be compiled and tested will be well-known examples such as Quicksort, and other classics from the Knuth or Sedgwick texts. The qualities desired of a test program are that it be moderately long, has a modest amount of input and output that can be simple array type problems with a variety of branching patterns. The Quicksort (Figure 1) can be used to stress test the memory system but its inner loops only use 4 opcodes (Figure 2) so it does not test much of the processor. Many
J. Jakson / R16: A New Transputer for FPGAs
353
small programs will fit inside the ICache so the simulator can compare programs that generate much instruction fetch traffic to the MMU versus those that do not. void quicksort(int a[], int l, int r) { int v,i,j,t; if (r>l) { v = a[r]; i = l-1; j = r; while (true) { while (a[++i]v); // critical loop, should be 4 opcodes if (i>=j) break; t = a[i]; a[i] = a[j]; a[j] = t; } t = a[i]; a[i] = a[r]; a[r] = t; quicksort(a, l, i-1); quicksort(a, i+1, r); } }
Figure 1: Quicksort in C 5.4 Native C Compiler An immediate goal is to complete the C compiler so that it can retarget itself to the R16 and this is nearing completion even though much general work remains. When the C compiler is finally correct in compiling itself, R16 can run it to recompile itself to itself, and if it produces the same binary result, the project will have reached a major milestone. If it does not, then the C compiler will be instrumented to find trace differences between the systems. 5.5 Compiler Structure The C compiler has previously used an Lcc lexer and parser solution with separate passes. It has since gone back to a single parse and lexical token scan on demand, which is much easier to add backtracking to within the source text. The parser directly emits an RPN tree, which is later scanned to emit a linked list of output codes, which then go through various stages of peephole optimization. In earlier compilers, a preprocessor was also built in and these early compilers used the Visual C preprocessor to build themselves. It was later realized that the compiler might be a lot cleaner if it did not use any preprocessor and that this would make the compiler much easier to self-compile. If the compiler can reach the point that inlined functions are supported, the quality will be much better in the source and in the final output. For the later Verilog support, inlining is required to be able to completely smash entire programs into a single module, so inlining is a major feature to implement. In a previous Verilog to C translator, this was already achieved. 5.6 V++ Compiler Once the processor and C compiler are functional and reasonably correct, the long-term goal is to upgrade the current C compiler sources to compile the V++ language extensions, which are intended to include an occam and Verilog subset. Since the processor is designed to include occam support, the compiler must also include it. Further the processor will later generalize that to a Verilog event-driven subset, so that must be included too. This latter, however, is less well defined. The compiler framework has had these in mind all along. Note that Verilog and C have very similar expression syntax, but the Verilog block level syntax is more like Pascal. One could imagine occam as the primary language for writing parallel software and also use it to synthesize hardware blocks that can be placed on the same FPGA as a Transputer.
354
J. Jakson / R16: A New Transputer for FPGAs
quicksort: // missing fp +ve adjust tgt R1,r,l // entry point brf L19 // if (r>l) { ldw R1,a,r // REDUNDANT mov v,R1 // v = a[r]; sub i,l,1 // i = l-1; mov j,r // j = r; bra L23 // DEAD CODE bra L27 // DEAD BRA L27: // while (true) { add i,i,1 law R1,a,i // REDUNDANT ldw R0,R1,R0 // ldw R0,a,i tlt R0,R0,v brt L27 // while (a[++i]v); tge R1,i,j // if (i>=j) break; brf L36 bra L26 // DEAD BRA bra L36 // DEAD BRA L36: ldw R1,a,i // REDUNDANT mov t,R1 // ldw t,a,i t = a[i]; law R1,a,i // REDUNDANT ldw R2,a,j // ldw R2,a,j stw R2,R1,R0 // stw R2,a,i a[i] = a[j]; law R1,a,j // REDUNDANT stw t,R1,R0 // stw t,a,j a[j] = t; L23: tne R1,true,0 // DEAD CODE brt L27 // bra L27 L26: ldw R1,a,i mov t,R1 law R1,a,i ldw R2,a,r stw R2,R1,R0 law R1,a,r stw t,R1,R0 mov T8,a mov T9,l sub T10,i,1 bfn quicksort mov T8,a add T9,i,1 mov T10,r bfn quicksort bra L19 L19: ret: // missing fp -ve adjust
Figure 2: C compiler output for Quicksort
J. Jakson / R16: A New Transputer for FPGAs
355
Figure 2 shows the output assembler emitted from the Quicksort function found in Sedgewick. The only edits applied to that output were to remove internal log columns to the left of the Label column and to add comments on where future work remains. Note that the labels have already been optimized away so that only those that are needed remain. The function does not yet have the stack frame instructions nor does it actually allocate variables to registers. The first redundant opcode shows v = a[r] which should be a single load opcode. The RPN internal structure evaluates the left and right side using pointer referencing and should merge those when the pointer is used once. The same applies to the law-ldw pairs, or load-address, then dereferences it. There are also several dead bra opcodes, these result from the if-then-else statements that have empty else parts. It was felt that rather than optimize that at the RPN stage, the peephole optimizer could perform that chore without caring why excessive bra codes are emitted. 5.7 Other Compilers Compared to the original Transputer, new compilers will see a new instruction set that is very straightforward to compile to and very orthogonal. The biggest difference is the register Load, Store architecture and the opcodes are now variable length 16 bit opcodes with only two simple formats, RRR and RL. The MMU supports memory allocation and access checking in hardware on objects. That should be used at the language level otherwise entirely sequential programs can still be written using a single workspace, heap, stack. The C support for New[] and Delete[] needs to be enhanced to allow the Transputer C programmer to direct object construction of full or partially allocated objects using hardware support. This could be done by allowing assembler level codes or calling a library of wrapper functions or by enhancing the C syntax. 6. Processor Element details 6.1 Multi Cycling 2 Clock Cycle Design Most of the R16 instructions are executed in 1 microcycle (2 clocks), passing twice through a fast 16-bit data path. The operands are fetched and saved as 32-bit words. This requires 2 read and 1 write access to the register file in a dual port Block RAM; another access fetches a pair of 16 bit opcodes on a 32-bit aligned address. The operand reads occur on the odd clock and the write back and opcode fetch occur on the even clock. With 2 clocks, these 4 accesses can be handled by just 1 dual port Block RAM organized as 512 by 32 bits. This maximizes performance for minimum FPGA resources. This leads to many simplifications that allow the entire PE to fit into about 500 LUTs and runs at 320MHz (after P/R) for about 150Mips or less on the the Virtex-II Pro. R16 achieves close to 0.45 IPC. As IPC rises, the resources needed to obtain that performance rise much faster. CPUs that push IPC higher than 1 use orders of magnitude more resources for single-threaded designs and are fragile, especially if they use caches that miss too often with low locality code. The R16 instruction decoder uses variable length prefixed opcodes up to 4 words total. This is simplified with the 2 clocks since each clock only considers 1 word; 2 clocks consider a single prefix opcode pair. Multiple prefixes are simply forced to take another microcycle since they are mostly used for infrequent long literals or long branches.
356
J. Jakson / R16: A New Transputer for FPGAs
This design allows multiple PEs to be used with a shared MMU depending on the available memory bandwidth. With 2 PEs back to back, the throughput is clearly better than 1 PE running instructions every clock that uses more than twice the hardware resources. The 2-clock design is fundamental to circuit performance since it allows every PE to run at the maximum rate that a Block RAM or 16-bit adder or 3 levels of LUT logic for the entire design can achieve. Typically 3 levels of LUT logic equals 10 gate delays. The design learned a great deal from R3, a previous, somewhat similar, ISA design which had a theoretical IPC of 1.3 and a data path clock rate of 300MHz. It eventually collapsed with 3 times the complexity and 1500 LUTs, and an actual IPC nearer to 1. Later the clock fell to 70MHz due to unplanned long control logic paths in the variable length instruction queue. While trying to rescue that design, it became obvious that 2 clocks could drastically simplify every part of the design and simplify the ISA too, so R16 was born. Many of the R3 best points survived in a more rational form. 6.2 Multi Processing N-way Design This brings up the possibility that CPUs matched with several 32 MByte RLDRAMs using several PEs per MMU can achieve sustained zero locality memory accesses every O*3.3ns versus the Athlon worst case at 300ns regular DRAM cycle. That would give a theoretical 90/O speed-up for zero locality codes during this dead memory access period. The value for O averages between 1..3 due to collisions but with a possible unbounded limit that sometimes reaches 10 to 200 when the memory reaches 90% or more full. However, O also gradually increases when New[] and Delete[] are called in quick succession which dirties the hit table, but a periodic clean up scheme can lower O again. High locality data memory references are irrelevant with hash based MMU, except at the register level, but help the Athlon reach 1 ns cycles. The question is what percentage of memory accesses have no locality – maybe 1%? 6.3 Multi Threaded Architecture 4-way Design The PE is divided nicely into 3 distinct blocks: the barrel controller, the instruction fetch queue plus prefix and opcode decode, and lastly the main data path. Each runs in permanent lock step in a never-ending cycle no matter what the 4 threads are doing within. Each block can be decomposed into smaller simpler sections, which can be analyzed or driven with test stimuli to verify each part independently. Each pipeline has very little logic in any one stage. The longest critical path is either the 16-bit adder, the Block RAM cycle, or in many places 3 LUT levels worth of logic which is equivalent to 10 gates worth in an ASIC. This even varies somewhat by Xilinx FPGA family making it very hard to be optimal for all families. The relative simplicity allows the entire design to clock at the fastest rates for FPGAs, which is where only high performance DSP engines usually live. These 3 blocks have been through the Xilinx XST flow for Spartan3, Virtex-II Pro and now Virtex-4 with 225MHz, 320MHz and 320370MHz results respectively after hand driven Place and Route. The Virtex-4 is too early to be taken further. With 8 pipelines per microcycle, there is plenty of time to make some of the bigger decisions per instruction. In a classic single-threaded RISC, the same decision would have to be made entirely within 1 clock cycle which severely limits the rate of control decisions, often requiring stall states to keep up; this always lowers IPC.
J. Jakson / R16: A New Transputer for FPGAs
357
6.4 Barrel Controller The barrel controller controls the PE. It consists of 4 rows of thread state that rotates every odd clock in a circular fashion. The thread state contains a number of distinct fields for controlling the Instruction Queue, Instruction Cache, and the 4 ip values. The 4 rows include some combinatorial logic in a few places to adjust each field value to its next state. The most complex part is the ip increment. In the barrel engine ip is only 7 bits wide, which eventually drives the ICache instruction pair fetch. The remainder of ip is left in the Block RAM register file. While instructions can be fetched from the ICache, ip follows the true position of the instruction pointer modulo 128. When branch opcodes occur, 1 microcycle is used to recompute the true ip for the next instruction and to leave that in some register. If the branch decision is true, a second microcycle performs the ip += offset or ip = target, depending on the instruction options. The branch decision logic includes the 8 to 1 condition code selection and various related logic. No presentable diagram available at this time. 6.5 Processor Schematic Figure 3 shows both the data path and the instruction fetch queue and prefix path. Most prominent at top left is the Block RAM driven by microcode z, x, y addresses and the Thread Ids. A pair of operands flows straight down the 2-cycle data path and back to the Block RAM on the other cycle phase about 5 clocks later. This is matched by a 32-bit pair of opcodes sent to the IQ FIFO usually 2 by 2 as the IQ empties. The other side of the IQ FIFO pulls opcodes out along with any prefix and arranges them into the 4 Microcode fields M3-M0 along with an optional delayed upper prefix pair. Only M0 drives control decodes. M3-M1 carry prefix payload which goes to the z, x, y address logic or literal input path. Not shown are the many small logic blocks that make this all work. This view accurately reflects the C Cycle RTL and Verilog RTL codes. 7. Conclusions This paper reports a different way to obtain good performance from an FPGA processor by using a relatively simple processor design that can be clocked at DSP speeds. It can be replicated to increase performance and has enough memory bandwidth to allow several PEs to remain busy using a threaded DRAM rather than the usual multi level cache with paged table system. It is difficult to predict how performance will compare with other processors or even the past Transputer until it can be benchmarked, but early studies suggest that each PE may be only 5-8 times slower than a full Athlon XP2400 for a few dollars of FPGA resources (based on a huge Quicksort trace analysis). Further it is hoped that bringing forth a new design for a Transputer will prove to have been the right thing to do; only time can tell. With continued development, most of this work should come to fruition. This paper is an edited version of a longer (40+ pages) document that additionally provides: details of simulators for cycles and instructions, support for objects and processes, details on inter-processor links and FPU co-processors, support for high performance computing, ASIC implementation, FPGA tools and hardware/software codesign (with V++ Verilog, occam and C). Future developments include completing the compiler, the PE and MMU designs, bringing up the FPGA board and later designing or finding a TRAM like FPGA RLDRAM
358
J. Jakson / R16: A New Transputer for FPGAs
module with space for another function. If the project can be commercialized there is plenty of work for a dozen or even a hundred people. Acknowledgement The author gratefully acknowledges the reviewers’ advice and comments on this paper, the result is a paper that became much more clearly focussed on the important aspects. The author also thanks Ruth Ivimey-Cook for preparing the processor illustration and for the initial paper that triggered the “what if, why not” thought. Also, thanks to the editors for making it possible to present the ideas here and for cleaning up the text and layout. Thanks especially to my wife for enduring many years of isolated development. I also thank the regular posters at the comp.arch newsgroup for many interesting computer architecture conversations; a lot can be learned by listening in. Finally, I thank Inmos for having existed and allowing me to spend my early years there. References [1] R. Ivimey-Cook, Legacy of the Transputer. In ‘Architectures, Languages and Techniques for Concurrent Systems (WoTUG 22)’, IOS Press, Amsterdam, 1999 [2] P. Walker, Hardware for Transputing without Transputers. In ‘Parallel Processing Developments (WoTUG 19)’, IOS Press, Amsterdam, 1996 [3] R. Meenakshisundaram, ClassicCmp, http://www.classiccmp.org/ [4] J. Gray, FPGA CPU News, http://www.fpgacpu.org/ [5] M. Tanaka et al, Design of a Transputer Core and its Implementation in an FPGA, CPA2004, IOS Press, Amsterdam, 2004 [6] Micron Technology, Inc., http://www.micron.com [7] SGS-THOMSON Microelectronics, Inmos Databook 1986 [8] IBM United States, http://www.ibm.com/ [9] Monolithic Systems Technology, Inc., http://www.mosys.com [10] Xilinx Inc., MicroBlaze RISC Architecture, http://www.xilinx.com [11] Altera Inc., NIOS RISC Architecture, http://www.altera.com [12] M. Flynn, S. Oberman, Advanced Computer Arithmetic Design, Wiley. ISBN 0-471-41209-0, 2001, 121. [13] Sun Microsystems, Niagara, http://www.sun.com [14] Raza Microelectonics Inc., http://www.razamicroelectronics.com/products/xlr.html [15] Ubicom, http://www.ubicom.com [16] D. Knuth, The Art of Computer Programming, Vols 1-3, Addison-Wesley. ISBN 0-201-89683-4, 1997. [17] M. Abrash, ZEN of Code Optimization, Coriolis Group Books. ISBN 1-883577-03-9, 1994. [18] M. Schmit, Pentium Processor Optimization Tools, AP Professional. ISBN 0-12-627230-1, 1995. [19] R. Booth, Inner Loops, Addison Wesley. ISBN 0-201-47960-5, 1997. [20] R. Sedgewick, Algorithms in C, p 118, Addison Wesley. ISBN 0-201-51425-7, 1990. [21] QinetiQ, http://www.qinetiq.com [22] A. Burns, Programming in occam2, Addison-Wesley. ISBN 0-201-17371-9, 1988. [23] D.E. Thomas and P.R. Moorby, The Verilog Hardware Description Language, Kluwer Academic Publishers. ISBN 1-4020-7089-6, June 2002, 1-23. [24] J.L. Hennessy and D.A. Patterson, Computer Architecture: A Quantitative Approach, pp 69-118, Elsevier Science & Technology Books. ISBN 1-55860-329-8, August 1995. [25] C. Fraser and D.R. Hanson, A Retargetable C Compiler: Design and Implementation, Addison-Wesley. ISBN 0-805-31670-1, January 1995.
J. Jakson / R16: A New Transputer for FPGAs
Figure 3: Schematic for the Data Path & Instruction Queue
359
360
J. Jakson / R16: A New Transputer for FPGAs
Appendix: Main Terms Used Throughout the Paper ALU ASIC Callback
CE Channel CISC CLA CPI CS CSA CSP Cycle DCache Descriptor DMA DSP EDA EE FIFO FPGA FPU FW HandelC
Handle HDL HW ICache IP ip, wp, fp IPC IPM
ISA KRoC LE Line Link LUT MAC MMU
Arithmetic Logic Unit, centre of most data paths. Application Specific Integrated Circuit, typically 1-5 times faster than FPGA. Function pointer stored in the descriptor called by the MMU when an illegal access has occurred, it might be harmless or serious. A zero field defaults the action to possibly stopping the process or raising an exception. Permissions used to determine if an object is executable, readable, writeable etc. Computing Element or co-processor designed by others that looks like a PE. Used to synchronize communicating processes and transfer data either processor to processor or process to process respectively. Complex Instruction Set Computer, typically memory to register. Carry Look Ahead uses a tree of adders, often 4 way, adds in log time. Cycles Per Instruction, inverse used on earlier microprogrammed CPUs. Computer Science. Carry Select Adder, try 2 solutions with carry in of 0 and 1, select one later. Communicating Sequential Processes, a way of looking at parallel programs as processes modelling hardware with channels and links replacing wires. Simulation sometimes used instead of RTL simulation but often in C or HDL Data Cache, R16 uses an inverted page scheme in RLDRAM instead. Indicates whether the object is workspace, channel, simple array, sparse etc. Direct Memory Access minimal overhead block move, supports all vector operations, store moves, Links, Channels, process swaps, cache updates. Digital Signal Processor, either special purpose CPU or special HW. Electronic Design Automation or CAD specifically targeted to chip design. Used to be schematic based, with timing driven HDL simulation, now almost entirely Synthesis, STA and automatic P/R. Electronic Engineer. First In First Out. Field Programmable Gate Array, essentially reconfigurable HW. Floating Point Unit, typically IEEE standard, a special purpose CE. Firmware as in bitfile or EPROM to configure an FPGA or embedded CPU C combined with occam used for higher level modelling and synthesis of hardware useful for converting algorithms to hardware with much less EE effort. A downside to HandelC and other commercial C based hardware synthesis is that they can not reach the entry level where free to use FPGA tools are available for Verilog and VHDL and are all proprietary. Others of note include PrecisionC, ImpulseC, SystemC, but these are intended more for system level design and co-design. An object reference, basically a unique random no assigned by New[]. Hardware Description Language, Verilog or VHDL (C, Pascal v ADA style) Hardware as in hard electronic stuff Instruction Cache, R16 uses a simple instruction queue with hidden prefetch. Intellectual Property, especially for VLSI modules as soft cores Instruction, workspace and frame pointer, similar to original Transputer. Instructions Per Cycle, up to one is fairly easy; above one becomes progressively harder and is what modern processors strive to achieve. Inverted Page Mapping, an alternative IBM scheme used on higher end servers used to map a VA to PA by means of an address hash with many benefits in hardware reduction but some costs in variable access times. Usually used for large page sizes, in R16 it is used for 32-byte lines leading onto object memory management. IPM also requires additional tag and hit table RAM. Instruction Set Architecture. Kent Retargetable occam Compiler, http://www.cs.kent.ac.uk/projects/ofa/kroc/. Link Element which could be a DS Link, a USB or Ethernet or Video port. RCache, ICache and DMA transfers use 32-byte line by line burst transfers. Serial DMA interface with very low hardware costs connects to a Channel. Look Up Table, basic FPGA component equivalent to a few or many logic gates Multiplier Accumulator, multiplier plus wide summing adder in 1 circuit, used extensively by programmable DSPs and in FPGAs usually paired with a Block RAM. Virtex-4 FPGA devices build these into embedded DSP cores. Memory Management Unit, usually paged tables, or inverted page tables. In R16 the MMU is the main hub for multiple PEs, LEs, CEs, but also delivers the Object support for Processes through a hash address scheme.
J. Jakson / R16: A New Transputer for FPGAs MTA NPU NRE Object occam
OoO OS P/R PA PCB PDL PE PRNG Process R16 RA
RC RCache RISC RPN RTL SOC STA
STL SW Synthesis Thread TLB TRAM Transputer ULSI V++
VA
VLSI DDR DIMM DLL
361
Multi-Threaded Architecture, recent trend to hide latency usually in CPUs. Network Processor Unit used to process network packets. Non Returnable Expense, design engineering costs. A block of memory with associated descriptor just below address 0, contains link list fields, callbacks, permissions, used by Transputer kernel and other processes. The original native language for the Transputer with small syntax that implemented CSP in a practical form, almost an assembly level language. Out-of-Order, as in rearranging instructions in time order rather than broken. Operating System, usually software, but in Transputers partially in hardware. Place & Route takes synthesized logic and completes final wiring etc., may include human Floor Planning to help tool produce better or worse results. Physical Address, the actual address that drives DRAM arrays. Printed Circuit Board Process Description Language, such as occam, or even Verilog or VHDL Processor Element, smaller incomplete CPU, requires MMU access. Pseudo-Random Number Generator, a special form of counter that counts through a random looking sequence. This is used to generate Object handles or references. Might be collected after de-allocation and reused. The basic unit of computing, models a hardware process or module. Also the other meaning describes semiconductor fabrication processes. This CPU, uses 16-bit 2-clock data path, follows previous R3 32-bit design. R16 Register Address, typically 3, 6 or even 9, 12 bit wide address, these are always mapped into VAs into Workspace memory at fp[RA]with hidden Load/Store operations for those beyond a specified limit but direct register access for those below. Such operations are both memory to memory and register to register. The RCache fills in a memory hole in the workspace. Reconfigurable Computing, changing FPGA function on the fly Register Cache, R16 uses a sliding register window over the workspace. Reduced Instruction Set Computer, typically Load/Store to register. Reverse Polish Notation, used in compilers. Register Transfer Level, construct compute pipeline codes in HDL or C System-on-Chip, used by marketing more than EEs. Single-Threaded Architecture: most processors today. Also Static Timing Analysis, analysis of circuits saving untold simulations, largely replaced the edit and simulate with timing cycles. Combined with Synthesis, most designs are now modelled cycle accurate without detailed timing Standard Template Library, a C++ package of standard objects that could be easily mapped onto the object memory system. Software as in soft code stuff, although the definition can be blurred HDL RTL code is synthesized into logic that can be Placed & Routed to the target device, it asks how fast, designer says 3ns, it tries its best. Hardware threads in a PE execute a software process for a length of time. Translation Look-aside Buffer, a small associative address cache. TRAnsputer Module, a Transputing PCB module about the size of a credit card. A processor that supports occam style processes and communications. VLSI * 10, no newer terms were added although SOC is in vogue today. A proposed language that combines HDL, C, occam (PAR, SEQ, PAR+SEQ) allowing different coding styles to be mixed or linked something like gcc, a common compiler framework with multiple front ends. This will permit algorithms to be developed in one language and then moved into another to reach other tools. Allows source code to run as executable code or synthesize as a co-processor possibly an LE connected to the MMU. Virtual address, the address constructed by Load/Store instruction. In most processors the VA to PA translation is performed by multilevel page table lookups accelerated with a bypass Address Cache or TLB but with relatively few entries easily broken with low locality or multithreaded code. Very Large Scale Integration, full custom typically 5-25x faster than FPGA. Double Data Rate, clocks data on both clock edges, typically 300 MHz clock Dual Inline Memory Module, the usual packaging format for DRAMs. Delay Locked Loops, recent additions to FPGAs allow high speed serial I/Os. Note, all modern DRAM and many SRAM interfaces are now essentially communications systems that use extensive PLL, DLLs to align signal and clock edges. Some even perform line characteristics to tune the IO circuits.
362 PLL QDR WebPack
XDR2
J. Jakson / R16: A New Transputer for FPGAs Phase Locked Loops, used to multiply (and divide) clock frequencies Quad Data Rate, clocks data at 4x the clock rate, requires fancy PLLs, DLLs. Xilinx version of free to use EDA FPGA package covering all modern but small to medium devices which can still include up to 1Million gate design equivalent functions. Altera has Quartus as near equivalent. Pay versions are unlocked for all the largest FPGAs and still only cost 1% of ASIC EDA tools. Both have neutral support for Verilog and VHDL. Newer version of XDR from Rambus that now supports threaded DRAM.
Block RAM (Also BRAM) SRAM internal to FPGA, dual ported 18 Kbits in various shapes, Around 300 MHz cycle rate. Available in 4 to 550 or so instances, about $1 each best used for their bandwidth, not as bulk SRAM. Can be grouped for more bits. DDRAM Superseded SDRAM with the minor change of transferring data on both clock edges i.e. DDR. DRAM Dynamic RAM, hierarchical memory structure with varying access times depending on Bank, Column, Row, access times varying from 60ns to 1.5ns typically used RAS/CAS clocks and address and data multiplexing to save pins. EEPROM Electrically Erasable Programmable Read Only Memory, reusable but wears. Some FPGAs may include the EEPROM internally to boot the FPGA, high performance FPGAs generally boot from external ROM. EPROM Erasable Programmable Read Only Memory (using UV light, largely superseded by EEPROM). FLASH EEPROM packaged in small package such as low cost Smart Media used in digital cameras now usable for booting FPGAs with suitable boot interface. Historically EEPROM used specifically for FPGA has been very expensive. Note by definition all MOS devices wear out but EE devices much more so. PROM Programmable Read Only Memory, often 1 shot use only. QDRAM Expected to supersede DDRAM, QDR DRAM, even more bandwidth. RLDRAM Reduced Latency DRAM, very high performance DRAM, multi-banked with few bank restrictions, access times of 20ns, new command cycle every 2.5ns. SDRAM Synchronous DRAM superseded old style RAS/CAS DRAM and now uses 1 system clock and reduces RAS/CAS to mode control bits. SRAM Static RAM, memory structure with typical access time of 10ns or less ARM HPC MIPS Niagara PPC
Raza XLR
Tera MTA
Ubicom
ARM processor highly successful but almost invisible embedded RISC processor. Notable features include good basic opcodes combined with optional shifts and conditional operations, highly protected IP licence. High Performance Computing term used in scientific computing. Classic RISC processor also highly successful embedded processor that is also highly protected IP licence. A threaded processor design from Afara acquired by Sun Microsystems. It features 8 processor cores each 4 way threaded running at about 1GHz. Afara was also housed in a venture capital outfit connected to Atiq Raza. Power PC architecture descended from the very early work of John Cocke's seminal RISC work. Of note is that is that it has many more instructions than other RISCs, it has 8 condition code sets selected by each opcode, it has been implemented in many forms and both 32 and 64 bits wide. In order and out of order, superscalar etc, the last standing high performance competitor to x86 at the high frequency end. Now featured as the processor of choice in the next generation of gaming systems in particular the Cell which some have likened to a Transputer because of the inclusion of 8 additional coprocessors. A threaded processor design using the MIPS ISA from Atiq Raza who was also the primary architect for the NexGen 686, and AMD Athlon series. Features 8 processor cores each 4 way threaded running at about 1.5 GHz, also supports RLDRAM aimed at network packet processing. A famous MTA architecture dating back to the Denelcor HEP designed by Burton Smith during the late 1980s and later reincarnated as Tera MTA. It featured multi threading in the processor and memory and could extract instruction level parallelism although at great hardware expense. Another MTA 8 way threaded processor at 250 MHz for Wireless markets.
Communicating Process Architectures 2005 Jan Broenink, Herman Roebbers, Johan Sunter, Peter Welch, and David Wood (Eds.) IOS Press, 2005 © 2005 The authors. All rights reserved.
363
Towards Strong Mobility in the Shared Source CLI Johnston STEWART a, Paddy NIXON b, Tim WALSH c and Ian FERGUSON a a SmartLab, Dept. of Comp. and Inf. Sciences, University of Strathclyde, Glasgow b Computer Science Department, University College Dublin, Dublin 4, Ireland c Department of Computer Science, University of Dublin, Trinity College, Dublin, Ireland {johnston.stewart, ian.ferguson}@cis.strath.ac.uk [email protected], [email protected] Abstract. Migrating a thread while preserving its state is a useful mechanism to have in situations where load balancing within applications with intensive data processing is required. Strong mobility systems, however, are rarely developed or implemented as they introduce a number of major challenges into the implementation of the system. This is due to the fact that the underlying infrastructure that most computers operate on was never designed to accommodate such a system, and because of this it actually impedes the development of these systems to some degree. Using a system based around a virtual machine, such as Microsoft’s Common Language Runtime (CLR), circumnavigates many of these problems by abstracting away system differences. In this paper we outline the architecture of the threading mechanism in the shared source version of the CLR known as the Shared Source Common Language Infrastructure (SSCLI). We also outline how we are porting strong mobility into the SSCLI, taking advantage of its virtual machine. Keywords. Thread Mobility, Virtual Machine, Common Language Runtime.
Introduction The term mobility, when applied to code, has two guises: weak mobility and strong mobility. Weak mobility [1] is defined as the ability to allow code transfer across nodes – the code can be accompanied by some initialization data, but the execution state cannot be migrated. Strong mobility [1] is defined as the ability to migrate both the code and the execution state to a remote host. Weak mobility is favoured for most mobile systems in the commercial domain as it is relatively easy to design and implement such a system. Strong mobility systems on the other hand are rarely developed or implemented as they introduce a number of major challenges for the implementation of the system. This is due to the fact that the underlying infrastructure that most computers operate on was never designed to accommodate such a system, and because of this it actually impedes the development of these systems somewhat. Using a system based around a virtual machine, however, circumnavigates the problems introduced by varying underlying operating systems and system architectures. We propose to augment the SSCLI’s thread model to allow the stack of a thread to be captured, saved and then reseeded and resumed on a new node. Doing this in a safe and managed manner is a challenging proposition and is necessary for the implementation of closures and continuations.
J. Stewart et al. / Towards Strong Mobility in the SSCLI
364
The SSCLI is a portable implementation of the programming tools and libraries that comprise the ECMA-335 Common Language Infrastructure standard [2]. The ECMA (European Computer Manufacturers Association) CLI is a standardised specification for a virtual execution environment with virtual execution occurring within the CLI under the control of its execution engine (EE). The ECMA CLI describes a data-driven architecture, in which various components in various languages can be brought together to form self-assembling, typesafe software systems. This process is driven by metadata, which is used by the developer to describe the behaviour of the software and is used by the execution engine (EE) to allow the safe loading of managed components from different sources. The EE (often referred to as the runtime) hosts components by interpreting the metadata which describes them at runtime. On the one hand the CLI execution engine is similar to an operating system, it is a privileged piece of code which provides services and managed resources for code executing under its control. Programs can explicitly request services or they can be made available as a part of the execution model. At the same time the CLI is similar to the traditional model of compiler, linker and loader performing in-memory layout, compilation and symbol resolution. The CLI enables seamless sharing of computing resources and responsibilities, allowing unmanaged code to coexist safely with managed code. The SSCLI (also known as Rotor) consists of a fully functional CLI execution engine, a C# compiler, essential programming libraries and a number of development tools, built using a combination of C++ and C# and a small amount of assembler for processor specific details [2]. 1. Motivation It is the belief of the authors that a coherent implementation of thread migration will be a foundational step for more advanced agent related research and is a core requirement for the implementation of ambient style systems. Many applications could benefit from the possibility of migrating threads between machines. Safety critical applications for instance often cannot afford to be shut down for any reason. If running threads could migrate to another host temporarily then maintenance and upgrades could occur without disruption to the service. Some of the main advantages of mobile computation [3] are noted below: x
x
x x x
Load sharing: Distributing computations among processors around the system can lighten the load on heavily used processors and make use of underused resources. Communications performance: Active objects which interact intensively can be migrated to the same node to reduce the communication overheads for the duration of their interaction. Availability: Objects can be moved to different nodes in order to improve service and provide better protection against failure and lost connections. Reconfiguration: Migrating objects allow continued service during scheduled downtime or node failure. Resource utilisation: An object can visit a node in order to take advantage of services or capabilities at that particular location.
J. Stewart et al. / Towards Strong Mobility in the SSCLI
365
2. Threads in the SSCLI Thread structures within the SSCLI provide a method for associating the microprocessor’s execution stack with related runtime data (by adding a chain of execution engine frames to the stack before JIT code is executed). This data includes security information, garbage collection markers and program variables as well as other information [2]. The logical abstraction of a thread of control is captured by an instance of the System.Threading.Thread object in the class library, [4] and is known as a managed thread. 2.1 PAL Threads Within the execution engine (EE) managed threads are implemented on top of Platform Adaptation Layer (PAL) threads. This is done in order to abstract away the threading details, and differing semantics of threading, from differing implementations of the threading model used by different operating systems that the SSCLI can run on [5] [2]. The PAL hides these differences beneath a single set of APIs. PAL threads themselves have a one to one relationship with OS threads (this does not, however, imply a one to one relationship between managed and unmanaged threads) [2]. PAL Thread Thread gThreadTLSIndex
Thread Thread m_State m_State
TLS TLS
gAppDomainTLSIndex frame frame
Stack Stack
frame frame frame frame
AppDomain m_ThreadHandle m_ThreadHandle m_Context m_Context m_pFrame m_pFrame
Context
Context
PAL Thread Thread Thread gThreadTLSIndex TLS TLS
gThreadTLSIndex
gAppDomainTLSIndex gAppDomainTLSIndex frame frame
Stack Stack
frame frame
frame frame
m_State m_State
Context
Context m_ThreadHandle m_ThreadHandle m_Context m_Context m_pFrame
m_pFrame
Figure 1. Overview of the threading mechanism
Managed threads are wrapped around PAL threads, and always have an associated PAL thread, however, a PAL thread need not always have an associated managed thread. Threads which originate from outside of the Common Language Runtime (CLR) are known as unmanaged threads [2]. PAL threads need to be able maintain private, per-thread state so that the EE can track specific threads by associating a Thread instance with the underlying PAL thread. This is achieved by using the m_ThreadHandle field (a private attribute) in the Thread type, which contains a HANDLE to the PAL thread (see Figure 1), in order to control and schedule execution on the thread.
366
J. Stewart et al. / Towards Strong Mobility in the SSCLI
There is no PAL call to enable navigating from a thread handle to a managed thread, so the EE maintains a ThreadStore which can be used to enumerate managed threads from either managed or unmanaged code (Figure 1) [2, 6]. 2.2 Interoperation of Managed and Unmanaged Code The CLR allows managed code to call unmanaged code and vice versa and allows managed and unmanaged code to interoperate freely and managed threads mix the execution state of managed and unmanaged code on a single stack (Figure 2) [2]. Also, a managed process can contain many different threads of control (Figure 2). The EE uses PAL threads which become associated with managed code, to maintain exception handlers, scheduling priorities, and a set of structures that the underlying platform uses to save the context (which contains the values held in the machine registers and the state of the current execution stack) whenever it pre-empts the thread’s execution [2]. Basically the thread context contains all of the information that the thread needs in order to seamlessly resume execution [7].
Figure 2. Threads within a managed process
Transitions between managed and unmanaged code can be created in many ways. Within managed code, application boundaries or remoting contexts can be crossed, security permissions can change and exceptions can be thrown and in all of these cases isolation needs to be maintained. [2] When a PAL thread attempts to enter managed code an instance of the Thread class is set up to wrap it using the SetupThread method. Two checks are made, a call to GetThread is made to look for a cached Thread instance in the Thread Local Storage (TLS), a feature which allows a PAL consumer to associate data with a specific thread [2], in order that the calling PAL thread can make sure the PAL thread is not already known to the EE. Next SetupThread ensures that the call is not coming from a different thread than the thread being initialised, by checking the ThreadStore for a matching identifier. If an identifier is found SetupThread will return, if not the PAL thread is unknown to the EE and a new Thread object is created, installed on the PAL thread’s TLS, and marked as started [2]. The EE then adds the new Thread instance to the ThreadStore’s list of all the threads ever seen. Threads that wander into the EE which are not previously known are set as background threads [2]. The easy way to control the creation and scheduling of threads is to let the CLR do it, using the thread pool.
J. Stewart et al. / Towards Strong Mobility in the SSCLI
367
2.3 The Thread Pool and Separate Threads in the SSCLI. The SSCLI includes a pooling mechanism which caters simply for multithreaded scenarios. The ThreadPool class provides a fixed pool of worker threads to service incoming requests [2]. This class is provided within the SSCLI so that systems like Windows 98, which does not have native support for thread pools, can still utilise them [8]. Initially all of the thread pool threads are idle, then when a request comes in the system looks up the thread pool, finds an idle thread and assigns that thread to the request which is to be serviced. If all of the threads in the thread pool are busy when a request comes in the system either grows the thread pool by one and assigns the new thread to the request or, if the thread pool is already operating with its maximum amount of worker threads, waits for a thread within the pool to become free [9]. Each thread in the thread pool runs at the default priority and uses the default stack size [10]. There is a single managed thread pool per process, access to which is gained through the ThreadPool class [2]. Using the thread pool is the easiest way to code for tasks which are required to handle multiple threads for relatively short tasks without blocking other threads and where there is no need for a specific order of scheduling for the tasks performed. However, the thread pool is not a good choice if the task to be performed needs to be set at a specific priority or if it might run for an extended period of time, as this would block other threads [7]. The thread pool is also of no use if a stable identity has to be associated with the thread (so that it can be suspended or discovered by name). Even when using the c++ method GetThreadInfo there is no guarantee that the threadID returned is the correct Win32 threadID of the underlying OS thread, as the thread pool is abstracted one layer away from the Win32 threadID [11]. In order to retain some degree of control over a managed thread’s scheduling, and to be able to accurately discover its associated OS thread, separate instances of the Thread class must be used. The alternative to using the threads from the thread pool is to create separate instances of the Thread class. Management of all threads within the CLR is done using the thread class, this includes threads created by the CLR, and any threads which originate outside of the runtime that enter the managed environment in order to execute code [2]. 2.4 Managed Thread to OS Thread Relationship In the current implementation of the CLR, an OS thread will have only one managed thread associated with it in a given Application Domain, known as an AppDomain,which is a form of isolation which the runtime uses to ensure that code running in one application cannot affect other applications. AppDomains provide a secure and versatile unit of processing that the CLR uses to isolate applications from each other [12]. If an OS thread executes code in more than one AppDomain, each AppDomain will have a distinct managed thread associated to that thread (see Figure 3) [13]. The threads within the CLR may or may not have a corresponding OS thread, as threads which have stopped or which have been created but not started, do not have a corresponding OS thread. Also, if an OS thread has not yet executed any managed code there will not be a managed thread object corresponding to it [2]. Within managed threading, Thread.GetHashCode is the stable identification for a managed thread and this will not collide with the value of any other thread for the lifetime of the thread which it identifies, regardless of the application domain from which the value is obtained. An OS ThreadId has no fixed relationship to a managed thread, as many managed threads may be serviced by the same OS thread (Figure 4) [5].
J. Stewart et al. / Towards Strong Mobility in the SSCLI
368
OS Thread X
OS Thread Y
OS Thread Z
Thread X
Thread X
Thread Y
Thread Z
AppDomain A
Thread Z
Thread Z
AppDomain B
CLR
OS Process Figure 3. AppDomains and Threads
Managed Thread
Managed Thread
OS Thread
Managed Thread
Managed Thread
OS Thread
Figure 4. Managed thread to OS thread relationship
2.5 Overview of the Thread Class The Thread class creates and controls a thread, sets its priority, and gets its status. The Thread type is safe for multithreaded operations. The ThreadPriority property can be used to schedule a priority level for a thread, however, this is not guaranteed to be honoured by the underlying operating system [5]. A managed thread can be started, joined, suspended and killed. It is exposed and used as a managed component although much of the implementation behind it is written in C++ and is internal to the EE. The code used to represent managed instances can be found in the C++ class Comsynchronisable.cpp [2]. 2.6 Functionality of the Thread Class The thread class does not have a method with which the thread ID of an OS thread can be discovered, as the ECMA standard 335 states that the Thread class represents a logical thread and not necessarily an operating system thread as explained above. The OS thread ID can be discovered, along with other information about the physical thread, by using the System.Diagnostic.ProcessThread class, or AppDomain.GetCurrentThreadId
J. Stewart et al. / Towards Strong Mobility in the SSCLI
369
method which is a wrapper built around the Win32 GetCurrentThreadId function [14]. The managed threads within the CLR map loosely to the win32 threads in the underlying OS (Table 1) [15]; however, this does not represent identical functionality as some methods or functions have no equivalent. The Thread class itself provides the other methods needed to create and control a thread object. Table 1. Functionality mapping between Win32 and CLR
Win32
CLR
CreateThread
Mix of Thread and ThreadStart
TerminateThread
Thread.Abort
SuspendThread
Thread.Suspend
ResumeThread
Thread.Resume
Sleep
Thread.Sleep
WaitForSingleObject on thread handle
Thread.Join
ExitThread
No equivalent
GetCurrentThread
Thread.CurrentThread
SetThreadPriority
Thread.Priority
No equivalent
Thread.Name
No equivalent
Thread.IsBackground
Close to CoInitializeEx (OLE32.DLL)
Thread.ApartmentState
3. Thread Local Storage With respect to managed threads, thread local storage (TLS) provides dynamic data slots unique to a thread and AppDomain combination which are used to store thread specific data. There are two types of data slots; named and unnamed. Named slots can be given a mnemonic identifier, however, other components can then (intentionally or not) modify them by using the same name for their own thread relative storage. If an unnamed slot is not exposed to any other code it cannot be used by any other component [16]. PAL threads have their own TLS which allows a PAL consumer to associate data with a specific thread for later retrieval in the thread’s context. In unmanaged code the TLS slot is utilised by calling TLSAlloc. This slot can then have its value set by TlsSetValue (which stores a pointer to the allocated memory) or retrieved using TlsGetValue (which returns the pointer to the thread’s memory slot). This memory is freed on thread termination [2].
370
J. Stewart et al. / Towards Strong Mobility in the SSCLI
The CLR maintains a lot of information within the PAL thread’s TLS, in particular references to the current AppDomain and managed Thread object. Whenever a PAL thread crosses from one AppDomain to another, the CLR adjusts the references in the TLS to point to the new current managed thread and AppDomain. The current implementation of the CLR maintains a per-AppDomain table which ensures that any PAL thread is associated with only one managed thread object per AppDomain [13]. 4. Method State The CLI manages multiple concurrent threads of control (which are not necessarily the same as the threads provided by the host operating system), multiple managed heaps and a shared memory address space (Figure 5) [4]. A thread of control can be thought of as a singly linked list of method states (see Figure 5), where a new state is created and linked back to the current state by a method call instruction, then removed when the method call completes (by a normal return, a tail-call, or an exception).
Figure 5. The machine state model, including threads of control, method states, and multiple heaps in a shared address space.
Method state describes the environment within which a method executes. In conventional compiler terminology, the method state corresponds to a superset of the information captured in the invocation stack frame. The CLI method state [4] (Figure 6) consists of the following items; an instruction pointer (IP), an evaluation stack, a local variable array (starting at index 0), an argument array, a methodInfo handle, a local memory pool, a return state handle and a security descriptor. The method state’s four areas; incoming arguments array, local variables array, local memory pool and evaluation stack, (Figure 6) are specified as if logically distinct areas [4].
J. Stewart et al. / Towards Strong Mobility in the SSCLI
371
Incoming Arguments Array Local Variables Array Local Memory Pool Evaluation Stack
Figure 6. Method State
5. SSCLI Stack Management for Managed Threads There is only one physical stack per process and all the threads within a process share this stack, using their linked list of stack frames (as explained above) to simulate a logically separate stack. This means that although the threads serviced by a single PAL thread all have their frames on the same stack, they can be thought of as having their own ‘virtual’ stack. These stack frames will be referred to as activation records in accordance with the Microsoft convention. Once a PAL thread has a Thread instance associated with it, managed code can be executed on it. Before any JIT compiled code is executed two important tracking structures are pushed onto the stack; an exception handler, to wrap the unmanaged code, and a chain of execution engine frames, to annotate parts of the stack with runtime information produced by the EE. A service called the code manager is used to deal with the intricacies of tracing the stack, and joins the managed and unmanaged portions of the stack into a single coherent view. When it is needed the information is gleaned from the stack by traversing the stack’s call chain to extract the current execution state in what is known as a stackwalk. A linked list of EE frames (instances of the Frame class or one of its subclasses) is associated with each managed Thread object [2]. 6. Towards Strong Mobility in the SSCLI The advantage of strong mobility is that long-running or long-lived threads can suddenly move or be moved from one host to another. A transparent mechanism such as this would allow continued thread execution without data loss in its ongoing execution. This approach is useful in building distributed systems with complex load balancing requirements where the threads involved need not know about their movement. Strong mobility has disadvantages also, i.e. if a particular resource on the original node is necessary for the threads execution this will cause problems at the destination node. Using containers [17] allows us to package necessary objects with the executing thread and send it during the migration. 6.1 Containers The container abstraction closely associates threads and their corresponding data objects (figure 7) and are the unit of migration. Containers may also contain passive objects for retaining data and results. Containers execute in ‘homes’, which act as a sandbox and receive containers and allow them to resume execution. Using a container allows us to
372
J. Stewart et al. / Towards Strong Mobility in the SSCLI
group threads and their respective objects into a single structured unit for migration [17]. Using appropriately set up AppDomains as containers and the CLR’s virtual machine as a ‘home’ gives us a starting point for developing strong mobility in the SSCLI. Due to the fact that the SSCLI was written with distributed applications in mind, crossing AppDomain boundaries is a relatively simple task. This action of crossing the boundaries of the AppDomain would quickly break an application which seeks to contain the inhabitants of a specific AppDomain and disallow entry to entities outside of the AppDomain. In view of the fact that code can be written in this way, we currently rely on the code being written in a responsible way in order to avoid breaking the application.
Figure 7. Containers
6.2 Thread Migration Our solution is currently a work in progress based around using the SSCLI’s AppDomain as a container and a modified version of the simple migration system known as Migrants [18] to facilitate transfer of the container from one node to another. Managed threads in the CLR are objects, however. They wrap PAL threads which in turn sit on top of OS threads, and because of this they are deliberately not serializable [19]. This means that it is not possible to simply serialize the thread in order to migrate the container to the remote host. At present the threading mechanisms within the SSCLI provide the basic functionality with no mechanism for state capture or thread migration [20]. We can, however, use the Win32 API in order to discover the managed thread’s underlying OS thread and then use this information to save that particular thread’s data from the OS thread’s stack. The threads can then be serialized and sent to the remote host with their instance variables and code where they will be unpackaged and the container can then be reseeded and execution resumed. 7. Conclusions Using the abstraction of a container within the SSCLI, utilising its virtual machine and taking advantage of AppDomains, makes thread migration with strong mobility a possibility. The initial implementation will be easily broken; however, this is something which can be made more robust over time. It can be argued that the proposed system is not one of true strong mobility, as references outside of the container cannot be supported. However, this can be seen as the first step to instantiating usable, useful strong mobility (for such areas as load balancing).
J. Stewart et al. / Towards Strong Mobility in the SSCLI
373
Acknowledgements This research was supported by Microsoft Research and the University of Strathclyde. References [1] [2] [3] [4] [5]
A. Fuggetta, G.P. Picco, G. Vigna, Understanding Code Mobility, IEEE Trans of Software Eng., vol 24, no. 5, pp. 342-361, May 1998. David Stutz, Ted Neward and Geoff Shilling. Shared Source CLI Essentials. O’Reilly 2003 First Edition. E. Jul, H. Levy, N. Hutchinson, A. Black, Fine-Grained Mobility in the Emerald System, ACM Trans. On Computer Systems, Vol. 6, no. 1, pp. 109-133, February 1988. Common Language Infrastructure (CLI) Partition I: Concepts and Architecture. ECMA TC39/TG3. Final draft – October 2002. Microsoft Corporation, Thread Class Overview, http://msdn.microsoft.com/library/default.asp?url=/library/enus/cpref/html/frlrfsystemthreadingthreadclasstopic.asp – accessed 02/08/05.
[6]
Jason Whittington, Inside Rotor Presentation,
[7]
Microsoft Corporation, Threads and Threading,
http://staff.develop.com/jasonw/tools_rotor_2002.ppt – accessed 02/08/05. http://msdn.microsoft.com/library/default.asp?url=/library/enus/cpguide/html/cpconthreadsthreading.asp – accessed 02/08/05.
[8]
Jeffrey Richter, The CLR’s Threadpool,
[9]
Microsoft Corporation, Threadpooling,
http://msdn.microsoft.com/msdnmag/issues/03/06/NET/ accessed 02/08/05. http://msdn.microsoft.com/library/default.asp?url=/library/enus/cpguide/html/cpconthreadpooling.asp – accessed 02/08/05.
[10] Microsoft Corporation, Threadpool Class, http://msdn.microsoft.com/library/default.asp?url=/library/en-us/cpref/html/ frlrfSystemThreadingThreadPoolClassTopic.asp – accessed 02/08/05.
[11] Brian Dowds, Introduction to the CLR, http://support.microsoft.com/default.aspx?scid=%2Fservicedesks2Fwebcasts%2Fwc 022802%2FWCT022802.asp – accessed 02/08/05.
[12] Microsoft Corporation, AppDomain Class, http://msdn.microsoft.com/library/default.asp?url=/library/enus/cpref/html/frlrfsystemappdomainclasstopic.asp – accessed 02/08/05.
[13] Don Box with Chris Sells. Essential . NET volume 1. Addison Wesley 2003 [14] Dino Esposito, Windows Hooks in the .NET Framework, http://msdn.microsoft.com/msdnmag/issues/02/10/cuttingedge/ – accessed 02/08/05. [15] Microsoft Corporation, Managed and Unmanaged Threading in Microsoft Windows, http://msdn.microsoft.com/library/default.asp?url=/library/enus/cpguide/html/cpconmanagedunmanagedthreadinginmicrosoftwindows.asp – accessed
02/08/05. [16] Microsoft Corporation, Thread Local Storage and Thread-Relative Local Fields, http://msdn.microsoft.com/library/default.asp?url=/library/enus/cpguide/html/cpconthread-localstoragethread-relativestaticfields.asp –
accessed 02/08/05 [17] Tim Walsh, Paddy Nixon, Simon Dobson, As strong as possible mobility: An Architecture for stateful object migration on the Internet. Technical Report TCD-CS-2000-11, Department of Computer Science, Trinity College Dublin, 2000. [18] Simon Dobson, Migrants – A Brain-Numbingly Simple Mobile Code System, 2003, https://www.cs.tcd.ie/Simon.Dobson/software/migrants/index.html
[19] SSCLI source code. Thread class. sscli\clr\src\bcl\system\threading\thread.cs [20] Platt, David, Introducing Microsoft .NET, Microsoft Press, 2001.
This page intentionally left blank
Communicating Process Architectures 2005 Jan Broenink, Herman Roebbers, Johan Sunter, Peter Welch, and David Wood (Eds.) IOS Press, 2005 © 2005 The authors. All rights reserved.
375
gCSP occam Code Generation for RMoX† Marcel A. GROOTHUIS, Geert K. LIET, Jan F. BROENINK Twente Embedded Systems Initiative, Drebbel Institute for Mechatronics and Control Engineering, Faculty of EE-Math-CS, University of Twente, P.O.Box 217, 7500 AE, Enschede, the Netherlands [email protected] Abstract. gCSP is a graphical tool for creating and editing CSP diagrams. gCSP is used in our labs to generate the embedded software framework for our control systems. As a further extension to our gCSP tool, an occam code generator has been constructed. Generating occam from CSP diagrams gives opportunities to use the Raw-Metal occam eXperiment (RMoX) as a minimal operating system on the embedded control PCs in our mechatronics laboratory. In addition, all processors supported by KRoC can be reached from our graphical CSP tool. The commstime benchmark is used to show the trajectory for gCSP code generation for the RMoX operating system. The result is a simple means for using RMoX in our laboratory for our control systems. We want to use RMoX for future research on distributed control and for performance comparisons between a minimal operating system and our CTC++/RT-linux systems. Keywords. CSP, Embedded Control Systems, Real-time, occam
Introduction For broad acceptance of an engineering paradigm, a graphical notation and a supporting design tool is needed. This is especially the case for CSP-based software, since it is concurrent software. Designers draw often a kind of block diagram to indicate the flow of data along the channels that connect the processes [1-6]. Besides standardization, a graphical tool supporting the gCSP graphical notation allows for proper consistency between the diagrams and resulting concurrent code. This opens the way towards a modeldriven development environment, where the diagram of the structure of the concurrent processes is the specification of it. The consistency between diagram and concurrent software is thus intrinsically guaranteed. From one model, different kinds of code can be generated, using multiple code generators, that all use the same input data from the graphical tool. Conclusions drawn from tests on one kind of generated code can be used for another kind of generated code (for example the results of model checks with FDR2 using the CSPm generator can be applied to the executable code generated with the CTC++ generator). CSP and formal checking are used in our labs to generate high quality control software free of deadlocks and divergence. Following our experience with the gCSP tool [7] an obvious extension is an occam code generator. gCSP already supports CSPm and (CT)C++ †
This research is supported by PROGRESS, the embedded system research program of the Dutch organization for Scientific Research, NWO, the Dutch Ministry of Economic Affairs and the Technology Foundation STW.
376
M.A. Groothuis et al. / gCSP occam Code for RMoX
code generators. The generated occam code can then be used as source code for the light weight RMoX operating system [2] running on PCs. Since embedded PCs are often used to control mechatronic setups, using RMoX extended with control software will result in a small embedded control system. Section 1 gives information on gCSP and the RMoX operating system. In Section 2, the architecture and software construction issues of this gCSP occam code generation extension are treated. Section 3 presents a case study in which the commstime benchmark is used to test the code generation for RMoX. Section 4 concludes this paper with conclusions and our future work on gCSP/ RMoX. 1 gCSP and RMoX 1.1 gCSP gCSP is a graphical tool for creating and editing CSP diagrams. It is based on the Graphical Modeling Language developed by Hilderink [8, 9]. CSP diagrams are dataflow diagrams, connecting processes with channels. Besides the dataflow, the concurrency structure is also indicated. The nodes in the graph are connected by two kinds of edges, namely the channels and compositional relations (see Figure 4 for an example). The structure of the processes including their concurrency relation is presented as a tree. This is another view of the structure in the diagram, with focus on the composition, since occam -like programs are always shaped as a strict tree-like hierarchy of SEQ, (PRI)PAR and (PRI)ALT constructs as branches and user-defined processes as leaves. Besides editing, gCSP does basic consistency checks and can generate code from the diagrams, thus intrinsically guaranteeing consistency between diagram and resulting code. The code generation outputs of gCSP are currently CSPm, as input for model checkers like FDR2, and CTC++, the CSP / C++ library of our group. The third code generation output, namely occam, is one of the subjects of this paper. 1.2 RMoX The RMoX operating system [2] is a small CSP based operating-system. The core is built around a stripped down version of the Linux kernel, for the low-level operating system operations, and the RMoX kernel. This RMoX kernel is an occam program that can also run within a standard Linux environment (User Mode RMoX). All RMoX components, like device drivers, consoles and the occam demo applications are included in this single program as occam processes. Since RMoX is a minimal operating system based on CSP, occam and Linux, it is an interesting target OS for embedded control PCs which are often equipped with small flash based disks and a small amount of memory. Embedded control PCs do not need all functionality offered by general purpose operating systems like Linux, instead they require accurate timing for a hard real-time control loop. The Linux basis gives the flexibility of adding existing Linux drivers for our I/O hardware to RMoX. The RMoX kernel gives us high speed CSP concurrency. The destination target platforms that can be reached by RMoX are restricted by two factors: the KRoC compiler should support it and a Linux kernel port should exist for this platform. RMoX is currently designed for Pentium based systems. Supporting smaller targets like DSPs will require much porting efforts for both KRoC and the Linux kernel.
M.A. Groothuis et al. / gCSP occam Code for RMoX
377
2 The gCSP occam Code Generator The essential architectural choice here is that the different code generators start from the same data model (i.e. data structure in the graphical editor), and that all other transformations are common to all the code generation output, see Figure 1. gCSP graphical editor consistency checker
CSPm code generator
CTC++ code generator
occam code generator
CSPm
CTC++
occam
Figure 1: Global structure of gCSP with its code outputs
The occam code generator is another output next to the two existing code generator outputs gCSP already has (namely CSPm and CTC++). Since all three code generation target languages are based on CSP, it is rather obvious that the occam code generator is comparable to the other two code generators. However, there are differences which are caused by the differences in target languages that gCSP generates code for. In the current implementation, the following six differences were recognized: 1. Initialization and body of a process are not separated in occam, in contrast to CTC++. 2. At an ALT construct, the readers in the guard and the guarded process need special attention for occam code generation. 3. occam only supports input guards in an ALT construct. 4. gCSP uses different names for arithmetic data types, double and float, which are cast to REAL64 and REAL32 in occam respectively. 5. occam has more types than gCSP currently supports: for instance the TIMER type. 6. Occam uses channels instead of functions for screen output and keyboard input. Item one implies that the code for sub processes has to be generated in-line instead of as separate functions. By using an appropriate folding editor and sophisticated comment lines, the overview of the generated occam code can be supported. However, inspecting the generated occam code should hardly be necessary. The ‘real’ source code is the gCSP diagram, including its code blocks to specify the algorithmic bodies of the processes. The implementation of the ALT construct in CTC++ combines the readers in the guards with the readers in the alternative processes, thus preventing a double read action. This behaviour is different in occam. Furthermore, occam only supports input guards, whilst CTC+ also supports output guards in an ALT construct. The sixth item implies that the generated occam code for processes that use screen output or keyboard input needs more channels than the corresponding CTC++ code. This is solved by adding ‘hidden’ screen or keyboard channels to an occam process if needed. These channels exist in the generated occam code, but are invisible in the gCSP model to maintain the overview. Another solution would be the use of external (linkdriver) channels to access the screen and the keyboard, but is currently not possible in gCSP to draw any-toone and one-to-any channels.
378
M.A. Groothuis et al. / gCSP occam Code for RMoX
gCSP produces the code of Listing 1 from the producer – consumer example, as shown in Figure 2. The producer contains a SEQ of a code block and a writer. The consumer contains a SEQ of a reader and a code block. The producer produces data and writes it to a channel, while the consumer reads the data and writes it to the screen. The producer has a higher priority than the consumer has (the arrow above the | | points to the process with the highest priority).
Figure 2: Producer - Consumer example in gCSP: composition tree (left) and block diagram (right).
Because these processes are small, a new inline generation feature has been added to the gCSP code generation to optimize the generated occam code. With this option enabled, the contents of the producer and consumer process will be generated as part of the model process instead of separate processes (compare Listing 1 with Listing 2). This feature is not yet available for CTC++. PROC gCSPModel(CHAN BYTE screen) ---Initialization INITIAL REAL64 y IS 0.0 : INITIAL REAL64 x IS 0.0 : CHAN REAL64 chan1: ---Process Body WHILE TRUE PRI PAR ---Producer SEQ y := y + 1.0 IF y > 10.0 y := 0.0 TRUE SKIP chan1 ! y ---Consumer SEQ chan1 ? x out.string("Value read from channel: *n",0,screen) out.real64(x,0,0,screen) out.string("*n",0,screen) :
Listing 1: occam code of Figure 2 generated by gCSP using inline generation
M.A. Groothuis et al. / gCSP occam Code for RMoX
379
The gCSP generated occam code always contains one parent process with the name gCSPModel. This process will contain all occam code for all processes, either inline generated or using subprocesses. 3 Example: commstime Benchmark To test the occam code generation in combination with RMoX, the occam commstime benchmark program delivered with KRoC was used as an example. Figure 3 shows the route from a gCSP model to a running example under RMoX.
gCSP
occam code
KRoC compiler
RMoX kernel
PC/104 RMoX OS
RMoX code
Figure 3: Overview for the gCSP to RMoX route
The first step is drawing the commstime example in gCSP. This results in the gCSP diagram for the commstime demo shown in Figure 4. The top right part of this figure shows four processes in parallel of which three processes send data in a circle and the fourth one, the TimeAnalyser, measures the loop time and the time required for the context switching. The left part of the figure shows the composition and at the bottom right part, the contents of the first process in the three, the successor, is shown. The internals of the other processes are not shown, but they are comparable.
Figure 4: Commstime example in gCSP: composition tree (left) and block diagram (right)
380
M.A. Groothuis et al. / gCSP occam Code for RMoX
The occam code for this diagram is generated using the new occam code generation output. The generated code is shown in Listing 2. All processes are now generated without inlining, which results in five sub processes (including the Identity subprocess) and a main body with the four commstime processes in parallel. Note that the TimeAnalyser process contains only the initial warm-up loop. The occam display code for this process will be added further down, because it cannot be drawn completely in gCSP. PROC gCSPModel(CHAN BYTE screen) PROC TimeAnalyser(CHAN INT chan4, CHAN BYTE screen) INT value: INT t0: INT t1: INITIAL INT looptime IS 100000 : INITIAL INT warmup IS 16 : SEQ WHILE warmup > 0 SEQ chan4 ? value warmup := warmup - 1 : PROC Successor(CHAN INT chan3, CHAN INT chan2) INT message: WHILE TRUE SEQ chan2 ? message message := message + 1 chan3 ! message : PROC Delta(CHAN INT chan1, CHAN INT chan4, CHAN INT chan2) INT n: WHILE TRUE SEQ chan1 ? n chan2 ! n chan4 ! n : PROC Identity(CHAN INT chan1, CHAN INT chan3) INT message: WHILE TRUE SEQ chan3 ? message chan1 ! message : PROC Prefix(CHAN INT chan1, CHAN INT chan3) INITIAL INT message IS 0 : SEQ chan1 ! message Identity(chan1, chan3) : CHAN INT chan1: CHAN INT chan4: CHAN INT chan3: CHAN INT chan2: PAR Successor(chan3, chan2) Prefix(chan1, chan3) Delta(chan1, chan4, chan2) TimeAnalyser(chan4, screen) :
Listing 2: Generated occam 3 code for the commstime benchmark using normal generation.
M.A. Groothuis et al. / gCSP occam Code for RMoX
381
To show the commstime statistics on the screen a code block was added to the TimeAnalyser process with occam code to display the results every loop. This is done in
gCSP using the code dialog window shown in Figure 5. This figure shows the gCSP code dialog that can be used to add (optional) code to a CSP process. Currently CTC++ code, CSPM code and occam code can be added. It is possible to add code for all three languages at the same time, resulting in one gCSP model that can be used for multiple languages. The missing occam code for displaying the statistics has been copied from the consume process in the original KRoC commstime example. After the addition of the display code, the generated occam code is comparable with the original commstime code delivered with the KRoC compiler.
Figure 5: Code dialog for the TimeAnalyser showing part of the missing occam code
The occam code block from Figure 5 uses support functions (out.string, cursor.x.y) which are not regularly available in occam libraries. For RMoX, out.string can be found in the occ_utils library. However, cursor.x.y is not available. This is solved by adding the missing occam support code to the top level model. The generated occam code is now complete and compilation, using the KRoC compiler, and running under Linux gives a working commstime example with a looptime of 990 ns and a context switch of 248 ns (on a 400 MHz Pentium II). The next step is to get the program running under RMoX. The current version of RMoX (version 0.1.3) has no provision for loading and executing programs. All functionality is included in the RMoX kernel. To run the generated code under RMoX, the RMoX kernel has been extended with an additional console process that contains the generated gCSP program. After code generation, the RMoX operating system has to be (re)compiled with KRoC in order to get the gCSP program running under RMoX. The result is shown in Figure 6. The benchmark results under User Mode RMoX are doubled (probably due to the emulation). Running under native RMoX gives almost identical results compared to the example running under normal Linux.
382
M.A. Groothuis et al. / gCSP occam Code for RMoX
Figure 6: User-Mode RMoX running the gCSP Commstime demo
4 Conclusion and Future Work The construction of a first version of an occam code generator output for gCSP has been successful. gCSP is now able to generate occam code from its models. This code can be compiled with KRoC and runs under RMoX and Linux. We expect a future use of RMoX on our distributed mechatronic setups, where PC/104 PCs are used as control computer in a hardware-in-the-loop simulation setting. Furthermore, the work on using RMoX gives us the idea for a lean-and-mean embedded OS variant of our CTC++ library and it makes performance comparisons between a minimal operating system and our existing CTC++/RT-linux systems in our control laboratory possible. Before RMoX can be used for control systems, interfaces from occam code to our I/O hardware devices are needed. occam ports or the new KRoC C-interface can be used for constructing these device drivers. Furthermore, accurate timing support is necessary for control systems to fulfil the requirement of practically jitter-free equidistant time stamps for sampling [10]. RMoX provides a way of blocking an occam process waiting for a hardware interrupt. This can be used to unblock controller processes waiting for a timer interrupt The tests used here, used a significant portion of occam code in the code blocks. Not every part of an occam program can be drawn completely in gCSP. For example, the use of constants, shared channels and FOR loops is not yet supported in gCSP. This should be added in future versions. Besides this, syntax highlighting in the gCSP code blocks is a useful addition. References [1] F.R.M. Barnes, occwserv: An occam Web-Server, in Communicating Process Architectures 2003, J.F. Broenink and G.H. Hilderink, Eds. Enschede, Netherlands: IOS Press, 2003, pp. 251-268, ISBN: 1 58603 381 6. [2] F.R.M. Barnes, C. Jacobson, and B. Vinter, RMoX: A raw-metal occam Experiment, in Communicating Process Architectures 2003, J.F. Broenink and G.H. Hilderink, Eds. Enschede, Netherlands: IOS Press, 2003, pp. 269-288, ISBN: 1 58603 381 6. [3] F.R.M. Barnes and P.H. Welch, Communicating Mobile Processes, in Communicating Process Architectures 2005, I.R. East, J.M.R. Martin, P.H. Welch, D. Duce, and M. Green, Eds. Oxford, UK: IOS Press, 2004, pp. 201-218, ISBN: 1586034588. [4] A.L. Lawrence, Overtures and hesitant offers: hiding in CSPP, in Communicating Process Architectures 2003, J.F. Broenink and G.H. Hilderink, Eds. Enschede, Netherlands: IOS Press, 2003, pp. 97-109, ISBN: 1 58603 381 6. [5] V. Raju, L. Rong, and G.S. Stiles, Automatic Conversion of CSP to CTJ, JCSP, and CCSP, in Communicating Process Architectures 2003, J.F. Broenink and G.H. Hilderink, Eds. Enschede, Netherlands: IOS Press, 2003, pp. 63-81, ISBN: 1 58603 381 6.
M.A. Groothuis et al. / gCSP occam Code for RMoX
383
[6] M. Schweigler, F.R.M. Barnes, and P.H. Welch, Flexible, Transparent and Dynamic occam Networking With KRoC.net, in Communicating Process Architectures 2003, J.F. Broenink and G.H. Hilderink, Eds. Enschede, Netherlands: IOS Press, 2003, pp. 199-224, ISBN: 1 58603 381 6. [7] D.S. Jovanovic, B. Orlic, G.K. Liet, and J.F. Broenink, gCSP: A Graphical Tool for Designing CSP systems, in Communicating Process Architectures 2004, I. East, J. Martin, P.H. Welch, D. Duce, and M. Green, Eds. Oxford, UK: IOS press, 2004, pp. 233-251, ISBN: 1586034588. [8] G.H. Hilderink, Graphical modelling language for specifying concurrency based on CSP, IEE Proceedings Software, vol. 150, pp. 108-120, 2003. [9] G.H. Hilderink, Managing Complexity of Control Software through Concurrency, PhD thesis, University of Twente, Netherlands, 2005, ISBN: 90-365-2204-8. [10] G.H. Hilderink and J.F. Broenink, Sampling and timing a task for the environmental process, in Communicating Process Architectures 2003, J.F. Broenink and G.H. Hilderink, Eds. Enschede, Netherlands: IOS Press, 2003, pp. 111-124, ISBN: 1 58603 381 6.
This page intentionally left blank
Communicating Process Architectures 2005 Jan Broenink, Herman Roebbers, Johan Sunter, Peter Welch, and David Wood (Eds.) IOS Press, 2005 © 2005 The authors. All rights reserved.
385
Assessing Application Performance in Degraded Network Environments: An FPGA-Based Approach Mihai IVANOVICI a,1 , Razvan BEURAN a and Neil DAVIES b a CERN, Geneva, Switzerland b Predictable Network Solutions, Bristol, UK Abstract. Network emulation is a technique that allows real-application performance assessment under controllable and reproducible conditions. We designed and implemented a hardware network emulator on an FPGA-based custom-design PCI platform. Implementation was facilitated by the use of the Handel-C programming language, that allows rapid development and fast translation into hardware and has specific constructs for developing systems with concurrent processes. We report on tests performed with web-browsing applications. Keywords. Application performance assessment, network emulation, FPGA, hardware implementation, Handel-C
Introduction Network emulation is a technique that makes it possible to assess real-application performance under controllable and reproducible conditions. This hybrid technique combines the advantages of network simulation with those of tests in real networks [1]. Most of the existing network emulators are implemented in software, therefore the quality degradation they introduce is imprecise and unreproducible. Current hardware [4], [5], [6], [7] and software [2], [3] approaches exhibit an additional important drawback: they all introduce unrealistic degradation. The reason is twofold: packets in a flow are treated independently, and quality degradation effects are not correlated (e.g., packet loss and delay are independent). We designed and implemented a hardware network emulator on an FPGA-based customdesign PCI platform [8], [12]. This ensures high-accuracy emulation, as well as high performance: the system supports packet rates up to 1.5 million packets per second in each direction. In addition, our implementation is a new approach to the emulation of quality degradation in networks, permitting reproducible experiments in realistic network scenarios. In this approach, network conditions are described in terms of network-element behavior models, which are aggregated into a single representation that is used effectively for emulation. Implementation was facilitated by the use of the Handel-C programming language [9], that allows rapid development and fast translation into hardware. This article focuses on the basic principles of our methodology and the architectural choices in the emulator design, including an assessment of the costs and benefits of the choices we made. CERN [10] collaborates with Predictable Network Solutions [11] to develop the emulator as a tool permitting the quantitative evaluation of the influence of the experienced network quality degradation on distributed application performance. We studied several network ap1 Corresponding Author: Mihai Ivanovici, CERN, 1211 Geneva 23, Switzerland. Tel.: +41 22 767 39 08; Fax: +41 22 767 39 00; E-mail: [email protected].
386
M. Ivanovici et al. / Assessing Application Performance: An FPGA-Based Approach
plications used in domestic and specialized environments that would benefit from assurance of bounds on quality degradation. We report on the behavior of short-lived TCP transfers as occurring in the case of web-browsing applications. We conclude that our approach is suitable to the assessment of applications and approaches to help deliver network-based services that are predictable, a prerequisite for the support of safety-critical services. 1. Application Performance Assessment A key issue in application performance assessment is the understanding of the fact that network environments perturb application behavior by delaying and dropping the application traffic. Networks are therefore degraded environments, and quality degradation in the network is reflected in the performance degradation at application level. 1.1. Principles There are three steps to take in order to assess application performance: (i) observe the application behavior at the end-node level, (ii) accurately measure the quality degradation experienced by the application traffic and (iii) correlate the above. A general setup is depicted in Figure 1.
NETWORK
Figure 1. Observing end-to-end application performance and measuring the quality degradation in the network.
Scientific method requires the use of objective metrics to perform both the network and application level performance assessments. In case of network quality degradation there is already a series of widely used metrics [15], [16]: one-way delay [17], one-way packet loss [18] and throughput. However, when application performance must be determined, each application class requires the definition of specific metrics that take into account the application nature. For example, for Voice over IP (VoIP) one can use the Perceptual Evaluation of Speech Quality score [19]. In case of file transfer, useful metrics are transfer time performance and goodput [20]. 1.2. Our Approach There is an issue with the setup in Figure 1: one has no control over the degradation introduced by the network. Other people’s traffic and the loss and delay it induces are what they are at that moment. Consequently, any measurement only reflects the conditions at that particular time of day. A much more practical approach is to use a network emulator instead of the real network. This allows for varying network conditions in a controllable and reproducible manner, hence effectively exploring application performance in a much wider range of conditions. Figure
M. Ivanovici et al. / Assessing Application Performance: An FPGA-Based Approach
387
2 shows the experimental setup we used. Quality degradation, denoted by ΔQ, is correlated with the User-Perceived Quality (UPQ) for the application under test.
TAP
TAP NETWORK EMULATOR
ΔQ Meter
Correlated
UPQ Meter
Figure 2. Experimental setup.
There are several emulators available at the moment, and we had hands-on experience with one of them—the results obtained with VoIP and file transfer applications were already reported in [1]. One of the problems of many network emulators, especially the software versions, is the lack of accuracy of the degradation they induce. Although it could seem that this is not essential from the point of view of delay (as long as the error is within reasonable bounds) it becomes important when packet loss is concerned. This is because packet loss is the result of a critical race for resources. The fact whether packet loss occurs or not at a certain moment depends on the relative timing of the packets. Accurate emulation of these effects is needed in order to get correct loss rates and distributions, a mandatory feature of an emulator since they are critical for the performance of most applications. A hardware implementation, such as ours, ensures a correct behavior regarding timing. But current hardware emulators have another drawback: they all introduce unrealistic degradation. This comes from the approach to emulation that is generally taken. Firstly, packets in a flow are treated independently, which may lead to packet reordering within the same stream. This has a disastrous effect on TCP application performance, which is optimized for the normal case when packets arrive in order—we already encountered such problems in real networks during the performance measurements that were part of feasibility studies related to the ATLAS1 Event Filter at CERN. Secondly, in current network emulators, quality degradation effects are not correlated (e.g., packet loss and delay are independent). By allowing the delay and loss distributions to be configured independently, the natural dependence that exists between these two parameters is destroyed. Moreover, most of the existing network emulators are network-topology oriented. They use a node-by-node representation of the emulated network. However, this approach becomes unfeasible for large-scale networks. We believe that the network quality degradation should be emulated by using compact models of the network. These models are obtained by aggregating simple network elements into an object with equivalent observable properties. In this simple way we address the design shortcomings of the existing approaches and also achieve high accuracy through a hardware implementation. We identified two basic elements that are the building blocks of any network system: the wire and the queue. The wire represents the transmission media, which can be considered, in a first approximation, error free. Therefore its main characteristic is the constant propagation 1
ATLAS is one of the four experiments being built at CERN on the Large Hadron Collider accelerator.
388
M. Ivanovici et al. / Assessing Application Performance: An FPGA-Based Approach
delay. The queue is characterized by its length and service rate. It introduces variable delay and loss. This degradation is introduced by the intra-stream and inter-stream competition for resources: a packet competes both with other packets from the same traffic flow, as well as with packets from other streams. Hence the need arises to emulate the inter-stream competition, for which task we found two techniques. The first one is to use background traffic generation so as to artificially consume resources (queue space and service time). The second technique is to use the “server with vacations” paradigm, in which the queue server takes vacations that correspond to servicing the other streams. Since any network emulator is in fact a system that introduces quality degradation in the network, from now on we will use the term “quality degrader” to refer to it.
2. Quality Degrader Architecture Our quality degrader is implemented in hardware, in order to achieve high accuracy and packet rates up to 1.5 million packets per second. This section presents the hardware platform and provides details about the core of the emulator. 2.1. Hardware Platform Our implementation is based on a custom-design PCI platform [8], [12]. This hardware platform uses one Altera Stratix FPGA (EP1S25F780C7) with 25 k logic elements, two Gigabit Ethernet RJ45 ports and memory (128 MB SDRAM and 1 MB SSRAM). The schematic is presented in Figure 3:
Figure 3. Schematic of the hardware platform.
M. Ivanovici et al. / Assessing Application Performance: An FPGA-Based Approach
389
The FPGA manages all the resources and contains the IP2 cores for the two Ethernet MACs (Medium Access Control) and the SDRAM and PCI controllers. The user-defined higher-level functionality is implemented on the same FPGA, using the Handel-C programming language. A low-level library that provides primitives for memory and PCI access was created. The dual-port on-chip RAM blocks provided by the Stratix FPGA are extensively used to implement queues between concurrent processes. The FIFO paradigm ensures that messages/data are processed in order and in the same time decouples through buffering the two processes that access the two ends of the queue. Of course data is lost if the queue fills, but this should not happen during normal operation and is an indication of a malfunction. The SDRAM is used to store packet data temporarily. The SSRAM is used to store the configuration of the emulator. Packet data flows between the two Gigabit Ethernet ports, allowing for the seamless integration of the platform into a network, a prerequisite for a network emulator. To facilitate the deployment, the board can be hosted by ordinary PCs owing to its standard 3V3 PCI connector. The PCI is used to configure the application firmware and to collect statistics. The whole system is driven at PC level by a Python-based [13] control system. This allows the creation of automated procedures for performing experiments in a very simple and flexible manner. The low-level communication with the hardware platform via PCI is ensured by a custom Linux kernel module [14]. 2.2. Internal Architecture The architecture of the network emulator is depicted in Figure 4. We used thick arrows to represent the data paths and thin arrows for the control paths. The architecture is based on modules, that are blocks of code that have a specific functionality, and which run concurrently. Each module consists of several parallel processes, and communicates with other modules and internally by means of channels. The behavior of the emulator is briefly the following. The traffic is first classified in order to enable different degradation to be applied to it. Quality degradation is effectively introduced by means of a system of queues. The length of each queue can be specified. We chose the approach of background traffic generation, which was implemented as part of the Degradation Emulation Engine module described below. Packets are then serviced according to one of the two scheduling algorithms, Strict Priority (SP) or Weighted Round Robin (WRR). The weights for WRR are user configurable. Next we’ll describe in more detail each of the modules. SDRAM Server
Packet Data Storage
Microflow Sequence Preservation
MAC
Packet Data Receiver
Classifier
Degradation Emulation Engine
Microflow Service
Figure 4. Quality degrader architecture. 2
IP stands here for Intellectual Property.
Fixed Delay
Packet Data Forwarder
MAC
390
M. Ivanovici et al. / Assessing Application Performance: An FPGA-Based Approach
The MAC Receiver module provides packet data to the Packet Data Receiver, which configures and controls it. The Packet Data Receiver checks with the MAC Receiver whether there was any error on reception, and discards the packet data if this is the case. Note that the MAC Receiver is an IP macro. The Packet Data Receiver module manages the reception of complete packets from the MAC interface. The Packet Data Receiver maintains two queues—one for storing full packet data and another one for packet descriptors. If one packet is correctly received, a descriptor is placed in the descriptor queue. The packets are then sent to the Packet Data Storage module, which returns the memory address where the packet data is stored. The address of the next available memory slot is determined using a bitmap representation of the memory occupation. Memory is divided in slots of 2048 bytes; slots are seized when a packet is received and freed when the packet is either discarded internally or transmitted. The memory address of the current packet and its descriptor form a packet reference. The packet reference along with information extracted from the IP packet header are sent to the Classifier module. The following fields are retained: protocol number, Type Of Service (TOS), source and destination IP addresses. The Classifier module classifies packets based on the retained fields forwarded by the Packet Data Receiver. Once packets have been classified, the corresponding packet reference and the identifier specifying the degradation queue are sent to the Degradation Emulation Engine module. The Degradation Emulation Engine module induces quality degradation through the management of a system of queues. Upon receiving a packet reference and a degradation queue identifier, the process sends them to Microflow Sequence Preservation for enqueueing. Under certain conditions the decision to immediately discard the packet can be taken; in this case no enqueueing takes place. The next queue to be serviced is determined based on a scheduling algorithm (SP or WRR) and is indicated to Microflow Service. An essential feature of this module is the background traffic generator associated with each queue. These generators can be independently started/stopped and configured to transmit user-defined artificial traffic patterns. The patterns are uploaded onto the board in the on-chip RAM. So far we used CBR (Constant Bit Rate) and Poisson distributions for the inter-packet arrival time of the background traffic packets. The Degradation Emulation Engine module also implements the transmission-rate limiting mechanism. The Microflow Sequence Preservation module stores and manages in a FIFO manner the packet references received from the Degradation Emulation engine, thus preventing packet reordering. The Microflow Sequence Preservation module uses eight queues, out of which one is assigned to the unclassified traffic. The size of each queue can be configured. The Microflow Service module manages the retrieval of packet references from Microflow Sequence Preservation based on queue identifiers received from Degradation Emulation. The packet reference is subsequently sent to the Fixed Delay module. The Fixed Delay module introduces a constant delay by enqueueing the descriptors in a queue. The constant delay represents the propagation delay and it is user configurable. The Packet Data Forwarder module manages the transmission of the packets to the MAC interface. It maintains two queues—one for full packet data and another one for packet descriptors. When a packet reference is received, the corresponding packet is retrieved from the Packet Data Storage. Once a packet is transmitted, a “free reference” message is sent to Packet Data Storage. The MAC Forwarder module receives packet data from the Packet Data Forwarder in chunks of 32-bit words. The MAC Forwarder is configured and controlled by the Packet Data Forwarder. As the MAC Receiver, the MAC Forwarder is an IP macro.
M. Ivanovici et al. / Assessing Application Performance: An FPGA-Based Approach
391
2.3. Implementation Philosophy For a design of this complexity the obvious description language of choice would be VHDL; however the learning curve for non-hardware specialists was estimated as being too steep. We chose instead Handel-C, a language whose close affinity to C made it readily accessible to software engineers while at the same time offering constructs that allowed us to employ the natural parallelism of the hardware. Handel-C had been employed before in networking applications at CERN [21] but the magnitude and complexity of this design would present a real challenge. The most obvious construct of use is the par, which allows for the parallel execution of statements or complete processes. In practice every module shown in Figure 4 runs as a process under a top-level par. There are limits with par, however, as was found when trying to parallelize the packet classifier. If the different classification rules were operating in parallel then each process would attempt to access the data at the same time, meaning significant FPGA routing problems and increased delay. In addition this solution doesn’t scale to a large number of rules. Replicating the data to provide each process with its own copy was equally time consuming, so finally it proved better to execute classification rules sequentially. Channel objects allow data to be communicated between processes; the transfer occurs only when both processes are ready and forces the synchronization of parallel processes. Their existence proved especially useful for several reasons. During the development phase, we could independently debug and validate individual modules comfortable in the knowledge that whatever logical or timing changes happened as a result, the module would still fit back into the full design and communicate with its adjacent modules as before. If, in the worst case of a design error, there is some channel mismatch then the whole system freezes and allows the debugger to retrieve state and correct the error. This independent development is especially powerful when one considers that for the full design a compile, place and route cycle is at least half an hour. Channels also resolve the problem of passing data across clock domain boundaries. The design can’t avoid different domains since the PCI interface requires 33 MHz and the MAC interface requires 25 MHz. However having the facility to easily cross clock domains meant that we could choose something close to the optimum frequency for several different tasks. For example, the SDRAM server runs at 83.3 MHz and the main core at 62.5 MHz. Without this option we would not have been able to meet the design requirements of the project. Again however there are limits. The channel is a complex structure with up to a four-cycle overhead which becomes the limit for very high speed transfers. For the fastest transfer logic we needed to use an internal hardware feature, the dual-port RAM, as shared memory between two clock domains. The shared memory acts as a mailbox while the requests for transfer are still sent over channels. This came at the cost of having to ensure the synchronization with our own logic, an error-prone process that cost considerable debugging time. Although ordered and synchronous operations have clear advantages, there are cases where data has to be retrieved as fast as possible—such as the ingress from the MAC interfaces which must be cleared irrespective of the state of the modules that will consume the data. We used queues in this case to accept asynchronously the incoming data. The use of multiple channels in each module lead to yet another problem: the arbitration of the access to all these channels. Handel-C provides a solution by means of the prialt instruction. This instruction allows the channel that is first ready to perform a transfer and it even works between different clock domains. This instruction was used, for example, in the Packet Data Storage module, to which “free reference” messages are addressed from more sources (from Degradation Emulation Engine on packet discard, and from the Packet Data Forwarder module on packet transmission).
392
M. Ivanovici et al. / Assessing Application Performance: An FPGA-Based Approach
In retrospect the choice of language for this project was fully justified. It allowed for a formal approach to the design process and we found that with each iteration we moved further from the shared memory architecture and closer to channel based ones that facilitated both debugging and execution. As we became more competent with using channels and CSP, the code we wrote became smaller and simpler. This leads to the fact that maintenance and modifications are also easier. 3. Experimental Results Using the network emulator we assessed the performance of several applications, such as web browsing, file transfer, VoIP, and video streaming. Web browsing is an HTTP-based application that is characterized by short-lived TCP transfers. The performance of such an application strongly depends on packet loss, hence we chose to present the results we obtained using it. The traffic of interest (HTTP) competes with the background traffic to occupy queue space—which induces loss—and for being serviced—which induces delay. We compare two cases, when the background traffic source has a CBR or a Poisson pattern. For all the tests the emulator was configured to introduce a fixed delay of 12.5 ms (equivalent to 25 ms RTT3 ). The available bandwidth was limited to 10 Mb/s and the size of the queue was 128 packets. We used the setup presented in Figure 2. The end PCs run Linux with kernel 2.4.21, the HTTP server was Apache 2.0 (httpd-2.0.46) and the client was wget (wget-1.8.2), a noninteractive network retriever that allows for the automation of tests. The interconnect employed was Fast Ethernet, because the taps we currently use work only at 100 Mb/s. The emulator is however able to run at 1 Gb/s as well. For the Apache server all the parameters had default values, including the Timeout4 of 300s. KeepAlive5 was “on” and “off” in turn. When “off”, a new TCP connection is opened and closed for each file transfer. This represents the most inefficient case. When “on”, the same TCP connection is used for up to MaxKeepAliveRequests = 100 transfers separated by no more than KeepAliveTimeout = 15 s. We chose a representative web-page structure to use in our tests, that contains both images and text. The site consists of 499 files, with a total size of 1.6 MB. The average file size is approximately 3 kB, close to the average value of file sizes on the Web [22]. The results in Figures 5 and 6 show the dependency of site download duration on the offered backgroundtraffic load, for KeepAlive “off” and “on”, respectively. The site download duration is a measure of the user-perceived quality for web-browsing applications. The reference value is that obtained when the application has an exclusive use of the network (i.e., when the background traffic generator is disabled). The offered background-traffic load varies from 0 to 100%, being a measure of the congestion induced by the emulator. Note in Figure 5 that CBR background traffic has almost no influence on the performance, since this case is equivalent to a constant diminution of the bandwidth available for the application. A constant amount of available bandwidth leads to a steady performance of TCP. Since web browsing only implies transfers of relatively small amount of data, the available bandwidth can be low without a significant impact on performance. When the background traffic load approaches 100%, the available bandwidth becomes insufficient. Subsequently there is a rapid increase of the download duration, followed by denial of service and leading to complete application failure. 3 Round Trip Time, the time needed for a packet to travel back and forth between two end nodes on a particular network connection. 4 Timeout is the number of seconds before the server receives and sends time out. 5 KeepAlive indicates whether or not to allow persistent connections (i.e., with more than one request per connection).
M. Ivanovici et al. / Assessing Application Performance: An FPGA-Based Approach
393
When the background traffic is Poisson (and therefore more realistic) noticeable performance degradation starts occurring from loads of 60%. At loads larger than 80% the degradation becomes significant and reaches values with more than one order of magnitude higher compared to the CBR case. The intrinsic burstiness of the Poisson traffic determines the larger deviations of the results. One can observe in Figure 6 an improvement of the worst-case behavior of one order of magnitude when KeepAlive is “on”, due to the reutilization of the same TCP connection for multiple transfers. This reduces the probability of losing connection establishment and termination packets; such loss is the main culprit for performance drop for TCP-based applications.
Site download duration [s]
2000 CBR background traffic Poisson background traffic 1500
1000
500
0 −20
0
20
40
60
80
100
120
Offered background−traffic load [%] Figure 5. Site download duration versus offered background-traffic load (KeepAlive “off”).
Site download duration [s]
140 120
CBR background traffic Poisson background traffic
100 80 60 40 20 0 −20
0
20
40
60
80
100
120
Offered background−traffic load [%] Figure 6. Site download duration versus offered background-traffic load (KeepAlive “on”).
394
M. Ivanovici et al. / Assessing Application Performance: An FPGA-Based Approach
4. Conclusions In this paper we present our approach to the emulation of quality degradation in networks. We implemented our concepts using an FPGA-based hardware platform. This proved to be the most appropriate solution for network emulation considering the requirements for accuracy, reproducibility and high-speed operation. The implementation was made easier by the use of Handel-C, a programming language that is rich in constructs that allow the development of concurrent process systems. We used in particular very intensively channels to synchronize the parallel processes in our implementation. The core of the emulator is a system of queues which guarantees a realistic dependency between packet loss and delay. We identified two strategies for emulating the effects of other traffic flows on the traffic of interest: background traffic generation and the “server with vacations” paradigm. We have already implemented the first approach and we are currently comparing the two strategies. This will be useful for the next generation emulators, able to emulate large-scale networks through the aggregation of network element models. We used this system for application performance assessment. We report on web browsing, characterized by many short-lived HTTP transfers. We determined experimentally the dependence between site download duration and the offered load of the background traffic, i.e., between user perceived quality and the congestion level in the network. As mentioned before, we already run test with various other applications, such as VoIP (Voice over IP) and video streaming using a software network emulator. Our future plans is to perform similar tests using the hardware network emulator we developed in order to obtain more accurate results and emphasize the differences between the two emulation approaches. In addition the emulator will be used to perform tests in connection with the design of the ATLAS Data Collection system at CERN. Preliminary tests have already taken place and we will report on them in a future paper. Acknowledgments The work presented here was supported by PPARC PIPSS Project No: PPA/I/S/2002/00653. We would like to acknowledge Brian Martin and Jaroslav Pech for the design of the hardware platform. We are grateful to Matei Ciobotaru for the development of the low-level access libraries. We would also like to thank Brian Martin and the anonymous CPA-2005 referees for their pertinent comments and suggestions. References [1] R. Beuran, M. Ivanovici, B. Dobinson, N. Davies, P. Thompson, “Network Quality of Service Measurement System for Application Requirements Evaluation” , International Symposium on Performance Evaluation of Computer and Telecommunication Systems, July 20-24, 2003, Montreal, Canada, pp. 380-387. [2] “Dummynet: a simple approach to the evaluation of network protocols”, L. Rizzo. [3] NISTNet Network Emulator, http://www-x.antd.nist.gov/nistnet [4] Simena, http://www.simena.net [5] Anue Systems, http://www.anuesystems.com [6] Empirix, http://www.empirix.com [7] Shunra, http://www.shunra.com [8] M. Ciobotaru, M. Ivanovici, R. Beuran, S. Stancu, “Versatile FPGA-based Hardware Platform for Gigabit Ethernet Applications” , 6th Annual Postgraduate Symposium, Liverpool, UK, June 27-28, 2005. [9] Celoxica, http://www.celoxica.com [10] CERN, The European Organization for Particle Physics, http://www.cern.ch [11] Predictable Network Solutions, http://www.pnsol.com
M. Ivanovici et al. / Assessing Application Performance: An FPGA-Based Approach
395
[12] M. Ciobotaru, S. Stancu, M. LeVine, B. Martin, “GETB—A Gigabit Ethernet Application Platform: its Use in the ATLAS TDAQ Network”, Real Time 2005, Stockholm, Sweden, June 10, 2005. [13] The Python Programming language, http://www.python.org [14] M. Joss, “IO RCC—A package for user level access to I/O resources on PCs and compatible computers”, CERN, Technical report ATL-D-ES-0008, October, 2003. [15] ITU-T Recommendation I.380, “Internet Protocol (IP) Data Communication Service—IP Packet Transfer and Availability Performance Parameters”, ITU-T, February, 1999. [16] V. Paxson, G. Almes, J. Mahdavi, M. Mathis, “Framework for IP Performance Metrics”, IETF RFC 2330, May, 1998. [17] G. Almes, S. Kalidindi, M. Zekauskas, “A One-way Delay Metric for IPPM”, IETF RFC 2679, September, 1999. [18] G. Almes, S. Kalidindi, M. Zekauskas, “A One-way Packet Loss Metric for IPPM”, IETF RFC 2680, September, 1999. [19] ITU-T Recommendation P.862, “Perceptual Evaluation of Speech Quality (PESQ), An Objective Method for End-to-end Speech Quality Assessment of Narrow-band Telephone Networks and Codecs”, ITU-T, February, 2001. [20] R. Beuran, M. Ivanovici, V. Buzuloiu, “File Transfer Performance Evaluation”, Scientific Bulletin of University “POLITEHNICA” Bucharest, C Series (Electrical Engineering), vol. 66, no. 2-4, 2004, pp. 3-14. [21] F. R. M. Barnes, R. Beuran, R.W. Dobinson, M.J. LeVine, B. Martin, J. Lokier, and C. Meirosu, “Testing Ethernet Networks for the ATLAS Data Collection System”, IEEE Trans. Nucl. Sci., Vol. 49, No. 2, April 2002, pp. 516-520. [22] M. F. Arlitt, C. L. Williamson, “Web Server Workload Characterization: The Search for Invariants”, Proc. SIGMETRICS, Philadelphia, PA, USA, April, 1996.
This page intentionally left blank
Communicating Process Architectures 2005 Jan Broenink, Herman Roebbers, Johan Sunter, Peter Welch, and David Wood (Eds.) IOS Press, 2005 © 2005 The authors. All rights reserved.
397
Communication and Synchronization in the Cell Processor (Invited Talk) H. Peter HOFSTEE IBM Systems and Technology Group 11500 Burnet Rd, Austin, TX 78758, USA [email protected] Abstract. This talk will first present an overview of the Cell Broadband Processor Architecture and then provide a more in-depth look at the various communication and synchronization mechanisms the architecture supports. We will look at the various types of coherent DMA commands supported in this architecture, at the asynchronous “channel” command interface between the core of the synergistic processor and its DMA processor, and at a number of other mechanisms that enable fast processor-to-processor communication for short messages. Keywords. Cell processor, architecture, communication, synchronization, channel, DMA processor, command interface
References [1] J.A. Kahle, M.N. Day, H.P. Hofstee, C.R. Johns, T.R. Maeurer, and D. Shippy, Introduction to the Cell multiprocessor. IBM Journal of Research and Development. Volume 49, Number 4/5, July/ September 2005. [2] Cell Broadband Processor Architecture, http://www.ibm.com/developerworks/power. [3] D. Pham, S. Asano, M. Bolliger, M.N. Day, H.P. Hofstee, C. Johns, J. Kahle, A. Kameyama, J. Keaty, Y. Masubuchi, M. Riley, D. Shippy, D. Stasiak, M. Suzuoki, M. Wang, J. Warnock, S. Weitzel, D. Wendel, T. Yamazaki,and K. Yazawa, The Design and Implementation of a First-Generation CELL Processor. Proceedings of the Custom Integrated Circuits Conference, September 2005. [4] B. Flachs, S. Asano, S.H. Dhong, H. P. Hofstee. G. Gervais, R. Kim, T. Le, P. Liu, J. Leenstra, J. Liberty, B. Michael, H-J. Oh, S. M. Mueller, O. Takahashi, A. Hatakeyama, Y. Watanabe, and N. Yano, The Microarchitecture of the Streaming Processor for a CELL Processor.IEEE International Solid-State Circuits Symposium, February 2005, pp.184-185. [5] http://www-306.ibm.com/chips/techlib/techlib.nsf/products/Cell.
This page intentionally left blank
Communicating Process Architectures 2005 Jan Broenink, Herman Roebbers, Johan Sunter, Peter Welch, and David Wood (Eds.) IOS Press, 2005 © 2005 The authors. All rights reserved.
Homogeneous Multiprocessing for Consumer Electronics (Invited Talk) Paul STRAVERS Building WDC P048 (M/S WDC 01), High Tech Campus, Prof Holstlaan 4, 5656 AA, Eindhoven,,The Netherlands [email protected] Abstract. Processor architectures have reached a point where it is getting increasingly hard to improve their performance without resorting to complex and exotic measures. Polack observed in 2000 that Intel processors had been “on the wrong side of a square law” for almost a decade. Embedded processors for consumer and telecommunication chips are now confronted with the same rule of diminishing returns. To further improve their performance, the processors are getting disproportionally bigger and consume much more energy per operation than previous generations. Traditionally, embedded systems-on-chip (SoC) have been designed as heterogeneous multiprocessors, where most processors are not programmable and a single control processor synchronizes all communication. Obvious advantages of such systems include low cost and low power consumption. In high-volume products, this outweighs disadvantages like a low degree of design reuse, little software reuse, and long product lead times. Despite all the hard work and good intentions it has proved difficult to establish a platform around heterogeneous SoC architectures. With the rise of non-recurrent engineering costs and an increasingly global and competitive semiconductor market, the need for a successful SoC platform is felt stronger than ever in the industry. Next to cost, the availability of qualified engineers is often an even bigger problem. Given that it is not unusual to spend several hundreds of person-years on software development for a single product, it is easy to see that even a multinational company can only have a very limited number of products in development at any point in time. The solution we propose is to move away from heterogeneous SoC and instead embrace homogeneous embedded multiprocessors. In this talk we discuss embedded multiprocessor architectures and how they relate to programming models. We contrast heterogeneous to homogeneous architectures, and we show how the traditional efficiency gap between the two is narrowing. We also discuss issues related to hardware and software reuse, and the quest for composable systems to speed up the often lengthy process of embedded system integration. Finally, we introduce Wasabi, the Philips high-performance homogeneous multiprocessor in 65nm technology, that will be available to the research community in 2006. Keywords. Wasabi, SoC, homogeneous embedded multiprocessor, embedded multiprocessor architectures.
399
This page intentionally left blank
Communicating Process Architectures 2005 Jan Broenink, Herman Roebbers, Johan Sunter, Peter Welch, and David Wood (Eds.) IOS Press, 2005 © 2005 The authors. All rights reserved.
Handshake Technology: High Way to Low Power (Invited Talk) Ad PEETERS Handshake Technology, High Tech Campus 12, 5656 AE Eindhoven [email protected] Abstract. Handshake Solutions [1] is a business line of the Philips Technology Incubator that has developed a design flow that enables a fast path to exploit the benefits from self-timed circuits, such as ultra-low power consumption and reduced electromagnetic emission. The quick route is enabled by the combination of two elements: a high-level “timeless” design language and an intermediate architecture based on handshake circuits. The design language, Haste, offers a syntax similar to behavioral Verilog, and in addition has explicit constructs for parallelism, communication, and hardware sharing. Haste is a behavioral language in the sense that it supports sequential composition of actions (such as assignments and communications) without reference to their timing, and also allows for data-dependent while-loops and other datadependent execution traces. As a parallel programming language, Haste supports CSP concepts such as synchronized channel communication, both via point-to-point channels and through broadcast and narrowcast channels. In addition, Haste also allows for the design of interfaces and protocols through its support of synthesizable edge/posedge/negedge (wait for event) and wait (for state) constructs. Designers experience and have reported high productivity in Haste. It turns out that compared to synthesizable VHDL the number of code lines is more than halved, thus facilitating design-space exploration and re-use, and improving design productivity. Starting from Haste, one can compile to behavioral Verilog for functional verification, to clock-gated circuits for mapping onto FPGAs, and to clockless circuits for ultra-low-power and low-EME VLSI implementations. This compilation exploits an intermediate architecture based on handshake components, which implement language constructs of Haste using handshake protocols. Handshake components and circuits support a modular design style, and can easily be implemented both as a clocked and as a self-timed circuit. The Handshake Solutions design flow is complementary to and compatible with third party EDA tools, e.g. for logic synthesis, test-pattern generation, and placement and routing. The clockless design flow works with standard-cell libraries, does not need any special cells and has a scan-test solution implemented. The talk will highlight the expressive power of Haste, how it is implemented in handshake circuits, and how this has led to concrete IC designs on the market. Keywords. Handshake, Haste, FPGA, communication, parallel programming language, parallelism, design language, clockless, interfaces, protocols.
Reference [1] Handshake Solutions home page, http://www.handshakesolutions.com
401
This page intentionally left blank
Communicating Process Architectures 2005 Jan Broenink, Herman Roebbers, Johan Sunter, Peter Welch, and David Wood (Eds.) IOS Press, 2005 © 2005 The authors. All rights reserved.
403
If Concurrency in Software is So Simple, Why is it So Hard? (Invited Talk) Guy BROADFOOT Verum, Paradijslaan 28, 5611 KN Eindhoven [email protected] Abstract. Not available at time of press. Keywords. ASD, Analytical Software Design, model checking.
Reference [1] Verum home page, http://www.verum.com/
This page intentionally left blank
Communicating Process Architectures 2005 Jan Broenink, Herman Roebbers, Johan Sunter, Peter Welch, and David Wood (Eds.) IOS Press, 2005 © 2005 The authors. All rights reserved.
405
Author Index Allen, A.R. Anshus, O.J. Barclay, K. Barnes, F. Beerel, P.A. Beuran, R. Bjørndalen, J.M. Broadfoot, G. Broenink, J.F. Chalmers, K. Davies, N. Dierssen, W. Dimmich, D.J. East, I. Ferguson, I. Gardner, W.B. Geelen, T.J.H. Gopalakrishnan, A. Groothuis, M.A. Happe, H.H. Hilderink, G.H. Hofstee, H.P. Ivanovici, M. Jacobsen, C.L. Jakson, J. Jansen, P.G. Jovanovic, D.S. Kavaldjiev, N.
71 261 13 165, 249, 289 275 385 261 403 v, 29, 375 109 385 147 235 1 363 129 43 43 375 155 317 397 385 235 335 219 29 219
Kerridge, J. Klebanov, V. Kooij, N. Liet, G.K. Nixon, P. Orlic, B.E. Peeters, A. Rem, B. Roebbers, H.W. Rümmer, P. Saifhashemi, A. Sampson, A. Savage, J. Schlager, S. Schmitt, P.H. Schoute, A. Seesink, R. Smit, G.J.M. Smith, M.L. Sputh, B.H.C. Stewart, J. Stravers, P. Sunter, J.P.E. Vinter, B. Walsh, T. Welch, P.H. Wiggers, M.H. Wood, D.C.
13, 109 203 147 375 363 29 401 43 v, 43 203 275 165 13 203 203 147 147 219 177 71 363 399 v 155, 189, 261 363 v, 165, 289 219 v
This page intentionally left blank