COMMUNICATING PROCESS ARCHITECTURES 2009
Concurrent Systems Engineering Series Series Editors: M.R. Jane, J. Hulskamp, P.H. Welch, D. Stiles and T.L. Kunii
Volume 67 Previously published in this series: Volume 66, Communicating Process Architectures 2008 (WoTUG-31), P.H. Welch, S. Stepney, F.A.C. Polack, F.R.M. Barnes, A.A. McEwan, G.S. Stiles, J.F. Broenink and A.T. Sampson Volume 65, Communicating Process Architectures 2007 (WoTUG-30), A.A. McEwan, S. Schneider, W. Ifill and P.H. Welch Volume 64, Communicating Process Architectures 2006 (WoTUG-29), P.H. Welch, J. Kerridge and F.R.M. Barnes Volume 63, Communicating Process Architectures 2005 (WoTUG-28), J.F. Broenink, H.W. Roebbers, J.P.E. Sunter, P.H. Welch and D.C. Wood Volume 62, Communicating Process Architectures 2004 (WoTUG-27), I.R. East, J. Martin, P.H. Welch, D. Duce and M. Green Volume 61, Communicating Process Architectures 2003 (WoTUG-26), J.F. Broenink and G.H. Hilderink Volume 60, Communicating Process Architectures 2002 (WoTUG-25), J.S. Pascoe, P.H. Welch, R.J. Loader and V.S. Sunderam Volume 59, Communicating Process Architectures 2001 (WoTUG-24), A. Chalmers, M. Mirmehdi and H. Muller Volume 58, Communicating Process Architectures 2000 (WoTUG-23), P.H. Welch and A.W.P. Bakkers Volume 57, Architectures, Languages and Techniques for Concurrent Systems (WoTUG-22), B.M. Cook Volumes 54–56, Computational Intelligence for Modelling, Control & Automation, M. Mohammadian Volume 53, Advances in Computer and Information Sciences ’98, U. Güdükbay, T. Dayar, A. Gürsoy and E. Gelenbe Volume 52, Architectures, Languages and Patterns for Parallel and Distributed Applications (WoTUG-21), P.H. Welch and A.W.P. Bakkers Volume 51, The Network Designer’s Handbook, A.M. Jones, N.J. Davies, M.A. Firth and C.J. Wright Volume 50, Parallel Programming and JAVA (WoTUG-20), A. Bakkers Transputer and OCCAM Engineering Series Volume 45, Parallel Programming and Applications, P. Fritzson and L. Finmo Volume 44, Transputer and Occam Developments (WoTUG-18), P. Nixon Volume 43, Parallel Computing: Technology and Practice (PCAT-94), J.P. Gray and F. Naghdy Volume 42, Transputer Research and Applications 7 (NATUG-7), H. Arabnia ISSN 1383-7575
Communicating Process Architectures 2009 WoTUG-32 Edited by
Peter H. Welch University of Kent, UK
Herman W. Roebbers TASS, Eindhoven, the Netherlands
Jan F. Broenink University of Twente, the Netherlands
Frederick R.M. Barnes University of Kent, UK
Carl G. Ritson University of Kent, UK
Adam T. Sampson University of Kent, UK
Gardiner S. Stiles Utah State University, USA
and
Brian Vinter University of Copenhagen, Denmark
Proceedings of the 32nd WoTUG Technical Meeting, 1–4 November 2009, TU Eindhoven, Eindhoven, the Netherlands
Amsterdam • Berlin • Tokyo • Washington, DC
© 2009 The authors and IOS Press. All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, without prior written permission from the publisher. ISBN 978-1-60750-065-0 Library of Congress Control Number: 2009937770 Publisher IOS Press BV Nieuwe Hemweg 6B 1013 BG Amsterdam Netherlands fax: +31 20 687 0019 e-mail:
[email protected] Distributor in the USA and Canada IOS Press, Inc. 4502 Rachael Manor Drive Fairfax, VA 22032 USA fax: +1 703 323 3668 e-mail:
[email protected]
LEGAL NOTICE The publisher is not responsible for the use which might be made of the following information. PRINTED IN THE NETHERLANDS
Communicating Process Architectures 2009 P.H. Welch et al. (Eds.) IOS Press, 2009 © 2009 The authors and IOS Press. All rights reserved.
v
Preface This thirty-second Communicating Process Architectures conference, CPA 2009, takes place as part of Formal Methods Week, 1-6 November, 2009. Under the auspices of WoTUG, CPA 2009 has been organised by TASS (formerly Philips TASS) in co-operation with the Technische Universiteit Eindhoven (TU/e). This is for the second time – we had a very successful conference here in 2005 and are very pleased to have been invited back. We see growing awareness of the ideas characterized by “Communicating Processes Architecture” and their growing adoption. The complexity of modern computing systems has become so great that no one person – maybe not even a small team – can understand all aspects and all interactions. The only hope of making such systems work is to ensure that all components are correct by design and that the components can be combined to achieve scalability and predictable function. A crucial property is that the cost of making a change to a system depends linearly on the size of that change – not on the size of the system being changed. This must be true whether that change is a matter of maintenance (e.g. to take advantage of increasing multicore capability) or to add new functionality. One key is that system composition (and disassembly) introduces no surprises. A component must behave consistently, no matter the context in which it is used – which means that component interfaces must be explicit, published and free from hidden side-effect. Our view is that concurrency, underpinned by the formal process algebras of Hoare’s Communicating Sequential Processes and Milner’s π-Calculus, provides the strongest basis for the development of technology that can make this happen. Many current systems cannot be maintained if they do not have concurrency working for them and not against – certainly not if multicores are involved! We have again received an interesting set of papers covering many different grounds: system design and implementation (for both hardware and software), tools (concurrent programming languages, libraries and run-time kernels), formal methods and applications. They have all been strongly refereed and are of high quality. As these papers are presented in a single stream you won’t have to miss out on anything. As always, we will have plenty of space for informal contact and do not forget the evening Fringe Programme – where we will not have to worry about the bar closing at half past ten! We are very pleased this year to have Professor Michael Goldsmith of the e-Security Group (WMG Digital Laboratory, University of Warwick) as our keynote speaker. He is one of the creators of the CSP model checking program FDR, which is used by many in industry and academia to check concurrent systems for deadlock, livelock and verify correct patterns of behaviour through refinement checks – all before any code is written. We thank the authors for their submissions and the Programme Committee for their hard work in reviewing the papers. We also thank Tijn Borghuis and Erik de Vink for inviting CPA 2009 to join FMweek 2009 and for making the arrangements with the TU/e. Peter Welch (University of Kent), Herman Roebbers (TASS, Eindhoven), Jan Broenink (University of Twente), Frederick Barnes (University of Kent), Carl Ritson (University of Kent), Adam Sampson (University of Kent), Dyke Stiles (Utah State University), Brian Vinter (University of Copenhagen).
vi
Editorial Board Dr. Frederick R.M. Barnes, University of Kent, UK Dr. Jan F. Broenink, University of Twente, The Netherlands Mr. Carl G. Ritson, University of Kent, UK Mr. Herman Roebbers, TASS, the Netherlands Mr. Adam T. Sampson, University of Kent, UK Prof. Gardiner (Dyke) Stiles, Utah State University, USA Prof. Brian Vinter, University of Copenhagen, Denmark Prof. Peter H. Welch, University of Kent, UK (Chair)
vii
Reviewing Committee Dr. Alastair R. Allen, Aberdeen University, UK Dr. Paul S. Andrews, University of York, UK Dr. Bahareh Badban, University of Konstanz, Germany Dr. Iain Bate, University of York, UK Dr. John Markus Bjørndalen, University of Tromsø¸, Norway Dr. Jim Bown, University of Abertay Dundee, UK Dr. Phil Brooke, University of Teesside, UK Mr. Neil C.C. Brown, University of Kent, UK Dr. Kevin Chalmers, Edinburgh Napier University, UK Dr. Barry Cook, 4Links Ltd., UK Dr. Ian East, Oxford Brookes University, UK Dr. Oliver Faust, Altreonic, Belgium Prof. Wan Fokkink, Vrije Universiteit Amsterdam, The Netherlands Dr. Leo Freitas, University of York, UK Dr. Bill Gardner, University of Guelph, Canada Mr. Marcel Groothuis, University of Twente, The Netherlands Dr. Kohei Honda, Queen Mary & Westfield College, UK Mr. Jason Hurt, University of Nevada, USA Ms. Ruth Ivimey-Cook, Creative Business Systems Ltd., UK Dr. Jeremy Jacob, University of York, UK Mr. Humaira Kamal, University of British Columbia, Canada Dr. Adrian E. Lawrence, University of Loughborough, UK Dr. Gavin Lowe, University of Oxford, UK Dr. Jeremy M.R. Martin, GlaxoSmithKline, UK Dr. Alistair McEwan, University of Leicester, UK Dr. MohammadReza Mousavi, Eindhoven University of Technology, The Netherlands Dr. Jan B. Pedersen, University of Nevada, USA Mr. Brad Penoff, University of British Columbia, Canada Dr. Fiona A.C. Polack, University of York, UK Mr. Jon Simpson, University of Kent, UK Dr. Marc L. Smith, Vassar College, USA Mr. Bernhard H.C. Sputh, Altreonic, Belgium Prof. Susan Stepney, University of York, UK Prof. Gardiner (Dyke) Stiles, Utah State University, USA Mr. Bernard Sufrin, University of Oxford, UK Dr.ir. Johan P.E. Sunter, TASS, The Netherlands Dr.ir. Jeroen P.M. Voeten, Eindhoven University of Technology, The Netherlands Prof. Alan Wagner, University of British Columbia, Canada Mr. Doug N. Warren, University of Kent, UK Prof. George C. Wells, Rhodes University, South Africa
This page intentionally left blank
ix
Contents Preface Peter H. Welch, Herman Roebbers, Jan F. Broenink, Frederick R.M. Barnes, Carl G. Ritson, Adam T. Sampson, Gardiner (Dyke) Stiles and Brian Vinter
v
Editorial Board
vi
Reviewing Committee
vii
Beyond Mobility: What Next After CSP/π? Michael Goldsmith
1
The SCOOP Concurrency Model in Java-Like Languages Faraz Torshizi, Jonathan S. Ostroff, Richard F. Paige and Marsha Chechik
7
Combining Partial Order Reduction with Bounded Model Checking José Vander Meulen and Charles Pecheur On Congruence Property of Scope Equivalence for Concurrent Programs with Higher-Order Communication Masaki Murakami
29
49
Analysing gCSP Models Using Runtime and Model Analysis Algorithms Maarten M. Bezemer, Marcel A. Groothuis and Jan F. Broenink
67
Relating and Visualising CSP, VCR and Structural Traces Neil C.C. Brown and Marc L. Smith
89
Designing a Mathematically Verified I2C Device Driver Using ASD Arjen Klomp, Herman Roebbers, Ruud Derwig and Leon Bouwmeester
105
Mobile Escape Analysis for occam-pi Frederick R.M. Barnes
117
New ALT for Application Timers and Synchronisation Point Scheduling (Two Excerpts from a Small Channel Based Scheduler) Øyvind Teig and Per Johan Vannebo
135
Translating ETC to LLVM Assembly Carl G. Ritson
145
Resumable Java Bytecode – Process Mobility for the JVM Jan Bækgaard Pedersen and Brian Kauke
159
OpenComRTOS: A Runtime Environment for Interacting Entities Bernhard H.C. Sputh, Oliver Faust, Eric Verhulst and Vitaliy Mezhuyev
173
Economics of Cloud Computing: A Statistical Genetics Case Study Jeremy M.R. Martin, Steven J. Barrett, Simon J. Thornber, Silviu-Alin Bacanu, Dale Dunlap and Steve Weston
185
x
An Application of CoSMoS Design Methods to Pedestrian Simulation Sarah Clayton, Neil Urquhart and Jon Kerridge An Investigation into Distributed Channel Mobility Support for Communicating Process Architectures Kevin Chalmers and Jon Kerridge
197
205
Auto-Mobiles: Optimised Message-Passing Neil C.C. Brown
225
A Denotational Study of Mobility Joël-Alexis Bialkiewicz and Frédéric Peschanski
239
PyCSP Revisited Brian Vinter, John Markus Bjørndalen and Rune Møllegaard Friborg
263
Three Unique Implementations of Processes for PyCSP Rune Møllegaard Friborg, John Markus Bjørndalen and Brian Vinter
277
CSP as a Domain-Specific Language Embedded in Python and Jython Sarah Mount, Mohammad Hammoudeh, Sam Wilson and Robert Newman
293
Hydra: A Python Framework for Parallel Computing Waide B. Tristram and Karen L. Bradshaw
311
Extending CSP with Tests for Availability Gavin Lowe
325
Design Patterns for Communicating Systems with Deadline Propagation Martin Korsgaard and Sverre Hendseth
349
JCSP Agents-Based Service Discovery for Pervasive Computing Anna Kosek, Jon Kerridge, Aly Syed and Alistair Armitage
363
Toward Process Architectures for Behavioural Robotics Jonathan Simpson and Carl G. Ritson
375
HW/SW Design Space Exploration on the Production Cell Setup Marcel A. Groothuis and Jan F. Broenink
387
Engineering Emergence: An occam-π Adventure Peter H. Welch, Kurt Wallnau and Mark Klein
403
Subject Index
405
Author Index
407
Communicating Process Architectures 2009 P.H. Welch et al. (Eds.) IOS Press, 2009 © 2009 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-065-0-1
1
Beyond Mobility: What Next After CSP/π? Michael GOLDSMITH WMG Digital Laboratory, University of Warwick, Coventry CV4 7AL, UK
[email protected] Abstract. Process algebras like CSP and CCS inspired the original occam model of communication and process encapsulation. Later the π-calculus and various treatments handling mobility in CSP added support for mobility, as realised in practical programming systems such as occam-π, JCSP, CHP and Sufrin’s CSO, which allow a rather abstract notion of motion of processes and channel ends between parents or owners. Milner’s Space and Motion of Communicating Agents on the other hand describes the bigraph framework, which makes location more of a first-class citizen of the calculus and evolves through reaction rules which rewrite both place and link graphs of matching sections of a system state, allowing more dramatic dynamic reconfigurations of a system than simple process spawning or migration. I consider the tractability of the notation, and to what extent the additional flexibility reflects or elicits desirable programming paradigms. Keywords. bigraphs, formal modelling, refinement, verification
Introduction The Communicating Process Architecture approach to programming can be traced back to the occam of the 1980s [1], which in turn was the result of grafting together a rather simple (and hence amenable to analysis) imperative language with the interaction model of the two competing process algebras, CCS [2] and CSP [3,4]; occam lies squarely within the sphere where the two can agree, although both allow more dynamic process evolution (such as recursion through parallel operators in the case of CSP), while occam deliberately enforces a rather static process structure, to simplify memory allocation and usage checking. The πcalculus [5,6] extends CCS with the ability dynamically to create and communicate (event/channel) names, which can then be used for further communication by their recipients. These notions have been realised in practical imperative programming systems such as Communicating Sequential Processes for Java (JCSP) [7], occam-π [8] and Communicating Scala Objects [9], while Communicating Haskell Processes (CHP) [10] embeds them into a functional-programming setting. We have grown accustomed to thinking of the ability to transfer channel ends and (running) processes as “mobility”, but often this is little more than a metaphor: movement is between logical owners and perhaps processors, but (at least within these languages) there is no notion of geographical location or physical proximity, such as would be needed to treat adequately of pervasive systems or location-aware services. For considerations of this kind, something like the Ambient Calculus [11] might be more appropriate, but this concentrates rather exclusively on entry to and exit from places, rather than communication and calculation per se. Milner’s bigraphs [12] provide a framework within which all these formalisms can be represented and perhaps fruitfully cohabit. In this paper, I give a brief overview of the notions and some musings on the potential and challenges they bring with them.
2
M. Goldsmith / Beyond Mobility
1. Bigraphs A bigraph is a graph with two distinct sets of edges, or, if you prefer, a pair of graphs with a (largely) common set of vertices. One is the place graph, a forest of vertices whose leaves (if any – there may equally be nodes with out-degree 0) are sites, placeholders into which another place graph might be grafted; a signature is a set of control types (with which the vertices of the place graph are annotated) and the arity (or number of ports) associated with each. The link graph on the other hand is a hypergraph relating zero or more ports belonging to nodes of the place graph together with (possibly) names constituting the inner and outer faces of the bigraph. Names on the inner face (conventionally drawn below the diagrammatic representation) provide a connection to the link graph of a bigraph plugging into the sites of the place graph; those on the outer face link into any context into which the bigraph itself may be plugged. Thus a given bigraph can be viewed as an adaptor mediating the fit of a bigraph with as many regions (roots of its place-graph forest) as this one has sites, and with the set of names on its outer face equal to those on the inner face of the one in question, into a context with as many sites as this one has regions and inner name-set equal to our outer. For those so inclined, it can be regarded as an arrow from its inner interface1 to its outer interface2 in a suitable category, and composition of such arrows constitutes such a plug-in operation. Both places and links may be required to obey a sorting discipline, restricting the control types which may be immediate children of one another, or for instance restricting each link to joining at most one “driving” port; such well-formedness rules will restrict the number of compositions that are possible, in comparison with amongst unsorted bigraphs. A bigraph provides only a (potentially graphical) representation of a state of a system, like a term in a traditional process algebra; it does not in itself define any notion of evolution such as to give rise to an operational semantics. This is presented in the form of a number of reaction rules, where a bigraphical redex is associated with a reactum into which any subgraph matching the redex may be transformed. The precise meaning of “match” is defined in categorical terms, in that there must exist contexts for both the current bigraph and the redex, such that the two in their respective contexts are equal. The result of the reaction is the reactum placed in the context required for the redex. For any well-behaved reaction system this last composition will be well defined, and one might hope that the context of the source bigraph could be stripped off the result to reflect a self-contained rewrite, but this is not part of the definition. Place graphs have to play a dual role: they may indeed represent physical location, and so juxtaposition (and parallel execution) of their children; but they are also the only structuring mechanism available, to represent possession of data or the continuation of a process beyond some guarding action. Thus it may be necessary to distinguish active and inactive control types, and to restrict matching of a redex to contexts where all its parents are active, lest the second term of a sequential composition, for example, should be reduced (and so effectively allowed to execute) before the first has successfully terminated. Similarly links might be expected to be used to model events or channels, and indeed one can formulate process algebras within this framework in such a way that they do, but they can also be used to represent more abstract relations between nodes, such as joint possession of some “secret” data value which it would be inconvenient to model as a distinguished type of control which two agents would have to “possess”. Note that links can relate multiple nodes, like events in CSP, and are not restricted to point-to-point connections. Thus they can also represent “barriers” in the occam-π sense. 1 2
The pair of the number of sites and the inner name-set. The pair of the number of regions and the outer name-set.
3
M. Goldsmith / Beyond Mobility
1.1 Simple Examples Figure 1 shows perhaps the simplest useful ambient-like reaction rule on place graphs: an atomic agent a who is adjacent to a room r might invoke this rule to move inside the room. Even here there is more than may meet the eye: we are in fact defining a parametric family of reaction rules, parameterised not only by the identities a and r, but also by the existing contents of the room (represented graphically by the pair of dots). There is also some abuse of notation involved, since strictly controls ought to have a fixed number of children (their rank), and here we have changed the number of nodes both inside and immediately outside the room; but there is some punning which lets us get away with such flexibility; note that the context that matches the redex against (a context of) the source bigraph must also be flexible in this way, since the children of the node where it plugs in reduce in number.
R
R
A
enter.a.r
A
a
a r
r
Figure 1
Figure 2 shows a similar transformation, but here a secure room x will only admit a user who carries a badge with the correct authorisation. In this version (and so at this level of abstraction) we are probably treating the badge as a passive token, and the U control should probably be defined to be inactive; but in a slightly richer model it might be that it had an expiry date and an appropriate timeout reaction would unlink it from the unlocks.x attribute (though not necessarily from other unlocks permissions), and in this case U should be active to allow this to proceed autonomously.
unlocks.x
unlocks.x
X
X
U
swipe.u.x u B
x
U x
Figure 2
u B
4
M. Goldsmith / Beyond Mobility
So far all the reactions have preserved the number and kind of nodes in the graph, and merely restructured the place graph. That certainly need not be the case. unlocks.x
unlocks.x
X
X
U
spend.u.x u T
x
U x
u ┴
Figure 3
For example, in Figure 3, rather than a long-term badge, u holds a single-use token – at least until he uses it in order to enter x, at which point it simply vanishes, leaving him with an empty pocket. As a final example, consider communication in a language like occam: two processes, possibly in widely separated parts of a graph, can interact over a channel to resolve an alternative construct where the input end is guarding one of the branches (Figure 4). c
c
ALT !v
x:=v
?x c.v
c Figure 4
Note that the other alternatives are discarded from the ALT, that there remains an obligation to complete the data transfer on the right-hand side, and that the channel c remains available for connection of either or both of the components with each other or the other’s (sequential) context.
M. Goldsmith / Beyond Mobility
5
2. Challenges 2.1 Semantics The bigraph framework is extremely flexible and powerfully expressive; but with great power comes great responsibility! The category-theoretical semantics is quite rebarbative, at least to a CSP-ist, and much of the theory seems aimed at finding restrictions on a given bigraphical reactive system to ensure that the resulting transition system can be trimmed to make it tractable, while preserving its discrimination power (up to bisimulation). The labelled transition system naturally derived from a set of reaction rules over some set of ground bigraphs has, as its labels, the context into which the source bigraph must be placed in order to be unified with a (unspecified) redex itself in some (unspecified) context. This results in a system where both the labels and the results of the transition are infinite in number and arbitrarily complex in structure; this is clearly undesirable for any practical purpose. One set of conditions is designed to ensure that the essentials of this transition system are captured when attention is restricted to the minimal contexts enabling the rule to fire; another allows attention to be restricted further to prime engaged transitions (see [12]). But even here I find something unsatisfactory: the label describes only what needs to be done to the source bigraph to allow it to engage in some reaction. There is no immediate way to determine by which reaction rule, nor where in that composed process, the rewrite occurs. For instance, any reductions that can take place internally to the current state give rise to a transition labelled with the null (identity) context. In some ways this is quite natural, by analogy with the internal τ-actions of the operational semantics of CCS or CSP, but my instinct is that it will often be of crucial interest which reactions gave rise to those essentially invisible events. In many cases it can be contrived (by subtle use of link graphs) that the minimal or prime engaged systems do agree with a more intuitive labelling scheme (such as might arise from the reaction names decorating the arrows in the above examples). But it remains far from clear to me whether there is any systematic way of attaching an intuitive scheme to (some subclass of) reactive systems or of contriving a reactive system whose natural semantics are equivalent to one conceived from a given intuitive labelling. Of course, given my background, it is not really bisimilarity that interests me – I want to be able to encode interesting specifications and check for refinement using FDR [13]. At least the failures pre-order is known to be monotonic over well-behaved bigraphs (and failures equivalence is a congruence), so there is no fundamental problem here; but there remains a certain amount of research needed to attain fluency and to map onto FDR. 2.2 Applications There are some areas where a tractable bigraphical treatment offers immediate benefit over the sometimes convoluted modelling needed in CSP, say, to achieve the same effect: the interface between physical and electronic security (of which the examples above are simplistic illustrations), pervasive adaptive systems, location-based and context-aware services, vehicle telematics, and so on. The security application has already received welcome attention in the form of Blackwell’s Spygraphs [14,15], which adds cryptographic primitives in the style of the Spi Calculus [16]. What others are there? Do they exhibit particular challenges or opportunities? 2.3 Language Features The encryption and decryption schemas for symmetric and asymmetric cryptography lend themselves quite naturally to a bigraphical description with nodes representing data items –
6
M. Goldsmith / Beyond Mobility
see Figure 5. (Note however that this reaction rule is not nice, in the technical sense, because it duplicates the contents of the site in the redex into the reactum, and so is not affine.) k K
k ENC
decrypt.k
k
K
ENC
k Figure 5
But are there any other data-processing features which could usefully be described in this way, possibly without the adjacency requirement? Perhaps external correlation attacks against putatively anonymised databases? More generally, suppose I describe some pervasive-adaptive system, say, and establish that the design has desirable properties; some of the agents in the system will describe people and other parts of the environment, but others will be components which need to be implemented. What is the language most appropriate for programming such components, so that we can attain a reasonable assurance that they will behave as their models do in all important aspects? Is it something like occam-π, or do we need new language features or whole new paradigms to reflect the potentially quite drastic reconfigurations that a reaction can produce? A challenge for the language designers out there! References [1] [2] [3] [4] [5] [6] [7] [8]
Inmos Ltd, occam Programming Manual, Prentice Hall, 1984. R. Milner, A Calculus of Communicating Systems, Springer Lecture Notes in Computer Science 92, 1980. C.A.R. Hoare, Communicating Sequential Processes, Prentice Hall, 1985. A.W. Roscoe, The Theory and Practice of Concurrency, Prentice-Hall, 1998. R. Milner, Communicating and Mobile Systems: the Pi-Calculus, Cambridge University Press, 1999. D. Sangiorgi and D. Walker, The π-calculus: A Theory of Mobile Processes, Cambridge University Press, 2001. JCSP home page. http://www.cs.kent.ac.uk/projects/ofa/jcsp/ P.H. Welch, An occam-pi Quick Reference, 1996-2008, https://www.cs.kent.ac.uk/research/groups/sys/wiki/OccamPiReference
[9] [10] [11] [12] [13] [14] [15] [16]
B. Sufrin, Communicating Scala Objects, Communicating Process Algebras 2008, pp. 35-54, IOS Press, ISBN 978-1-58603-907-3, 2008. N.C.C. Brown, Communicating Haskell Processes: Composable Explicit Concurrency using Monads, Communicating Process Architectures 2008, pp. 67-83, IOS Press, ISBN 978-1-58603-907-3, 2008. L. Cardelli and A.D. Gordon, Mobile Ambients, in Foundations of System Specification and Computational Structures, Springer Lecture Notes in Computer Science 1378, pp.140-155, 2000. R. Milner, The Space and Motion of Communicating Agents, Cambridge University Press, 2009. Formal Systems (Europe) Ltd: Failures-Divergence Refinement: the FDR2 manual, http://www.fsel.com/fdr2_manual.html, 1997-2007. C. Blackwell: Spygraphs: A Calculus For Security Modelling, in Proceedings of the British Colloquium for Theoretical Computer Science – BCTCS, 2008. C. Blackwell: A Security Architecture to Model Destructive Insider Attacks, in Proceedings of the 8th European Conference on Information Warfare and Security, 2009. M. Abadi and A.D. Gordon: A calculus for cryptographic protocols: The Spi Calculus, in Proceedings of the Fourth ACM Conference on Computer and Communications Security, pp.36-47, 1997.
Communicating Process Architectures 2009 P.H. Welch et al. (Eds.) IOS Press, 2009 © 2009 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-065-0-7
7
The SCOOP Concurrency Model in Java-like Languages Faraz TORSHIZI a,1 , Jonathan S. OSTROFF b , Richard F. PAIGE c and Marsha CHECHIK a a
b
Department of Computer Science, University of Toronto, Canada Department of Computer Science and Engineering, York University, Canada c Department of Computer Science, University of York, UK Abstract. SCOOP is a minimal extension to the sequential object-oriented programming model for concurrency. The extension consists of one keyword (separate) that avoids explicit thread declarations, synchronized blocks, explicit waits, and eliminates data races and atomicity violations by construction, through a set of compiler rules. SCOOP was originally described for the Eiffel programming language. This paper makes two contributions. Firstly, it presents a design pattern for SCOOP, which makes it feasible to transfer SCOOP’s concepts to different object-oriented programming languages. Secondly, it demonstrates the generality of the SCOOP model by presenting an implementation of the SCOOP design pattern for Java. Additionally, we describe tools that support the SCOOP design pattern, and give a concrete example of its use in Java. Keywords. Object oriented, concurrency, SCOOP, Java
Introduction Concurrent programming is challenging, particularly for less experienced programmers who must reconcile their understanding of programming logic with the low-level mechanisms (like threads, monitors and locks) needed to ensure safe, secure and correct code. Brian Goetz (a primary member of the Java Community Process for concurrency) writes: [...] multicore processors are just now becoming inexpensive enough for midrange desktop systems. Not coincidentally, many development teams are noticing more and more threading-related bug reports in their projects. In a recent post on the NetBeans developer site, one of the core maintainers observed that a single class had been patched over 14 times to fix threading related problems. Dion Almaer, former editor of TheServerSide, recently blogged (after a painful debugging session that ultimately revealed a threading bug) that most Java programs are so rife with concurrency bugs that they work only “by accident”. ... One of the challenges of developing concurrent programs in Java is the mismatch between the concurrency features offered by the platform and how developers need to think about concurrency in their programs. The language provides low-level mechanisms such as synchronization and condition waits, but these mechanisms must be used consistently to implement application-level protocols or policies. [1, p.xvii]
Java 6.0 adds richer support for concurrency but the gap between low-level language mechanisms and programmer cognition persists. The SCOOP (Simple Concurrent Object Oriented Programming) model of concurrency [2,3] contributes towards bridging this gap in the object-oriented context. Instead of providing typical concurrency constructs (e.g., threads, semaphores and monitors), SCOOP operates at a different level of abstraction, providing a single new keyword separate that 1 Corresponding Author: Faraz Torshizi, 10 King’s College Road, Toronto, ON, Canada, M5S 3G4. E-mail:
[email protected].
8
F. Torshizi et al. / The SCOOP Concurrency Model in Java-like Languages
is used to denote an object running on a different processor. Processor is an abstract notion used to define behavior. Processors can be mapped to virtual machine threads, OS threads, or even physical CPUs (in case of multi-core systems). Calls within a single processor are synchronous (as in sequential programming), whereas calls to objects on other processors are dispatched asynchronously to those processors for execution while the current execution continues. This is all managed by the runtime, invisible to the developer. The SCOOP compiler eliminates race conditions and atomicity violations by construction thus automatically eliminating a large number of concurrency errors via a static compiler check. However, it is still prone to user-introduced deadlocks. SCOOP also extends the notion of Design by Contract (DbC) to the concurrent setting. Thus, one can specify and reason about concurrent programs with much of the simplicity of reasoning about sequential programs while preserving concurrent execution [4].
Contributions
The SCOOP model is currently implemented in the Eiffel programming language, making use of Eiffel’s built-in DbC support. In this paper, we make two contributions: 1. We provide a design pattern for SCOOP that makes it feasible to apply the SCOOP concurrency model to other object-oriented programming languages such as Java and C# that do not have Eiffel constructs for DbC. The design has a front-end and a backend. The front-end is aimed at software developers and allows them to write SCOOP code in their favorite language. The back-end (invisible to developers) translates their SCOOP programs into multi-threaded code. The design pattern adds separate and await keywords using the language’s meta-data facility (e.g., Java annotations or C# attributes). The advantage of this approach is that developers can use their favorite tools and compilers for type checking. The translation rules enforced by the design pattern happen in the back-end. 2. We illustrate the design pattern with a prototype for Java, based on an Eclipse plug-in called JSCOOP. Besides applying the SCOOP design pattern to Java, and allowing SCOOP programs to be written in Java, the JSCOOP plug-in allows programmers to create projects and provides syntax highlighting and compile-time consistency checking for “traitors” that break the SCOOP concurrency model (Section 2.1). The plugin uses a core library and translation rules for the generic design pattern to translate JSCOOP programs to multi-threaded Java. Compile time errors are reported at the JSCOOP level using the Eclipse IDE. Currently the Eclipse plug-in supports preconditions (await) but not postconditions. This is sufficient to capture the major effects of SCOOP, but future work will need to address the implementation of full DbC. In Section 1 we describe the SCOOP model in the Eiffel context. In Section 2 we describe the front-end design decisions for transferring the model to Java and C#. In Section 3, we describe the back-end design pattern via a core library (see the UML diagram in Fig. 1 on page 8). This is followed by Section 4 which goes over a number of generic translation rules used by the pre-processor to generate multi-threaded code using the core library facilities. In Section 5 we describe the Eclipse plug-in which provides front-end and back-end tool support for JSCOOP (the Java implementation of SCOOP). This shows that the design pattern is workable and allows us to write SCOOP code in Java. In Section 6, we compare our work to other work in the literature.
F. Torshizi et al. / The SCOOP Concurrency Model in Java-like Languages
9
1. The SCOOP Model of Concurrency The SCOOP model was described in detail by Meyer [2], and refined (with a prototype implementation in Eiffel) by Nienaltowski [3]. The SCOOP model is based on the basic concept of OO computation which is a routine call t.r(a), where r is a routine called on a target t with argument a. SCOOP adds the notion of a processor (handler). A processor is an abstract notion used to define behavior — routine r is executed on object t by a processor p. The notion of processor applies equally to sequential or concurrent programs. In a sequential setting, there is only one processor in the system and behaviour is synchronous. In a concurrent setting there is more than one processor. As a result, a routine call may continue without waiting for previous calls to finish (i.e., behaviour is asynchronous). By default, new objects are created on the same processor assigned to handle the current object. To allow the developer to denote that an object is handled by a different processor than the current one, SCOOP introduces the separate keyword. If the separate keyword is used in the declaration of an object t (e.g., an attribute), a new processor p will be created as soon as an instance of t is created. From that point on, all actions on t will be handled by processor p. Assignment of objects to processors does not change over time. 1 class PHILOSOPHER create 2 make 3 feature 4 left, right: separate FORK 5 make (l, r: separate FORK) 6 do 7 left := l; right := r 8 end 9 10 act 11 do 12 from until False loop 13 eat (left, right) -- eating 14 -- thinking 15 end 16 end 17 18 eat (l, r: separate FORK) 19 require not (l.inuse or r.inuse) 20 do 21 l.pickup; r.pickup 22 if l.inuse and r.inuse then 23 l.putdown; r.putdown 24 end 25 end 26 end Listing 1. PHILOSOPHER class in Eiffel.
The familiar example of the dining philosophers provides a simple illustration of some of the benefits of SCOOP. A SCOOP version of this example is shown in listings 1 and 2. Attributes left and right of type FORK are declared separate, meaning that each fork object is handled by its own processor (thread of control) separate from the current processor (the thread handling the philosopher object). Method calls from a philosopher to a fork object (e.g., pickup) are handled by the fork’s dedicated processor in the order that the calls are received. The system deadlocks when forks are picked up one at a time by competing philosophers. Application programmers may avoid deadlock by recognizing that two forks must
10
F. Torshizi et al. / The SCOOP Concurrency Model in Java-like Languages
1 2 3 4 5 6 7 8 9
class FORK feature inuse: BOOLEAN pickup is do inuse := True end putdown is do inuse := False end end Listing 2. FORK class in Eiffel.
be obtained at the same time for a philosopher to safely eat to completion. We encode this information in an eat method that takes a left and right fork as separate arguments. The eat routine invoked at line 13 waits until both fork processors are allocated to the philosopher and the precondition holds. The precondition (require clause) is treated as a guard that waits for the condition to become true rather than a correctness condition that generates an exception if the condition fails to hold. In our case, the precondition asserts that both forks must not be in use by other philosophers. A scheduling algorithm in the back-end ensures that resources are fairly allocated. Thus, philosophers wait for resources to become available, but are guaranteed that eventually they will become available provided that all reservation methods like eat terminate. Such methods terminate provided they have no infinite loops and have reserved all relevant resources. One of the difficulties of developing multi-threaded applications is the limited ability to re-use sequential libraries. A naive re-use of library classes that have not been designed for concurrency often leads to data races and atomicity violations. The mutual exclusion guarantees offered by SCOOP make it possible to assume a correct synchronization of client calls and focus on solving the problem without worrying about the exact context in which a class will be used. The separate keyword does not occur in class FORK. This shows how classes (such as FORK) written in the sequential context can be re-used without change for concurrent access. Fork methods such as pickup and putdown are invoked atomically from the eat routine (which holds the locks on these forks). Once we enter the body of the eat routine (with a guarantee of locks on the left and right forks and the precondition True), no other processors can send messages to these forks. They are under the control of this routine. When a philosopher (the client) calls fork procedures (such as l.pickup) these procedures execute asynchronously (e.g., the pickup routine call is dispatched to the processor handling the fork, and the philosopher continues executing the next instruction). In routine queries such as l.inuse and r.inuse at line 22, however, the client must wait for a result and thus must also wait for all previous separate calls on the fork to terminate. This is called wait-by-necessity. 2. The Front-end for Java-like Languages Fair versions of the dining philosophers in pure Java (e.g., see [5, p137]) are quite complex. The low level constructs for synchronization add to the complexity of the code. In addition, the software developer must explicitly implement a fair scheduler to manage multiple resources in the system. The scheduling algorithm for a fair Java implementation in [5] requires an extra 50 lines of dense code in comparison to the SCOOP code of listings 1 and 2. Is there a way to obtain the benefits of SCOOP programs for languages such as Java and C#? Our design goal in this endeavor is to allow developers to write SCOOP code using their favorite editors and compilers for type checking.
F. Torshizi et al. / The SCOOP Concurrency Model in Java-like Languages
11
1 public class Philosopher { 2 private @separate Fork rightFork; 3 private @separate Fork leftFork; 4 5 public Philosopher (@separate Fork l, @separate Fork r) { 6 leftFork = l; rightFork = r; 7 } 8 9 public void act(){ 10 while(true) { 11 eat(leftFork, rightFork); //non-separate call 12 } 13 } 14 15 @await(pre="!l.isInUse()&&!r.isInUse()") 16 public void eat(@separate Fork l, @separate Fork r){ 17 l.pickUp(); r.pickUp(); // separate calls 18 if(l.isInUse() && r.isInUse()) { 19 l.putDown(); r.putDown(); 20 } 21 } 22 } Listing 3. Example of a SCOOP program written in Java.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
public class Philosopher { [separate] private Fork rightFork; [separate] private Fork leftFork; public Philosopher ([separate] Fork l, [separate] Fork r) { leftFork = l; rightFork = r; } public void act(){ while(true) { eat(leftFork, rightFork); //non-separate call } } [await("!l.isInUse()&&!r.isInUse()")] public void eat([separate] Fork l, [separate] Fork r){ l.pickUp(); r.pickUp(); if(l.isInUse() && r.isInUse()) { l.putDown(); r.putDown(); } } Listing 4. Example of a SCOOP program written in C#.
At the front-end, we can satisfy our design goal by using the meta-data facility of modern languages that support concurrency as shown in listing 3 and in listing 4. The meta-data facility (e.g., annotations in Java and attributes for C#) provide data about the program that is not part of the program itself. They have no direct effect on the operation of the code they annotate. The listings show SCOOP for Java (which we call JSCOOP) and SCOOP for C# (which we call CSCOOP) via the annotation facility of these languages. In Eiffel, the keyword require is standard and in SCOOP is re-used for the guard. Thus only one new keyword (separate) is needed. Java and C# languages do not support contracts.
12
F. Torshizi et al. / The SCOOP Concurrency Model in Java-like Languages
So we need to add two keywords via annotations: separate and await (for the guard). The front-end uses the Java or C# compiler to do type checking. In addition, the front-end must do number of consistency checks to assure the correct usage of the separate construct (see below). 2.1. Syntax Checking In SCOOP, method calls are divided into two categories: • A non-separate call where the target object of the call is not declared as separate. This type of call will be handled by the current processor. Waiting happens only if the parameters of this method contain references to separate objects (i.e., objects handled by different processors) or there exists an unsatisfied await condition. Before executing a non-separate call, two conditions have to be true: (a) locks should be acquired automatically on all processors handling separate parameters and (b) await condition should be true. Locks are passed down further (if needed) and are released at the end of the call. Therefore, atomicity is guaranteed at the level of methods. • A separate call. In the body of a non-separate method one can safely invoke methods on the separate objects (i.e., objects declared as separate in the parameters). This type of call, where the target of the call is handled by a different processor, is referred to as a separate call. Execution of a separate call is the responsibility of the processor that handles the target object (not the processor invoking the call). Therefore, the processor invoking the call can send a message (call request) to the processor handling the target object and move on to the next instruction without having to wait for the separate call to return. The invoking processor waits for the results only at the point that it is necessary (e.g., the results are needed in an assignment statement) this is referred to as wait-by-necessity. Due to the differences between the semantics of separate and non-separate calls, it is necessary to check that a separate object is never assigned to a variable that is declared as nonseparate. Entities declared as non-separate but pointing to separate objects are called traitors. It is important to detect traitors at compile-time so that we have a guarantee that remote objects cannot be accessed except through the official locking mechanism that guarantees freedom from atomicity and race violations. The checks for traitors must be performed by the front-end. We use the Eclipse plugin (see Section 5) for JSCOOP to illustrate the front-end which must check the following separateness consistency (SC) rules [3]: • SC1: If the source of an attachment (assignment instruction or parameter passing) is separate, its target entity must be separate too. This rule makes sure that the information regarding the processor of a source entity is preserved in the assignment. As an example, line 9 in Listing 5 is an invalid assignment because its source x1 is separate but its target is not. Similarly, the call in line 11 is invalid because the actual argument is separate while the corresponding formal argument is not. There is no rule prohibiting attachments in the opposite direction — from non-separate to separate entities (e.g., line 10 is valid). • SC2: If an actual argument of a separate call is of a reference type, the corresponding formal argument must be declared as separate. This rule ensures that a non-separate reference passed as actual argument of a separate call be seen as separate outside the processor boundary. Let’s assume that method f of class X (line 22 in Listing 5) takes a non-separate argument and method g takes a separate argument (line 23). The client is not allowed to use its non-separate attribute a as an actual argument of x.f because from the point of view of x, a is separate. On the other hand, the call x.g(a) is valid.
F. Torshizi et al. / The SCOOP Concurrency Model in Java-like Languages
13
• SC3: If the source of an attachment is the result of a separate call to a function returning a reference type, the target must be declared as separate. If query q of class X returns a reference, that result should be considered as separate with respect to the client object. Therefore, the assignment in line 34 is valid while the assignment in line 35 is invalid. 1 public class Client{ 2 public static @separate X x1; 3 public static X x2; 4 public static A a; 5 6 public void sc1(){ 7 @separate X x1 = new X(); 8 X x2 = new X(); 9 x2 = x1; // invalid: traitor 10 x1 = x2; // valid 11 r1 (x1); // invalid 12 } 13 14 public void r1 (X x){} 15 16 public void sc2(){ 17 @separate X x1 = new X(); 18 r2 (x1); 19 } 20 21 public void r2 (@separate X x){ 22 x.f(a); // invalid 23 x.g(a); // valid 24 } 25 26 public void sc3() { 27 @separate X x1 = new X(); 28 s (x1); 29 } 30 31 public void s (@separate X x){ 32 @separate A res1; 33 A res2; 34 res1 = x.q; // valid 35 res2 = x.q; // invalid 36 } 37 } Listing 5. Syntax checking.
A snapshot of the tool illustrating the above compile errors is shown in Fig. 6 on page 16. 3. The Core Library for the Back-end In this section we focus on the design and implementation of the core library classes shown in Fig. 1; these form the foundation of the SCOOP implementation. The core library provides support for essential mechanisms such as processors, separate and non-separate calls, atomic locking of multiple resources, wait semantics, wait-by-necessity, and fair scheduling. Method calls on an object are executed by its processor. Processors are instances of the Processor class. Every processor has a local call stack and a remote call queue. The local stack is used for storing non-separate calls and the remote call queue is used for storing calls
14
F. Torshizi et al. / The SCOOP Concurrency Model in Java-like Languages
&RUH/LEUDU\ 6FRRS7KUHDG
6FKHGXOHU
3URFHVVRU
/RFN5HTXHVW
-DYD!! -6&223
&DOO
LPSRUW!!
LPSRUW!!
&!! &6&223
LQWHUIDFH!!
6\VWHP7KUHDGLQJ7KUHDG
-6FRRS7KUHDG
&6FRRS7KUHDG
5XQQDEOH
Figure 1. Core library classes associated with the SCOOP design pattern (see Fig. 9 in the appendix for more details).
made by other processors. A processor can add calls to the remote call queue of another processor only when it has a lock on the receiving processor. All calls need to be parsed to extract method names, return type, parameters, and parameter types. This information is stored in the Call objects (acting as a wrapper for method calls). The queue and the stack are implemented as lists of Call elements. The ScoopThread interface is implemented by classes whose instances are intended to be executed by a thread, e.g., Processor (whose instances run on a single thread acting as the “processor” executing operations) or the global scheduler Scheduler. This interface allows the rest of the core library to rely on certain methods being present in the translated code. Depending on the language SCOOP is implemented, ScoopThread can be redefined to obey the thread creation rules of the supporting language. For example, a Java based implementation of this interface (JScoopThread) extends Java’s Runnable interface. The Scheduler should be instantiated once for every SCOOP application. This instance acts as the global resource scheduler, and is responsible for checking await conditions and acquiring locks on supplier processors on behalf of client processors. The scheduler manages a global lock request queue where locking requests are stored. The execution of a method by a processor may result in creation of a call request (an instance of Call) and its addition to the global request queue. This is the job of SCOOP scheduler to atomically lock all the argu-
F. Torshizi et al. / The SCOOP Concurrency Model in Java-like Languages
15
ments (i.e., processors associated with arguments) of a routine. For example, the execution of eat (listing 3) blocks until both processors handling left and right forks have been locked. The maximum number of locks acquired atomically is the same as the maximum number of formal arguments allowed in the language. Every class that implements the ScoopThread interface must define a run() method. Starting the thread causes the object’s run() method to be called in that thread. In the run() method, each Processor performs repeatedly the following actions: 1. If there is an item on the call stack which has a wait condition or a separate argument, the processor sends a lock request (LockRequest) to the global scheduler Scheduler and then blocks until the scheduler sends back the “go-ahead” signal. A lock request maintains a list of processors (corresponding to separate arguments of the method) that need to be locked as well as a Semaphore object which allows this processor to block until it is signaled by the scheduler. 2. If the remote call queue is not empty, the processor dequeues an item from the remote call queue and pushes it onto the local call stack. 3. If both the stack and the queue are empty, the processor waits for new requests to be enqueued by other processors. 3.1. Scheduling When a processor p is about to execute a non-separate routine r that has parameters, it creates a call request and adds this request to the global request queue (handled by the scheduler processor). A call request contains (a) the identity of the requesting processor p, (b) list of resources (processors) requested to be locked, (c) the await condition, and (d) routine r itself. When the request is added to the global queue, p waits (blocks) until the request is granted by the SCOOP scheduler. Requests are processed by the SCOOP scheduler in FIFO order. The scheduler first tries to acquire locks on the requested processors on behalf of the requesting processor p. If all locks are acquired, then the scheduler checks the await condition. According to the original description of the algorithm in [3] if either of (a) acquiring locks on processors or (b) wait condition fail, the scheduler moves to the next request, leaving the current one intact. If both locking and wait condition checking are successful, the request is granted giving the client processor the permission to execute the body of the associated routine. The client processor releases the locks on resources (processors) as soon as it terminates executing the routine. 1 public class ClassA 2 { 3 private @separate ClassB b; 4 private @separate ClassC c; 5 ... 6 this.m (b, c); 7 ... 8 @await(pre="arg-B.check1()&&arg-c.check2()") 9 public void m(@separate ClassB arg-B, @separate ClassC arg-C){ 10 arg-B.f(arg-C); // separate call (no waiting) 11 arg-C.g(arg-B); // separate call (no waiting) 12 ... 13 } 14 ... 15 } Listing 6. Example of a non-separate method m and separate methods f and g.
16
F. Torshizi et al. / The SCOOP Concurrency Model in Java-like Languages
! ! " # " #
Figure 2. Sequence diagram showing a non-separate call followed by two separate calls involving three processors A, B and C (processor is abbreviated as proc and semaphore as sem).
3.2. Dynamic Behaviour In order to demonstrate what happens during a separate or non-separate call, let’s consider the JSCOOP code in Listing 6. We assume that processor A (handling this of type ClassA) is about to execute line 6. Since method m involves two separate arguments (arg-B and arg-C), A needs to acquire the locks on both processors handling objects attached to these arguments (i.e., processors B and C). Fig. 2 is a sequence diagram illustrating actions that happen when executing m. First, A creates a request asking for a locks (lock-semaphore-B and lock-semaphore-C) on both processors handling objects attached to arguments and sends this request to the global scheduler. A then blocks on another semaphore (call-semaphore-A) until it receives the “go-ahead” signal from the scheduler. The scheduler acquires both lock-semaphore-B and lock-semaphore-C in a fair manner and safely evaluates the await condition. If the await condition is satisfied, then the scheduler releases call-semaphore-A signaling processor A to continue to the body of m. A can then safely add separate calls f and g (lines 10 and 11) to the remote call queue of B and C respectively and continue to the next instructions without waiting for f and g to return. The call semaphore described above is also used for query calls, where wait-by-necessity is needed. After submitting a remote call to the appropriate processor, the calling class blocks on the call semaphore. The call semaphore is released by the executing object after the method has terminated and its return value has been stored in the call variable. Once the calling object has been signaled, it can retrieve the return value from the Call object. What happens if the body of method f has a separate call on object attached to arg-C (i.e., processor B invoking a call on C)? Since A already has a lock on C, B cannot get that lock and has to wait until the end of method m. In order to avoid cases like this, the caller processor “passes” the needed locks to the receiver processor and let the receiver processor release those locks. Therefore, A passes the lock that it holds on C to B allowing B to continue safely. Locks are passed down further (if needed) and are released at the end of the call.
F. Torshizi et al. / The SCOOP Concurrency Model in Java-like Languages
17
4. Translation Rules In this section we describe some of the SCOOP translation rules for Java-like languages. These rules take the syntax-checked annotated code (with separate and await) as their input and generate pure multi-threaded code (without the annotation) as their output. The output uses the core library facilities to create the dynamic behavior required by the SCOOP model. All method calls in SCOOP objects must be evaluated upon translation to determine whether a call is non-separate, separate, requires locks, or has an await condition. If the call is non-separate and requires no locks (i.e., does not have any separate arguments) and has no await condition, it can be executed as-is. However, if the call is separate, requires locks, or has an await, it is flagged for translation. All classes that use separate variables or have await conditions, or are used in a separate context are marked for translation. In addition, the main class (entry point of the application) is marked for translation. For each marked class A, a corresponding SCOOP class SCOOP A is created which implements the ScoopThread interface. 4.1. Non-separate Method Fig. 3 shows the rule for translating non-separate methods taking separate argument(s). To make the rule more generic, we use variables (entities starting with %). The method call (%Method) is invoked on an entity (%VAR2) which is not declared as separate. This call is therefore a non-separate call and should be executed by the current processor. The call takes one or more separate arguments (indicated as %SepArg0 to %SepArgN) and zero or more non-separate arguments (indicated as %Arg0 to %ArgN). The method may have the return value of type (%ReturnType). The SCOOP model requires that all formal separate arguments of a method have their processors locked by the scheduler before the method can be executed. In order to send the request to the scheduler, this processor needs to wrap the call and the required resources in a Call object. The creation of a Call wrapper scoop call is done in line 20. In order to create such a wrapper we need to (1) create a LockRequest object that contains the list of locks required by this call, i.e., processors handling objects attached to the arguments (lines 3–8), (2) capture the arguments and their types (lines 12–18), and (3) capture the return type of this call. We can assume that the classes corresponding to separate arguments are marked for translation to %SCOOP Type0 to %SCOOP TypeN. This allows us to access the remote call queues of the processors associated with these arguments. The new Call object is added to the stack of this processor (line 23) to be processed by the run() method (not shown here). Finally, the current processor blocks on its call semaphore (line 24) until it receives the go-ahead signal (release()) from the scheduler. The thread can safely enter the body of the method (line 26) after the scheduler issues the signal (the body of the method is translated by the separate-call rule). At the end of the method, the lock semaphores on all locked processors are released (lines 28–32). 4.2. Await Condition Fig. 4 shows the translation rule that deals with await conditions. The await conditions for all methods of a class are encapsulated in a single boolean method checkPreconditions located in the generated SCOOP %ClassName. This method is executed only by the scheduler thread (right after the lock semaphores on all requested processors are acquired). The scheduler sets the attribute call of this class and then runs the checkPreconditions. In this method, each of the await conditions is translated and suited to the SCOOP %ClassName context i.e., separate variables in await clause should be casted to the corresponding SCOOP types. As an example, if we have the await condition a.m1() == b.m2() where a is a sep-
18
F. Torshizi et al. / The SCOOP Concurrency Model in Java-like Languages
Annotated code // Method call on a non - separate object , including method // calls on ’ this ’. Class % ClassName { ... % ReturnType % VAR1 ; % TargetType % VAR2 ; ... [% VAR1 =] % VAR2 .% Method (% Type0 % SepArg0 ,[... ,% TypeN % SepArgN , % Type00 % Arg0 ,... ,% TypeNN % ArgN ]); ... }
Line
Translation
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
Class SCOOP_ % ClassName { ... List < Processor > locks ; // loop locks . add (% SepArg0 . getProcessor ()); ... // do this for all separate args locks . add (% SepArgN . getProcessor ()); // end loop lock_request := new LockRequest ( % VAR2 , locks , this . getProcessor (). getLockSemaphore ()); List < Object > args_types , args ; // loop args_types . add ( SCOOP_ % Type0 ); args . add (% SepArg0 ); ... // do this for all args args_types . add (% TypeNN ); args . add (% ArgN );... // end loop Semaphore call_semaphore = new Semaphore (0); scoop_call := new Call ( " % Method " , args_types , args , % ReturnType , lock_request , % VAR2 , call_semaphore ); getProcessor (). addLocalCall ( scoop_call ); call_semaphore . acquire (); // call the translated version of % Method ( see Rule 3) [% VAR1 = ] % VAR2 . translated_ % Method (% SepArg0 ,[... ,% SepArgN , % Arg0 ,... ,% ArgN ]); // loop locks [0]. unlockProcessor (); ... // do this for all locks processors locks [ N ]. unlockProcessor (); // end loop ... }
Figure 3. Translation rule for invoking non-separate method taking separate arguments.
arate variable of type A and b is a non-separate variable of type B, the corresponding await condition translation looks like ((SCOOP A) a).m1() == b.m2(). checkPreconditions returns true iff the await condition associated with the call evaluates to true. 4.3. Separate Call This rule describes the translation of a separate method (%SepMethod) invoked on a separate target object %SepArg0 in the body of a non-separate call %Method. %Method has at least one argument that is declared to be separate (i.e., %SepArg0), we can assume that the corresponding classes are marked for translation (i.e., %SCOOP Type0). This allows us to access the remote call queues of the processors associated with those arguments. As in the first rule, we first create a list of processors that needs to be locked for (%SepMethod) (lines 8–12), and create a lock request from that information (line 13). We collect the arguments and their
F. Torshizi et al. / The SCOOP Concurrency Model in Java-like Languages
19
Annotated code Class % ClassName { ... // method signature % AwaitCondition0 ... % Method0 (% Args0 ...% ArgsA ) { ... } ... % AwaitConditionN ... % MethodN (% Args0 ...% ArgsB ) { ... } }
Line
Translation
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Class SCOOP_ % ClassName { { ... public boolean checkPreconditions () { Call this_call = call ; // set by the scheduler String method_name = this_call . getMethodName (); int num_args = this_call . getNumArgs (); // " Reflection " method to check wait condition if ( method_name . equals ( " % Method0 " )) { if ( num_args == A ) { if ( Translated_ % AwaitCondition0 ) return true ; else return false ; } } ... // this is done for all methods } else if ( method_name . equals ( " % MethodN " )) { if ( num_args == B ) { if ( Translated_ % AwaitConditionN ) return true ; else return false ; } } ... }
Figure 4. Translation rule for methods with await condition.
types in lines 15–22. The Call object is finally created in line 23. In order to schedule the call for execution, we add the call object to the end of the remote call queue of the processor responsible for %SepArg0 (line 25). After the request has been sent, the current thread can safely continue to the next instruction without waiting. This rule only deals with void separate calls. If the call returns a value needed by the current processor, the current processor has to wait for the result (wait-by-necessity). This is achieved by waiting on the call semaphore. The supplier processor releases the call semaphore as soon as it is done with the calls. In this section we only showed three of the main translation rules. The rest of the translation rules are available online1 . A sample translation of a JSCOOP code to Java can be found in the appendix. 5. Tool Support JSCOOP is supported by an Eclipse plug-in2 that provides well-formedness checking, as well as syntax highlighting and other typical features of a development environment. Overall, the 1 2
http://www.cs.toronto.edu/~faraz/jscoop/rules.pdf http://www.cs.toronto.edu/~faraz/jscoop/jscoop.zip
20
F. Torshizi et al. / The SCOOP Concurrency Model in Java-like Languages
Annotated code Class % ClassName { ... // method body containi ng a separate call % ReturnType % Method (% Type0 % SepArg ,...); { ... % SepArg0 .% SepMethod ([% SepArg0 ,... ,% SepArgN ,...]); ... } ... } Class % Type { ... // supplier side void % SepMethod (% Type00 % Arg0 ,...% TypeNN % ArgN ) { ... } ... }
Line
Translation
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
Class SCOOP_ % ClassName { ... // translated method % ReturnType % Method ( SCOOP_ % Type0 % SepArg ,...); { ... Call call ; List < Processor > locks ; // loop locks . add (% SepArg0 . getProcessor ()); ... // do this for all separate arguments of % SepMethod locks . add (% SepArgN . getProcessor ()); // end loop lock_request = new LockRequest (% SepArg , locks , % SepArg . getProcessor (). getLockSemaphore ()); List < Object > args_types , args ; // loop args_types . add (% Type00 ); // use the SCOOP_ for separate types args . add (% Arg0 ); ... // do this for all arguments of % SepMethod args_types . add (% TypeNN ); args . add (% ArgN );... // end loop call = new Call ( " % SepMethod " , arg_types , args , void , lock_request , % SepArg0 , void ); % SepArg0 . getProcessor (). addRemoteCall ( call ); ... // move on to the next operation without waiting } ... }
Figure 5. Translation rule for separate methods.
plug-in consists of two packages: 1. edu.jscoop: This package is responsible for well-formedness and syntax checking and GUI support for JSCOOP Project creation. A single visitor class is used to parse JSCOOP files for problem determination. 2. edu.jscoop.translator: This package is responsible for the automated translation of JSCOOP code to Java code. Basic framework for automated translation has been designed with the anticipation of future development to fully implement this feature. The plug-in reports errors to the editor on incorrect placement of annotations and checks for violation of SCOOP consistency rules. We use the Eclipse AST Parser to isolate methods decorated by an await annotation. Await conditions passed to await as strings are translated into a corresponding assert statement which is used to detect problems. As a result,
F. Torshizi et al. / The SCOOP Concurrency Model in Java-like Languages
21
all valid assert conditions compile successfully when passed to await in the form of a string, while all invalid assert conditions are marked with a problem. For example, on the @await (pre="x=10") input, our tool reports “Type mismatch: cannot convert from Integer to boolean.” All await statements are translated into assert statements, and rewritten to a new abstract syntax tree, which is in turn written to a new file. This file is then traversed for any compilation problems associated with the assert statements, which are then reflected back to the original SCOOP source file for display to the developer. The Eclipse AST Parser is also able to isolate separate annotations. We are able to use the Eclipse AST functionalities to type check the parameters passed in the method call by comparing them to the method declaration. Separateness consistency rules SC1 and SC2 are checked when the JSCOOP code is compiled using this plug-in. Fig. 6 is an illustration of these compile time checks. In order to process the JSCOOP annotations, we have created a hook into the Eclipse’s JDT Compilation Participant. The JSCOOP CompilationParticipant acts on all JSCOOP Projects, using JSCOOP Visitor to visit and process all occurrences of separate and await in the source code. 6. Related work The Java with Annotated Concurrency (JAC) system [6] is similar in intent and principle to JSCOOP. JAC provides concurrency annotations — specifically, controlled and compatible — that are applicable to sequential program text. JAC is based on an active object model [7]. Unlike JSCOOP, JAC does not provide a wait/precondition construct, arguing instead that both waiting and exceptional behaviour are important for preconditions. Also, via the compatible annotation, JAC provides means for identifying methods whose execution can safely overlap (without race conditions), i.e., it provides mutual exclusion mechanisms. JAC also provides annotations for methods, so as to indicate whether calls are synchronous or asynchronous, or autonomous (in the sense of active objects). The former two annotations are subsumed by SCOOP (and JSCOOP)’s separate annotation. Overall, JAC provides additional annotations to SCOOP and JSCOOP, thus allowing a greater degree of customisability, while requiring more from the programmer in terms of annotation, and guaranteeing less in terms of safety (i.e., through the type safety rules of [3]). Similar annotation-based mechanisms have been proposed for JML [8,9]; the annotation mechanisms for JML are at a different level of abstraction than JSCOOP, focusing on methodlevel locking (e.g., via a new locks annotation). Of note with the JML extension of [8] is support for model checking via the Bogor toolset. The RCC toolset is an annotation-based extension to Java designed specifically to support race condition detection [10]. Additionally, an extension of Spec# to multi-threaded programs has been developed [11]; the annotation mechanisms in this extension are very strong, in the sense that they provide exclusive access to objects (making it local to a thread), which may reduce concurrency. The JCSP approach [12] supports a different model of concurrency for Java, based on the process algebra CSP. JCSP also defines a Java API and set of library classes for CSP primitives, and does not make use of annotations, like JSCOOP and JAC. Polyphonic C# is an annotated version of C# that supports synchronous and asynchronous methods [13]. Polyphonic C# makes use of a set of private messages to underpin its communication mechanisms. Polyphonic C# is based on a sound theory (the join calculus), and is now integrated in the Cω toolset from Microsoft Research. Morales [14] presents the design of a prototype of SCOOP’s separate annotation for Java; however, preconditions and general design-by-contract was not considered, and the support for type safety and well-formedness was not considered. More generally, a modern abstract programming framework for concurrent or parallel
22
F. Torshizi et al. / The SCOOP Concurrency Model in Java-like Languages
Figure 6. Snapshot of the Eclipse plug-in.
programming is Cilk [15]; Cilk works by requiring the programmer to specify the parts of the program that can be executed safely and concurrently; the scheduler then decides how to allocate work to (physical) processors. Cilk is also based, in principle, on the idea of annotation, this time of the C programming language. There are a number of basic annotations, including mechanisms to annotate a procedure call so that it can (but doesn’t have to) operate in parallel with other executing code. Cilk is not yet object-oriented, nor does it provide design-by-contract mechanisms (though recent work has examined extending Cilk to C++). It has a powerful execution scheduler and run-time, and recent work is focusing on minimising and eliminating data race problems. Recent refinements to SCOOP, its semantics, and its supporting tool have been reported.
F. Torshizi et al. / The SCOOP Concurrency Model in Java-like Languages
23
Ostroff et al [16] describe how to use contracts for both concurrent programming and rich formal verification in the context of SCOOP for Eiffel, via a virtual machine, thus making it feasible to use model checking for property verification. Nienaltowski [17] presents a refined access control policy for SCOOP. Nienaltowski also presents [4] a proof technique for concurrent SCOOP programs, derived from proof techniques for sequential programs. An interesting semantics for SCOOP using the process algebra CSP is provided in [18]. Brooke [19] presents an alternative concurrency model with similarities to SCOOP that theoretically increases parallelism. Some of the open research questions with regards to SCOOP are addressed in [20]. Our approach allows us to take advantage of the recent SCOOP developments and re-use them in Java and C# settings. 7. Conclusion and Future Work In this paper, we described a design pattern for SCOOP, which enables us to transfer concepts and semantics of SCOOP from its original instantiation in Eiffel to other object-oriented programming languages such as Java and C#. We have instantiated this pattern in an Eclipse plug-in, called JSCOOP, for handling Java. Our initial experience with JSCOOP has been very positive, allowing us to achieve clean and efficient handling of concurrency independently from the rest of the program. The work reported in this paper should be extended in a variety of directions. First of all, we have not shown correctness of our translation. While the complete proof would indicate that the translation is correct for every program, we can get partial validation of correctness by checking the translation on some programs. For example, Chapter 10 of [3] includes many examples of working SCOOP code. We intend to rewrite these programs in JSCOOP and check that they have the same behaviour (i.e., are bi-similar) as programs written directly in Java. These experiments will also enable us to check efficiency (e.g., time to translate) and effectiveness (lines of the resulting Java code and performance of this code) of our implementation of the Eclipse plug-in. We also need to extend the design pattern and add features to our implementation. Full design-by-contract includes postconditions and class invariants, and we intend to extend our design pattern to allow for these. Further, we need to implement support for inheritance and the full lock passing mechanism of SCOOP [17]. And creating support for handling C# remains future work as well. SCOOP is free of atomicity violations and race conditions by construction but it is still prone to user-introduced deadlocks (e.g., an await condition that is never satisfied). Since JSCOOP programs are annotated Java programs, we can re-use existing model-checkers and theorem-provers to reason about them. For example, we intend to verify the executable code produced by the Eclipse plug-in for the presence of deadlocks using Java Pathfinder [21]. Acknowledgements We would like to thank Kevin J. Doyle, Jenna Lau and Cameron Gorrie for their contributions to the JSCOOP project. This work was conducted under a grant from Natural Sciences and Engineering Research Council of Canada. References [1] Brian Goetz, Tim Peierls, Joshua Bloch, Joseph Bowbeer, David Holmes, and Doug Lea. Java Concurrency in Practice. Addison-Wesley, May 2006. ISBN-10: 0321349601 ISBN-13: 978-0321349606.
24
F. Torshizi et al. / The SCOOP Concurrency Model in Java-like Languages
[2] Bertrand Meyer. Object-Oriented Software Construction. Prentice Hall, 1997. [3] Piotr Nienaltowski. Practical framework for contract-based concurrent object-oriented programming, PhD thesis 17031. PhD thesis, Department of Computer Science, ETH Zurich, 2007. [4] Piotr Nienaltowski, Bertrand Meyer, and Jonathan Ostroff. Contracts for concurrency. Formal Aspects of Computing, 21(4):305–318, 2009. [5] Stephen J. Hartley. Concurrent programming: the Java programming language. Oxford University Press, Inc., New York, NY, USA, 1998. [6] Klaus-Peter Lohr and Max Haustein. The JAC system: Minimizing the differences between concurrent and sequential Java code. Journal of Object Technology, 5(7), 2006. [7] Oscar Nierstrasz. Regular types for active objects. In OOPSLA, pages 1–15, 1993. [8] Edwin Rodr´ıguez, Matthew B. Dwyer, Cormac Flanagan, John Hatcliff, Gary T. Leavens, and Robby. Extending JML for modular specification and verification of multi-threaded programs. In ECOOP, pages 551–576, 2005. [9] Wladimir Araujo, Lionel Briand, and Yvan Labiche. Concurrent contracts for Java in JML. In ISSRE ’08: Proceedings of the 2008 19th International Symposium on Software Reliability Engineering, pages 37–46, Washington, DC, USA, 2008. IEEE Computer Society. [10] Cormac Flanagan and Stephen Freund. Detecting race conditions in large programs. In PASTE ’01: Proceedings of the 2001 ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools and engineering, pages 90–96, New York, NY, USA, 2001. ACM. [11] Bart Jacobs, Rustan Leino, and Wolfram Schulte. Verification of multithreaded object-oriented programs with invariants. In Proc. Workshop on Specification and Verification of Component Based Systems. ACM, 2004. [12] Peter H. Welch, Neil Brown, James Moores, Kevin Chalmers, and Bernhard H. C. Sputh. Integrating and extending JCSP. In CPA, pages 349–370, 2007. [13] Nick Benton, Luca Cardelli, and C´edric Fournet. Modern concurrency abstractions for C#. ACM Trans. Program. Lang. Syst., 26(5):769–804, 2004. [14] Francisco Morales. Eiffel-like separate classes. Java Developer Journal, 2000. [15] Robert Blumofe, Christopher Joerg, Bradley Kuszmaul, Charles Leiserson, Keith Randall, and Yuli Zhou. Cilk: an efficient multithreaded runtime system. In Proc. Principles and Practice of Programming. ACM Press, 1995. [16] Jonathan S. Ostroff, Faraz Ahmadi Torshizi, Hai Feng Huang, and Bernd Schoeller. Beyond contracts for concurrency. Formal Aspects of Computing, 21(4):319–346, 2009. [17] Piotr Nienaltowski. Flexible access control policy for SCOOP. Formal Aspects of Computing, 21(4):347– 362, 2009. [18] Phillip J. Brooke, Richard F. Paige, and Jeremy L. Jacob. A CSP model of Eiffel’s SCOOP. Formal Aspects of Computing, 19(4):487–512, 2007. [19] Phillip J. Brooke and Richard F. Paige. Cameo: an alternative concurrency model for eiffel. Formal Aspects of Computing, 21(4):363–391, 2009. [20] Richard F. Paige and Phillip J. Brooke. A critique of SCOOP. In First International Symposium on Concurrency, Real-Time, and Distribution in Eiffel-like Languages (CORDIE), 2006. [21] Klaus Havelund and Tom Pressburger. Model checking Java programs using JavaPathfinder. Software Tools for Technology Transfer (STTT), 2(4):72–84, 2000.
F. Torshizi et al. / The SCOOP Concurrency Model in Java-like Languages
25
Appendix
Line
JSCOOP
1 2 3 4 5 6
... public void act (){ ... eat ( leftFork , rightFork ); ... }
Line
Java
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
... public void act () { Semaphore j_call_semaphore = new Semaphore (0); JSCOOP_Call j_call ; JSCOOP_Call [] j_wait_calls ; ... // eat ( leftFork , rightFork ); j_req_locks = new LinkedList < JSCOOP_Processor >(); j_req_locks . add ( lefFork . getProcessor ()); j_req_locks . add ( rightFork . getProcessor ()); j_lock_request = new JSCOOP_LockRequest ( this , j_req_locks , this . getProcessor (). getLockSemaphore ()); j_arg_types = new Class [2]; j_arg_types [0] = leftFork . getClass (); j_arg_types [1] = rightFork . getClass (); j_args = new Object [2]; j_args [0] = leftFork ; j_args [1] = rightFork ; j_call = new JSCOOP_Call ( " eat " , j_arg_types , j_args , null , j_lock_request , this , j_call_semaphore ); getProcessor (). addLocalCall ( j_call ); j_call_semaphore . acquire (); eat ( leftFork , rightFork ); scheduler . releaseLocks ( j_call ); ... }
Figure 7. Mapping from Philosopher to JSCOOP Philosopher: calling eat method.
26
F. Torshizi et al. / The SCOOP Concurrency Model in Java-like Languages
Line
JSCOOP
1 2 3 4 5 6 7 8
@await ( pre = " ! l . isInUse ()&&! r . isInUse () " ) public void eat ( @separ ate Fork l , @separate Fork r ){ l . pickUp (); r . pickUp (); if ( l . isInUse () && r . isInUse ()) { status = 2; }... }...
Line
Java
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58
public boolean checkPreconditions () { ... if ( method_name . equals ( " eat " )) { ... if (((( JSCOOP_Fork ) args [0]). isInUse ()== false ) && ((( JSCOOP_Fork ) args [1]). isInUse ()== false )) return true ; else return false ; }} else if ( method_name . equals (...)) { ... } public void eat ( J SCOOP_Fork l , JSCOOP_Fork r ) { JSCOOP_Call [] j_wait_calls ; Semaphore j_call_semaphore = new Semaphore (0); JSCOOP_Call j_call ; // l . pickUp (); j_req_locks = new LinkedList < JSCOOP_Processor >(); j_lock_request = new J SC OO P_Lo ck Re qu es t (l , j_req_locks , l . getProcessor (). getLockSemaphore ()); j_arg_types = null ; j_args = null ; j_call = new JSCOOP_Call ( " pickUp " , j_arg_types , j_args , null , j_lock_request , l , null ); l . getProcessor (). addRemoteCall ( j_call ); // r . pickUp (); ... // if ( l . isInUse () && r . isInUse ()) j_wait_calls = new JSCOOP_Call [2]; j_req_locks = new LinkedList < JSCOOP_Processor >(); j_lock_request = new JSC OO P_ Lo ckRe qu es t (l , j_req_locks , l . getProcessor (). getLockSemaphore ()); j_arg_types = null ; j_args = null ; j_wait_calls [0] = new JSCOOP_Call ( " isInUse " , j_arg_types , j_args , Boolean . class , j_lock_request , l , j_call_semaphore ); j_req_locks = new LinkedList < JSCOOP_Processor >(); j_lock_request = new JSC OO P_ Lo ckRe qu es t (r , j_req_locks , r . getProcessor (). getLockSemaphore ()); j_arg_types = null ; j_args = null ; j_wait_calls [1] = new JSCOOP_Call ( " isInUse " , j_arg_types , j_args , Boolean . class , j_lock_request , r , j_call_semaphore ); l . getProcessor (). addRemoteCall ( j_wait_calls [0]); r . getProcessor (). addRemoteCall ( j_wait_calls [1]); // ( wait - by - necessity ) j_call_semaphore . acquire (2); // Execute the if statement with returned values if (( Boolean ) j_wait_calls [0]. getReturnValue () && ( Boolean ) j_wait_calls [0]. getReturnValue ()) status = 2; ... }
Figure 8. Mapping from Philosopher to JSCOOP Philosopher.
27
F. Torshizi et al. / The SCOOP Concurrency Model in Java-like Languages
&RUH/LEUDU\ LQWHUIDFH!!
6FRRS7KUHDG
3URFHVVRU ORFNHGBE\3URFHVVRU ORFDOBFDOOBVWDFN UHPRWHBFDOOBTXHXH VOHHSBVHPDSKRUH6HPDSKRUH ORFNBVHPDSKRUH6HPDSKRUH VFKHGXOHU
UXQ FUHDWH3URFHVVRU UHPRYH3URFHVVRU UHOHDVH/RFNV DGG5HTXHVW
UXQ ORFN3URFHVVRU XQORFN3URFHVVRU DGG/RFDO&DOO DGG5HPRWH&DOO LQYRNH&DOO
&DOO
/RFN5HTXHVW VHPDSKRUH6HPDSKRUH UHTXHVWHU ORFNV
PHWKRGBQDPH DUJBW\SHV UHWXUQBYDOXH VFRRSBSURFHVVRU VFRRSBREMHFW FDOOBVHPDSKRUH6HPDSKRUH ORFNBUHTXHVW JHW0HWKRG1DPH JHW2EMHFW JHW2EMHFW3URFHVVRU JHW&DOO6HPDSKRUH JHW/RFN5HTXHVW
JHW/RFNV JHW6HPDSKRUH JHW5HTXHVWHU
LPSRUW!!
&6&223
LPSRUW!!
-6&223
6FKHGXOHU ORFNBUHTXHVWV DOOBSURFHVVRUV ORFNHGBSURFHVVRUV VOHHSBVHPDSKRUH6HPDSKRUH
VHW3URFHVVRU JHW3URFHVVRU FKHFN3UHFRQGLWLRQV VHW&DOO JHW&DOO
6\VWHP7KUHDGLQJ
LQWHUIDFH!! 5XQQDEOH
UXQ LQWHUIDFH!!
&VFRRS7KUHDG
LQWHUIDFH!!
-VFRRS7KUHDG -6&223B3URFHVVRU ORFNHGBE\ -6&223B3URFHVVRU ORFDOBFDOOBVWDFN UHPRWHBFDOOBTXHXH VOHHSBVHPDSKRUH6HPDSKRUH ORFNBVHPDSKRUH6HPDSKRUH VFKHGXOHU
UXQ FUHDWH3URFHVVRU UHPRYH3URFHVVRU UHOHDVH/RFNV DGG5HTXHVW
UXQ ORFN3URFHVVRU XQORFN3URFHVVRU DGG/RFDO&DOO DGG5HPRWH&DOO LQYRNH&DOO
-6&223B&DOO
PHWKRGBQDPH
JHW0HWKRG1DPH JHW2EMHFW JHW2EMHFW3URFHVVRU JHW&DOO6HPDSKRUH JHW/RFN5HTXHVW
JHW/RFNV JHW6HPDSKRUH JHW5HTXHVWHU
&6B3URFHVVRU ORFNHGBE\&6B3URFHVVRU ORFDOBFDOOBVWDFN UHPRWHBFDOOBTXHXH VOHHSBVHPDSKRUH6HPDSKRUH ORFNBVHPDSKRUH6HPDSKRUH VFKHGXOHU
&6B6FKHGXOHU ORFNBUHTXHVWV DOOBSURFHVVRUV ORFNHGBSURFHVVRUV VOHHSBVHPDSKRUH6HPDSKRUH
VHW3URFHVVRU JHW3URFHVVRU FKHFN3UHFRQGLWLRQV VHW&DOO JHW&DOO
UXQ FUHDWH3URFHVVRU UHPRYH3URFHVVRU UHOHDVH/RFNV DGG5HTXHVW
UXQ ORFN3URFHVVRU XQORFN3URFHVVRU DGG/RFDO&DOO DGG5HPRWH&DOO LQYRNH&DOO
&6B&DOO
DUJBW\SHV UHWXUQBYDOXH MVFRRSBSURFHVVRU MVFRRSBREMHFW FDOOBVHPDSKRUH6HPDSKRUH ORFNBUHTXHVW
-6&223B/RFN5HTXHVW VHPDSKRUH6HPDSKRUH UHTXHVWHU ORFNV
-6&223B6FKHGXOHU ORFNBUHTXHVWV DOOBSURFHVVRUV ORFNHGBSURFHVVRUV VOHHSBVHPDSKRUH6HPDSKRUH
VHW3URFHVVRU JHW3URFHVVRU FKHFN3UHFRQGLWLRQV VHW&DOO JHW&DOO
&6B/RFN5HTXHVW VHPDSKRUH6HPDSKRUH UHTXHVWHU ORFNV
JHW/RFNV JHW6HPDSKRUH JHW5HTXHVWHU
PHWKRGBQDPH DUJBW\SHV UHWXUQBYDOXH FVBSURFHVVRU FVBREMHFW FDOOBVHPDSKRUH6HPDSKRUH ORFNBUHTXHVW JHW0HWKRG1DPH JHW2EMHFW JHW2EMHFW3URFHVVRU JHW&DOO6HPDSKRUH JHW/RFN5HTXHVW
-6&223B3KLORVRSKHU &6B3KLORVRSKHU
SURFHVVRU VFKHGXOHU FDOO MVFRRSBORFNBUHTXHVW MVFRRSBUHTXHVWLQJBORFNV OHIWBIRUNULJKWBIRUN -6&223B)RUN
SURFHVVRU VFKHGXOHU FDOO FVBORFNBUHTXHVW FVBUHTXHVWLQJBORFNV OHIWBIRUNULJKWBIRUN -6&223B)RUN
HDW UXQ WKLQN OLYH
HDW UXQ WKLQN OLYH
Figure 9. Core library classes.
This page intentionally left blank
Communicating Process Architectures 2009 P.H. Welch et al. (Eds.) IOS Press, 2009 © 2009 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-065-0-29
29
Combining Partial Order Reduction with Bounded Model Checking Jos´e VANDER MEULEN and Charles PECHEUR Universit´e catholique de Louvain, Louvain-la-Neuve, Belgium {jose.vandermeulen , charles.pecheur} @uclouvain.be Abstract. Model checking is an efficient technique for verifying properties on reactive systems. Partial-order reduction (POR) and symbolic model checking are two common approaches to deal with the state space explosion problem in model checking. Traditionally, symbolic model checking uses BDDs which can suffer from space blowup. More recently bounded model checking (BMC) using SAT-based procedures has been used as a very successful alternative to BDDs. However, this approach gives poor results when it is applied to models with a lot of asynchronism. This paper presents an algorithm which combines partial order reduction methods and bounded model checking techniques in an original way that allows efficient verification of temporal logic properties (LT LX ) on models featuring asynchronous processes. The encoding to a SAT problem strongly reduces the complexity and non-determinism of each transition step, allowing efficient analysis even with longer execution traces. The starting-point of our work is the Two-Phase algorithm (Namalesu and Gopalakrishnan) which performs partial-order reduction on process-based models. At first, we adapt this algorithm to the bounded model checking method. Then, we describe our approach formally and demonstrate its validity. Finally, we present a prototypal implementation and report encouraging experimental results on a small example.
Introduction Model checking is a technique used to verify concurrent systems such as distributed applications and communication protocols. It has a number of advantages. In particular, model checking is automatic and usually quite fast. Also, if the design contains an error, model checking will produce a counterexample that can be used to locate the source of the error [1]. In the 1980s, several researchers introduced very efficient temporal logic model checking algorithms. McMillan achieved a breakthrough with the use of symbolic representations based on the use of Ordered Binary Decision Diagrams (BDD) [2]. By using symbolic model checking algorithms, it is possible to verify systems with a very large number of states [3]. Nevertheless, the size of the BDD structures themselves can become unmanageable for large systems. Bounded Model Checking (BMC) uses SAT-solvers instead of BDDs to search errors on bounded execution path [4]. BMC offers the advantage of polynomial space complexity and has proven to provide competitive execution times in practice. A common approach to verify a concurrent system is to compute the product finite-space description of the processes involved. Unfortunately, the size of this product is frequently prohibitive due, among other causes, to the modelling of concurrency by interleaving. The aim of partial order reduction (POR) techniques is to reduce the number of interleaving sequences that must be considered. When a specification cannot distinguish between two interleaving sequences that differ only by the order in which concurrently executed events are taken, it is sufficient to analyse one of them [5].
30 J. Vander Meulen and C. Pecheur / Combining Partial Order Reduction with Bounded Model Checking
This paper presents a technique which combines together the BMC method and the POR method for verifying linear temporal logic properties. We start from the Two-Phase algorithm (TP) of Namalesu and Gopalakrishnan [6]. We merge a variant of TP with the BMC procedure. This allows the verification of models featuring asynchronous processes. Intuitively, from a model and a property, the BMC method constructs a propositional formula which represents a finite unfolding of the transition relation and the property. Our method proceeds in the same way, but instead of using the entire transition relation during the unfolding of the model, we only use a safe subset based on POR considerations. This produces a propositional formula which is well suited for most modern SAT solvers. In contrast, our previous work introduced an algorithm to verify branching temporal logic properties which merges POR and BDD-based model checking [7]. To assess the validity of our approach, we start by introducing two methods which can be combined together for transforming computation trees. The POR method captures partialorder reduction criteria [8,5,1]. The idle-extension shows how a finite number of transitions can be added while also preserving temporal logic properties. Then, the Stuttering Bounded Two-Phase (SBTP) reduction is introduced, as a particular instance of a combination of these two methods inspired from TP. Finally, we present how a finite unfolding of SBTP is encoded as a propositional formula suitable for BMC. The remainder of the paper is structured as follows. Section 1 recalls some background concepts, definitions and notations that are used throughout the paper: bounded model checking, bisimulations and POR. In Section 2, two transformations of computation trees which preserves CT L∗X properties are presented, as well as the SBTP algorithm and its transformation to a BMC problem. Section 3 presents the extension of our prototype implementing the BMC of SBTP method. In Section 4, we present the results obtained by applying our method on a case study. Section 5 reviews related works. Finally, Section 6 gives conclusions as well as directions for future work. 1. Background 1.1. Transitions Systems A transition system which is a particular class of state machine represents the behavior of a system. A state of the transition systems is a snapshot of the system at a particular time, formally each state is labelled with atomic propositions. The actions performed by the system are modeled by means of transitions between states. Formally each transition carries a label which represents the performed action [1] 1 . In the rest of this paper, we assume a set AP of atomic propositions and a set A of transitions. Without loss of generality, the set AP can be restricted to the propositions that appear in the property to be verified on the system. Definition 1 (Transition System). Given a set of transitions A and a set of atomic propositions AP , a transition system (over A and AP ) is a structure M = (S, T, s0 , L) where S is a finite set of states, s0 ∈ S is an initial state 2 , T ⊆ S × A × S is a transition relation and L : S → 2AP is an interpretation function over states. α
We write s −→ s for (s, α, s ) ∈ T . A transition α is enabled in a state s iff there is a α state s such that s −→ s . We write enabled(s, T ) for the set of enabled transitions of T in s. When the context is clear, we write enabled(s) instead of enabled(s, T ). We assume that T 1 Our treatment differs slightly from [1] which views T as a set of (unlabelled) transition relations α ⊆ S × S. Using labelled transitions amounts to the same structures and is mathematically cleaner to define. 2 For simplicity, s0 is a single state. All the arguments of this paper can be easily generalized to many initial states (i.e. S0 ⊆ S).
J. Vander Meulen and C. Pecheur / Combining Partial Order Reduction with Bounded Model Checking 31
is total (i.e. enable(s) = ∅ for all s ∈ S). A transition α is deterministic in a state s iff there α is at most one s such that s −→ s . α A transition α ∈ T is invisible if for each pair of states s, s ∈ S such that s −→ s , L(s) = L(s ). A transition is visible if it is not invisible. An execution path of M is an infinite a0 a1 a2 −→ s1 − −→ s2 − −→ ···. sequence of consecutive transitions steps s0 − A computation tree can be built from a transition system M . s0 ∈ S is the root of a tree that unwinds all the possible executions from that initial state [1]. The computation tree of M (CT (M )) is itself a transition system and is essentially equivalent to M , in a sense that will be made precise below. 1.2. Model Checking This section briefly introduces model checking. For more details, we refer the reader to [1]. Model checking is an automatic technique to verify that a concurrent systems such a distributed application and a communication protocol, satisfies a given property. Intuitively, the system is modeled as a finite transition system, and model checking performs an exhaustive exploration of the resulting state graph to fulfill the verification. If the system violates the property, model checking will generate a counterexample which will help to locate the source of the error. A common approach to verify a concurrent system is to compute the combined finitespace description of the processes involved. Unfortunately, the size of this combination can grow exponentially, due to all the different interleavings among the executions of all the processes. Partial Order Reduction (POR) techniques reduce the number of interleaving sequences that must be be considered. When a specification cannot distinguish between two interleaving sequences that differ only by the order in which concurrently executed events are taken, it is sufficient to analyse one of them [5]. Temporal logic is used to express properties to be verified. In addition to the elements of propositional logic, this logic provides temporal operators for reasoning over different steps of the execution. There are several types of temporal logics such as linear temporal logic (LT L), computation tree logic (CT L), or CT L∗ which subsumes both LT L and CT L. For instance, LT L formulæ are interpreted over each execution path of the model. In LT L, Gϕ (globally ϕ) says that ϕ will hold in all future states, Fϕ (finally ϕ) says that ϕ will hold in some future states, ϕ U ψ (ϕ until ψ) says that ψ will hold in some future states and at every preceding states ϕ holds, and Xϕ (next ϕ) says that ϕ is true in the next state. In this paper we will consider LT LX , the fragment of LT L without the X operator. Similarly, CT L∗X (resp. CT LX ) is the fragment of CT L∗ (resp. CT L) without the X operator. By using temporal logic model checking algorithms, we can check automatically whether a given system, modeled as a transition system, satisfies a given temporal logic property. In the 1980’s, very efficient temporal logic model checking algorithms were introduced for these logics [9,10,11,12]. For instance, to check if a system M satisfies a LT L property ϕ, the algorithm presented in [10] constructs an automaton B over infinite words named B¨uchi automaton from the negation of ϕ [13]. Then it searches for violations of ϕ by checking the executions of the state graph which result from the combination of M and B. 1.3. Bounded Model Checking In 1992, a step forward was reached by McMillan by using a symbolic approach, based on Binary Decision Diagrams (BDD), to reason on set of states rather than individual states. This technique made it possible to verify systems with a very large number of states [14]. However for large models, the size of the BDD structures themselves can become intractable. In contrast, the Bounded Model Checking (BMC) uses SAT solver instead of BDDs as the underlying computational device [4]. The idea of BMC is to characterize an error
32 J. Vander Meulen and C. Pecheur / Combining Partial Order Reduction with Bounded Model Checking
execution path of length k as a propositional formula, and search for solutions to that formula with a SAT solver. This formula is obtained by combining a finite unfolding of the system’s transition relation and an unfolding of the negation of the property being verified. The latter is obtained on the basis of expansion equivalences such as p U q ≡ q ∨ (p ∧ X(p U q)) which allow us to propagate across successive states the constraints corresponding to the violation of the LT L property. If no counterexample is found, k is incremented and a new execution path is searched. We continue this process until a counterexample is found or any limit is reached. BMC allow to check LT L properties on a system. Since BMC works on finite paths, an approximate bounded semantics of LT L is defined. Intuitively, the bounded semantics treats differently paths with a back-loop (c.f. Figure 1(a)) and paths without such a back-loop (c.f. Figure 1(b)). The former can be seen as an infinite path formed by a finite number of states. In this case the classical semantic of LT L can be applied. In contrast, the latter is a finite prefix of an infinite path. In some cases, such a prefix π is sufficient to show that a path violates a property f . For instance, let f be the property Gp. If π contains a state which does not satisfy p then all paths which start with the prefix π violate Gp.
(b)
(a) sl
si
si
sk
sk
Figure 1. The two cases for a bounded path [4].
The propositional formula [[M, ¬f ]]k which is submitted to the SAT solver is constructed as follows, where f is the LT L property to be verified. Definition 2 (BMC encoding). Given a transition system M = (S, T, s0 , L), a LTL formula f , and a bound k ∈ N: k (l Lk ∧ l [[¬f ]]) where [[M, ¬f ]]k = [[M ]]k ∧ (¬Lk ∧ [[¬f ]]) ∨ l=0
• [[M ]]k is a propositional formula which represents the unfolding of k steps of the transition relation, • l Lk is propositional formula which is true iff there is a transition from sk to sl , • Lk is propositional formula which is true iff there exists a l such that l Lk , • [[¬f ]] is a propositional formula which is the translation of ¬f when [[M ]]k does not contain any back loop, and • l [[¬f ]] is a propositional formula which is the translation of ¬f when [[M ]]k does contain a back loop to state sl . It is shown in [4] that if M |= f then there is a k ≥ 0 such that [[M, ¬f ]]k is satisfiable. Conversely, if [[M, ¬f ]]k has no solutions for any k then M |= f 3 . Given a propositional formula p produced by the BMC encoding, a SAT solver decides if p is satisfiable or not. If it is, a satisfying assignment is given that describes the path violating the property. Most of the SAT solvers apply a variant of the Davis-Putnam-LogemannLoveland (DPLL) algorithm [15]. Intuitively, DPLL performs alternatively two phases. The first one chooses a value for some variable. The second one propagates the implications of 3 Actually, [4] shows that it is sufficient to look for bounded solutions of [[M, ¬f ]]k up to bound k ≤ K which depends on f and M .
J. Vander Meulen and C. Pecheur / Combining Partial Order Reduction with Bounded Model Checking 33
this decision that are easy to infer. This method is known as unit propagation. The algorithm backtracks when a conflict is reached 1.4. Bisimulation Relations A bisimulation is a binary relation between two transition systems M and M . Intuitively, a bisimulation can be constructed between two systems if one can simulate the other and viceversa. For instance, bisimulation techniques are used in model checking to reduce the number of states of M while preserving some kind of properties (e.g. LT LX , CT LX , . . . ). The literature proposes a large number of variants of bisimulation relations [16,8]. This section describes two kinds of bisimulation relations used in the sequel. 1.4.1. Bisimulation Equivalence Bisimulation equivalence is the classical notion [16], here adapted to transition systems by requiring identical state labellings, which ensures that CT L∗ properties are preserved [1]. Intuitively, bisimulation equivalence groups states that are impossible to distinguish, in the sense that both have the same labelling and offer the same transitions leading to equivalent states. Definition 3 (Bisimulation Equivalence). Let M = (S, T, s0 , L) and M = (S , T , s0 , L ) be two structures with the same set of atomic propositions AP . A relation B ⊆ S × S is a bisimulation relation between M and M if and only if for all s ∈ S and s ∈ S, if B(s, s ) then the following conditions hold: • L(s) = L(s ). α α • For every state s1 ∈ S such that s −→ s1 there is a s1 ∈ S such that s −→ s1 and B(s1 , s1 ). α α • For every state s1 ∈ S such that s −→ s1 there is a s1 ∈ S such that s −→ s1 and B(s1 , s1 ). M and M are bisimulation-equivalent iff there exists a bisimulation relation B such that B(s0 , s0 ). In [1] it is shown that unwinding a structure results in a bisimulation-equivalent structure. So, we conclude that a computation tree which is generated from a model M is bisimulation-equivalent to M . Furthermore, bisimulation equivalence preserves CT L∗ properties, as shown in [1]. Figure 2 (a) and Figure 2 (b) are bisimulation-equivalent. For each dashed oval, we can group together every state of Figure 2 (b) to state of Figure 2 (a) (e.g. B(1, 3)). On the other hand, Figure 2 (a) and Figure 2 (c) are not bisimulation-equivalent because the node 7 in Figure 2 (c) does not correspond to any states in Figure 2 (a). 1.4.2. The Visible Bisimulation Visible bisimulation is a weaker equivalence that only preserves CT L∗X properties, and thus also CT L and LT L properties. Our POR methods preserve visible bisimilarity and therefore those logics. Intuitively, the visible bisimulation associates two states s and t that are impossible to distinguish, in the sense that if from s a visible action a is attainable in the future, a also belongs to t’s future. Definition 4 (Visible Bisimulation [8]). A relation B ⊆ S × S is a visible simulation between two structures M = (S, T, s0 , L) and M = (S , T , s0 , L ) iff B(s0 , s0 ) and for every s ∈ S, s ∈ S such that B(s, s ), the following conditions hold: 1. L(s) = L (s )
34 J. Vander Meulen and C. Pecheur / Combining Partial Order Reduction with Bounded Model Checking
p (a)
p
a
1
2
p
b a
3 (b)
(c)
q b
p 5
a
6
p
4
q
b
p
7
8
b
a
a
Figure 2. Bisimilar and nonbisimilar structures. a
2. Let s −→ t. There are two cases: • a is invisible and B(t, s ), or c
cn−1
c
a
0 1 −→ s1 − −→ · · · −−−→ sn − −→ t in M , such that B(s, si ) • there exists a path s − for 0 < i ≤ n and ci is invisible for 0 ≤ i < n. Furthermore, if a is visible, then a = a. Otherwise, a is invisible.
a
a
0 1 −→ s1 − −→ · · · in M , where all ai are invisible 3. If there is an infinite path s = s0 − c and B(si , s ) for i ≥ 0, then there exists a transition s −→ t such that c is invisible, and for some j > 0, B(sj , t )
B is a visible bisimulation iff both B and B −1 are visible simulations. M and M are visibly-bisimilar iff there is a visible bisimulation B. Figure 3 (a) and 3 (b) are visibly-bisimilar. To see this, we construct the relation which put together states of Figure 3 (a) and states of Figure 3 (b) that are linked by a dashed line together. The action a and b can be executed in any order leading to the same result, from the standpoint of verification. Figure 3 (a) and Figure 3 (c) are not visibly-bisimilar, the node 12 in Figure 3 (c) does not correspond to any states in Figure 3 (a). p 1 a p 2
c 5 q
p
8 p
c
c 6 r
b
13
b
a
(a)
a
a 3 p
4 p
12
7 b
b
p
p
14
a
b 9 p
15
c
c
10
11
16
q
r
q
(b)
q
p
c 17
(c)
r
Figure 3. Visibly-bisimilar and not visibly-bisimilar structures.
There also exists a weaker bisimulation, called stuttering bisimulation. In general, the POR literature is based on the notion of stuttering bisimulation to reason about POR. In [8], it is shown that a visible bisimulation is also a stuttering bisimulation, and also preserves CT L∗X properties. In general, it is easier to reason about visible bisimulation than about stuttering bisimulation because the former implies an argument about states and the latter
J. Vander Meulen and C. Pecheur / Combining Partial Order Reduction with Bounded Model Checking 35
implies an argument about infinite paths. Actually, the POR method which is applied in the sequel produces a reduced graph from a model M which is visible-bisimilar to M . 1.5. Partial-Order Reduction The goal of partial-order reduction methods (POR) is to reduce the number of states explored by model-checking, by avoiding to explore different equivalent interleavings of concurrent events [8,5,1]. Naturally, these methods are best suited for strongly asynchronous programs. Interleavings which are required to be preserved may depend on the property to be checked. Partial-order reduction is based on the notions of independence between transitions and invisibility of a transition. Two transitions are independent if they do not disable one another and executing them in either order results in the same state. Intuitively, if two independent transitions α and β are invisible w.r.t. the property f that one wants to verify, then it does not matter whether α is executed before or after β, because they lead to the same state and do not affect the truth of f . Partial-order reduction consists in identifying such situations and restricting the exploration to either of these two alternatives. In effect, POR amounts to exploring a reduced model M = (S , T , s0 , L) with S ⊆ S and T ⊆ T . In practice, classical POR algorithms [5,1] execute a modified depthfirst search (DFS). At each state s, an adequate subset ample(s) of the transitions enabled in s are explored. To ensure that this reduction is adequate, that is, that verification results on the reduced model hold for the full model, ample(s) has to respect a set of conditions, based on the independence and invisibility notions previously defined. In some cases, all enabled transitions have to be explored. The following conditions are set forth in [1,8]: C0 ample(s) = ∅ if and only if enable(s) = ∅. C1 Along every path in the full state graph that starts at s, the following condition holds: a transition that is dependent on a transition in ample(s) cannot be executed without a transition in ample(s) occurring first. C2 If ample(s) = enabled(s), then all transitions in ample(s) are invisible. C3 A cycle is not allowed if it contains a state in which some transition α is enabled, but is never included in ample(s) on the cycle. On finite models, conditions C0, C1, C2 and C3 are sufficient to guarantee that the reduced model preserves properties expressed in LT LX . On infinite models (such as computation trees) condition C3 must be rephrased as the following condition, which intuitively states that all the transitions in enabled(s) will eventually be expanded. C3b An infinite path is not allowed if it contains a state in which some transition α is enabled, but is never included in ample(s) on the path. In order to demonstrate that C0, C1, C2 and C3b preserve LT LX properties, a similar argument as the one presented in [1] can be used. The only difference is the method applied for demonstrating the Lemma 28 of [1]. This Lemma can be demonstrated by using condition C3b instead of condition C3. Ensuring preservation of branching temporal logics requires an additional constraint which is significantly more restrictive [8]: C4 If ample(s) = enabled(s), then ample(s) contains only one transition that is deterministic in s. When conditions C0 to C4 are satisfied, [8] shows that there is a visible bisimulation between the complete and reduced models, which ensures preservation of CT L∗X properties (and thus CT LX and LT LX ). The same argument can be used to demonstrate that there is
36 J. Vander Meulen and C. Pecheur / Combining Partial Order Reduction with Bounded Model Checking
also a visible bisimulation between the full and the reduced state graph, when conditions C0, C1, C2, C3b, and C4 are satisfied. Conditions C1 and C2 depend on the whole state graph and are not directly exploitable in a verification algorithm. Instead, one uses sufficient conditions, typically derived from the structure of the model description, to safely decide where reduction can be performed. 1.6. Process Model In the sequel, we assume a process-oriented modeling language. We define a Process Model as a refinement of a transition system: Definition 5 (Process Model). Given transition system M = (S, T, s0 , L), a process model consists of a finite set P of m processes p0 , p1 , . . . , pm−1 . For each pi , we define safe deterministic actions Ai ⊆ A and safe deterministic transitions Ti = T ∩ (S × Ai × S) such that for all a ∈ Ai , a is invisible, and for all s ∈ S: ample(s) = enable(s, Ti ) = {a} satisfies conditions C1 and C4. All Ti contain only safe deterministic transitions. Given a state s and a Ti , s is safe deterministic w.r.t. Ti if and only if enable(s, Ti ) = ∅. For instance, suppose a concurrent program S composed of a finite number of thread(s) m. Each thread has exclusive access to some local variables, as well as some global variables that all threads can read or write. This program can be translated into a process model M . The translation procedure may translate each thread of S into a process pi . In particular, a (deterministic) instruction of pi that affects only a local variable x (e.g. x = 3) will meet the conditions of Definition 5 and can be modelled as a safe determinisitc action ax=3 ∈ Ai . ax=3 −→ t resulting from the execution of that instruction will be safe Indeed, all transition s −−− deterministic. Thus, ample(s) = enable(s, Ti ) = {ax=3 } is a valid ample set for POR. 1.7. The Two-Phase Approach to Partial Order Reduction This section presents the Two-Phase algorithm (TP) which was firstly introduced in [6]. Starting from a model M , it generates a reduced model M which is visible-bisimilar to M . It is a variant of the classical DFS algorithm with POR [5,1]. It alternates between two distinct phases: • Phase-1 only expands safe deterministic transitions considering each process at a time, in a fixed order. As long as a process is deterministic, the single transition that is enabled for that process is executed. Otherwise, the algorithm moves on to the next process. After expanding all processes, the last reached state is passed on to Phase-2. • Phase-2 performs a full expansion of the state resulting from the Phase-1, then applies Phase-1 recursively to all reached states. To avoid postponing a transition indefinitely, at least one state is fully expanded on each cycle in the reduced state space. Such an indefinite postponing can only arise within Phase-1. It is handled by detecting cycles within the current Phase-1 expansion. When such a cycle is detected, the algorithm moves to the next process or to Phase-2. As shown in [17], the Two-Phase algorithm produces a reduced state space which is visible-bisimilar to the whole one and therefore preserves CT L∗X properties. This follows from the fact that TP is a classical DFS algorithm with POR and that ample(s) meets conditions C0 to C4 of Section 1.5.
J. Vander Meulen and C. Pecheur / Combining Partial Order Reduction with Bounded Model Checking 37
2. The Stuttering Bounded Two-Phase Algorithm This section presents a variant of the two-phase approach to partial-order reduction, called the Stuttering Bounded Two-Phase method (SBTP). In contrast to the original TP, which performs Phase-1 partial expansions as long as possible, our SBTP method imposes a fixed number n of Phase-1 expansions for each process. If less than n successive steps are enabled for some process, invisible idle transitions are performed instead. Figure 4 illustrates the resulting computation tree, for two processes with n = 3 transitions each. T0 else idle T0 else idle T0 else idle T1 else idle T1 else idle T1 else idle
T T0 T0 T0 T1 T1 T1
else idle else idle else idle else idle else idle else idle
T .. .
Figure 4. SBT P (M, 3) with two processes and n = 3.
This approach ensures that, at a given depth in the execution, the same (partial or global) transition relation is applied to all states, which greatly simplifies the encoding and resolution of this exploration as a bounded model-checking problem using SAT solvers. We consider computation trees (CTs) rather than general transition systems. This offers the advantage that states of the original transition system that can be reached through different paths in the original model, and thus be expanded in different ways, become different states in the computation tree, each with its unique expansion. It matches naturally with the SATbased bounded model-checking approach, which does not attempt to prevent exploring the same state several times on the same path, as opposed to conventional enumerative modelcheckers. To precisely define the SBTP approach, we first characterize a broad class of derived CTs reduced according to partial-order criteria and extended with (finite chains of) idle transitions, and show that they are visible-bisimilar to the CT they derive from. Then we define the CT corresponding to the SBTP method we just outlined, as a particular instance of this class of derived CTs. Finally, we express a constraint system whose solutions are (finite or infinite periodic) bounded execution paths of the CT produced by SBTP. 2.1. Transforming the Computation Tree This section presents two classes of derived computation trees that are visible-bisimilar to a given computation tree CT : partial-order reductions (POR) of CT , which removes states and transitions according to POR criteria, and idle-extensions of CT , which adds (finitely many) idle transitions to each state.
38 J. Vander Meulen and C. Pecheur / Combining Partial Order Reduction with Bounded Model Checking
Both derivations can be combined: indeed, given an initial CT we can successively derive CT as a POR of CT then CT as an idle extension of CT . By transitivity of equivalence, we know that CT is visible-bisimilar to CT . Partial-Order Reduction of CTs The definition of POR on computation trees is a straight application of the criteria quoted in Section 1.5. Definition 6 (POR). Given CT = (S, T, s0 , L) and CT = (S , T , s0 , L) two computation trees such that S ⊆ S and T ⊆ T , CT is a partial-order reduction (POR) of CT if and only if ample(s) respects the conditions C0, C1, C2, C3b and C4 from Section 1.5 over CT , where for all s in S, ample(s) = enabled(s, T ) when s ∈ S and ample(s) = enabled(s, T ) otherwise4 . Theorem 1. If CT is a partial-order reduction of CT , then CT ≈ CT . Proof. This can be demonstrated by constructing a visible bisimulation between M and M . a1 a2 The relation ∼ ⊆ S × S is defined such that s ∼ s iff there exists a path s = s1 − −→ s2 − −→ an1 · · · −−−→ sn = s such that ai is invisible and {ai } satisfies C1 from state si for 1 ≤ i < n. It was shown in [8] that the relation ≈ = ∼ ∩ (S × S ) is a visible bisimulation between M and M . Idle-Extension of CTs The idle-extension consists in adding a finite (possibly null) number of idle transitions on states of CT , giving CT . Intuitively, an idle transition is a transition which does nothing and so does not modify the current state. Definition 7 (Idle-Extension). Given a computation tree CT = (S, T, s0 , L), an idleextension of CT is a computation tree CT = (S , T , s0 , L) over an extended set of transitions A ∪ {idle}, with S ⊇ S and such that for all s ∈ S there is a finite sequence idle idle idle s = s0 −−−→ s1 −−−→ · · · −−−→ sn in CT where: • • • •
s1 , . . . , sn are new states not in S, L(s1 ) = · · · = L(sn ) = L(s), idle is the only enabled transition in s0 , . . . , sn−1 , a a for all s −→ t in CT we have sn −→ t in CT . idle∗
We write s −−−→ si when such a sequence exists and call si an idle-successor of s and s the idle-origin of si . Note that the idle transition is invisible according to this definition. Since the idleextension is a tree, idle-successors are never shared between multiple idle-origins. Theorem 2. If CT is an idle-extension of CT , then CT ≈ CT . Proof. Let CT = (S, T, s0 , L) and CT = (S , T , s0 , L). We define B ⊆ CT ×CT such that B(s, s ) iff s is an idle-successor of s (including s itself). We will prove that B is a visible bisimulation between CT and CT . First, obviously we have B(s0 , s0 ). Next, we consider s, s such that B(s, s ) and check that the three conditions of Definition 4 are satisfied both ways. By definition of B, s is an idle-successor of s. 1. L(s) = L(s ) by Definition 4. a idle∗ a 2. If s −→ t in CT , then there is s −−−→ s −→ t in CT , with B(t, t). a Conversely, if s −→ t in CT then either a = idle, which is invisible, and t is another idle-successor of s so B(s, t ), or a = idle, in which case s is the last idlea successor of s and s −→ t in CT , with B(t , t ). 4
The case where s ∈ / S is for technical soundness only.
J. Vander Meulen and C. Pecheur / Combining Partial Order Reduction with Bounded Model Checking 39 a
a
1 2 3. Suppose that there exists an infinite path s − −→ t1 − −→ t2 · · · in CT , where all ai are invisible and B(ti , s ) for all ti . Then s is a shared idle-successor of all ti , which is impossible according to Definition 7. a1 a2 −→ t1 − −→ t2 · · · in CT , Conversely, suppose that there exists an infinite path s − where all ai are invisible and B(s, ti ) for all ti . Then all ti are idle-successors of s, which is again impossible according to Definition 7.
2.2. The Stuttering Bounded Two Phase Computation Tree In order to accelerate the SAT procedure, we want to consider a modified computation tree of a model such that the same (possibly partial) transition relations are applied to all states at a given depth across the tree. This result can be obtained by applying Stuttering Bounded Two Phase (SBTP) which is a variant of the Two-Phase algorithm (TP). For the simplicity of the arguments, the method presented in this Section and in Section 2.3 considers only the case of finite traces without back-loops (c.f. Section 1.3). Section 2.4 explains how to reason about back-loops. We consider a process model M = (S, T, s0 , L) with m processes p0 , p1 , . . . , pm−1 (c.f. Section 1.6) SBTP’s Phase-1 expands exactly n deterministic transitions of p0 , then n deterministic transitions of p1 , . . . , then n deterministic transitions of pm−1 (n ∈ N). If less than n safe deterministic transitions are allowed, then idle transitions are performed instead. After Phase-1, a Phase-2 expansion occurs even if there are safe deterministic transitions remaining. The computation tree produced by SBT P (M, n) is defined in Listing 1, where BCT (s, t, i) computes the transition relation from state t at depth i using transitions from state s. SBT P (M, n) = (S , T , s0 , L) where c = m · n + 1, T = BCT (s0 , s0 , 0), BCT (s, t, i) = a if n · p ≤ i mod c ≤ n · (p + 1) ∧ s −→ s ∈ Tp then a {t −→ s } ∪ BCT (s , s , i + 1) else if n · p ≤ i mod c ≤ n · (p + 1) ∧ enable(s, Tp ) = ∅ then idle
{t −−−→ t } ∪ BCT (s, t , i + 1) where t is a idle - successor of s else // p = M a {t −→ s } ∪ BCT (s , s , i + 1) , and
(s,a,s )∈T
S = {s | s is reachable from s0 using T } Listing 1. SBTP.
It is easily seen that the computation tree produced by SBTP is an idle-extension of a partial order reduction of CT (M ), and is therefore visible-bisimilar to CT (M ). We notice that when n equals 0 no partial order reduction is performed and the resulting computation tree is the same as the original computation tree. Figure 5(b) illustrates the result of applying one full cycle of SBTP to the CT of Figure 5(a), with two processes and n = 3. The gray arrows of Figure 5(a) represent transitions which are ignored by Phase-1.
40 J. Vander Meulen and C. Pecheur / Combining Partial Order Reduction with Bounded Model Checking s4
T T1 (a) CT (M )
s0
T1
T2
s1
s2
T
s3
s5
T s6
T1
T1 (b) SBT P (M, 3)
s0
s1
T2
idle s20
s21
s4
T idle s30
idle s31
s32
T
s5
T s6
i=0
1
2
3
Phase 1
4
5
6
7
Phase 2
Figure 5. CT (M ) vs SBT P (M, 3), if s and s are linked by a dashed line then s ≈ s and s ≈ s.
2.3. Applying Bounded Model Checking to SBTP This section describes the actual bounded model checking problem used in our approach. This problem encodes bounded executions of the SBTP computation tree defined in the previous section. Given a process model M with m processes, a LT LX property f , and n, k ∈ N, our approach uses a variant of the method presented in [4] to create a propositional formula P . Contrary to the classical bounded model checking methods which uses a sin[[M, ¬f ]]SBT k,n gle transition relation to carry out the required computation on the state space, we define m + 1 transition relations. One is the full transition relation T used in Phase-2. The others, used in Phase-1, only contain for each process pi , the safe deterministic transitions of pi and idle transitions on states where no such safe deterministic transitions are enabled. We denote these relation transitions by Tiidle . Given two states s, t and an action a , Tiidle (s, a, t) if and a only if either enable(s, Ti ) = {a} and s −→ t , or enable(s, Ti ) = ∅ and a = idle . Given the number of processes m and parameter n, we know which phase is used in the unfolding process at each depth i of the unfolding process. Furthermore, if Phase-1 is expanded at i, we know which process is being unfolded (c.f. Figure 4). The transition relation Tn (i, s, a, s ) expanded at level i is defined as follows: Definition 8 (Tn (i, s, a, s )). Given M = (S, T, s0 , L) with m processes p0 , p1 , . . . , pm−1 . Let c = m · n + 1 the number of steps of a cycle: n Phase-1 steps for each of the m processes plus one Phase-2 step, i ∈ N, s, s ∈ S, and a ∈ A: Tn (i, s, a, s ) :=
T (s, a, s ) if i mod c = m · n (Phase-2) Tjidle (s, a, s ) where j = (i mod c) div n otherwise (Phase-1)
We are able to apply POR on bounded model checking by making use of the previous definition into Definition 2 which translate a transition system and a LT L property into a propositional formula:
J. Vander Meulen and C. Pecheur / Combining Partial Order Reduction with Bounded Model Checking 41
Definition 9 (SBTP encoding). Let M be a process model which contains m processes, f be a LT LX property and n, k ∈ N: P [[M, ¬f ]]SBT := I(s0 ) ∧ k,n
i=k−1
Tn (i, si , a, si+1 ) ∧ l Lk ∧ [[¬f ]]k
i=0 P When the propositional formula [[M, ¬f ]]SBT is built, a decision procedure is used to k,n P is satisfiable. The validity of this check its satisfiability. An error is found if [[M, ¬f ]]SBT k,n method stems from the following observations. By comparing the construction of SBTP(M, P it is clear that the latter is the BMC encoding of the former i.e. n) and [[M, ¬f ]]SBT k,n SBT P [[M, ¬f ]]k,n = [[SBT P (M, n), ¬f ]]k (restricted to finite traces). The rest derives from the validity of BMC and SBTP, as follows:
Theorem 3. Let M be a process model with m processes, f be an LT LX formula, and n ∈ N. P if and only if M |= f . There exists k ≥ 0 such that [[M, ¬f ]]SBT k,n Proof.
P ∃k : [[M, ¬f ]]SBT is satifiable k,n
P [[M, ¬f ]]SBT = [[SBT P (M, n), ¬f ]]k ⇐⇒ ∃k : [[SBT P (M, n), ¬f ]]k is satifiable k,n ⇐⇒ ∃k : SBT P (M, n) |=k f (by validity of BMC (c.f. Theorem 2 of [4])) ⇐⇒ SBT P (M, n) |= f (by validity of BMC (c.f. Theorem 1 of [4])) ⇐⇒ M |= f (SBT P (M, n) ≈ CT (M ) ≈ M )
P [[M, ¬f ]]SBT is well suited for the DPLL algorithm in the sense that the Phase-1 trank,n idle sition Tj produces mostly efficient unit propagation with little backtracking. Suppose that a0 we want to find a satisfying assignment for the path s0 − −→ s1 · · · and that s0 is a deterministic state. Once the variable’s values of s0 are completely decided, the variable’s values of s1 can be completely inferred by the propagation unit phase. Because s0 is a deterministic state, there is exactly one possibility for the variable’s value of s1 .
2.4. BMC with the Back-loops This section shows why paths with back-loops invalidate the arguments of Section 2.3, and how to extend those arguments of Section 2.3 to deal with back-loops. Figure 6 represents a path π which contains a back-loop. It is easy to see that this finite loop induces an infinite path which does not belong to SBT P (M, 2). This path belongs to the computation tree presented in Figure 7. All the execution paths start with a prefix of length T idle
T idle
T idle
1 1 2 −→ s1 −−− −→ s2 −−− −→, followed by an infinite expansion of the k1 = 3 of the form s0 −−−
T idle
T
T idle
T idle
T idle
T
2 1 1 2 −→ si+1 −→ si+2 −−− −→ si+3 −−− −→ si+4 −−− −→ si+5 −→. loop of length k2 = 6: si −−− Given a process model M = (S, T, s0 , L) with m processes, and lengths k1 and k2 , we can build variants of SBT P (M, n) that correspond to the computation tree of Figure 7. These variants are still idle extensions of partial order reductions of CT (M ), hence visible-bisimilar P with back-loops, similar to to M . We can then construct a complete version of [[M, ¬f ]]SBT k,n Definition 2, that essentially corresponds to the union of those modified SBTP computation trees. Note that in order to satisfy the condition C3b of Section 1.5, the full transition relation T must be used to check whether there exists a back loop or not. If a transition relation Tjidle was used instead, we could have a loop that does not contain a Phase-2 expansion, thus postponing some transitions indefinitely and violating condition C3b.
42 J. Vander Meulen and C. Pecheur / Combining Partial Order Reduction with Bounded Model Checking
T T1idle
T1idle
T2idle
T2idle
T
T1idle
T1idle
T2idle
sk k1
k2
Figure 6. A finite path with a back-loop.
T1idle
prefix (k1 )
T1idle T2idle T2idle T T1idle
loop 1 (k2 )
T1idle T2idle T T2idle T T1idle
loop 2 (k2 )
T1idle T2idle T
.. .
.. .
Figure 7. variant of SBT P (M, n).
3. Implementation We extended the model checker presented in [7] to support the BMC over SBTP models. It allows us to describe concurrent systems and to verify LT LX properties. Our prototype has been implemented with the Scala language [18]. We decided to use the Scala language because it is a multi-paradigm programming language, fully interoperable with Java, designed to integrate features from object-oriented programming and functional programming. Scala is a pure object-oriented language in the sense that every value is an object. Scala is also a functional language in the sense that every function is a value. The model checker defines a language for describing transitions systems. The design of the language has been influenced on the one hand by process algebras and on the other hand by the NuSMV language [19]. A model of a concurrent system declares a set of global variables, a set of shared actions and a set of processes. A process pi declares a set of local variables, a set of local actions and the set of shared actions which pi is synchronized on. Each process has a distinguished local program counter variable pc. For each value of pc, the behavior of a process is defined by means of a list of action-labelled guarded commands of the form [a]c → u, where a is an action, c is a condition on variables and u is an assignment updating some variables. Shared actions are used to define synchronization between the processes. A shared action occurs simultaneously in all the processes that share it, and
J. Vander Meulen and C. Pecheur / Combining Partial Order Reduction with Bounded Model Checking 43
only when all enable it. For each process pi , we use an heuristic which is based on syntactic information about pi to compute a safe approximation Ai of the safe deterministic transitions. These conditions are described in [20]. We only allow properties to refer the global variables. Intuitively, the safe deterministic transitions are those which perform a deterministic action (e.g. a deterministic assignment) and do not access any global variables or global labels. A more complex heuristic could take into account the variables occurring in the properties being verified to improve the quality of the safe approximation Ai . The model checker takes a model in this language as input. The number of steps per process in Phase-1 (parameter n) is fixed by the user. To find an error, it applies an iterative deepening, producing a SAT problem corresponding to Definition 9 for increasing depths k. The Yices SMT solver is used to check the satisfiability of the generated formula [21]. We decided to use Yices because it offers built-in support for arithmetic operators defined in our language. If a counterexample is found, a trace which violates the property is displayed.
4. Case Study
In order to assess the effectiveness of our method, we applied it to a variant of a producerconsumer system where all producers and consumers contribute on the production of every single item. The model is composed of 2m processes: m producers and m consumers. The producers and consumers communicate together via a bounded buffer. Each producer works locally on a piece p, then it waits until all producers terminate their task. Then, p is added to the bounded buffer, and the producers start processing the next piece. When the consumers remove p from the bounded-buffer, they work locally on it. When all the consumers have terminated their local work, an other piece can be removed from the bounded-buffer. Two properties have been analyzed on this model: P1 states that the bounded buffer is always empty, and P2 states that in all cases the buffer will eventually contain more than one piece. Table 1 and Table 2 compare the classical BMC method and the SBTP method when applied to P1 and P2 . Notice that BMC proceeds by increasing depth k until an error is found (c.f. iterative deepening). Classical BMC quickly runs out of resources whereas our method can treat much larger models in a few minutes. In regard of the verification time, we notice that our method significantly outperforms the BMC method for this example. We also notice that SBTP traces are 3.4 to 6.75 times longer. This difference can come from either the addition of the idle transitions, or the considered paths themselves: contrary to BMC, our method does not consider all possible interleavings, thus it is possible that the smallest error traces are not considered. Table 3 analyses the influence of the number of times Phase-1 is executed for each process (i.e. the parameter n). We notice that for a given number of producers and consumers, n influences in a non-monotonic way the length of the error execution path, the verification time as well as the memory used during the verification. n influences the two aspects of the transformation of the model. On one hand, the graph is more reduced as n is increased due to more partial-order reduction. On the other hand, the number of added idle transitions is also influenced by this parameter. When n is increased, the number of cycles on the discovered error path tends towards the minimum number of unsafe transitions which participate to the violation of the property. We notice that each time the number of cycles is decremented by one (c. f. n = 4), the cpu time and the memory needed reach a local minimum. Then the cpu time and the memory used augment until the number of cycles is decremented again.
44 J. Vander Meulen and C. Pecheur / Combining Partial Order Reduction with Bounded Model Checking Table 1. Statistics of property P1 of the producer-consumer model using BMC approach and SBTP approach with n = 8. m is the number of producers (resp. consumers), # states is the state space size, k is the smallest bound for which an error is found, TIME is the verification time (in seconds), MEM is the memory used by Yices when the bound equals k (in Megabytes), and # cycles is the number of cycles: Phase-1/Phase-2. — indicates that the computation did not end with 8 hours.
m 1 2 3 4 5 6 7
# states 1,059 51,859 3,807,747 ≈ 108 ≈ 1010 ≈ 1012 ≈ 1014
k 10 18 26 — — — —
BMC property P1 TIME (sec) MEM (MB) 10 29 44 41 11,679 65 — — — — — — — —
k 34 66 98 130 162 194 226
SBTP property P1 # cycles TIME (sec) 2 7 2 8 2 16 2 31 2 43 2 57 2 77
MEM (MB) 30 49 85 122 169 224 288
Table 2. Statistics of property P2 of the producer-consumer model using BMC approach and SBTP approach with n = 8. m is the number of producers (resp. consumers), # states is the state space size, k is the smallest bound for which an error is found, TIME is the verification time (in seconds), MEM is the memory used by Yices when the bound equals k (in Megabytes), and # cycles is the number of cycles: Phase-1/Phase-2. — indicates that the computation did not end with 8 hours.
m 1 2 3 4 5 6 7
# states 1,059 51,859 3,807,747 ≈ 108 ≈ 1010 ≈ 1012 ≈ 1014
BMC property P2 k TIME (sec) 26 73 44 29,898 — — — — — — — — — —
MEM (MB) 33 131 — — — — —
k 153 297 441 585 729 873 1,017
SBTP property P2 # cycles TIME (sec) 9 122 9 211 9 401 9 1,238 9 1,338 9 1,926 9 4,135
MEM (MB) 96 224 363 680 983 1,438 1,618
Table 3. Influence of the parameter n when the number of producers (resp. consumers) equals 2. k is the smaller bound for which an error is found, # cycles is the number of cycles: Phase-1/ Phase-2, TIME is the verification time (in seconds), and MEM is the memory used by Yices when the bound equals k (in Megabytes).
n 0 1 2 3 4 5 6 7 8 9
k 18 35 45 39 51 63 75 87 66 74
# cycles 18 7 5 4 3 3 3 3 2 2
property P1 TIME (sec) 44 12 11 10 8 10 12 13 8 9
MEM (MB) 41 41 40 47 47 50 57 58 49 57
k 44 95 135 169 187 231 275 319 297 333
# cycles 44 19 15 13 11 11 11 11 9 9
property P2 TIME (sec) 29,898 855 235 305 217 375 381 583 211 240
MEM (MB) 131 159 167 194 192 308 240 318 224 295
J. Vander Meulen and C. Pecheur / Combining Partial Order Reduction with Bounded Model Checking 45
5. Related Work Different approaches have been developed to apply symbolic model checking on asynchronous systems. In [22], Enders et al. show how to encode a transition relation T (s, a, s ) into BDDs. This paper focusses on the ordering of the variables within BDDs. It is well-know that the size of BDDs, and therefore performance of BDD-based model checking, strongly depends on this ordering. In general finding the best variable ordering is a NP-complete problem. The paper presents an heuristic which produces BDDs that grow linearly in the number of asynchronous components according to experimental results. In [8], Gerth et al. show how to perform partial order reduction in the context of process algebras. They show that condition C0 to C4 of Section 1.5 can be applied to produce a reduced structure that is branching-bisimilar, and hence preserve Hennessy-Milner logic [23]. Other approaches combine symbolic model checking and POR to verify different classes of properties. In [24], Alur et al. transform an explicit model checking algorithm performing partial order reduction. This algorithm is able to check invariance of local properties. They start from a DFS algorithm to obtain a modified BFS algorithm. Both expand an ample set of transitions at each step. In order to detect the cycles, they assume pessimistically that each previously expanded state might close a cycle. In [25], Abdulla et al. present a general method for combining POR and symbolic model checking. Their method can check safety properties either by backward or forward reachability analysis. So as to perform the reduction, they employ the notion of commutativity in one direction, a weakening of the dependency relation which is usually used to perform POR. In [26], Kurshan et al. introduce a partial order reduction algorithm based on static analysis. They notice that each cycle in the state space is composed of some local cycles. The method performs a static analysis of the checked model so as to discover local cycles and set up all the reductions at compile time. The reduced state space can be handled with symbolic techniques. This paper complements our previous work which combined symbolic model checking and partial order reduction [7]. That work introduces the FwdUntilPOR algorithm that combines two existing techniques to provide symbolic model checking of a subset of CTL on asynchronous models. The first technique is the ImProviso algorithm which efficiently merges POR and symbolic methods [20]. It is a symbolic adaptation of the Two-Phase algorithm. The second technique is the forward symbolic model checking approach applicable to a subset of CTL [27]. Contrary to FwdUntilPOR which checks CT LX properties using BDD-based model checking, our method deals with LT LX properties using a SAT solver. In [28], Jussila presents three improvement to apply bounded model checking to asynchronous systems. Jussila considers reachability properties, whereas our method allows the verification of LT LX properties. The partial order semantics replaces the standard interleaving execution model with non-standard models allowing the execution of several independent actions simultaneously. Then, the on-the-fly determinization consists to determinize the different components during their composition. This is done creating a propositional formula whose models correspond to the executions from the determinized equivalents of the components. We point out that the state automaton resulting from the determinization of the components are never constructed. Finally, the merging of local transitions can be seen as introducing additional transitions to the components. These transitions correspond to the execution of a sequence of local actions. When a transition is added, the component has to contain a path between the transition’s source and target states. The partial order semantics addresses the same problem as we do. Both methods consider a model which contains less execution paths than the original model. On-the-fly determinization can be seen as a complementary method to ours. In general, when asynchronous systems are considered, two causes of non-determinism are identified: the first one comes
46 J. Vander Meulen and C. Pecheur / Combining Partial Order Reduction with Bounded Model Checking
from the components themselves, and the second one comes from the interleaving execution model. The former is handled by on-the-fly determinization while our method tackles the latter. All three approaches are potentially applicable and open interesting directions for further work. However, none of those methods provides a BMC encoding using only a subset of the relation transition at some steps, which has proven to provide important performance gains in our approach. 6. Conclusion In this paper, we introduced a technique which applies the partial order reduction methods to bounded model checking. It provides an algorithm which is appropriate to the verification of LT LX properties to asynchronous models. The formulæ produced by this approach allow for more efficient processing by the DPLL algorithm used in BMC, compared to those produced by the conventional bounded model checking approach. These formulæ are obtained by using only a restricted, safe subset of the transition relation based on POR considerations at many steps in the unfolding of the model. In order to assess the correctness of our method, we define two general procedures for transforming a computation tree CT to a visible-bisimilar one. The partial-order reduction of CT , which reduces CT according to classical POR criteria, and the idle-extension of CT , which adds a finite number of idle transitions to each state. Then, we define the SBTP algorithm which is a particular instance of these transformations. Finally, we present the transformation of SBTP into a bounded model checking problem. We extended a model checker which is currently under development at our university to support our method. We show on a simple case study that our method achieves an improvement in comparison to the classical bounded model checking algorithm. However, our method need to be tested on a larger range of case studies and to be compared with other methods and tools such as NuSMV [19] or FDR [29]. Furthermore, one could explore how to applied our method to those tools. Our approach can be extended in the following ways: • The SBTP algorithm can be extended to handle models featuring variables on infinite domains. This can be achieved by using the capabilities of Satisfiability Modulo Theories solvers such as Yices [21] and MathSat [30]. • When a partial Tjidle (si , a, si+1 ) is applied, only local variables from pj are modified, other variables y being constrained to remain the same (yi = yi+1 ). Based on that we could merge these variables and remove the corresponding constraints. This would amount to parallelizing safe transitions of different processes, approaching the result of Jussila’s first method from a different angle. • The heuristic used to determine the safe deterministic transitions is quite simple. Meanwhile, there exists a large body of literature on this subject. Based on that, we could explore better approximations that result in detecting more safe deterministic states. Acknowledgements This work is supported by project MoVES under the Interuniversity Attraction Poles Programme — Belgian State — Belgian Science Policy. The authors are grateful for the fruitful discussions with Stefano Tonetta, Marco Roveri, and Alessandro Cimatti during a stay at the Fondazione Bruno Kessler. These discussions were the starting point of this work.
J. Vander Meulen and C. Pecheur / Combining Partial Order Reduction with Bounded Model Checking 47
References [1] Edmund M. Clarke, Orna Grumberg, and Doron Peled. Model Checking. Mit Press, 1999. [2] R. E. Bryant. Graph-based algorithms for boolean function manipulation. IEEE Transactions on Computers, C-35(8), 1986. [3] J. R. Burch, E. M. Clarke, K. L. McMillan, D. L. Dill, and J. Hwang. Symbolic model checking: 1020 states and beyond. Information and Computation, 98(2):142–170, 1992. [4] Armin Biere, Alessandro Cimatti, Edmund M. Clarke, Ofer Strichman, and Yunshan Zhu. Bounded model checking. Advances in Computers, 58:118–149, 2003. [5] Patrice Godefroid. Partial-Order Methods for the Verification of Concurrent Systems – An Approach to the State-Explosion Problem, volume 1032 of Lecture Notes in Computer Science. Springer-Verlag, 1996. [6] Ratan Nalumasu and Ganesh Gopalakrishnan. A new partial order reduction algorithm for concurrent system verification. In CHDL’97: Proceedings of the IFIP TC10 WG10.5 international conference on Hardware description languages and their applications : specification, modelling, verification and synthesis of microelectronic systems, pages 305–314, London, UK, UK, 1997. Chapman & Hall, Ltd. [7] Jos´e Vander Meulen and Charles Pecheur. Efficient symbolic model checking for process algebras. In 13th International Workshop on Formal Methods for Industrial Critical Systems (FMICS 2008), volume 5596, pages 69–84. LNCS, 2008. [8] Rob Gerth, Ruurd Kuiper, Doron Peled, and Wojciech Penczek. A partial order approach to branching time logic model checking. Information and Computation, 150(2):132–152, 1999. [9] Orna Lichtenstein and Amir Pnueli. Checking that finite state concurrent programs satisfy their linear specification. In POPL ’85: Proceedings of the 12th ACM SIGACT-SIGPLAN symposium on Principles of programming languages, pages 97–107, New York, NY, USA, 1985. ACM. [10] Rob Gerth, Doron Peled, Moshe Y. Vardi, and Pierre Wolper. Simple on-the-fly automatic verification of linear temporal logic. In Proc. 15th Work. Protocol Specification, Testing, and Verification, Warsaw, June 1995. North-Holland. [11] E. M. Clarke, E. A. Emerson, and A. P. Sistla. Automatic verification of finite-state concurrent systems using temporal logic specifications. ACM Transactions on Programming Languages and Systems, 8(2):244– 263, 1986. [12] E. Clarke, E. A. Emerson, and A. P. Sistla. Automatic verification of finite-state concurrent systems using temporal logic. In 10th Annual Symposium on Principles of Programming Languages. ACM, 1983. [13] Julius R. B¨uchi. On a decision method in restricted second order arithmetic. In Ernest Nagel, Patrick Suppes, and Alfred Tarski, editors, Proceedings of the 1960 International Congress on Logic, Methodology and Philosophy of Science, pages 1–11. Stanford University Press, June 1962. [14] J. R. Burch, E. M. Clarke, K. L. McMillan, D. L. Dill, and J. Hwang. Symbolic model checking: 1020 states and beyond. Information and Computation, 98(2):142–170, 1992. [15] Martin Davis, George Logemann, and Donald Loveland. A machine program for theorem-proving. Commun. ACM, 5(7):394–397, 1962. [16] Robin Milner. Communication and Concurrency. Prentice-Hall, 1989. [17] Ratan Nalumasu and Ganesh Gopalakrishnan. An efficient partial order reduction algorithm with an alternative proviso implementation. Formal Methods in System Design, 20(3):231–247, 2002. [18] Martin Odersky, Philippe Altherr, Vincent Cremet, Burak Emir, Sebastian Maneth, St´ephane Micheloud, Nikolay Mihaylov, Michel Schinz, Erik Stenman, and Matthias Zenger. An overview of the scala programming language. Technical Report IC/2004/64, EPFL Lausanne, Switzerland, 2004. [19] A. Cimatti, E. Clarke, F. Giunchiglia, and M. Roveri. NuSMV: a new symbolic model verifier. In Proc. of International Conference on Computer-Aided Verification, 1999. [20] Flavio Lerda, Nishant Sinha, and Michael Theobald. Symbolic model checking of software. In Byron Cook, Scott Stoller, and Willem Visser, editors, Electronic Notes in Theoretical Computer Science, volume 89. Elsevier, 2003. [21] Bruno Dutertre and Leonardo de Moura. The Yices SMT solver. Tool paper at http://yices.csl.sri.com/toolpaper.pdf, August 2006. [22] Reinhard Enders, Thomas Filkorn, and Dirk Taubner. Generating BDDs for symbolic model checking in CCS. In CAV ’91: Proceedings of the 3rd International Workshop on Computer Aided Verification, pages 203–213, London, UK, 1992. Springer-Verlag. [23] Matthew Hennessy and Robin Milner. Algebraic laws for nondeterminism and concurrency. J. ACM, 32(1):137–161, 1985. [24] Rajeev Alur, Robert K. Brayton, Thomas A. Henzinger, Shaz Qadeer, and Sriram K. Rajamani. Partialorder reduction in symbolic state space exploration. In Computer Aided Verification, pages 340–351, 1997.
48 J. Vander Meulen and C. Pecheur / Combining Partial Order Reduction with Bounded Model Checking [25] Parosh Aziz Abdulla, Bengt Jonsson, Mats Kindahl, and Doron Peled. A general approach to partial order reductions in symbolic verification (extended abstract). In Computer Aided Verification, pages 379–390, 1998. [26] Robert P. Kurshan, Vladdimir Levin, Marius Minea, Doron Peled, and H¨usn¨u Yenig¨un. Static partial order reduction. In TACAS ’98: Proceedings of the 4th International Conference on Tools and Algorithms for Construction and Analysis of Systems, pages 345–357, London, UK, 1998. Springer-Verlag. [27] Hiroaki Iwashita, Tsuneo Nakata, and Fumiyasu Hirose. CTL model checking based on forward state traversal. In ICCAD ’96: Proceedings of the 1996 IEEE/ACM international conference on Computer-aided design, pages 82–87, Washington, DC, USA, 1996. IEEE Computer Society. [28] Toni Jussila. On bounded model checking of asynchronous systems. Research Report A97, Helsinki University of Technology, Laboratory for Theoretical Computer Science, Espoo, Finland, October 2005. Doctoral dissertation. [29] A. W. Roscoe. Model-checking CSP, In A classical mind: essays in honour of C. A. R. Hoare. Prentice Hall International (UK) Ltd., Hertfordshire, UK, 1994. [30] Roberto Bruttomesso, Alessandro Cimatti, Anders Franz´en, Alberto Griggio, and Roberto Sebastiani. The MathSAT 4 SMT solver. In CAV ’08: Proceedings of the 20th international conference on Computer Aided Verification, pages 299–303, Berlin, Heidelberg, 2008. Springer-Verlag.
Communicating Process Architectures 2009 P.H. Welch et al. (Eds.) IOS Press, 2009 © 2009 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-065-0-49
49
On Congruence Property of Scope Equivalence for Concurrent Programs with Higher-Order Communication Masaki MURAKAMI Department of Computer Science, Graduate School of Natural Science and Technology, Okayama University, 3-1-1 Tsushima-Naka, Okayama, 700-0082, Japan
[email protected]
Abstract. Representation of scopes of names is important for analysis and verification of concurrent systems. However, it is difficult to represent the scopes of channel names precisely with models based on process algebra. We introduced a model of concurrent systems with higher-order communication based on graph rewriting in our previous work. A bipartite directed acyclic graph represents a concurrent system that consists of a number of processes and messages in that model. The model can represent the scopes of local names precisely. We defined an equivalence relation such that two systems are equivalent not only in their behavior, but also in extrusion of scopes of names. This paper shows that our equivalence relation is a congruence relation w.r.t. τ -prefix, new-name, replication and composition, even when higher-order communication is allowed. We also show our equivalence relation is not congruent w.r.t. input-prefix though it is congruent w.r.t. input-prefix in the first-order case. Keywords. theory of concurrency, π-calculus, bisimilarity, graph rewriting, higherorder communication
Introduction There are a number of formal models of concurrent systems. In models such as πcalculus [11], “a name” represents, for example, an IP address, a URL, an e-mail address, a port number and so on. Thus, the scopes of names in formal models are important for the security of concurrent systems. On the other hand, it is difficult to represent the scopes of channel names precisely with models based on process algebra. In many such models based on process algebra, the scope of a name is represented using a binary operation such as the ν-operation. Thus the scope of a name is a subterm of an expression that represents a system. For example, in a π-calculus term: νa2 (νa1 (b1 |b2 )|b3 ), the scope of the name a2 is the subterm (νa1 (b1 |b2 )|b3 ) and the scope of the name a1 is the subterm (b1 |b2 ). However, this method has several problems. For example, consider a system S consisting of a server and two clients. A client b1 communicates with the server b2 using a channel a1 whose name is known only by b1 and b2 . And a client b3 communicates with b2 using a channel a2 that is known only by b2 and b3 . In this system a1 and a2 are private names. As b2 and b1 knows the name a1 but b3 does not, then the scope of
50
M. Murakami / On Congruence Property of Scope Equivalence for Concurrent Programs
Figure 1. Scopes of names in S.
a1 includes b1 and b2 and the scope of a2 includes b3 and b2 . Thus the scopes of a1 and a2 are not nested as shown in Figure 1. The method denoting private names as bound names using ν-operator cannot represent the scopes of a1 and a2 precisely because scopes of names are subterms of a term and then they are nested (or disjoint) in any π-calculus term. Furthermore, it is sometimes impossible to represent the scope even for one name precisely with ν-operator. Consider the example, νa(¯ v a.P ) | v(x).Q where x does not occur in Q. In this example, a is a private name and its scope is v¯a.P . The scope of a is extruded by communication with prefixes v¯a and v(x). Then the result of the action is νa(P |Q) and Q is included in the scope of a. However, as a does not occur in Q, it is equivalent to (νaP )|Q by rules of structural congruence. We cannot see the fact that a is ‘leaked’ to Q from the resulting expression: (νaP )|Q. Thus we must keep the trace of communications for the analysis of scope extrusion. This makes it difficult to analyze extrusions of scopes of names. In our previous work we presented a model that is based on graph rewriting instead of process algebra as a solution to the problem of representing the scopes of names [6]. We defined an equivalence relation on processes called scope equivalence such that it holds if two processes are equivalent not only on their behavior but also on the scopes of channel names. We showed the congruence results of weak bisimulation equivalence [7] and of scope equivalence [9] on the graph rewriting model. On the other hand, a number of formal models with higher-order communication have been reported. LHOπ (Local Higher Order π-calculus) [12] is the one of the most well studied model in that area. It is a subcalculus of higher-order π-calculus with asynchronous communication. However the problem of scopes of names also happens in LHOπ . We need a model with higher-order communication that can represent the scopes of names precisely. We extended the graph rewriting model of [6] for systems with higher-order communication [8]. We extended the congruence results of the behavioral equivalence to the model with higher-order communication [10]. This paper discusses the congruence property of scope equivalence for the graph rewriting model with higher-order communication introduced in [8]. We show that the scope equivalence relation is a congruence relation w.r.t. τ -prefix, new-name, replication and composition even if higher-order communication is allowed as presented in section 4.1. These results are extensions of the results presented in [9]. On the other hand, in section 4.2, we show that it is not congruent w.r.t. input-prefix though it is congruent w.r.t. input-prefix in first-order case [9]. Congruence results on bisimilarity based on graph rewriting models are reported in [2,13]. Those studies adopts graph transformation approach for proof techniques. In this paper, graph rewriting is introduced to extend the model for the representation of name scopes.
M. Murakami / On Congruence Property of Scope Equivalence for Concurrent Programs
51
Figure 2. A bipartite directed acyclic graph.
Figure 3. A message node.
Figure 4. A behavior node α.P .
Figure 5. Message receiving.
1. Basic Idea Our model is based on graph rewriting system such as [2,3,5,4,13]. We represent a concurrent program that consists of a number of processes (and messages on the way) with a bipartite directed acyclic graph. A bipartite graph is a graph whose nodes are decomposed into two disjoint sets: source nodes and sink nodes such that no two graph nodes within the same set are adjacent. Every edge is directed from a source node to a sink node. The system of Figure 1 that consists of three processes b1 , b2 and b3 and two names ai (i = 1, 2) shared by bi and bi+1 is represented with a graph as Figure 2. Processes and messages on the way are represented with source nodes. We call source nodes behaviors. In Figure 2, b1 , b2 and b3 are behaviors. message: A behavior node that represents a message is a node labeled with a name of the recipient n (it is called the subject of the message) and the contents of the message o as Figure 3. The contents of the message is a name or a program code as we allow higher-order messages. As a program code to be sent is represented with a graph structure, then the content of a message may have bipartite graph structure also. Thus the message node has a nested structure that has a graph structure inside of the node. message receiving: A message is received by a receiver process that executes an input action and then continues the execution. We denote a receiver process with a node that consists of its
52
M. Murakami / On Congruence Property of Scope Equivalence for Concurrent Programs
Figure 6. Extrusion of the scope of n.
Figure 7. Message sending.
Figure 8. Receiving a program code.
epidermis that denotes the first input action and its content that denotes the continuation. For example, a receiver that executes an input action α and then become a program P (denoted as α.P in CCS term) is denoted with a node whose epidermis is labeled with α and the content is P (Figure 4). As the continuation P is a concurrent program, then it has a graph structure inside of the node. Thus the receiver process also has a nested structure. Message receiving is represented as follows. Consider a message to send an object (a name or an abstraction) n and the receiver with a name m (Figure 5a). The execution of message input action is represented by “peeling the epidermis of the receiver process node”. When the message is received then it vanishes, the epidermis of the receiver is removed and the content is exposed (Figure 5b). Now the continuation P is activated. The received object n is substituted to the name x in the content P . The scope of a name is extruded by message passing. For example, π-calculus has a transition τ such that (νnan)|a(y).P → νnP [n/y]. This extrusion is represented by a graph rewriting as Figure 6. A local name n occurs in the message node but there is no edge from the node of the receiver because n is new to the receiver. After receiving the message, as n is a newly imported local name, then a new sink node corresponding to n is added to the graph and new edges are created from each behavior of the continuation to n as the continuation of the receiver is in the scope of n. message sending: In asynchronous π-calculus, message sending is represented in the same way as process activation. We adopt the similar idea. Consider an example that executes an action α and sends a message m (Figure 7 left). When the action α is executed, then the epidermis is peeled and the message m is exposed as Figure 7 right. Now the message m is transmitted and m can move to the receiver. And the execution of Q continues.
M. Murakami / On Congruence Property of Scope Equivalence for Concurrent Programs
53
higher-order communications: Consider the case that the variable x occurs as the subject of a message like xu in the content of a receiver (Figure 8a). If the received object n is a program code, then nu becomes a program to be activated. As LHOπ , a program code to transfer is in the form of an abstraction in a message. An abstraction denoted as (y)Q consists of a graph Q representing a program and its input argument y. When an abstraction (y)Q is sent to the receiver and substituted to x in Figure 8a, the behavior node (y)Qu is exposed and ready to be activated(Figure 8b). To activate (y)Qu, u is substituted to y in Q (Figure 8c). This action corresponds to the β-conversion in LHOπ . Then we have a program Q with input value u, and it is activated. Note that new edges from each behaviours Q to the sink node which had a edge from xu are created.
2. Formal Definitions In this section, we present formal definitions of the model presented informally in the previous section. 2.1. Programs First, a countably-infinite set of names is presupposed as other formal models based on process algebra. Definition 2.1 (program, behavior) Programs and behaviors are defined recursively as follows. (i) Let a1 , . . . , ak are distinct names. A program is a bipartite directed acyclic graph with source nodes b1 , . . . , bm and sink nodes a1 , . . . , ak such that • Each source node bi (1 ≤ i ≤ m) is a behavior. Duplicated occurrences of the same behavior are possible. • Each sink node is a name aj (1 ≤ j ≤ k). All aj ’s are distinct. • Each edge is directed from a source node to a sink node. Namely, an edge is an ordered pair (bi , aj ) of a source node and a name. For any source node bi and a name aj there is at most one edge from bi to ai . For a program P , we denote the multiset of all source nodes of P as src(P ), the set of all sink nodes as snk(P ) and the set of all edges as edge(P ). Note that the empty graph: 0 such that src(0) = snk(0) = edge(0) = ∅ is a program. (ii) A behavior is an application, a message or a node consists of the epidermis and the content defined as follows. In the following of this definition, we assume that any element of snk(P ) nor x does not occur in anywhere else in the program. 1. A tuple of a variable x and a program P is an abstraction and denoted as (x)P . An object is a name or an abstraction. 2. A node labeled with a tuple of a name: n (called the subject of the message) and an object: o is a message and denoted as no. 3. A node labeled with a tuple of an abstraction and an object is an application. We denote an application as Ao where A is an abstraction and o is an object. 4. A node whose epidermis is labeled with “!” and the content is a program P is a replication, and denoted as !P .
54
M. Murakami / On Congruence Property of Scope Equivalence for Concurrent Programs
5. An input prefix is a node (denoted as a(x).P ) that the epidermis is labeled with a tuple of a name a and a variable x and the content is a program P . 6. A τ -prefix is a node (denoted as τ.P ) that the epidermis is labeled with a silent action τ and the content is a program P . Definition 2.2 (local program) A program P is local if for any input prefix c(x).Q and any abstraction (x)Q occurring in P , x does not occur in the epidermis of any input prefix in Q. An abstraction (x)P is local if P is local. A local object is a local abstraction or a name. The locality condition says that “anyone cannot use a name given from other one to receive messages”. Though this condition affects the expressive power of the model, we do not consider that the damage to the expressive power by this restriction is significant. Because as transfer of receiving capability is implemented with transfer of sending capability in many practical example, we consider local programs have enough expressive power for many important/interesting examples. So in this paper, we consider local programs only. Theoretical motivations of this restriction are discussed in [12]. Definition 2.3 (free/bound name) 1. For a behavior or an object p, the set of free names of p : fn(p) is defined as follows: fn(0) = ∅, fn(a) = {a} for a name a, fn(ao) = fn(o) ∪ {a}, fn((x)P ) = fn(P ) \ {x}, fn(!P ) = fn(P ), fn(τ.P ) = fn(P ), fn(a(x).P ) = (fn(P ) \ {x}) ∪ {a} and fn(o1 o2 ) = fn(o1 ) ∪ fn(o2 ). 2. For a program P where src(P ) = {b1 , . . . , bm }, fn(P ) = i fn(bi ) \ snk(P ). The set of bound names of P (denoted as bn(P )) is the set of all names that occur in P but not in fn(P ) (including elements of snk(P ) even if they do not occur in any element of src(P )). The role of free names is a little bit different from that of π-calculus in our model. For example, a free name x occurs in Q is used as a variable in (x)Q or a(x).Q. A channel name that is used for communication with the environments is an element of snk, so it is not a free name. Definition 2.4 (normal program) A program P is normal if for any b ∈ src(P ) and for any n ∈ fn(b) ∩ snk(P ), (b, n) ∈ edge(P ) and any program occurs in b is also normal. It is quite natural to assume the normality for programs, because someone must know a name to use it. In the rest of this paper we consider normal programs only. Definition 2.5 (composition) Let P and Q be programs such that src(P ) ∩ src(Q) = ∅ and fn(P ) ∩ snk(Q) = fn(Q) ∩ snk(P ) = ∅. The composition P Q of P and Q is the program such that src(P Q) = src(P ) ∪ src(Q), snk(P Q) = snk(P ) ∪ snk(Q) and edge(P Q) = edge(P ) ∪ edge(Q). Intuitively, P Q is the parallel composition of P and Q. Note that we do not assume snk(P )∩ snk(Q) = ∅. Obviously P Q = QP and ((P Q)R) = (P (QR)) for any P, Q and R from the definition. The empty graph 0 is the unit of “”. Note that src(P ) ∪ src(Q) and edge(P ) ∪ edge(Q) denote the multiset unions while snk(P ) ∪ snk(Q) denotes the set union. It is easy to show that for normal and local programs P and Q, P Q is normal and local. Definition 2.6 (N -closure) For a normal program P and a set of names N such that N ∩ bn(P ) = ∅, the N -closure νN (P ) is the program such that src(νN (P )) = src(P ), snk(νN (P )) = snk(P ) ∪ N and edge(νN (P )) = edge(P ) ∪ {(b, n)|b ∈ src(P ), n ∈ N }. We denote νN1 (νN2 (P ))) as νN1 νN2 (P ) for a program P and sets of names N1 and N2 .
M. Murakami / On Congruence Property of Scope Equivalence for Concurrent Programs
55
Definition 2.7 (deleting a behavior) For a normal program P and b ∈ src(P ), P \ b is a program that is obtained by deleting a node b and edges that are connected with b from P . Namely, src(P \ b) = src(P ) \ {b}, snk(P \ b) = snk(P ) and edge(P \ b) = edge(P ) \ {(b, n)|(b, n) ∈ edge(P )}. Note that src(P ) \ {b} and edge(P ) \ {(b, n)|(b, n) ∈ edge(P )} mean the multiset subtractions. Definition 2.8 (context) Let P be a program and b ∈ src(P ) where b is an input prefix, a τ -prefix or a replication and the content of b is 0. A simple first-order context is a graph P [ ] such that the contents 0 of b is replaced with a hole “[ ]”. We call a simple context a τ -context if the hole is the contents of a τ -prefix, an input context if is the contents of an input prefix and a replication context if it is the contents of a replication. Let P be a program such that b ∈ src(P ) and b is an application (x)0Q. An application context P [ ] is a graph obtained by replacing the behavior b with (x)[ ]Q. A simple context is a simple first-order context or an application context. A context is a simple context or the graph P [Q[ ]] that is obtained by replacing the hole of P [ ] with Q[ ] for a simple context P [ ] and a context Q[ ] (with some renaming of the names which occur in Q if necessary). For a context P [ ] and a program Q, P [Q] is the program obtained by replacing the hole in P [ ] by Q (with some renaming of the names which occur in Q if necessary). 2.2. Operational Semantics We define the operational semantics with a labeled transition system. The substitution of an object to a program, to a behavior or to an object is defined recursively as follows. Definition 2.9 (substitution) Let p be a behavior, an object or a program and o be an object. For a name a, we assume that a ∈ fn(p). The mapping [o/a] defined as follows is a substitution. o if c = a • for a name c, c[o/a] = c otherwise • for behaviors, ((x)P )[o/a] = (x)(P [o/a]), (o1 o2 )[o/a] = o1 [o/a]o2 [o/a], (!P )[o/a] = !(P [o/a]), (c(x).P )[o/a] = c(x).(P [o/a]) and (τ.P )[o/a] = τ.(P [o/a]), • and for a program P and a ∈ fn(P ), P [o/a] = P where P is a program such that src(P ) = {b[o/a]|b ∈ src(P )}, snk(P ) = snk(P ) and edge(P ) = {(b[o/a], n)|(b, n) ∈ edge(P )}. For the cases of abstraction and input prefix, note that we can assume x = a because a ∈ fn((x)P ) or ∈ fn(c(x).P ) without losing generality. (We can rename x if necessary.) Definition 2.10 Let p be a local program or a local object. A substitution [a/x] is acceptable for p if for any input prefix c(y).Q occurring in p, x = c. In the rest of this paper, we consider acceptable substitutions only for a program or an abstraction. Because in any execution of a local programs if a substitution is applied by one of the rules of operational semantics then it is acceptable. Namely we assume that [o/a] is applied only for the objects such that a does not occur as a subject of any input prefix. It is easy to show that substitution and N -closure can be distributed over “” and “\” from the definitions.
56
M. Murakami / On Congruence Property of Scope Equivalence for Concurrent Programs
Definition 2.11 (action) For a name a and an object o, an input action is a tuple of a and o that is denoted as a(o), and an output action is a tuple that is denoted as ao. An action is a silent action τ , an output action or an input action. α
Definition 2.12 (labeled transition) For an action α, → is the least binary relation on normal programs that satisfies the following rules. a(o)
input : If b ∈ src(P ) and b = a(x).Q, then P → (P \b)ν{n|(b, n) ∈ edge(P )}νM (Q[o/x]) for an object o and a set of names M such that fn(o) ∩ snk(P ) ⊂ M ⊂ fn(o) \ fn(P ). τ
β-conversion : If b ∈ src(P ) and b = (y)Qo, then P → (P \ b)ν{n|(b, n) ∈ edge(P )}(Q[o/y]). τ
τ -action : If b ∈ src(P ) and b = τ.Q, then P → (P \ b)ν{n|(b, n) ∈ edge(P )}(Q). α
α
replication 1 : P → P if !Q = b ∈ src(P ) and P ν{n|(b, n) ∈ edge(P )}Q → P , where Q is a program obtained from Q by renaming all names in snk(R) to distinct fresh names that do not occur elsewhere in P nor programs executed in parallel with P , for all R’s where each R is a program that occur in Q (including Q itself). τ
τ
replication 2 : P → P if !Q = b ∈ src(P ) and P ν{n|(b, n) ∈ edge(P )}(Q1 Q2 ) → P , where each Qi (i = 1, 2) is a program obtained from Q by renaming all names in snk(R) to distinct fresh names that do not occur elsewhere in P nor programs executed in parallel with P , for all R’s where each R is a program that occur in Q (including Q itself). av
output : If b ∈ src(P ), b = av then, P → P \ b. communication : If b1 , b2 ∈ src(P ), b1 = ao, b2 = a(x).Q then, τ
P → ((P \ b1 ) \ b2 )ν{n|(b2 , n) ∈ edge(P )}ν(fn(o) ∩ snk(P ))(Q[o/x]). In all rules above except replication 1/2, the behavior that triggers an action is removed from src(P ). Then the edges from the removed behaviors no longer exist after the action. The set of names M that occur in the input rule is the set of local names imported by the input action. Some name in M may be new to P , and other may be already known to P but b is not in the scope. α
We can show that for any program P and P , and any action α such that P → P , if P is local then P is local and if P is normal then P is normal. α
Proposition 2.1 For any normal programs P , P and Q, and any action α if P → P then α P Q → P Q. α
proof (outline): By the induction on the number of replication 1/2 rules to derive P → P . α
Proposition 2.2 For any program P, Q and R and any action α, if P Q → R is derived by one of input, β-conversion, τ -action or output immediately, then R = P Q for some α α P → P or R = P Q for some Q → Q . proof (outline): Straightforward from the definition. ao
a(o)
τ
τ
Proposition 2.3 If Q → Q and R → R then QR → Q R (and RQ → R Q ) . proof (outline): By the induction on the total number of replication 1/2 rules to derive α α Q → Q and R → R .
M. Murakami / On Congruence Property of Scope Equivalence for Concurrent Programs
57
2.3. Behavioral Equivalence Strong bisimulation relation is defined as usual. It is easy to show ∼ defined as Definition 2.13 is an equivalence relation. Definition 2.13 (strong bisimulation equivalence) A binary relation R on normal programs α is a strong bisimulation if: for any (P, Q) ∈ R (or (Q, P ) ∈ R ), for any α and P if P → P α α then there exists Q such that Q → Q and (P , Q ) ∈ R ((Q , P ) ∈ R) and for any Q → Q the similar condition holds. Strong bisimulation equivalence ∼ is defined as {R|R is a strong bisimulation}. The following proposition is straightforward from the definitions. Proposition 2.4 If src(P1 ) = src(P2 ) then P1 ∼ P2 . We can show the congruence results of strong bisimulation equivalence [10] as Proposition 2.6 - 2.10 and Theorem 2.1. First we have the congruence result w.r.t. “”. Proposition 2.5 For any program R, if P ∼ Q then P R ∼ QR. The following propositions Proposition 2.6 - 2.9 say that ∼ is a congruence relation w.r.t. τ -prefix, replication, input prefix and application respectively. Proposition 2.6 For any P and Q such that P ∼ Q and for any τ -context, R[P ] ∼ R[Q]. Proposition 2.7 For any P and Q such that P ∼ Q and for any replication context, R[P ] ∼ R[Q]. Proposition 2.8 For any P and Q such that P ∼ Q and for any input context, R[P ] ∼ R[Q]. Proposition 2.9 For any P and Q such that P ∼ Q and for any application context R[ ], R[P ] ∼ R[Q]. From Proposition 2.6 - 9, we have the following result by the induction on the definition of context. Theorem 2.1 For any P and Q such that P ∼ Q and for any context R[ ], R[P ] ∼ R[Q]. For asynchronous π-calculus, the congruence results w.r.t. name restriction: “P ∼ Q implies νxP ∼ νxQ” is reported also. We can show the corresponding result with the similar argument to the first order case [7]. Proposition 2.10 For any P and Q and a set of names N such that N ∩(bn(P )∪bn(Q)) = ∅, if P ∼ Q then νN (P ) ∼ νN (Q).
3. Scope Equivalence This section presents an equivalence relation on programs which ensures that two systems are equivalent in their behavior and for the scopes of names. Definition 3.1 For a process graph P and a name n such that n, P/n is the program defined as follows:src(P/n) = {b|b ∈ src(P ), (b, n) ∈ edge(P )}, snk(P/n) = snk(P ) \ {n} and edge(P/n) = {(b, a)|b ∈ src(P/n), a ∈ snk(P/n), (b, a) ∈ edge(P )}.
58
M. Murakami / On Congruence Property of Scope Equivalence for Concurrent Programs
Figure 9. The graph P/a1 .
Intuitively P/n is the subsystem of P that consists of behaviors which are in the scope of n. Let P be an example of Figure 2, P/a1 is a subgraph of Figure 2 obtained by removing the node of b3 (and the edge from b3 to a2 ) and a1 (and the edges to a1 ) as shown in Figure 9. It consists of process nodes b1 and b2 and a name node a2 . The following propositions are straightforward from the definitions. We will refer to these propositions in the proof of congruence results w.r.t. to scope equivalence that will be defined below. Proposition 3.1 For any P, Q and n ∈ snk(P ) ∪ snk(Q), (P Q)/n = P/nQ/n. Proposition 3.2 For a program P , a set of names N such that N ∩ bn(P ) = ∅ and n ∈ snk(P ), (νN (P ))/n = νN (P/n). Proposition 3.3 Let R[ ] be a context and P be a program. For any name m ∈ snk(R), (R[P ])/m = R/m[P ]. Definition 3.2 (scope bisimulation) A binary relation R on programs is scope bisimulation if for any (P, Q) ∈ R, 1. 2. 3. 4.
P = 0 iff Q = 0, src(P/n) = ∅ iff src(Q/n) = ∅ for any n ∈ snk(P ) ∩ snk(Q), P/n ∼ Q/n for any n ∈ snk(P ) ∩ snk(Q) and R is a strong bisimulation.
It is easy to show that the union of all scope bisimulations is a scope bisimulation and it is the unique largest scope bisimulation. Definition 3.3 (scope equivalence) The largest scope bisimulation is scope equivalence and denoted as ⊥. It is obvious from the definition that ⊥ is an equivalence relation. The motivation and the background of the definition of ⊥ is reported in [6,8]. As ⊥ is a strong bisimulation from Definition 3.2, 4, we have the following proposition. Proposition 3.4 P ⊥ Q implies P ∼ Q. Definition 3.4 (scope bisimulation up to ⊥) A binary relation R on programs is a scope bisimulation up to ⊥ if for any (P, Q) ∈ R, 1. 2. 3. 4.
P = 0 iff Q = 0, src(P/n) = ∅ iff src(Q/n) = ∅ for any n ∈ snk(P ) ∩ snk(Q), P/n ∼ Q/n for any n ∈ snk(P ) ∩ snk(Q) and R is a strong bisimulation up to ⊥, namely for any P and Q such that (P, Q) ∈ R (or α α (Q, P ) ∈ R), for any P such that P → P , there exists Q such that Q → Q and P ⊥ R ⊥ Q (Q ⊥ R ⊥ P .).
The following proposition is straightforward from the definition and the transitivity of “⊥”.
M. Murakami / On Congruence Property of Scope Equivalence for Concurrent Programs
59
Figure 10. Graph representation of P1 and mo.
Proposition 3.5 If R is a strong bisimulation up to ⊥, then ⊥ R ⊥ is a scope bisimulation. Proposition 3.6 If b ∈ src(P ) and !Q = b then, P ν{n|(b, n) ∈ edge(P )}Q ⊥ P where Q is a program obtained from Q by renaming names in bn(Q) to fresh names. proof (outline): We have the result by showing the following relation: {(P ν{n|(b, n) ∈ edge(P )}Q , P )|!Q ∈ src(P ), Q is obtained from Q by fresh renaming of bn(Q).}∪ ⊥ is a scope bisimulation up to ⊥ and Proposition 3.5. Example 3.1 Consider the following (asynchronous) π-calculus processes: P1 = m(x).τ.Q and P2 = νn(m(u).na | n(x).Q). Assume that neither x nor n occurs in Q. P1 and P2 are strongly bisimilar. Consider the case that a message mo is received by Pi (i = 1, 2). In P1 , the object o reaches τ.Q by the execution of m(o). On the other hand, o does not reach to Q in the case of P2 . Assume that o is so confidential that it must not be received by any unauthorized process and Q and τ.Q are not authorized. (Here, we consider that just receiving is not allowed even if the data is not stored.) Then P1 is an illegal process but P2 is not. Thus P1 and P2 should be distinguished but they cannot be distinguished with usual behavioral equivalences in π-calculus. Furthermore we cannot see if o reached to unauthorised process or not just from the resulting processes Q and νnQ. This means that for a system which is implemented with a programming language based on a model such as π-calculus, if someone optimize the system into behavioural equivalent one without taking care of the scopes, the security of the original system may be damaged. One may say that stronger equivalence relations such as syntactic equivalence or structural congruence work. Of course, syntactic equivalence can distinguish these two cases, but it is not convenient. How about structural congruence? Unfortunately it is not successful. It is easy to give an example such that P2 ≡ P3 but both of them are legal (and behavioural equivalent), for example P3 = νn(m(u).(n1 a1 |n1 a2 ) | n1 (x1 ).n2 (x2 ).Q). (Furthermore, we can also give an example of two processes which are structural congruent and one of them is legal but the other is not.) We can use bipartite directed acyclic graph model presented here to distinguish P1 and P2 . The example that corresponds to the system consists of P1 and the message mo is given by the graph in Figure 10 left 1 . This graph evolves to the graph in Figure 10 right (in the case that o is a name) that corresponds to the π-calculus process Q. This graph explicitly denotes that Q is in the scope of the newly imported name o. On the other hand the example of P2 with mo is Figure 11 left. After receiving the message carrying o, the graph evolves into Figure 11 right. This explicitly shows that Q is not in the scope of o. We can see this difference by showing P1 ⊥ P2 . One may consider that an equivalence relation similar to ⊥ can be defined on a model based on process algebra, for example, by encoding a graph into an algebraic term. However it is 1
The sink nodes corresponding n are not depicted in the following examples.
60
M. Murakami / On Congruence Property of Scope Equivalence for Concurrent Programs
Figure 11. Graph representation of P2 and mo.
Figure 12. Graph P and Q.
not easy to define an operational semantics on which we can enjoy the merit of algebraic model by naive encoding of the graph model. Especially, it seems difficult to give an orthodox structural operational semantics or reduction semantics that consists of a set of rules to rewrite subterms locally. We consider that we need some tricky idea for encoding.
4. Congruence Results of Scope Equivalence This section discusses on congruence property of scope equivalence. 4.1. Congruence Results w.r.t. Composition, τ -prefix and Replication The next proposition says that ⊥ is a congruence relation w.r.t. . Proposition 4.1 If P ⊥ Q then P R ⊥ QR. proof: See Appendix I. The following proposition is also straightforward from the definitions. Proposition 4.2 For any program P and Q, let P and Q be programs obtained from P and Q respectively by renaming n ∈ snk(P )∩snk(Q) to a fresh name n . If P ⊥ Q then P ⊥ Q . The following proposition is the congruence result of ⊥ w.r.t. new name. Proposition 4.3 For any P and Q and a set of names N such that N ∩ (bn(P ) ∪ bn(Q)) = ∅, if P ⊥ Q then νN (P ) ⊥ νN (Q). proof (outline): We show that the following relation is a scope bisimulation: {(νN (P ), νN (Q))|P ⊥ Q}. It is straightforward from the definition to show Definition 3.2 1 and 2. 3. is from Proposition 3.2 and Proposition 2.10. 4. is by the induction on the number of replication 1/2.
M. Murakami / On Congruence Property of Scope Equivalence for Concurrent Programs
61
Proposition 4.4 For any P and Q such that P ⊥ Q and for any τ -context R[ ], R[P ] ⊥ R[Q]. proof: See Appendix II. Proposition 4.5 For any P and Q such that P ⊥ Q and for any replication context R[ ], R[P ] ⊥ R[Q]. proof: See Appendix III. 4.2. Input and Application Context We can show that the strong bisimulation equivalence is congruent w.r.t. input prefix context and application context [10]. Unfortunately, this is not the case for the scope equivalence of higher-order programs. Our results show that ⊥ is not congruent w.r.t. the input context nor the application context. The essential problem is that ⊥ is not congruent w.r.t. substitutions of abstractions as the following counter example shows. Example 4.1 (i) Let P be a graph such that src(P ) = {b1 , b2 }, edge(P ) = {(b1 , n1 ), (b2 , n2 )} and snk(P ) = {n1 , n2 } and Q be a graph such that src(Q) = {b}, edge(Q) = {(b, n1 ), (b, n2 )} and snk(Q) = {n1 , n2 } where both of b and bi (i = 1, 2) are !xa as Figure 12. Note that nj (j = 1, 2) does not occur in b nor bi (i = 1, 2). Lemma 4.1 Let P and Q be as Example 4.1 (i). Then we have P ⊥ Q. proof (outline): We show that the relation {(P, Q)} is a scope bisimulation. Definition 3.2, 1 is obvious as neither P nor Q is an empty graph. For nj (j = 1, 2), both of P/nj and Q/nj are not 0, so Definition 3.2, 2. holds. For 3. P/nj is the graph such that src(P/nj ) = {bj } and Q/nj is the graph such that src(Q/nj ) = {b}. As bi = b =!xa, src(P/nj ) = src(Q/nj ). From Proposition 2.4, P/nj ∼ Q/nj . For 4., it is easy to show that the relation {(P, Q)} is a xa
xa
bisimulation because P → P and Q → Q are the only transition for P and Q respectively. Example 4.1 (ii) Let P and Q be as Example 4.1(i). Now, let o be an abstraction : (y)c(u).d(v).R where R is a program. P [o/x] is the graph such that src(P ) = {b1 [o/x], b2 [o/x]}, snk(P ) = {n1 , n2 } and edge(P ) = {(b1 [o/x], n1 ), (b2 [o/x], n2 )} as Figure 13a, top. And Q[o/x] is a graph such that src(Q) = {b[o/x]}, snk(Q) = {n1 , n2 } and edge(Q) = {(b[o/x], n1 ), (b[o/x], n2 )} where b[o/x] and bi [o/x](i = 1, 2) are !(y)c(u).d(v).Ra as Figure 13b, top. Lemma 4.2 Let P [o/x] and Q[o/x] be as Example 4.1 (ii). Then, P [o/x] ⊥ Q[o/x]. proof: See Appendix IV. Note that the object o in the counter example is an abstraction. This incongruence happens only in the case of higher-order substitution. In fact, scope equivalence is congruent w.r.t. substitution of any first-order term by the similar argument as presented in [9]. From Lemma 4.1 and 4.2,we have the following results. Proposition 4.1 There exist P and Q such that P ⊥ Q but P [o/x] ⊥ Q[o/x] for some object o. Proposition 4.2 There exist P and Q such that P ⊥ Q but I[P ] ⊥ I[Q] for some input context I[ ].
62
M. Murakami / On Congruence Property of Scope Equivalence for Concurrent Programs
a. P [o/x]
b. Q[o/x]
Figure 13. Transitions of P [o/x] and Q[o/x].
proof (outline): Let P and Q be as Example 4.1 (i) and I[ ] be a input context with a behavior m(o)
m(o)
m(x).[ ]. Consider the transitions: I[P ] → P [o/x] and I[Q] → Q[o/x] for o of Example 4.1 (ii). Proposition 4.3 There exist P and Q such that P ⊥ Q but A[P ] ⊥ A[Q] for some application context A[ ]. proof: (outline) Let P, Q and o be as Example 4.1 (ii) and A[ ] be an application context with a behavior (x)[ ]o.
5. Conclusion and Future Work This paper presented congruence results of scope equivalence w.r.t. new name, composition, τ -prefix and replication for a model with higher-order communication. We also showed that scope equivalence is not congruent w.r.t. input context and application context. As we presented in [9], scope equivalence is congruent w.r.t. input context for first order case. Thus, the non-congruent problem arise from higher-order substitutions. The lack of substitutivity of the equivalence relation makes analysis or verification of systems difficult. We will study this problem by the following approaches as future work. The first approach is revision of the definition of scope equivalence. The definition of ⊥ is based on the idea that two process are equivalent if the components that know the name are equivalent for each name. This idea is implemented as the Definition 3.2, 3. Our alternative idea for the third condition is P/N ∼ Q/N for each subset N of common private names
M. Murakami / On Congruence Property of Scope Equivalence for Concurrent Programs
63
instead of P/n ∼ Q/n. P and Q in lemma 4.1 are not equivalent based on this definition. We should study if this alternative definition is suitable or not for the equivalence of processes. The second approach involves the counter example. As the counter example presented in Section 4.2 is an artificial one, we should study whether there are any practical examples. Finally, we must reconsider our model of higher-order communication. In our model an output message has the same form as a tuple of a process variable that receives a higher-order term and an argument term. This idea is from LHOπ [12]. One of the main reasons why LHOπ adopts this approach is type theoretical convenience. As we saw in lemma 4.2, this identification of output messages and process variables causes the problem with congruence. Thus we should reconsider the model of higher-order communication used.
References [1] Martin Abadi and Andrew D. Gordon. A Calculus for Cryptographic Protocols: Spi Calculus. Information and Computation 148: pp. 1-70, 1999. [2] Hartmut Ehrig and Barbara K¨onig. Deriving Bisimulation Congruences in the DPO Approach to Graph Rewriting with Borrowed Contexts. Mathematical Structures in Computer Science, vol.16, no.6, pp. 11331163, 2006. [3] Fabio Gadducci. Term Graph rewriting for the π-calculus. Proc. of APLAS ’03 (Programming Languages and Systems), LNCS 2895, pp. 37-54, 2003. [4] Barbara K¨onig. A Graph Rewriting Semantics for the Polyadic π-Calculus. Proc. of GT-VMT ’00 (Workshop on Graph Transformation and Visual Modeling Techniques), pp. 451-458, 2000. [5] Robin Milner. Bigraphical Reactive Systems, Proc. of CONCUR’01. LNCS 2154, Springer, pp. 16-35, 2001. [6] Masaki Murakami. A Formal Model of Concurrent Systems Based on Bipartite Directed Acyclic Graph. Science of Computer Programming, Elsevier, 61, pp. 38-47, 2006. [7] Masaki Murakami. Congruence Results of Behavioral Equivalence for A Graph Rewriting Model of Concurrent Programs. Proc of ICITA 2008, pp. 636-641, 2008 (to appear in IJTMS, Indersience). [8] Masaki Murakami. A Graph Rewriting Model of Concurrent Programs with Higher-Order Communication. Proc. of TMFCS 2008, pp. 80-87, 2008. [9] Masaki Murakami. Congruence Results of Scope Equivalence for a Graph Rewriting Model of Concurrent Programs. Proc. of ICTAC2008, LNCS 5160, pp. 243-257, 2008. [10] Masaki Murakami. Congruence Results of Behavioral Equivalence for A Graph Rewriting Model of Concurrent Programs with Higher-Order Communication. Submitted to FST-TCS 2009. [11] Davide Sangiorgi and David Walker. The π-calculus, A Theory of Mobile Processes. Cambridge University Press, 2001. [12] Davide Sangiorgi. Asynchronous Process Calculi: The First- and Higher-order Paradigms. Theoretical Computer Science, 253, pp. 311-350, 2001. [13] Vladimiro Sassone and Paweł Soboci´nski. Reactive Systems over Cospans. Proc. of LICS ’05 IEEE, pp. 311-320, 2005.
Appendix I: Proof of Proposition 4.1 (outline) We can show that the following relation R is a scope bisimulation. R = {(P R, QR)|P ⊥ Q} Definition 3.2, 1. is straightforward from the definition of “”. The second condition is also straightforward from Proposition 3.1 and the definition of “”. 3. is from Proposition 2.5 and Proposition 3.1.
64
M. Murakami / On Congruence Property of Scope Equivalence for Concurrent Programs α
4. is by the induction on the number of replication 1/2 used to derive P R → P . If it is derived by one of input, β-conversion, τ -action or output immediately, there exists Q such α that QR → Q from Proposition 2.2 and 2.1 and (P , Q ) ∈ R as ⊥ is a bisimulation. For the case that it is derived from communication rule immediately, we consider two cases. First, if both of b1 and b2 are in one of src(P ) or src(R), we can show the existence of Q such α that QR → Q and P ⊥ Q by the similar argument as the cases of input etc. mentioned above. The second case is that one of b1 and b2 is in src(P ) and the other is in src(R). If b1 is in ao
a(o)
ao
src(P ), then P → P1 from output and R → R1 from input. From P ⊥ Q, Q → Q1 τ and P1 ⊥ Q1 . From Proposition 2.3, QR → Q1 R1 and (P1 R1 , Q1 R1 ) ∈ R from the definition. The case b2 is in src(P ) is similar. α
Consider the case that P R → P is derived by applying k + 1 replication 1/2 rules. If the k + 1th rule is replication 1, b =!S ∈ src(P R). First we consider b ∈ src(P ). From the premises of replication 1, P Rν{n|(b, n) ∈ α edge(P R)}S → P . As b ∈ src(P ), ν{n|(b, n) ∈ edge(P )}S = ν{n|(b, n) ∈ edge(P R)}S . From Proposition 3.6 and the transitivity of ⊥, P ν{n|(b, n) ∈ edge(P )}S ⊥ Q. α
Thus, (P ν{n|(b, n) ∈ edge(P )}S R, QR) ∈ R. And P Rν{n|(b, n) ∈ edge(P )}S → P is derived by applying k replication 1/2 rules. From the inductive hypothesis, there exists α Q such that QR → Q and P ⊥ Q . If b ∈ src(R), ν{n|(b, n) ∈ edge(P R)}S = ν{n|(b, n) ∈ edge(R)}S . From the premises α of replication 1, P Rν{n|(b, n) ∈ edge(R)}S → P with k applications of replication 1/2. As (P Rν{n|(b, n) ∈ edge(R)}S , QRν{n|(b, n) ∈ edge(R)}S ) ∈ R, there exα ists Q such that QRν{n|(b, n) ∈ edge(R)}S → Q and P ⊥ Q from the inductive α hypothesis. As b =!S ∈ src(QR), QR → Q by replication 1. The case of replication 2 is similar.
Appendix II: Proof of Proposition 4.4 (outline) We have the result by showing that the following relation R is a scope bisimulation. R = {(R[P1 ], R[P2 ])|P1 ⊥ P2 , R[ ] is a τ -context.}∪ ⊥ . To show Definition 3.2, 1 is straightforward from the definitions. 2. is from Proposition 3.3. 3. is from Proposition 3.3, 3.4 and 2.6. For 4., we can assume that R[ ] has the form of τ.[ ]R1 where τ.[ ] is a context that consists α of just one behavior node that is a τ -prefix with a hole. Then any transition: R[P1 ] → P1 of R[P1 ] is derived by application of τ -rule to τ.[P1 ] or is caused by a transition of R1 . For the first case, P1 has the form of νN (P1 )R1 . Similarly, there exists a transition for R[P2 ] α such that R[P2 ] → νN (P2 )R1 . As P1 ⊥ P2 , we have νN (P1 )R1 ⊥ νN (P2 )R1 from Proposition 4.1 and 4.3. If the transition is derived by applying some rule to R1 , P has the form of τ.[P1 ]R1 α α where R1 → R1 . Then we have τ.[P2 ]R1 → τ.[P2 ]R1 from Proposition 2.1 and (τ.[P1 ]R1 , τ.[P2 ]R1 ) ∈ R .
M. Murakami / On Congruence Property of Scope Equivalence for Concurrent Programs
65
Appendix III: Proof of Proposition 4.5 (outline) We can show the result by showing the following relation R is a scope bisimulation up to ⊥ and Proposition 3.5. {(R[P1 ], R[P2 ])|P1 ⊥ P2 , R[ ] is a replication context.}∪ ⊥ . To show Definition 3.4, 1. is straightforward from the definitions. 2 is from Proposition 3.3. 3 is from Proposition 3.3, 3.4 and 2.7. α
4. is by the induction on the number of the replication rules to derive R[P1 ] → R1 . We can assume that R[ ] has the form of ![ ]R1 Where ![ ] is a context that consists of just one behavior node that is a replication of a hole. α
If R[P1 ] =![P1 ]R1 → R1 is derived without any application of replication 1/2, that is a α transition of R1 . For this case, we can show that there exists R2 such that R[P2 ] → R2 and (R1 , R2 ) ∈ R with the similar argument to the proof of Proposition 4.4. Now we go into the induction step. If replication 1/2 is applied to R1 , then we can show the result by the similar way to the base case again. α
We consider the case that R[P1 ] → R1 is derived by replication 1 for ![P1 ]. Then α ![P1 ]ν{n|(![P1 ], n) ∈ edge(R[P1 ])}(P1 )R1 → R1 where P1 is a renaming of P1 . By the induction hypothesis, there exists R2 such that α ![P2 ]ν{n|(![P1 ], n) ∈ edge(R[P1 ])}(P1 )R1 → R2 ” and (R1 , R2 ”) ∈ R. From Proposition 4.2, P1 ⊥ P2 implies P1 ⊥ P2 , and we have ![P2 ]ν{n|(![P1 ], n) ∈ edge(R[P1 ])}(P1 )R1 ⊥![P2 ]ν{n|(![P1 ], n) ∈ edge(R[P1 ])}(P2 )R1 from Proposition 4.3 and Proposition 4.1. From this, we have ![P2 ]ν{n|(![P1 ], n) ∈ edge(R[P1 ])}(P2 )R1 ⊥![P2 ]ν{n|(![P2 ], n) ∈ edge(R[P2 ])}(P2 )R1 as {n|(![P1 ], n) ∈ edge(R[P1 ])} = {n|(![P2 ], n) ∈ edge(R[P2 ])}. α ˆ Then there exists R2 such that ![P2 ]ν{n|(![P2 ], n) ∈ edge(R[P2 ])}(P2 )R1 ⇒ R2 and R2 ” ⊥ R2 . Thus (R1 , R2 ) ∈ R ⊥. The case of Replication 2 is similar and then R is a scope bisimulation up to ⊥. Appendix IV: Proof of lemma 4.2 (outline) We show that for any relation R, if (P [o/x], Q[o/x]) ∈ R, then it is not a scope bisimulation. If R is a scope bisimulation, R is a strong bisimulation from Definition 3.2. Then for any α α P [o/x] such that P [o/x] → P [o/x] , there exists Q[o/x] such that Q[o/x] → Q[o/x] and (P [o/x] , Q[o/x] ) ∈ R. From replication 1 and β-conversion, we have P [o/x] such that: src(P [o/x] ) = {b } ∪ src(P [o/x]) where b = c(u).d(v).R, snk(P [o/x] ) = snk(P [o/x]) and edge(P [o/x] ) = edge(P [o/x]∪{(b , n1 )} for α = τ (Figure 13a, middle). On the other hand, the only transition τ for Q[o/x] is Q[o/x] → Q[o/x] where src(Q[o/x] ) = {b }∪src(Q[o/x] ), b = c(u).d(v).R, snk(Q[o/x] ) = snk(Q[o/x]) and edge(Q[o/x] ) = edge(Q[o/x] ∪ {(b , n1 ), (b , n2 )} (Figure 13b, middle) by replication 1 and β-conversion. c(m)
If R is a scope bisimulation, there exists Q[o/x]” such that Q[o/x] → Q[o/x]” and c(m)
(P [o/x]”, Q[o/x]”) ∈ R for any P [o/x] → P [o/x]”. Let P [o/x]” be a graph such that: src(P [o/x] ) = {b”} ∪ src(P [o/x]) where b” = d(v).R[m/u], snk(P [o/x] ) = snk(P [o/x]) and edge(P [o/x]”) = edge(P [o/x]) ∪ {(b”, n1 )} obtained by applying input rule (Fig-
66
M. Murakami / On Congruence Property of Scope Equivalence for Concurrent Programs
Figure 14. P [o/x]”/n2 and Q[o/x]”/n2 .
ure 13a, bottom). The only transition of Q[o/x] by c(m) makes src(Q[o/x] ) = {b”} ∪ src(Q[o/x] ) where b” = d(v).R[m/u], snk(Q[o/x] ) = snk(Q[o/x]) and edge(Q[o/x]”) = edge(Q[o/x] ∪ {(b”, n1 ), (b”, n2 )} (Figure 13b, bottom). Then (P [o/x]”, Q[o/x]”) is in R if R is a bisimulation. However, (P [o/x]”, Q[o/x]”) does not satisfy the condition 3. of Definition 3.2 because P [o/x]”/n2 and Q[o/x]”/n2 (Figure 14) are not strong bisimilar. Thus R cannot be a scope bisimulation.
Communicating Process Architectures 2009 P.H. Welch et al. (Eds.) IOS Press, 2009 © 2009 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-065-0-67
67
Analysing gCSP Models Using Runtime and Model Analysis Algorithms M. M. BEZEMER , M. A. GROOTHUIS and J. F. BROENINK Control Engineering, Faculty EEMCS, University of Twente, P.O. Box 217 7500 AE Enschede, The Netherlands {M.M.Bezemer , M.A.Groothuis , J.F.Broenink} @utwente.nl Abstract. This paper presents two algorithms for analysing gCSP models in order to improve their execution performance. Designers tend to create many small separate processes for each task, which results in many (resource intensive) context switches. The research challenge is to convert the model created from a design point of view to models which have better performance during execution, without limiting the designers in their ways of working. The first algorithm analyses the model during run-time execution in order to find static sequential execution traces that allow for optimisation. The second algorithm analyses the gCSP model for multi-core execution. It tries to find a resource-efficient placement on the available cores for the given target systems. Both algorithms are implemented in two tools and are tested. We conclude that both algorithms complement each other and the analysis results are suitable to create optimised models. Keywords. CSP, embedded systems, gCSP, process scheduling, traces, process transformation
Introduction Increasingly machines and consumer applications contain embedded systems to perform their main operations. Embedded systems are fed by signals from the outside world and use these signals to control the machine. Designing embedded systems becomes increasingly complex since the requirements grow. To aid developers with this complexity, Communicating Sequential Processes (CSP) [1, 2] can be used. The Control Engineering (CE) group has created a tool to graphically design and debug CSP models, called gCSP [3, 4]. The generated code makes use of the Communicating Threads (CT) library [5], which provides a C++ framework for the CSP language. An overview is shown in Figure 1. Optimisations Optimisations
CT lib gCSP Tool gCSP model
Code gen
+
gCSP (model)
Model analyser
Executable
Animation
Figure 1. gCSP & CT framework.
Runtime analyser
Executable
Figure 2. Locations of the analysers.
gCSP allows designers to create models and submodels from their designer’s point of view. When executing a model, it is desired to have a fast executing program that corresponds
68
M.M. Bezemer et al. / Analysing gCSP Models Using Runtime and Model Analysis Algorithms
to the modelled behaviour. Transformations are required to flatten (submodels are removed) the model as much as possible, to represent the model in its simplest form, in order to be able to schedule the processes using the available resources without having too much overhead. This paper presents the results of the research on the first step of model transformations: it looks for ways to analyse models in order to retrieve information about ways to optimise the model. The transformations themselves are not included yet. An example of such resources are multiprocessor chips as described in [6]. It is problematic to schedule all processes efficiently on one of these processors by hand, keeping in mind communication routes, communication costs and processor workloads. Researching the conversion of processes from a design point of view to a execution point of view is the topic of this paper. Related Work This section shortly describes work on algorithms to schedule processes on available resources. It finishes explaining which algorithm was chosen as a basis for this research. Boillat and Kropf [7] designed a distributed algorithm to map processes to nodes of a network. It starts by placing the processes in a randomised way, next it measures the delays and calculates a quality factor. The processes which have a bad influence on the quality factor, are mapped again and the quality is determined again. The target system is a Transputer network of which the communication delays and available paths differ depending on the nodes which need to communicate. So the related processes are automatically mapped close to each other by this algorithm. Magott [8] tried to solve the optimisation evaluation by using Petri nets. First a CSP model is converted to a Petri net, which forms a kind of dependency graph, but still contains a complete description of the model. His research added time factors to these Petri nets and using these time factors he is able to optimise the analysed models according to his theorems. Van Rijn describes in [9] and [10] an algorithm to parallelise model equations. He deduces a task dependency graph from the equations and uses it to schedule the equations on a Transputer network. Equations with a lot of dependencies form heaps and should be kept together to minimise the communication costs. Furthermore, his scheduling algorithm tries to find a balanced load for the Transputers, while keeping the critical path as short as possible. In this work, van Rijn’s algorithms are used as a basis for the model analysis algorithms, mainly because the requirements of both sets of algorithms are quite similar. The CSP processes can be compared to the model equations, both entities depend on predecessors and their results are required for other entities. The target systems both consist of (a network) of nodes having communication costs and or similar properties. However, van Rijn’s algorithm is not perfectly suitable for this work, so it will be extended into a more CSP dedicated algorithm. The results of the runtime analyser can be presented in different ways. Brown and Smith [11] describe three kinds of traces: CSP, VCR and structural traces, but they indicate that even more possibilities of making traces visible are available. Because the runtime analyser tries to find a sequential order in the execution of a gCSP model, the results of the tool use the sequential CSP symbol ‘->’ between the processes. This way of showing the sequential process order is quite intuitive, but for bigger models it gets (too) complex, a solution is presented in the recommendations section. Goals of the Research Designing and implementing algorithms to analyse gCSP models is the main goal of this research. It must be possible to see from the algorithm results how the performance of the
M.M. Bezemer et al. / Analysing gCSP Models Using Runtime and Model Analysis Algorithms
GUI
Console Reader
Analysis Algorithm
Socket Communication Handler
Runtime analyser
gCSP Model
Command line
Commands Runtime information (TCP/IP)
69
CT Library
Executable
Figure 3. Architecture of the analyser and the executable.
execution of the model can be optimised in order to save resources. Two methods of analysis are implemented as shown in Figure 2. • The runtime analyser looks at the executing model and its results can be used to determine a static order of running processes. • The model analyser looks at the models at design time and schedules the processes for a given target system. Both analysers return different results due to their different points of view. These results can be combined and used to make an optimised version of the model, which is currently still manual labour. Outline First the algorithms and results of the runtime analyser are described in Section 1. Section 2 contains the algorithms and results of the Model Analyser. The paper ends (Section 3) with conclusion about both analysers and some recommendations. 1. Runtime Analyser This section describes the algorithms and the implementation of the gCSP Runtime Analyser, called the runtime analyser from now on. The algorithm tries to find execution patterns by analysing a compiled gCSP program. Hilderink writes in [12] (section 3.9) about design freedom. He implies that the designer does not need to specify the process order of the complete node, but leave it to the underlying system. The runtime analyser sits between the model and the underlying system, the CT library. It suggests static solutions for (some of) the process orders, so the underlying system gets (partly) relieved from this task, making it more efficient. 1.1. Introduction Figure 3 shows the runtime analyser architecture with respect to the model to be analysed. In order to execute a model, it is compiled together with the CT library into an executable. This can be done by the tool or manually if compiling requires some extra steps. The main part of the analyser consists of the implementation of the used algorithm. The executable gets started after compilation and communication between the tool and the executable is established. After the executable is started it initialises all processes and waits for the user to either run or step through the executable. Communication is required to feed the algorithms with runtime information. Therefore two types of communication are used: • TCP/IP communication, which is send over a TCP/IP channel between the CT library and the Communication Handler. It was originally used for animation purposes
70
M.M. Bezemer et al. / Analysing gCSP Models Using Runtime and Model Analysis Algorithms
[4, 13], but the runtime analyser reuses it to send commands to the executable, for example to start the execution and to receive information about the activate processes and their states. • command line communication, consisting of texts normally visible on the command line. It is used for debugging only, the texts will be shown to the user in a log view. The processes in a gCSP model have several states, which are used by the algorithms: new, ready, running, blocked and finished. Upon the creation of the process it starts in the new state. When the scheduler decides that a process is ready to be started it is put in the ready queue and its state is changed to the ready state. After the scheduler decides to start that process, it enters the running state. From the running state a process is able to enter the finished or the blocked state, depending on whether the process was finished or was blocked. Blocking mostly occurs when rendezvous communication is required and the other site is not ready to communicate yet. From the finished state a process can be put in the ready state again via a restart or via the next execution of a loop. When the cause for the blocking state is removed, the process is able to resume its behaviour when it is put in the running state again. 1.2. Algorithms The runtime analyser uses two algorithms: one to reconstruct the model tree and the other to determine the order of the execution of the processes. Both algorithms run in parallel and both use the information from changes in the process states. Before the algorithms can be started the initialisation information of the processes is required to let the algorithms know which processes are available. This initialisation information is sent upon the creation of all processes, when they enter their new state. This occurs before the execution of the model is started. 1.3. Model Tree Construction Algorithm Model tree (re)construction is required because the executable does not contain a model tree anymore. Using the model tree from the gCSP model is difficult, because the current version of gCSP has unusable, complex internal data structures. This will be improved in a new version (gCSP2), but until then the model tree construction algorithm is required. An advantage of reconstructing the tree is the possibility to be sure that the used gCSP model tree corresponds with the model tree hidden in the code of the executable. If both model trees are not equal an error has occurred. For example the code generation might have generated incorrect code or communication between the executable and the analyser has been corrupted. Figure 4 shows the reconstructed model-tree of the dual ProducerConsumer model shown in Figure 15. The numbers between parenthesis indicate the number of times a process finished. This is shown for simple profiling purposes, so the user is able to see how active all processes are. Processes are added to the tree when they are started for the first time. All grouped processes, indicated with the dashed rectangles in Figure 15, appear on the same branch. In Figure 4 the label B shows a branch, which is made black for this example. The two processes on this branch, Consumer1 and Consumer2, are grouped and belong to Seq C. The grouped processes indicate a scheduling order, in this example they have a sequential relationship as shown by the CSP notation right of the Seq C tree node. The rule for model tree creation, but also for process ordering, is based on the internal algorithms, which are used by the CT library to activate and execute the processes. When a processes becomes active it creates its underlying processes and adds them to the ready queue of the scheduler. Therefore one rule is sufficient to reconstruct the model tree, as shown in Rule 1
M.M. Bezemer et al. / Analysing gCSP Models Using Runtime and Model Analysis Algorithms Model Par2 REP_P Seq_P (61) Producer1 (61) P1_Seq (61) P1_C (62) P1_Wr (61) Producer2 (61) P2_Seq (61) P2_C (61) P2_Wr (61) REP_C Seq_C (61) Consumer1 (62) C1_Seq (62) C1_Rd (62) C1_C (62) Consumer2 (61) C2_Seq (61) B C2_Rd (61) C2_C (61)
71
Par2 = REP_P || REP_C REP_P = Seq_P; REP_P Seq_P = Producer1; Producer2 Producer1 = P1_C; P1_Wr
Producer2 = P2_C; P2_Wr
REP_C = Seq_C; REP_C Seq_C = Consumer1; Consumer2 Consumer1 = C1_Rd; C1_C
Consumer2 = C2_Rd; C2_C
Figure 4. A reconstructed model tree of the dual ProducerConsumer model, with the CSP notation at the right.
If a process enters its ready or running state for the first time, make it a sibling of the last started process, which is still running.
Rule 1
In order to make the rule more clear the creation of the shown model tree, based on the behaviour of the scheduler build in the CT library, will be discussed. First ‘Model’ is added as the root of the tree and marked as last started process even though it is not a real process. On the first step Par2 enters its running state and thus added to ‘Model’. Since Par2 is a parallel construct, both REP P and REP C enter their ready state and both are added to Par2, since this was the last process entering its running state. In this example REP P enters its running state, since it is added already nothing happens. Now Seq P enters its running state, so it is added to REP P and so on, until P1 C enters its finished state and P1 Wr enters its running state. P1 Wr is not added to P1 C because that process is not running anymore, instead the last running running process is P1 Seq. After P1 Wr blocks, no rendezvous is possible yet, REP C enters its running state, since it already as added to the tree nothing happen. Seq C enters its running state and is added to REP C since that is last running process. This keeps on going until all processes are added to the tree and are removed of the list of unused processes. When this list is empty, the reconstruction algorithm is finished. Otherwise, it may be an indication of (process) starvation behaviour or a modelling error. 1.4. Process Ordering Like the model tree construction algorithm, the process ordering algorithm also uses the changes in the process states. This time, only the state change to ‘finished’ is used, since the required process order is the finished order of the processes and not the starting order. The result of the algorithm is a set of chains of processes showing the execution order of the executable. A chain is a sequential order of a part of the processes, it ends with one or more cross-references to other chains. The algorithm operates in two modes: one for chains which are not complete and therefore not have set cross-references to other chains and one for chains which are finished and have cross-references. A chain gets finished when the same process is added for a second time to this chain, after it is finished no processes can be added to it. The reason to finish chains is that they are not allowed to have the same process add to it twice, as this would result in a single chain which gets endlessly long, instead of having a set of multiple chains showing the static running order of the processes. Therefore, two sets of rules are required one for each mode.
72
M.M. Bezemer et al. / Analysing gCSP Models Using Runtime and Model Analysis Algorithms
The combination of the created chains should be equal of the execution trace of the model. So basically the set of chains is another notation of the corresponding trace, so by following the chains and their cross-references the trace can be reconstructed. 1.4.1. Used Notation D->C->F->B->(B,D*) [start]->A->B->C->D->(B)
Figure 5. Example notation of a set of chains.
The notation of the chains is shown in Figure 5. Each chain has a unique starting process followed by a series of processes and is ended by one or more cross-references. The crossreferences indicate which chain(s) which might become active after the current chain is ended and are shown at the end of a chain between parentheses. Asterisks behind a cross-reference are used to indicate that chain is looped. One asterisk is added when an ending process refers to the beginning of its chain. Two asterisks are added when an ending process is pointing to a process in the middle of its chain. During the explanation of the algorithm there always is an active chain, indicating that the the current execution is somewhere in that chain. This chain is shown in bold face and starts with an asterisk, in this example notation the chain starting with B in the example is the active one. The execution order starts at starting point ‘[start]’. In the example of the used notation, the chain starting with B is activated after the ‘[start]’ chain is finished, since the crossreference points to this chain. At the end of chain B two cross-references are available and the outcome depends on the underlying execution engine. Either the current chain stays active and continues at process E or the chain starting with D becomes active and so on. 1.4.2. Rules for Chains with no Cross-References These rules are responsible for creating and extending new chains. The trace shown in Figure 6 is used for the explanation, the numbers above the trace indicate a position, which is referred to in the explanation. 1 2 3 | | | A->B->C->B->D->E->F->B
Figure 6. The used trace.
Figure 7. A new chain.
If the state of process p changes to ‘finished’ add p to the end of the active chain.
Rule 2
Rule 2 is the most basic rule to create new chains. When position 1 is reached, the result is a chain with processes A, B and C added, as shown in Figure 7. At position 2 process B is finished for a second time. Since it is not allowed to add a process twice to a chain, Rules 3 and 4 are required. If process p is finished, but is already present in the active chain, it will become an cross-reference of this chain.
Rule 3
If process p becomes the cross-reference of a chain, the chain starting with p is looked up. If the chain is not available a new chain starting with p will be created. The existing or newly created chain will become the new active chain.
Rule 4
M.M. Bezemer et al. / Analysing gCSP Models Using Runtime and Model Analysis Algorithms
[start]->A->B->C->(B)
73
[start]->A->B->C->(B)
Figure 8. A second chain is added.
Figure 9. The complete trace converted to chains.
Rule 3 defines when a chain should be ended. When a chain that starts with process B is not found, Rule 4 states that a new chain should be added. The result is shown in Figure 8. When position 3 is reached, process B again applies to Rule 4. This time a chain starting with process B is found: the current active chain. No new chain needs to be added and the current chain is ended as shown in Figure 9. 1.4.3. Rules for Chains with Cross-References When the active chain is finished and thus has one or more cross-references, the algorithm should check if this chain is correctly created. A chain is correctly created if it is valid for the complete execution of the model, when this is not the case the chain needs modification. In order to check whether a chain is correct or not, the position in the currently active chain is required. To explain these rules the trace is extended, as shown in Figure 10. 1 2 3 4 5 | | | | | A->B->C->B->D->E->F->B->D->G->H
Figure 10. The extended trace.
First of all it should be checked if the active process is at the end of the chain. Position 3 ends the chain starting with B shown in Figure 9. Rule 5 performs this check and find the referenced chain, in this case it is the same chain. If the finished process p is at the end of the chain, find the chain starting with process p and make it the active chain.
Rule 5
When the chain is not ended by the finished process, the process should match the next process in the chain, this results in Rule 6. At position 4, process D finishes. Previously process B was finished, so process D is expected next. No problems arise since the expected and the finished processes match. If the active process does not match the finished process p the chain must be split.
Rule 6
At position 5 process G finished, instead of the expected process (E), which comes after the previous process (D) in the chain. According to the rule, the active chain should be split. Splitting chains is a complicated task, since the resulting chains still need to match the previous part of the trace. For the current trace the active chain should be split after process D, since that process was still expected. Rule 7 defines the steps to be taken in such a situation. The active chain should be split at process e when process p is unexpected and a chain starting with process e is not yet present. To split a chain: end the current chain before process e Put the processes after e and the cross-references in a new chain starting with process e Add e and p as new cross-references Create a new chain starting with process p and make it the active chain.
Rule 7
It results in a new chain starting with E containing the remainder of the split chain. A new chain is created starting with process G and is made active. The resulting chains are
74
M.M. Bezemer et al. / Analysing gCSP Models Using Runtime and Model Analysis Algorithms
shown in Figure 11. When following the trace from the start till position 5, the set of chains in the figure match the given trace again. The active chain is the chain starting with process G and is not yet finished so the rules of the previous section apply to it. E->F->(B) B->D->(E,G) A->B->C->(B)
Figure 11. Chains after splitting chain B
1.4.4. Rule for splitting chains when a matching chain already exists The steps of Rule 7 can only be used if a chain starting with process e is not present. If the trace is extended again according to Figure 12, a problem arises at position 7. Until position 6 no problems occurred, Figure 13 shows the resulting set of chains until this position. 1 2 3 4 5 6 7 | | | | | | | A->B->C->B->D->E->F->B->D->G->H->B->D->G->H->I
Figure 12. The final trace.
E->F->(B) B->D->(E,G) A->B->C->(B)
Figure 13. The chains till position 6
At position 6 process H is finished and expected, so position 7 is reached. A problem occurs at position 7: process I is finished but process B is expected, so according to Rule 6 the chain should be split. Problem is, a chain starting with process B is present already. Having two chains starting with the same process is not allowed, since it would not be deterministic which chain should be activated when a cross-reference to one of these chains is reached. However, it is clear that the processes after process H match the chain starting with B exactly. Rule 8 provides a solution for these situations. The active chain should be split at process e when process p is unexpected, but a chain starting with process e is present already. Compare the processes after e with the chain starting with process e. If both parts are equal, remove the remaining processes in the active chain starting at process e. Add the cross-references to the chain starting with e if they are not present at this chain. Create a new chain starting with p and make it the active chain. If both parts are unequal the comparison did not succeed. The static process order cannot be determined and the algorithm should stop.
Rule 8
G->H->(B,I) E->F->(B) B->D->(E,G) A->B->C->(B)
Figure 14. The chains after position 7.
Figure 14 shows the result after applying the rule. Process I is active and can be added to the set of chains using the described rules. If the comparison of the rule fails, it indicates that two unequal chains starting with the same process should be created. This situation is not supported by the algorithm, because the executing engine of the CT library in combination with this gCSP model does not result in a static execution order and the model is non deterministic. A simple example which has a chance to fail the comparison would be a ‘one2any channel’: one process writes on the channel and multiple processes are candidates to read the value. If the reading order of these candidate processes is not deterministic Rule 8 fails.
M.M. Bezemer et al. / Analysing gCSP Models Using Runtime and Model Analysis Algorithms
75
1.5. Results This section describes the results of the analyser, first a functional test is performed. Next, a scalability test is performed to see whether the analyser still functions when a bigger real-life model is used. 1.5.1. Functional Test As a functional test a simple producer consumer model is created, as shown in Figure 15. The Producer and Consumer submodels are shown below the main model. [true]
[true] REP_P
*
*
Producer1
Consumer1
Producer2
Consumer2
P1_Wr
P1_C
REP_C
Ch_P1C1
Ch_P1C1
!
varP1:Double
C1_Rd
?
C1_C
varC1:Double
Figure 15. The dual ProducerConsumer model.
The model is loaded in the analyser, which automatically generates code using gCSP, compiles and creates the executable. Figure 16 shows the result of the analyser. P1_C->C1_Rd->C1_C->P1_Wr->P2_C->P2_Wr->C2_Rd->C2_C->(P1_C*) [start]->(P1_C)
Figure 16. The process chains for the dual ProducerConsumer model.
One big chain is created and the [start]-chain is empty indicating that no startup behaviour is present. While keeping in mind that it shows the finished order of the processes, it can be verified that it corresponds with the model. First P1_C is finished. Next P1_Wr blocks since no reader is active yet, so C1_Rd becomes active and reads the value from the channel. After C1_C finishes C2_Rd blocks since P2_Wr is not active yet. P1_Wr finally can finish as well, this can be continued for the other Producer-Consumer pair. The result of the analysis of the model results in a single chain, in order to compare the results of the optimised model against the results of the original model at least two chains are required. Therefore the model of Figure 15 is reduced to a single Producer-Consumer pair, by removing P2 and C2. The results of the analysis of this new model are shown in Figure 17 P1_Wr->P1_C->C1_Rd->C1_C->(C1_Rd) C1_Rd->C1_C->P1_Wr->P1_C->(P1_Wr) [start]->(C1_Rd)
Figure 17. The process chains for the single ProducerConsumer model.
The chains are about half the size of the original model, which is as expected, but the extra chain is new. The difference between both chains, is that one chain first writes to the
76
M.M. Bezemer et al. / Analysing gCSP Models Using Runtime and Model Analysis Algorithms
channel and then reads, while the other chain first reads and then writes. When the chain of the original model is closely examined this behaviour is also visible, it is less noticeable since two ProducerConsumer pairs are available this behaviour. The analysis result now has two chains so a simplified version can be created, which is shown in Figure 18. It has two processes, one for each chain, both containing a code block representing a chain. The repetitions are replaced by infinite while-loops in the code blocks, resulting in a model which is as optimal as possible. Ch_C2C1
Chain1
Chain2
Ch_C2C1
Chain1_Code
Chain2_Code
Ch_C1C2
Ch_C1C2
Figure 18. The simplified ProducerConsumer model, rebuild using the runtime analyser results.
The original gCSP model needs quite some remodelling in order to be usable for time measurements, as can be seen in Figure 19. Mainly, this is because of the lack of ‘init’ and ‘finish’ events for code blocks, so other code blocks are required to simulate these events. Therefore, the results will not be exactly accurate since extra context switches are introduced by the measurement system. Also errors, due to the modified models compared their originals, occur but these errors will be the same for both models and for comparison it does not matter. isRunning:Boolean current_run:Integer [(current_run++)<10]
*
INIT_MODEL
Producer
Consumer
REPETITION3
PROD_INIT
isRunning
isRunning
isRunning:Boolean
isRunning:Boolean
INIT start_time:Integer
*
[isRunning] REPETITION1
WRITER p:Double
[now - start_time < 6000 | producerCanBeStopped] REPETITION2
*
dataChannel1
now:Integer
READER
dataChannel1
!
? c:Double
current:Integer producerCanBeStopped:Boolean
Consume
f:FILE*
Produce
FINISH
Figure 19. The ProducerConsumer model with measurement added.
To test the speed differences between the original and the optimised model, they are build without animation functionality in order to have less external disturbances. For the comparison, the producer and consumer will be running during a certain interval and afterwards the amount of communicated data is stored. This test is repeated 10 times and the average of the resulting values is calculated. In order to get an even more accurate result, this series of 10 tests is repeated 60 times and each average result is plotted in Figure 20. In the figure, the x-axis shows the measurement number and the y-axis shows the average amount of data produced and consumed for the each measurement run. The more data is processed the better, since it indicates that the model is running faster. It is clear that the optimised model indeed has better results compared to the original model. The average factor between the processed data of both models is 1.3, so the optimised
processed data
M.M. Bezemer et al. / Analysing gCSP Models Using Runtime and Model Analysis Algorithms 40000
77
Original model Optimised model
30000
20000
10000
0 0
10
20
30
40 50 60 measurement number
Figure 20. Measurement results.
model is 30% faster. Since the optimised model, using the results produced by the runtime analyser, is better than the original, the tool is functional. Whether the speedup is optimal or not, must be determined by performing more tests while taking more details into account, which is out of scope of this paper. 1.5.2. Scalability Test The previous section showed that the analyser is working and that the optimisation indeed is better, compared to the original. This section describes the analysis results of a real control model, in order to see that the analyser is able to handle big models aswell. The model [14] shown in Figure 21 is used to control a cartesian plotter. The motion sequencer uses a data file to create a motion path for the pen. The motor controllers block contains a 20-sim [15] model to control the X, Y and Z motor of the plotter. The safety block contains safety checks and is able to disable the X, Y or Z motor signal to make sure no unsafe situations occur. The last block, Scaling, scales the X, Y, Z and VCC signals within expected value ranges so the plotter receives correct signals. ENCX
Legend
!
LinkDriver Writer
?
LinkDriver Reader
PWMX
! 12
14
4
PWMY
8 15
MotionSequencer
1
MotorControllers
2 3
13
5
Safety
9
6
10
7
11
?
?
Scaling PWMZ 16
? VCCZ
!
17
?
ENCY
Figure 21. Plotter model.
Normally the LinkDrivers, visible in the figure, act as glue between the model and the hardware. For analysis purposes they are implemented to be non-blocking and to generate dummy data, instead of receiving data from the hardware. So the model is able to run without real IO. Besides the change of LinkDrivers everything is the same as is would be when controlling the actual plotter. After the runtime analyser finished the analysis, the results shown in Figure 22 are obtained. The first two letters of the reader and writer process is the abbreviation of the parent process and the number at the end corresponds with the numbers of the channels in the Figure, for example ‘Sc Rd8’ is the reader from Scaling which reads the values from channel number 8.
78
M.M. Bezemer et al. / Analysing gCSP Models Using Runtime and Model Analysis Algorithms Sc_Rd8->DoubletoBooleanConversion->Sc_Wr17->Sa_Wr8->Sa_Rd4->MC_Wr4->Sa_Rd7->MC_Wr7 ->Sa_Rd6->MC_Wr6->Sa_Rd5->MC_Wr5->Sa_Rd_ESX2_2->Sa_Rd_ESX2_1->Sa_Rd_ESX1_2 ->Sa_Rd_ESX1_1->MC_Rd12->MC_Rd13->Safety_X->Sa_Rd_ESY1->Sa_Rd_ESY2->Safety_Y ->Safety_Z->MC_Rd1->MS_Wr1->MC_Rd2->MS_Wr2->Sa_Wr9->Sc_Rd9->MC_Rd3 ->LongtoDoubleConversion->Controller->MS_Wr3->Sc_Rd10->Sa_Wr10->(Sc_Rd11) Sc_Rd11->DoubletoShortConversion->Sc_Wr14->Sc_Wr15->Sc_Wr16 ->Sa_Wr11->(Sc_Rd8, HPGLParser) MC_Rd12->MC_Rd13->Sa_Rd_ESX2_2->Sa_Rd_ESX2_1->Sa_Rd_ESX1_2->Sa_Rd_ESX1_1->MC_Rd1 ->MS_Wr1->Safety_X->Sa_Rd_ESY1->Sa_Rd_ESY2->Safety_Y->Safety_Z->MC_Rd2->MS_Wr2 ->MC_Rd3->LongtoDoubleConversion->Controller->MS_Wr3->Sa_Wr9->Sc_Rd9->Sc_Rd10 ->Sa_Wr10->(HPGLParser) HPGLParser->(MC_Rd12, Sc_Rd11) [start]->MC_Rd12->MC_Rd13->HPGLParser->MS_Wr1->MC_Rd1->MC_Rd2->MS_Wr2->MC_Rd3 ->LongtoDoubleConversion->Controller->MS_Wr3->MC_Wr5->Sa_Rd5->MC_Wr6->Sa_Rd6->MC_Wr7 ->Sa_Rd7->MC_Wr4->Sa_Rd4->(HPGLParser)
Figure 22. Result of the analyser for the plotter controller.
This shows that the analyser also works for big, real controller models. It is practically impossible to validate the results. From ‘[start]’ to ‘Sa_Rd4’ the order seems reasonable. After that part some optimisation effects start to play a role and ‘HPGLParser’ is finished for the second time, even before the rest of the model has been finished. These optimisation effects are the result of a channel optimisation in the CT library (described in [12] section 5.5.1). They only allow context switches when really required, for example: after reading a result from the channel, the group containing the reading process stays and continues with the execution. As opposed to the possibility that the corresponding writer becomes active again and the flow is more natural. The reader might try to read data from the channel again at some point, but now the channel will be empty, since the writer did not became active yet. Now the reader blocks and the writing process will be activated again. This results in hard to explain analysis results. On the long run every process is called as often as the other processes, which is as expected. To see whether the results are indeed valid the set of chains should be used to create a optimised model to if the model still runs as expected. Determining which chain is started when, is not possible from the analysis results themselves. Using the stepping option of the tool, it becomes possible to manually determine that the order of chains is as shown in Figure 23. The last ‘HPGLParser’ references back to the first one and the loop is complete. This complex order also is a result of the channel optimisations. [start]->HPGLParser->MC_Rd12->HPGLParser->Sc_Rd11->Sc_Rd8->Sc_Rd11->(HPGLParser*)
Figure 23. Execution order of the chains.
1.6. Discussion First of all the runtime analyser seems to work as expected, being able to analyse most (precompiled) models. Two known problems are: • the use of One2Any or Any2Any channels with three or more processes connected. Models which use these types of channels might fail during analysis, because of the lack of deterministic behaviour of such constructs. In these cases the will give a warning that it is not possible to continue analysing. • models using recursion blocks or alternative channels. This may lead to processes which are not used during the analysis, because they might depend on (combinations of) external events which might not be available during analysis.
M.M. Bezemer et al. / Analysing gCSP Models Using Runtime and Model Analysis Algorithms
79
The rules, described in section 1.4, are sufficient for deterministic models and for simple non-deterministic models as well. However, the rules are not proven to be complete, since they are defined by comparing the results of relevant tests and the expected results. When new (non-deterministic) situations are analysed it might be possible that new rules are required. The bigger the models becomes, the harder the results become to interpret. In the plotter example, it is clear that certain sets of process occur multiple times, like: Safety_X->Sa_Rd_ESY1->Sa_Rd_ESY2->Safety_Y In order to make the results easier to interpret, such sets could be replaced by a group symbol, so the groups can used multiple times. Currently the results of the analysis are based on a single thread scheduler. When multicore systems are used and the scheduler of the CT library is able to schedule multiple threads simultaneously, the analysis results are undefined. The processes will be running really in parallel and the received state changes are not representing a sequential order, but represent multiple sequential orders. To solve this problem the received state changes should be accompanied by thread information, so the multiple sequential orders can be separated. The analysis algorithms can then construct the chains for each thread separately using this extra information. Another usage possibility of the current runtime analyser tool would be post game analysis [16]. The executable is executed on a target system first and the trace is stored in a log. A small tool could be created to replay the logged trace by using the animation protocol to connect to the runtime analyser tool. The analyser would not be able to notice the difference and it becomes possible to do analysis for models which runs on their target system. Even though the results are likely to become very complex, as became clear with the scalability test, usable information still can be obtained. Processes which are unused will not become part of the model tree, this information can be used as a coverage test. Of course these processes might become active depending on external inputs, like a safety checking process that might incorporate processes that only run when unsafe situations occur. This situation might never occur when just analysing a model. It also might indicate that an alternative channel is behaving unexpectedly or that a recursion is infinitely looping. We can however use this information to detect processes which should be reviewed. A more complex usage of the results is to actually implement them. By using the chains it becomes possible to create big processes containing the processes of each chain. By replacing channels, which are used internally in the new processes, with variables the processes will not block and they also become even simpler. Most processes are control related, these processes mostly do not contain (finite) state machines or very small ones, thus combining these processes will not likely give state space explosion related problems. These steps result in an optimised model which requires less system resources and runs faster as shown in section 1.5.1. 2. Model Analyser This section describes the developed model analyser tool. After a short introduction, an explanation of the algorithm is given. Next the testing and results are given using the same models as the ones used for the the runtime analyser in section 1. This section ends with conclusions about the model analyser and its usability. 2.1. Introduction The model analyser consists of several algorithm blocks connected to each other, shown in Figure 24. For each block the corresponding section describing it is shown within the parentheses.
80
M.M. Bezemer et al. / Analysing gCSP Models Using Runtime and Model Analysis Algorithms User Interface
gCSP model
Model Tree Creator (2.2.1)
Dependency Graph Creator (2.2.2)
Critical Path Creator (2.2.3)
Heap Scheduler (2.2.4)
Core Scheduler (2.2.5)
Analysis results
Algorithm
Figure 24. Algorithm steps of the model analyser.
The analysis results are calculated using the gCSP model information combined with extra data from the user interface. The following user data is being used by the algorithm: • • • • •
Number of cores: the number of cores available to schedule the processes on. Core speed: the relative speed of a core. Channel setup time: the time it takes to open a channel from one core to another. Communication time: the time it takes to communicate a value over the channel. Process weight: the time a process needs to finish after is was started.
The results of the separate algorithms steps are made visible in the user interface. All algorithm steps build in the model analyser are automatic, the user only needs to open a gCSP model and wait for the analysis results. The analyser has a possibility to step through the algorithm steps, in order to find out why certain decisions are made. For example why a certain heap is placed on a certain core, or why several processes are grouped on the same heap. 2.2. Algorithms The algorithm created by van Rijn [9] is used as stated before. The important parts are based on a set of rules that can be extended, in order to add new functionality to the analyser. Originally it was created to be used for model equations running on a Transputer network. Our improved algorithm is usable for scheduling CSP processes on available processor cores or distributed system nodes. Scheduling the processes on the available cores is too complex for a single algorithm step, therefore heaps are introduces by van Rijn which can be created by the heap scheduler. Heaps are groups of closely related processes which should be scheduled onto the same core. The heaps can be handled as one process, so the heap scheduler reduces the number of elements to be analysed by the core scheduler and lightens its task. Van Rijn noticed a disadvantage in the use of heaps: they might have lots of dependencies to other heaps, since the process dependencies become heap dependencies after they are placed onto the heaps. Heaps might even contain circular dependencies with other heaps, these circular dependencies cannot be solved by the core scheduler. To solve this problem the processes on heaps are regrouped using index blocks, which start at processes which have an external incoming dependency. The core scheduler is able to schedule these index blocks since they solve the circular dependencies. The algorithm blocks, shown in Figure 24, are independent blocks chained together. So it is fairly easy to extend these blocks or to add a new one in between, without the need to rewrite surrounding blocks, assuming that the data sent to the existing blocks stays compatible of course. This data is also used by the user interface to update its views after each step. Unless stated otherwise, the model of Figure 25 is used to create the results shown in the following sections while explaining the behaviour of the blocks. It is the same as Figure 15, but all sub models are exploded to make all dependencies clearer. 2.2.1. Creation of the Model Tree The model tree is only for informational use and is shown in the user interface of the tool, the algorithm itself does not use it. It is different from the real Model Trees as used in gCSP, this
M.M. Bezemer et al. / Analysing gCSP Models Using Runtime and Model Analysis Algorithms
[true] REP_P
81
[true]
*
REP_C
P1_Wr P1_C
!
*
C1_Rd
varP1:Double
varC1:Double
P2_Wr P2_C
C1_C
?
! varP2:Double
C2_Rd C2_C
? varC2:Double
Figure 25. Exploded view of the dual ProducerConsumer model.
tree only shows elements which influence the analysis results. This algorithm step is simple: • it loads the selected model, • it builds the tree by recursively walking through all submodels to add the available parts, • and it sends the loaded model to the dependency graph creator. 2.2.2. Dependency Graph Creator The dependency graph is extensively used by the algorithm blocks to find and make use of the dependencies of the processes. The dependency graph creator recursively walks through the model. First to add all vertices which represent the available processes and a second time to add edges representing the dependencies. The first step, adding vertices, is done by finding suitable processes: code blocks, readers and writers. For the second step, adding edges, sequential relations are used: rendezvous channels and sequential compositions. Figure 26 shows an example dependency graph. The vertices are created using the sequential relationships, except for the dependencies between the writers and readers. Those two dependencies are derived from the corresponding two channels. The graphical representation also uses the predecessors and the successors to place the vertices in a partial ordered way: in the direction of time flow (i.e. vertical direction in the figure). start
10
P1_C
20
P1_Wr
30
P2_C
C1_Rd
40
40
P2_Wr
C1_C
50
50
C2_Rd
60
C2_C
70
end
80
Figure 26. Dependency graph of the ProducerConsumer model.
Figure 27. Critical path of the ProducerConsumer model.
82
M.M. Bezemer et al. / Analysing gCSP Models Using Runtime and Model Analysis Algorithms
2.2.3. Critical Path Creator A critical path is required for the heap scheduler, see Figure 24, to schedule the heaps efficiently. Finding this critical path is easy: first the highest cumulative weight for each vertex is determined by following paths from the start vertex to the end vertex and storing the highest calculated weight for each vertex. Secondly the path containing the highest calculated weights is determined to be the critical path. Figure 27 shows the critical path with the thick line, the values in the vertices are the highest cumulative weights of the vertices. 2.2.4. Heap Scheduler As described earlier, the heap scheduler groups processes onto heaps, can be seen as groups of closely related processes and lighten the work of the core scheduler. Each process on a heap has a start time set to indicate when the heap need to start the process. Using this start time and the process weight data, a heap end time can be calculated. Since all processes scheduled on heaps have these values, the heaps can be seen as a sequentially scheduled groups of processes. The heap scheduling algorithm is complex and comprehensive, therefore the complete algorithm and its rules are explained in [17]. In general the algorithm puts the vertices of the critical path in heap 1. The remaining vertices are grouped in sequential chains and assigned to their own heaps. The start time, assigned to each process placed on a heap, is the end time of the previous process. As described before, the end time of a process is calculated by adding the weight of a process to the start time and can be used for the start time of the next vertex on the heap. Sometimes, copying a vertex and its predecessors onto multiple heaps might be cheaper than communicating the result to from one heap to another, so several heaps might contain the same vertices and the start and end times might vary for a process placed on different heaps. Having processes scheduled on multiple locations does not give any problems in general, because a process receive some values and performs a calculation of some sort and sends the results to another process. The communication takes place by using rendezvous channels, this mechanism makes sure that processes receive the correct information and whether it is calculated multiple times because that is faster does not matter. An exception are processes which have IO-pins, have these kind of processes duplicated is not possible since only one set of IO-pins is available in general. The result of the example can be found in Figure 28. At the left the dependency graph is shown, the colours of the vertices correspond with the heaps they have been placed on. The texts inside the vertices represent the heap number, the start time and the end time. At the right bars are visible which represent the heaps, with the vertices placed on them in the determined chronological order.
1: 0-10, 2: 0-10 1: 10-20, 2: 10-20 1: 20-30
2: 20-30
1: 30-40
2: 30-40 1: 40-50 1: 50-60
(a) Dependency graph
0 1 2
10
20
30
40
(b) Heap timing
Figure 28. Results of heap scheduler.
50
60
M.M. Bezemer et al. / Analysing gCSP Models Using Runtime and Model Analysis Algorithms
83
The figure shows that ‘P1 C’ and ‘P1 Wr’ are scheduled on both heaps, which is understandable since the communication weight is set by default to 25 and the weight of a vertex to 10. So the vertices have a combined weight of 20 and putting it on both heaps saves a weight value of 5, resulting in less workload for the cores. 2.2.5. Core Scheduler The last algorithm block is the core scheduler. This block schedules the heaps onto a number of available cores, which can be seen as independent units, like real cores of a processor or as nodes in a distributed system. The scheduler tries to find an optimum, depending on the amount of available cores and their relative speeds. Like the heap scheduler, the core scheduler is also very complex and comprehensive so only the basics will be explained here. The algorithm is divided in two parts, an index block creator and an index block scheduler part. Index blocks are groups of vertices within heaps, which are easier to schedule than the heaps themselves: for example if two heaps are dependent on each other they cannot be scheduled according to the scheduling rules, since a heap is scheduled after all dependent heaps are scheduled. Therefore index blocks are defined in such a way that this schedule problem and others will not occur. The actual scheduler consists of many rules to come to a single optimal scheduling result for each index block. A distinction is made between heaps which are part of the critical path and the rest and the other heaps, to make sure that the critical path is scheduled optimally and the end time of this path will not become unnecessary long. The overall rules are designed to keep the scheduled length and thus the ending time as short as possible. Figure 29 shows the results of the core scheduler in a same way as the heap results are shown. Three cores, named c1, c2 and c3, were made available. Their core names are visible in the dependency graph with the vertex timing behind them. The bars view also shows the core names with the vertices placed on chronological order. The figure shows that core c3 is unused, which is no surprise since only two heaps are available both completely placed on their own core. c1: 0-10, c2: 10-20 c1: 10-20, c2: 10-20 c1: 20-30
c2: 20-30
c1: 30-40
c2: 30-40 c1: 65-75 c1: 75-85
0 10 20 30 40 50 60 70 80 c1 c2 c3
(a) Dependency graph
(b) Core timing
Figure 29. Results of core scheduler.
2.3. Results This section describes the results of the analyser. The setup is the same as the results of the runtime analyser described in section 1.5. 2.3.1. Functional Test As a functional test the dual producer consumer model is used again, as shown in Figures 15 or 25, but now the number of cores is reduced to one. The result can be seen in Figure 30.
84
M.M. Bezemer et al. / Analysing gCSP Models Using Runtime and Model Analysis Algorithms
c1: 0-10 c1: 10-20 c1: 40-50
c1: 20-30
c1: 50-60
c1: 30-40 c1: 60-70 c1: 70-80
(a) Dependency graph
0 1 2
10
20
30
40
50
60
(b) Heap timing
Figure 30. Results of the core scheduler with one available core.
Both heaps now are scheduled on the same core, the following constraints should be met in order to pass the functional test: • ‘P1 C’ and ‘P1 Wr’ are available in both heaps, but of course they should be scheduled only once. • the timing of all processes should be correct. It is not allowed to have two processes with overlapping timing estimates. • the dependencies of the processes should be honoured. A process should be scheduled after its predecessors even when those predecessors were scheduled on different heaps. All of these constraints are met when looking at the figure. In fact these constraints are always met when the available settings are varied even more, like varying the number of cores, the communication weight or the weight of processes, which is not shown here. In [17] more functional tests are performed and it concludes that the analyser is functional. 2.3.2. Scalability test The plotter model is analysed and its results are shown in Figure 31. It has a long sequential path, which is expected since the model was designed to be sequential: first sensor information is read, it gets processed, checked for safely and motors are controlled. The model even has sequential paths for the calculations for the X, Y and Z directions of the plotter. The scalability test shows that the analyser indeed is working for bigger models. In this case, a long sequential path does not give any problems and the analyser is still able to run the algorithms. When looking at the figure, it becomes clear that the cores are scheduled as optimal as possible, meaning that the end time of the schedule is as low as possible for the current configuration. This is visible by the fact that the gaps in the schedule of the first core are smaller than the combined weight of the processes scheduled on the other cores beneath the gaps: it would not fit to put these processes in the gaps. In order to make the gaps smaller or even to get rid of the them, the communication delays must be made smaller. By looking at the analysis results it can be concluded that the model should be made more parallel by design when the model needs to run on a multi-core target system. Currently, it is almost completely scheduled onto a single core, mainly due to the amount of dependabilities between the processes resulting in the long critical path. Parallelising the X, Y and Z paths in the model so that the scheduler schedules these paths onto their own cores, results in a model which is more suitable to run on the target. Another solution is creating a pipeline for the four tasks and to put each step in this pipeline on each own core.
M.M. Bezemer et al. / Analysing gCSP Models Using Runtime and Model Analysis Algorithms
start ENCX
ENCY
HPGLParser MS_WR1
MC_Rd12
MS_Wr3 MS_Wr2
MC_Rd13 MC_Rd1
MC_Rd2 MC_Rd3 LongtoDoubleConversio Controller MC_Wr7 Sa_Rd7
MC_Wr6 Sa_Rd6
MC_Wr5 Sa_Rd5
MC_Wr4 Sa_Rd4
ESX1_1
ESX1_2
ESX2_1
ESX2_2
Sa_Rd_ESX1_1
Sa_Rd_ESX1_2
Sa_Rd_ESX2_1
Sa_Rd_ESX2_2
Safety_X ESY1 ESY2 Sa_Rd_ESY1 Sa_Rd_ESY2 Safety_Y Safety_Z Sa_Wr9 Sa_Wr11
Sa_Wr10
Sc_Rd9 Sc_Rd10 Sc_Rd11 Sa_Wr8
DoubletoShortConversi Sc_Wr14 PWMX
Sc_Wr15
PWMY
Sc_Wr16
PWMZ
Sc_Rd8
0
100
200
1 6 7
DoubletoBooleanConve Sc_Wr17 VCCZ
8 9 10
0
100
200
300
core-0
400
end (a) Dependency graph
11
core-1
12
core-2
13
core-3
14
(b) Core timing
(c) Heap timing Figure 31. Plotter results.
300
85
86
M.M. Bezemer et al. / Analysing gCSP Models Using Runtime and Model Analysis Algorithms
2.4. Discussion As seen in the results section the analyser provides useful information about a model. It allows us to see whether a model is suitable to be scheduled on multiple cores efficiently, or to see how timing is estimated and which processes are the bottleneck depending on the target system. However, a couple of things which need to be improved: • the comminication costs only have one value which is too simple. A delay between the predecessor and the process is added, but in a more realistic situation this is not sufficient. In such situations the delays values should be different for each combination of cores, for example depending on the type of network between the cores or the speed of the cores. • the communication delays are also made too generic. When defining the heaps it is not known whether communication is going to take place or not. Therefore the default communication costs should be removed for the heap scheduler and added to the core schedulers. This should result in less communication costs, since they are only • communication between cores always has the same costs. In real situations cores might be on the same processor or be on different nodes in a distributed system. So the communication costs should be able to be different depending which cores are communicating. • the heap scheduler need to be improved to produce more balanced heaps. Currently heaps tend to be very long, mainly due to the fact that the communication costs are too dominant. To prevent these dominant communication costs, some experiments are required in order to get (relative) values for average communication costs and process weights. These values can then be used as defaults values for the analyser, so the analysis will result in better results. Big heaps are hard to schedule optimal, since a heap is completely placed on one core. • processes might be node dependent This is the case for example processes which depend on external I/O, which probably is wired to one node and the process should be present on that particular node. For these processes a possibility to force them to be scheduled on a particular node would be useful. More information about process-core affinity is described by Sunter [16] section 3.3. More improvements and recommendations are available in [17]. Besides these recommendations, the algorithm could be improved with additional functionality: a lot of results are presented for which new applications can be developed. One of these applications is the implementation of a deadlock analyser, which needs information about dependencies of processes to find situations which might become a problem. An example of a problematic situation are circular paths in the dependency graph. These paths indicates that the processes keep on waiting on each other. Circular paths also have bad influences on the algorithm and are handled internally already. Adding some extra rules to the internal problem handling code could be enough to detect deadlocks during analysis. When building a model designer application (like gCSP) this method could be used to check for deadlocks during the design of a model. Another improvement would be a new algorithm which is able to bundle channels into a multiplexed channel. This would result in less context switches, since multiple writer and readers are combined into a single multiplex writer and reader. After the heap scheduler is finished, it is known which processes will be grouped together. Parallel channels between these groups, could probably be combined into a single channel which transfers the data in a single step.
M.M. Bezemer et al. / Analysing gCSP Models Using Runtime and Model Analysis Algorithms
87
3. Conclusions and Recommendations Both algorithms can be used to analyse gCSP models in order to make the execution of gCSP models more efficient, each from a different point of view with different results. The runtime analyser returns set of static chains, which can be used to build a new gCSP model. This model implements each chain into one process, which saves context switches and thus system resources. The model analyser presents schedules for gCSP models keeping a given system architecture in mind. The system load, available resources and communication costs are kept as optimal as possible. However, some work on refining both algorithms must be done. The runtime analyser results appear hard to interpret, so a better way of representing them is required. As described, grouping set of processes which occur more than once, might decrease the lengths of the chains, making the set of chains more understandable. Automatically creating a new optimised model from the runtime analyser results would be the most appropriate solution. As described in section 1.6, support for analysing multi-core applications is required as well, especially when the CT library gets updated to support multiple parallel threads. The section also describes state space explosion related problems as unlikely when combining multiple processes into one. We need to look further into this statement in order to see if this indeed is the case. The model analyser lacks a lot of functionality to make it a usable tool. As described in section 2.4, things like improvement of communication costs, a new channel multiplexing algorithm and deadlock checking could help to mature the tool and make it part of our tool chain. After both tools become more usable, it is possible to make them a part of out tool chain. This can be done by integrating them into gCSP2, which also needs more work first, in order to analyse the models while designing them. With the possibility of design time checks, new features can be implemented as well, like multi-node code generation, producing multiple sets of code results in multiple executables, one for each node. For this the results of the model analyser need to be included with the code generated by gCSP2, implementing this is not trivial and needs more structured work. From the designer’s point of view, having lots of small processes for each task, continues to be a good way of designing models. The results of both algorithms are useful for converting these models to an execution point of view. Acknowledgements The authors would like to thank the anonymous reviewers for their helpful and constructive feedback. Without this feedback the paper would not have been ready for publication as it is now. References [1] C. A. R. Hoare. Communicating Sequential Processes. Prentice Hall International, 1985. ISBN 0-13153289-8. [2] A.W. Roscoe, C.A.R. Hoare, and R. Bird. The Theory and Practice of Concurrency. Prentice Hall PTR, Upper Saddle River, NJ, USA, 1997. ISBN 0136744095. [3] D.S. Jovanovic, B. Orlic, G.K. Liet, and J.F. Broenink. Graphical Tool for Designing CSP Systems. In Communicating Process Architectures 2004, pages 233–252, September 2004. ISBN 1-58603-458-8. [4] T.T.J. van der Steen, M.A. Groothuis, and J.F. Broenink. Designing Animation Facilities for gCSP. In Communication Process Architectures 2008, volume 66 of Concurrent Systems Engineering Series, page 447, Amsterdam, September 2008. IOS Press. ISBN 978-1-58603-907-3. doi: 10.3233/ 978-1-58603-907-3-447.
88
M.M. Bezemer et al. / Analysing gCSP Models Using Runtime and Model Analysis Algorithms
[5] G.H. Hilderink, J.F. Broenink, and A. Bakkers. Communicating Java Threads. In Parallel Programming and Java, Proceedings of WoTUG 20, volume 50, pages 48–76, University of Twente, Netherlands, 1997. IOS Press, Netherlands. [6] D. May. Communicating process architecture for multicores. In A.A. McEwan, W. Ifill, and P.H. Welch, editors, Communicating Process Architectures 2007, pages 21–32, July 2007. ISBN 978-1586037673. [7] J.E. Boillat and P.G. Kropf. A fast distributed mapping algorithm. In CONPAR 90: Proceedings of the joint international conference on Vector and parallel processing, pages 405–416, New York, NY, USA, 1990. Springer-Verlag New York, Inc. ISBN 0-387-53065-7. [8] J. Magott. Performance evaluation of communicating sequential processes (CSP) using Petri nets. Computers and Digital Techniques, IEE Proceedings E, 139(3):237–241, May 1992. ISSN 0143-7062. [9] L.C.J. van Rijn. Parallelization of model equations for the modeling and simulation package CAMAS. Master’s thesis, Control Engineering, University of Twente, November 1990. [10] K.C.J. Wijbrans, L.C.J. van Rijn, and J.F. Broenink. Parallelization of Simulation Models Using the HSDE Method. In E. Mosekilde, editor, Proceedings of the European Simulation Multiconference, June 1991. [11] N.C. Brown and M.L. Smith. Representation and Implementation of CSP and VCR Traces. In Communicating Process Architectures 2008, volume 66 of Concurrent Systems Engineering Series, pages 329–345, Amsterdam, September 2008. IOS Press. ISBN 978-1-58603-907-3. [12] G.H. Hilderink. Managing complexity of control software through concurrency. PhD thesis, Control Engineering, University of Twente, May 2005. [13] T.T.J. van der Steen. Design of animation and debug facilities for gCSP. Master’s thesis, Control Engineering, University of Twente, June 2008. URL http://purl.org/utwente/e58120. [14] M.A. Groothuis, A.S. Damstra, and J.F. Broenink. Virtual Prototyping through Co-simulation of a Cartesian Plotter. In Emerging Technologies and Factory Automation, 2008. ETFA 2008. IEEE International Conference on, number 08HT8968C, pages 697–700. IEEE Industrial Electronics Society, September 2008. ISBN 978-1-4244-1505-2. doi: 10.1109/etfa.2008.4638472. [15] Controllab Products. 20-sim website, 2009. [16] J.P.E. Sunter. Allocation, Scheduling & Interfacing in Real-Time Parallel Control Systems. PhD thesis, Control Engineering, University of Twente, 1994. [17] M.M. Bezemer. Analysing gCSP models using runtime and model analysis algorithms. Master’s thesis, Control Engineering, University of Twente, November 2008. URL http://purl.org/utwente/ e58499.
Communicating Process Architectures 2009 P.H. Welch et al. (Eds.) IOS Press, 2009 © 2009 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-065-0-89
89
Relating and Visualising CSP, VCR and Structural Traces Neil C. C. BROWN a and Marc L. SMITH b a School of Computing, University of Kent, Canterbury, Kent, CT2 7NF, UK,
[email protected] b Computer Science Department, Vassar College, Poughkeepsie, New York 12604, USA,
[email protected] Abstract. As well as being a useful tool for formal reasoning, a trace can provide insight into a concurrent program’s behaviour, especially for the purposes of run-time analysis and debugging. Long-running programs tend to produce large traces which can be difficult to comprehend and visualise. We examine the relationship between three types of traces (CSP, VCR and Structural), establish an ordering and describe methods for conversion between the trace types. Structural traces preserve the structure of composition and reveal the repetition of individual processes, and are thus wellsuited to visualisation. We introduce the Starving Philosophers to motivate the value of structural traces for reasoning about behaviour not easily predicted from a program’s specification. A remaining challenge is to integrate structural traces into a more formal setting, such as the Unifying Theories of Programming – however, structural traces do provide a useful framework for analysing large systems. Keywords. traces, CSP, VCR, structural traces, visualisation
Introduction Hoare and Roscoe’s Communicating Sequential Processes (CSP) [1,2] is most well-known as a process calculus (or process algebra) for describing and reasoning about concurrent systems. As its name suggests, a concurrent system may be viewed as a composition of individual, communicating sequential processes. One of Hoare and Roscoe’s big ideas was that one may reason about a computation by reasoning about its history of observable events. This history must be recorded by someone, and so CSP introduces the notion of an observer to perform this task. Thus, CSP introduced traces as a means of representing a process’s computational history. The idea of observable events in concurrent computation is an extension of the idea from computability theory that one may observe only the input/output behaviour of processes when reasoning about properties of that process (e.g., will a given process halt on its input?). Communication (between two processes, or between a process and its environment) is another form of input/output behaviour, and therefore observable events are a reasonable basis for recording traces of computations. In mathematics, algebra is the study of equations, and calculus is the study of change. How do these areas of mathematics relate to processes? It follows that a process algebra provides a means for describing processes as equations, and a process calculus provides a means for reasoning about changes in a system as computation proceeds. CSP provides a rich process algebra that is used for process specification, and a process calculus based on possible traces of a process’s specification to help prove properties of a concurrent system. Cast in terms of change, a trace represents a system’s computational progress, or the changes in the state of a system during its computation. Modern CSP [2] considers more than just traces in
90
N.C.C. Brown and M.L. Smith / Relating and Visualising CSP, VCR and Structural Traces
its process calculus (i.e., possible failures, divergences, and refinements of a process), though in this paper we are concerned with investigating what further benefits traces can provide, not in the context of specification, but in the context of debugging large systems. The remainder of this introduction discusses the three different types of traces, a motivating example of distributed systems and the use of traces in debugging. CSP, VCR and Structural Traces In previous work [3], the authors extended the notion of what it means to record a trace, and describe how the shapes of these differently recorded traces extend the original definition of a CSP trace. In this section, we briefly summarize this previous work. CSP traces are sequentially interleaved representations of concurrent computations. One way to characterize a CSP trace is that it abstracts away time and space from a concurrent computation. Due to interleaving in the trace, one cannot in general see a followed by b in the trace and know from the trace alone whether event a actually occurred long before event b, or whether they were observed at the same time and ordered arbitrarily in the trace. Likewise, if process P and process Q are both capable of engaging in event a, one cannot tell from the trace alone whether it was process P or process Q that engaged in event a, or whether a represents a synchronizing event for processes P and Q. VCR traces support View-Centric Reasoning [4], provide for multiple observers, and a means for representing multiple views of a computation. Rather than a sequence of events, a VCR trace is a sequence of event multisets, where the multisets represent independent events. Independent events are events that may be nondeterministically interleaved by the CSP observer. When new events are observed for a VCR trace, they are added to a new multiset iff a dependence relationship can be inferred between the new event and the events in the current multiset [3]. If this dependency cannot be inferred, the event is added to the current multiset in the trace. In other words, a VCR trace preserves some of the timing information of events, but still abstracts away space from a concurrent computation (i.e., what process engaged in each of the events). A VCR trace permits reasoning about all the possible interleavings of a particular computation, but does not preserve which process engaged in which event. Structural traces provide not only for multiple observers, but a means for associating those observers with the processes themselves! As the name suggests, structural traces reflect the structure of a concurrent system’s composition. That is, structural traces abstract away neither time nor space. A parallel composition of CSP processes is recorded as a parallel composition of the respective traces in the structural trace. Each process has a local observer who records a corresponding trace, for example the trace of a process P. If P contains a parallel composition of processes Q and R, then after both Q and R have completed, P’s observer records the traces of Q and R as a parallel composition in P’s trace. Besides the extra structure, the primary difference between structural traces and other types is that structural traces record an event synchronisation multiple times – once for each participant. The grammar of the three different trace types discussed above is given in Figure 1. As an example, given the CSP system: (a → SKIP ||| b → SKIP) → (c → SKIP || c → SKIP) {c}
The possible traces are shown below: CSP: a, b, c or b, a, c VCR: {a, b}, {c} Structural: (a || b) → (c || c) Note that the synchronising event c occurs twice in the parallel composition in the structural trace.
N.C.C. Brown and M.L. Smith / Relating and Visualising CSP, VCR and Structural Traces
91
CSPTRACE ::= | EVENT ( , EVENT)∗
(1)
VCRTRACE ::= | EVENTBAG ( , EVENTBAG )∗
(2)
∗
EVENTBAG ::= { EVENT ( , EVENT ) }
(3)
STRUCTURALTRACE ::= () | SEQ
(4) ?
SEQ ::= ((EVENT | PAR) ( → SEQ) ) | (NATURAL ∗ SEQ) ∗
PAR ::= SEQ SEQ ( SEQ)
(5) (6)
Figure 1. The grammar for CSP traces (line 1), VCR traces (lines 2–3) and structural traces (lines 4–6). Literals are underlined. A superscript ‘∗ ’ represents repetition zero or more times, and a superscript ‘? ’ represents repetition zero or one times (i.e. an optional item).
Distributed Systems We can consider the differences between the three trace types with a physical analogy. Physics tells us that light has a finite speed. We can imagine a galaxy of planets, light-years apart, observing each other’s visible transmissions. CSP calls for an omnipotent, Olympian observer that is able to observe all events in the galaxy and order them correctly (aside from events that appear to have occurred simultaneously) – a strong requirement! VCR’s multisets of independent events allow us to capture some of the haziness of observation, by leaving events independent unless a definite dependence can be observed. Structural traces allow for each planet to observe its own events, thus removing any concerns about accuracy or observation delay. The trace of the whole galaxy is formed by assembling together, compositionally, the traces of the different planets. This analogy may appear somewhat abstract, but it maps directly to the practical problem of recording traces for distributed systems, where each machine has internal concurrency and communication, as well as the parallelism and communication of the multiple machines. Implementing CSP’s Olympian observer in a distributed system, that could accurately observe all internal machine events (from multiple machines) and networked events, is not feasible. The CSP observer must be present on one of the machines, rendering its observation of other machines inaccurate. The system could perhaps be implemented using synchronised highresolution timers (and a later merging of traces) but what VCR and structural traces capture is that the ordering of some events does not matter. If machine A and B communicate while C and D also communicate, with no interaction between the two pairs, we can deduce that the ordering of the A-B and C-D communications was arbitrary as they could have occurred in either order. In VCR terms, they are independent events. Trace Investigation We have already stated that CSP enables proof using traces (among other models) – should your program already be formally modelled, the need for further analysis is perhaps unclear. Two possible uses for trace investigation are given in this section. CSP offers both external choice (where the environment decides between several offered events) and internal choice (where the process itself decides which to offer). There is sometimes an implicit intention that the program be roughly fair in the choice. For example, if a server is servicing requests from two clients, neither client should be starved of attention from
92
N.C.C. Brown and M.L. Smith / Relating and Visualising CSP, VCR and Structural Traces
the server. This may even be a qualitative judgement; hence human examination of the traces of a system (that capture its behaviour) may be needed. We give an example of starvation in section 2.2. Another reason is that the program may have capacity for error handling. For example, some events may be offered by a server (to prevent deadlock), but for well-formed messages from a remote client, these events should not occur. A spate of badly-formed messages from a client may be cause for investigation, and a trace can aid in this respect. Organisation It turns out that in some senses a structural trace generalizes both VCR and CSP traces, which we discuss in section 1. The generality of structural traces is important for two reasons. First, the CHP library [5] is capable of producing all three types of traces of a Haskell program, but producing structural traces is more efficient than producing the other two trace types. (As previously discussed by the authors [3], these traces are different from merely adding print statements to a program in that recording traces does not change the behaviour of the program.) From a structural trace, one can generate equivalent VCR or CSP traces (we give the algorithms in section 1). Second, structural traces provide a framework for visualisation of concurrent programs, which we discuss in section 2, and argue how trace visualisation is helpful for debugging. In section 3 we discuss our progress on characterizing structural traces more formally, and the challenges that remain.
1. Conversion Each trace type is a different representation of an execution of a system. It is possible to convert between some of the representations. One VCR trace can be converted to many CSP traces (see section 1.1). A VCR trace can thus be considered to be a particular equivalence class of CSP traces. One structural trace can be interleaved in different ways to form many VCR traces and (either directly or transitively) many CSP traces (see section 1.2). A structural trace can also be considered to be a equivalence class of VCR (or CSP) traces. This gives us an ordering on our trace representations: one structural trace can form many VCR traces, which can in turn form many CSP traces. These conversions are deterministic and have a finite domain. In general, conversions in the opposite direction (e.g., CSP to Structural) can be non-deterministic and infinite – for example, the CSP trace a could have been generated by an infinite set of structural traces: a0 , or (a0 || a0 ), or (a0 || a0 || a0 ), and so on. 1.1. Converting VCR Traces to CSP Traces One VCR trace can be transformed to many CSP traces. A VCR trace is a sequence of multisets; by forming all permutations of the different multisets and concatenating them, it is possible to generate the corresponding set of all possible CSP traces. A Haskell function for performing the conversion is given in Figure 2 – an example of this conversion is: {a, b}, {c}, {d, e} → {a, b, c, d, e, a, b, c, e, d, b, a, c, d, e, b, a, c, e, d} Note that the number of resulting CSP traces is easy to calculate: for all non-empty VCR traces tr , the identity length (vcrToCSP tr) == foldl1 (∗) (map Set.size tr ) holds. That is, the number of generated CSP traces is equivalent to the product of the sizes of each set in the VCR trace.
N.C.C. Brown and M.L. Smith / Relating and Visualising CSP, VCR and Structural Traces
93
type CSPTrace = [Event] type VCRTrace = [Set Event] -- All sets must be non-empty. cartesianProduct :: [a] -> [b] -> [(a, b)] cartesianProduct xs ys = [ (x , y) | x <- xs , y <- ys ] vcrToCSP :: VCRTrace -> Set CSPTrace vcrToCSP [] = singleton [] vcrToCSP (s : ss ) = fromList [ a ++ b | (a, b) <- cartesianProduct ( permutations ( toList s )) ( toList (vcrToCSP ss))] Figure 2. An algorithm for converting a VCR trace into a set of CSP traces.
1.2. Converting Structural Traces to VCR or CSP Traces Our originally recorded structural traces [3] could not be converted into CSP or VCR traces post hoc because it was not clear how events “lined up”. Consider the structural trace: (a → a) || a It is not clear which (if either!) of the events from the LHS of the parallel composition are the same synchronisation as the RHS. Therefore it is not clear whether this should become the CSP trace a, a, a or a, a. A structural trace can be converted into CSP or VCR traces if we record a little extra information. Specifically, we must record a sequence number with each communication event in the trace. In CSP terms, we effectively replace all synchronisations on each event a with a corresponding external choice:
2 A → P)
(a → P) → (
where A = {ai | i ∈ N0 }
All uses of the event a in sets for hiding and parallel composition must also be replaced by the events in A. We must then compose our existing system P in parallel with a new process for that event: P → (P || SEQa,0 )
where SEQa,i = ai → SEQa,i+1
A
In the implementation, the sequence number becomes part of the data structure underlying channels and barriers, and adds no significant overhead to the cost of communications. With this sequence number, it is possible to match up the communications from different parts of the structural trace. Our earlier example would become (a0 → a1 ) || a0 , if the first synchronisation on the LHS was the one featured on the RHS. Furthermore, the format of the structural trace means that it was already possible to derive the process identifiers needed to form the VCR trace. These two aspects combined will allow us to generate CSP and VCR traces from structural traces. 1.2.1. Algorithm Description We define the algorithm for converting structural traces to CSP traces in Haskell, in Figure 3. First, we define our data structures, based around an Event type with hidden definition (see Figure 3, lines 1–7). A CSP trace is a list of events; a structural trace is either empty or a sequential trace. A sequential trace is an event synchronisation (an event identifier paired with a sequence identifier) or a parallel trace, followed by an optional further sequential
94
N.C.C. Brown and M.L. Smith / Relating and Visualising CSP, VCR and Structural Traces
trace in a cons-fashion. A parallel trace is a collection of two or more sequential traces. This corresponds to the grammar in Figure 1. To convert a structural trace, we must first know how many processes were involved in each synchronisation. We thus define a prSeq function that builds up a map from an event and sequence identifier pair to the number of processes that synchronised on that event (see Figure 3, lines 9–16). The remainder of our algorithm can be implemented in a continuation-like style. The structural trace is explored, and a map is built up with all the events initially available (and the number of processes available to engage in them), as well as a function that, given an occurred event, will return the next map and function. This is done by the convSeq function (see Figure 3, lines 18–36). The final piece is a function that iteratively picks the next available event from the merged maps and records it in a new CSP trace, until the trace is complete (see Figure 3, lines 38–49). This picking of available events and then continuing the trace is strongly reminiscent of the execution of the CSP program. Effectively, our algorithm is the environment, picking arbitrarily from the available set of events and then continuing the computation. Of course, our structural trace differs from a typical CSP system in that it is finite, lacks any choice, and by definition our replaying of a real structural trace is deadlock-free. 1.2.2. Example We give here an example of converting a structural trace to its CSP and VCR forms: (a0 → b0 → b1 → a1 ) || (c0 → b0 → (c1 || d0 ) → b1 ) →{a, c, b, c, d, b, a , a, c, b, d, c, b, a , c, a, b, c, d, b, a , c, a, b, d, c, b, a} (a0 → b0 → b1 → a1 ) || (c0 → b0 → (c1 || d0 ) → b1 ) →{{a, c}, {b}, {c, d}, {b}, {a}} It can be seen that converting the set of CSP traces is the same as would be produced by converting the VCR trace into CSP traces. 1.2.3. Generalising and VCR Algorithm Although our algorithm creates one trace through picking arbitrarily, it would be trivial to modify it to generate all possible traces through exploring all options. From a more abstract perspective, the set of all traces is of interest: generating an arbitrary trace is a needless specialisation. From a practical perspective, the set of all traces may be overwhelming and thus it may be easier to explore an arbitrary single converted trace. The algorithm for converting structural traces to VCR traces is a combination of the algorithm given here for converting Structural to CSP traces, and the pre-existing algorithm for recording VCR traces based on process identifiers [3]. One interesting aspect is the choice of available events. In our conversion to CSP, we pick arbitrarily (Figure 3, line 49). In VCR, the choice (if we only wish to generate one trace) makes a noticeable difference to the resulting trace. Consider the structural trace: (a → b) || c
N.C.C. Brown and M.L. Smith / Relating and Visualising CSP, VCR and Structural Traces
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
95
type CSPTrace = [Event] type data data data
SeqId = Integer StructuralSeq = Seq ( Either (Event, SeqId) StructuralPar ) (Maybe StructuralSeq ) StructuralPar = Par StructuralSeq StructuralSeq [ StructuralSeq ] StructuralTrace = TEmpty | T StructuralSeq
combine :: [Map (Event, SeqId) Integer ] -> Map (Event, SeqId) Integer combine = foldl (Map.unionWith (+)) Map.empty prSeq :: StructuralSeq -> Map (Event, SeqId) Integer prSeq (Seq ( Left e) s) = combine [Map.singleton e 1, maybe Map.empty prSeq s] prSeq (Seq ( Right (Par sA sB ss )) s) = combine (maybe Map.empty prSeq s : prSeq sA : prSeq sB : map prSeq ss) data Cont = Cont (Map (Event, SeqId) Integer ) (( Event, SeqId) -> Cont) | ContDone convSeq :: StructuralSeq -> Cont convSeq (Seq ( Left e) s) = c where c = Cont (Map.singleton e 1) (\e’ -> if e /= e’ then c else maybe ContDone convSeq s) convSeq (Seq ( Right (Par sA sB ss )) s) = merge (convSeq sA : convSeq sB : map convSeq ss) ‘andThen‘ maybe ContDone convSeq s andThen :: Cont -> Cont -> Cont ContDone ‘andThen‘ r = r Cont m f ‘andThen‘ r = Cont m (\e -> f e ‘andThen‘ r) merge :: [Cont] -> Cont merge cs = case [ m | Cont m f <- cs] of [] -> ContDone ms -> Cont (combine ms) (\e -> merge [f e | Cont m f <- cs]) structuralToCSP :: StructuralTrace -> CSPTrace structuralToCSP TEmpty = [] structuralToCSP (T s) = iterate (convSeq s) where participants = prSeq s iterate :: Cont -> CSPTrace iterate ContDone = [] iterate (Cont m f ) = fst e : iterate ( f e) where es = Map. filter (\(n, n’) -> n == n’) (Map.intersectionWith (,) m participants ) e = Map.findMin es -- Arbitrary pick Figure 3. The algorithm for converting Structural traces to CSP traces.
Our first choice may be a, giving us the partial VCR trace {a}. If our next choice is b, the VCR trace becomes {a}, {b}, but if our next choice is c, the VCR trace becomes {a, c}. We could favour the second choice if we wanted more independence visible in the trace – which may aid understanding. We emphasise again that this is a moot point for generating the set of all traces, but we believe that for single traces, a trace with maximal
96
N.C.C. Brown and M.L. Smith / Relating and Visualising CSP, VCR and Structural Traces
obvious independence (i.e. fewer but larger sets in the VCR trace) will be easier to follow. 1.3. Practical Implications In our previous work [3] we discussed implementing the recording of three different types of traces: CSP, VCR and Structural. A CSP trace is most straightforward to record, by using a mutex-protected sequence of events as a trace for the whole program. A VCR trace uses a mutex-protected data structure with additional process identifiers. Both of these recording mechanisms are inefficient and do not scale well, due to the single mutex-protected trace. With four, eight or more cores, the contention for the mutex may cause the program to slow down or alter its execution behaviour. This deficiency is not present with Structural traces, which are recorded locally without a lock, and thus scale perfectly to more cores. We could also consider recording the traces of a distributed system. Implementing CSP’s Olympian observer on a distributed system (a parallel system with delayed messaging between machines) is a very challenging task. VCR’s concepts of imperfect observation and multiple observers would allow for a different style of observation. Structural traces could be recorded on different machines without modification to the recording strategy, and merged afterwards just as they normally are for parallel composition. The structural trace should also be easier to compress, due to regularity that is present in branches of the trace, but that is not present in the arbitrary interleavings of a CSP trace. Given that the structural trace is more efficient to record in terms of time and space (memory requirements), supports distributed systems well and can be converted to the other trace types post hoc, this makes a strong case for recording a structural trace rather than CSP or VCR. 2. Visual Traces It is possible to visualise the different traces in an attempt to gain more understanding of the execution that generated the trace. For an example, we will use P: P = (Q o9 Q) || ((b → b → SKIP) ||| (c → c → SKIP)) {b}
where Q = (a → SKIP) ||| (b → SKIP) One possible CSP trace of this system is: a, b, a, c, c, b This trace is depicted in Figure 4a. It can be seen that for CSP traces, this direct visualisation offers no benefits over the original trace. As an alternative, the trace is also depicted in Figure 4b. This diagram is visually similar to a finite state automata, with each event occurrence being a state, and sequentially numbered edges showing the paths from event to event – but these edges must be followed in ascending order. A possible VCR trace of this system is: {a, b}, {a, c}, {c, b} This trace is depicted straightforwardly in Figure 4c. The structural trace of this system, with event sequence identifiers, is: ((a0 || b0 ) → (a1 || b1 )) || ((b0 → b1 ) || (c0 → c1 )) A straightforward depiction is given in Figure 4d. An alternative is to merge together the nodes that represent the same event and sequence number, and use edges with threadidentifiers to join them together: this is depicted in Figure 4e.
97
N.C.C. Brown and M.L. Smith / Relating and Visualising CSP, VCR and Structural Traces
(a) A simple rendering of the CSP trace: a, b, a, c, c, b.
(b) A more complicated rendering of the CSP trace from (a): event synchronisations are rendered as state nodes, and ascending transition identifiers label the edges.
(c) A simple rendering of the VCR trace: {a, b}, {a, c}, {c, b}
(d) A rendering of the structural trace: ((a0 || b0 ) → (a1 || b1 )) || ((b0 → b1 ) || (c0 → c1 )). Event synchronisations are represented by nodes with the same name, and the edges are labelled threads of execution joining together the event synchronisations.
(e) A rendering of the structural trace: ((a0 || b0 ) → (a1 || b1 )) || ((b0 → b1 ) || (c0 → c1 )). Event synchronisations are represented by a single node, and the edges are labelled threads of execution joining together the event synchronisations. Figure 4. Various graphical representations of traces that could be produced by the CSP system P = (Q o9 Q) || ((b → b → SKIP) ||| (c → c → SKIP)) where Q = (a → SKIP) ||| (b → SKIP). {b}
2.1. Discussion of Visual Traces Merely graphing the traces directly adds little power to the traces (Figures 4a, 4c and 4d). The contrast between the CSP, VCR and structural trace diagrams is interesting. The elegance and simplicity of the CSP trace (Figure 4a) can be contrasted with the more information
98
N.C.C. Brown and M.L. Smith / Relating and Visualising CSP, VCR and Structural Traces
a
a
JOIN
FORK
FORK
b
JOIN b
FORK
JOIN
FORK
c
c
JOIN
Figure 5. A petri net formed from figure 4e by using the nodes as transitions and the edges as places (unlabelled in this graph). Being from a trace, the petri net is finite and non-looping; it terminates succesfully when a token reaches the right-most place.
provided by the VCR trace of independence (Figure 4c) and the more complex structure of the structural trace (Figure 4d). The merged version of the structural trace (Figure 4e) is a simple change from the original (Figure 4d), but one that enables the reader to see where the different threads of execution “meet” as part of a synchronisation. Figure 4e strongly resembles the dual of a graphical representation of a Petri net [6], a model of true concurrency. The nodes in our figure (FORK, JOIN and events) are effectively transitions in a standard Petri net, and the edges are places in a standard Petri net. Thus we can mechanically form a Petri net equivalent, as shown in Figure 5. Figure 4e is also interesting for its relation to VCR traces. Our definition of dependent events [3] is visually evident in this style of graph. An event b is dependent on a iff a path can be formed in the directed graph from a to b. Thus, in Figure 4e, it can be seen that the c events are mutually independent of the a and b events, and that the right-most a event is dependent on the left-most b event, and so on.
2.2. The Starving Philosophers
CSP models (coupled with tools such as FDR) allow formally-specified properties to be proved about programs: for example, freedom from deadlock, or satisfaction of a specification. Some properties of a system can be difficult to capture in this way – we present here an example involving starvation, based on Peter Welch’s “Wot, No Chickens?” example [7]. Our example involves three philosophers. All the philosophers repeatedly attempt to eat chicken (no time for thinking in this example!). They are coupled with a chef, who can serve two portions of chicken at a time. In pseudo-CSP1 , our example is:
1
The ampersand denotes conjunction [8]: the process a & b is an indivisible synchronisation on a and b.
99
N.C.C. Brown and M.L. Smith / Relating and Visualising CSP, VCR and Structural Traces
Figure 6. An example structural trace of five iterations of the starving philosophers example. The chef is collapsed, as it could be in an interactive visualisation of the trace. It is clear from this zoomed out view that one philosopher (shown in the third row) has far fewer synchronisations , revealing that it is being starved.
PHILOSOPHER(n) = chicken.n → PHILOSOPHER(n) CHEF = ((chicken.0 & chicken.1) 2 (chicken.0 & chicken.2) 2 (chicken.1 & chicken.2)) o 9
CHEF
SYSTEM = (PHILOSOPHER(0) ||| PHILOSOPHER(1) ||| PHILOSOPHER(2)) ||
CHEF
{chicken.0,chicken.1,chicken.2}
Our intention is that the three philosophers should eat with roughly the same frequency. However, the direct implementation of this system using the CHP library [5] leads to starvation of the third philosopher. Adding additional philosophers does not alleviate the problem: the new philosophers would also starve. In the CHP implementation, the chef always chooses the first option (to feed the first two philosophers) when multiple options are ready. This is a behaviour that is allowed by the CSP specification but that nevertheless we would like to avoid. Attempting to ensure that the three philosophers will eat with approximately the same frequency is difficult formally, but by performing diagnostics on the system we can see if it is happening in a given implementation. By recording the trace and visualising it (as shown in the structural trace in Figure 6) we are able to readily spot this behaviour and attempt to remedy it. 2.3. Interaction The diagrams already presented are static graphs of the traces. One advantage of the structural traces is that they preserve, to some degree, the organisation of the original program. The amenability of process-oriented programs to visualisation and visual interaction has long been noted – Simpson and Jacobsen provide a comprehensive review of previous work in this area [9]. The tools reviewed all focused on the design of programs, but we could use similar ideas to provide tools for interpreting traces of programs. We wish to support examination of one trace of a program, in contrast to the exploration of all traces of a program that tools such as PRoBE provide [2]. One example of an interactive user interface is shown in Figure 8. This borrows heavily from existing designs for process-oriented design tools – and indeed, it would be possible to display the code for the program alongside the trace in the lower panel. This could provide a powerful integrated development environment by merging the code (what can happen) with the trace (what did happen). An alternative example is given in Figure 9. This shows the process hierarchy vertically, rather than the previous nesting strategy. Synchronising events are shown horizontally, connecting the different parts of parallel compositions. What this view makes explicit is that for any given event, there is an expansion level in the tree such that all uses of that event are solely contained with one process. For a, b and c, this process is numbers; for d it is the root of the tree.
100
N.C.C. Brown and M.L. Smith / Relating and Visualising CSP, VCR and Structural Traces
CommsTime = (numbers(d) || recorder(d)) \ {d} {d}
numbers(out) = ((delta(b, c, out) || prefix(a, b)) || succ(c, a)) \ {a, b, c} {b}
{a,c}
delta(in, outA, outB) = in?x → (outA!x → SKIP ||| outB!x → SKIP) o9 delta(in, outA, outB) prefix(in, out) = out!0 → id(in, out) id(in, out) = in?x → out!x → id(in, out) succ(in, out) = in?x → out!(x + 1) → succ(in, out) recorder(in) = in?x → recorder(in) (a) The CSP specification for the CommsTime network, CommsTime. (numbers: (delta:42 ∗ (b? → (c! || d!))) || (prefix:(42 ∗ (b! → a?) → b!)) || (succ:42 ∗ (c! → a?)) || (recorder:42 ∗ d?)) (b) The structural trace for the CommsTime network. Note that the labels come from code annotations. Figure 7. The specification and trace of the CommsTime example network. Note that the hiding of events – important for the formal specification – is disregarded for recording the trace in our implementation in order to provide maximum information about the system’s behaviour.
numbers
numbers prefix
b
a
delta
d
c
prefix recorder
b
a
delta
d
c
recorder
succ
succ
42*(b? -> (c! || d!)) || (42*(b! -> a?) -> b!) || 42*(c! -> a?)
(a) The selected process is numbers.
42*(b? -> (c! || d!))
(b) The selected process is delta.
Figure 8. An example user interface for exploring the trace (from Figure 7) of a process network. The top panel displays the process network, with its compositionality reflected in the nesting. Processes can be selected, and their trace displayed in the bottom panel. Inner processes could be hidden when a process is not selected (for example, recorder could have internal concurrency that is not displayed while it is not being examined).
3. Relations and Operations on Structural Traces In their Unifying Theories of Programming [10], Hoare and He identified theories of programming by sets of healthiness conditions that they satisfy. Many of these healthiness conditions involve the following relations and operations on traces: prefix (≤), concatenation (ˆ) and quotient (−). For CSP the definitions are obvious. Prefix is the sequence-prefix relation, concatenation joins two sequences, and quotient subtracts the given prefix from a sequence. VCR traces have previously been set within the UTP framework [11], and we should consider the same for Structural traces. However, the definition of these three relations and operations
N.C.C. Brown and M.L. Smith / Relating and Visualising CSP, VCR and Structural Traces
d
numbers
numbers
recorder
recorder d
-
+ succ
a
prefix
101
b
delta
c
42*(c? -> a!)
42*d?
(a) The recorder process is selected.
(b) The succ process is selected.
Figure 9. An example user interface for exploring the trace (from Figure 7) of a process network. The top panel shows a tree-like view for the process network. Processes can be selected and their trace displayed in the bottom panel. Notably, the internal concurrency of processes (and the associated events) can be hidden, as in the left-hand example for numbers. On the right-hand side, the numbers process is expanded to reveal the inner concurrency.
Figure 10. A partial structural trace (nodes without identifiers are dummy fork/join nodes). Concatenation and quotient need to be defined such that the whole trace minus the shaded area can later be concatenated with the shaded area to re-form the entire trace.
is difficult for structural traces. Implicitly, traces must obey the following law: ∀ tr, tr : (tr ≤ tr ) ⇒ (tr ˆ(tr − tr) = tr ) Intuitively, the problem is that whereas a CSP trace has one end to append events to, a structural trace typically has many different ends that events could be added to: with an open parallel composition, the new events could be added to any branch of the open composition, or after the parallel composition (thus closing it). We must find a suitable representation to show where the new events must be added. This is represented in graphical form in Figure 10 – the problem is how to represent the shaded part of the tree so that it can be concatenated with the non-shaded part to form the whole trace. We leave this for future work, but this shows that meshing structural traces with existing trace theory is difficult or unsuitable. Hence, at this time we place our stress on the use of structural traces for practical investigation and understanding, rather than use in theoretical and formal reasoning.
102
N.C.C. Brown and M.L. Smith / Relating and Visualising CSP, VCR and Structural Traces
4. Conclusions We have examined three trace types: CSP, VCR and Structural. We have made a case for structural traces being the most general form of the three and have shown that a given structural trace can be converted to a set of VCR traces, and that a given VCR trace can be converted to a set of CSP traces, giving us an ordering for the different trace types. When developing concurrent systems, it might be instructive to examine the traces of a program under development, to verify that the current behaviour is as expected. This ongoing verification can be in addition to having used formal tools such as FDR to prove the program correct. Even after development is complete, it can be beneficial to check the traces of the system running for a long time under load, to check for patterns of behaviour such as starvation of processes. Additionally, it may be necessary to examine the program’s behaviour following recovery from a hardware fault (e.g., replacement of a network router) or security intrusion to ensure the system continues to operate correctly – situtations that might be beyond the scope of a model checker. Structural traces are the most general, the most efficient to record, and require the fewest assumptions for their recording strategy. Their efficiency arises from recording traces locally within each process, whereas recording CSP and VCR traces involves contending for access to a single centralised trace. They are also the most amenable to compression, as the repeating behaviour of an individual process is represented in the structural trace. Regardless of which trace type is used for investigation, structural traces can be used for the recording. 4.1. Future Work Currently, two main areas of future work remain: visualisation and formalisation. We have explored visualising the different trace types, especially structural traces, which provide the most useful information for visualising a trace. For large traces, structural traces provide a means for implementing a trace browser that permits focusing on particular processes. Such a tool would provide the visualisation capabilities necessary to locate causes of problems recorded in the trace. Figure 8 provides examples of what the user interface of such a trace browser might look like, but this application has yet to be implemented. While the structural traces contain useful structure that permits visualisation and conversion to VCR and CSP traces, this same structure makes it less amenable to formal manipulation. We have explained how operations (concatenation and quotient) that are straightforward on CSP and VCR traces are much more difficult to define on structural traces. Trying to define these operations has given us a greater appreciation of the elegance of CSP traces. Despite the challenge that defining concatenation and quotient presents, we remain motivated to solving this problem and ultimately drawing structural traces into the Unifying Theories of Programming. 4.2. Availability The CHP library [5] remains the primary implementation of the trace recording discussed here. We hope to soon release an update with the latest changes described in this paper and the accompanying programs for automatically generating the graphs. Acknowledgements We remain grateful to past reviewers for helping to clarify our thoughts on VCR and structural traces, especially recasting the notion of parallel events as independent events. We are also grateful to Ian East, for his particularly inspiring “string-and-beads” diagram [12] that paved the way for much of the visualisation in this paper.
N.C.C. Brown and M.L. Smith / Relating and Visualising CSP, VCR and Structural Traces
103
References [1] C. A. R. Hoare. Communicating Sequential Processes. Prentice-Hall, 1985. [2] A. W. Roscoe. The Theory and Practice of Concurrency. Prentice-Hall, 1997. [3] Neil C. C. Brown and Marc L. Smith. Representation and Implementation of CSP and VCR Traces. In Communicating Process Architectures 2008, pages 329–345, September 2008. [4] Marc L. Smith, Rebecca J. Parsons, and Charles E. Hughes. View-Centric Reasoning for Linda and Tuple Space computation. IEE Proceedings–Software, 150(2):71–84, April 2003. [5] Neil C. C. Brown. Communicating Haskell Processes: Composable explicit concurrency using monads. In Communicating Process Architectures 2008, September 2008. [6] W. Reisig. Petri Nets—An Introduction, volume 4 of EATCS Monographs on Theoretical Computer Science. Springer-Verlag, Berlin, New York, 1985. [7] Peter H. Welch. Java Threads in the Light of occam/CSP. In Architectures, Languages and Patterns for Parallel and Distributed Applications, volume 52 of Concurrent Systems Engineering Series, pages 259–284. WoTUG, IOS Press, 1998. [8] Neil C. C. Brown. How to make a process invisible. In Communicating Process Architectures 2008, page 445, 2008. Talk abstract. [9] Jonathan Simpson and Christian L. Jacobsen. Visual process-oriented programming for robotics. In Communicating Process Architectures 2008, pages 365–380, September 2008. [10] C. A. R. Hoare and Jifeng He. Unifying Theories of Programming. Prentice-Hall, 1998. [11] Marc L. Smith. A unifying theory of true concurrency based on CSP and lazy observation. In Communicating Process Architectures 2005, pages 177–188, September 2005. [12] Ian East. Parallel Processing with Communicating Process Architecture. Routledge, 1995.
This page intentionally left blank
Communicating Process Architectures 2009 P.H. Welch et al. (Eds.) IOS Press, 2009 © 2009 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-065-0-105
105
Designing a Mathematically Verified I2C Device Driver Using ASD Arjen Klomp a 1, Herman Roebbers b, Ruud Derwig c and Leon Bouwmeester a a Verum B.V., Laan v Diepenvoorde 32, 5582 LA, Waalre, The Netherlands b TASS B.V., P.O. Box 80060, 5600KA, Eindhoven, The Netherlands c NXP, High Tech Campus 46, 5656 AE, Eindhoven, The Netherlands Abstract. This paper describes the application of the Analytical Software Design methodology to the development of a mathematically verified I2C device driver for Linux. A model of an I2C controller from NXP is created, against which the driver component is modelled. From within the ASD tool the composition is checked for deadlock, livelock and other concurrency issues by generating CSP from the models and checking these models with the CSP model checker FDR. Subsequently C code is automatically generated which, when linked with a suitable Linux kernel runtime, provides a complete defect-free Linux device driver. The performance and footprint are comparable to handwritten code. Keywords. ASD, CSP, I2C, Device Driver, Formal Methods, Linux kernel
Introduction In this Analytical Software Design [1] [2] (ASD) project, NXP’s Intellectual Property and Architecture group successfully used Verum’s ASD:Suite to model the behaviour of an I2C [3] device driver. It had been demonstrated in other case studies [4][5] to save time and cost for software teams developing and maintaining large and complex systems. But this was the first time the tool had been applied to driver software. NXP is a leading semiconductor company founded by Philips and, working with Verum, undertook a project to evaluate the benefits of using ASD:Suite for upgrading its device driver software. 1. Background Analytical Software Design (ASD) combines the practical application of software engineering mathematics and modelling with specification methods that avoid difficult mathematical notations and remain understandable to all project stakeholders. In addition, it uses advanced code generation techniques. From a single set of design specifications, the necessary mathematical models and program code are generated automatically. ASD uses the Sequence Based Specification Method [6] to specify functional requirements and designs as black box functions. These specifications are traceable to the original (informal) requirements specifications and remain completely accessible to the critical project stakeholders. In turn, this allows the stakeholders and domain experts to play a key role and fully participate in verifying ASD specifications; this is an essential requirement for successfully applying such techniques in practice. 1
Corresponding Author:
[email protected]
106
A. Klomp et al. / A Mathematically Verified I2C Device Driver
Functional Requirements
Used Component Interfaces
Model Checking
Inspection
CSP
BB:S* ൺ R
?
Specification
Functional Specification
)' BSDM
||
CSP Black Box
BB:S* ൺ5 Design
√
Inspection
CSP Black Box
BB:S* ൺ R
Functional Specification
√ Inspection
Hand-written Code
Generated Code
Generated Test Cases
Figure 1. An overview of Analytical Software Design.
At the same time, these specifications provide the degree of rigour and precision necessary for formal verification. Within ASD, one can apply the Box Structured Development Method (BSDM) [7], following the principles of stepwise refinement, to transform the black box design specifications into state box specifications from which programming is typically derived. Figure 1 summarizes the main elements of ASD. The functional specification is analysed using the Sequence Based Specification method extended to enable nondeterminism to be captured. This enables the externally visible behaviour of the system to be specified with precision and guarantees completeness. Next, the design is specified using Sequence-Based Specification. This still remains a creative, inventive design activity requiring skill and experience combined with domain knowledge. With ASD, however, the design is typically captured with much more precision than is usual with conventional development methods, raising many issues like illegal behaviour, deadlocks, race conditions, etc. early in the life cycle and resolving them before implementation has started. The ASD code generator automatically generates mathematical models from the black box and state box specifications and designs. These models are currently generated in CSP [8], which is a process algebra for describing concurrent systems and formally reasoning about their behaviour. This enables the analysis to be done automatically using the ASD model checker (based on FDR [9]). For example, we can use the model checker to verify whether a design satisfies its functional requirements and whether the state box specification (used as a programming specification) is behaviourally equivalent to the corresponding black box design. In most cases, a design cannot be verified in isolation; it depends on its execution environment and the components it uses for its complete behaviour. In ASD, used component interfaces are captured as under-specified sequence based specifications. The corresponding CSP models are automatically generated, with under-specified behaviour
A. Klomp et al. / A Mathematically Verified I2C Device Driver
107
being modelled by the introduction of non-determinism. These models are then combined with those of the design and the complete system is verified for compliance with the specification. Defects detected during the verification phase are corrected in the design specification, leading to the generation of new CSP models and the verification cycle being repeated (this is typically a very rapid cycle). After the design has been verified, the ASD code generator can be used to generate program source code in C++, C or other similar languages. Section 2 introduces the case study and section 3 is about the method used. In Section 4, we give a more detailed overview of how ASD techniques were applied in practice for this particular case and section 5 presents the resulting analysis performed. 2. Case-study: a Mathematically Verified I2C Driver 2.1 Introduction NXP’s IP and Architecture group was developing code for an I2C device driver, for next generation systems. I2C drivers had been in the marketplace for a number of years. In theory, the existing software can be reused for new generations of the hardware, but in practice there were timing and concurrency issues. Therefore they also wanted to know whether the ease of maintenance of the driver could be improved. To learn more about the capabilities of the ASD:Suite and its underlying technology, the group carried out a joint project with Verum in which ASD was applied to model the behaviour of the device driver. The main objectives of the case study were to: • Verify the benefits of ASD. The most important benefits of interest to NXP were: o Delivering a higher quality product with the same or reduced amount of development effort. o Reducing costs of future maintenance. o Achieving at least equivalent performance to the existing product. •
Determine if ASD modelling helps in verifying the correctness of device driver software.
•
Assess whether the ASD:Suite is a useful and practical tool for developing device drivers.
This paper presents an overview of how the ASD method and tool suite were applied to the development of a Linux version of the IP_3204 I2C driver software, hosted on NXP’s Energizer II chip, and draws conclusions based upon these objectives. 2.2 Current Situation I2C devices are used particularly in larger systems on chips (SoCs). These are typically employed in products such as digital TVs, set-top boxes, automotive radios and multimedia equipment, as well as in more general purpose controllers for vending machines etc. Quality is a major concern, especially for the automotive market, where end user products are installed in high-end, expensive vehicles.
108
A. Klomp et al. / A Mathematically Verified I2C Device Driver
The I2C driver software had been ported and updated many times, both to keep pace with the evolving hardware platform and to cater for the requirements of new and upgraded target operating systems. The existing device driver in the past suffered from timing and concurrency issues that caused problems in development and testing, largely stemming from an incomplete definition of the hardware and software interfaces. 3. Method ASD:Suite had been demonstrated in other projects to shorten timescales and to reduce costs when used in the development and maintenance of large and complex systems. This was, however, the first time it had been applied to device driver software. The project covered three key activities: • Modelling of the device driver software. • Automatic generation of C code. • Integration into the Linux Kernel. NXP provided the domain knowledge for the project. Verum did the actual modelling of the driver specification and design. TASS provided input for the C code generation process and assisted in the implementation of the Linux kernel Operating System Abstraction Layer (OSAL) including interrupt management. 3.1 Modelling the Hardware and Software Interfaces The dynamic behaviour of the original driver was largely unspecified and unclear in the original design. The application of ASD clarified the interfaces, which resulted in a better understanding of the behaviour. For example, the software interface, which NXP calls the hardware API (HWAPI), was assumed to be stateless. However, ASD modelling revealed HWAPI state behaviour that had not previously been documented. 3.2 Improving the Design The design of the original I2C driver software was different. More work was done in the interrupt service routine context, instead of in the kernel threads. Using ASD it was possible to do the majority of the driver work in a kernel thread and as little as possible in the interrupt service routine. There is trade-off here; doing more in the kernel thread will improve overall system response but can reduce the driver performance itself. ASD will however ensure that the behaviour of the driver is complete and correct. 3.3 ASD Methodology At all stages of a project, ASD technology is applied to the various parts of the device driver software, and can also be applied to the hardware. In summary the approach, visualized in Figure 2 is: 1. Using the ASD:Suite, gather and document requirements, capturing them in the ASD model.
A. Klomp et al. / A Mathematically Verified I2C Device Driver
109
Figure 2. The triangular relation between ASD Model, CSP and source code.
2. Balance the design for optimal performance (e.g. by introducing asynchronous operation). 3. Generate a formal model of system behaviour and verify the correctness of the design (before any code has been generated). 4. Generate the source code, guaranteed to be defect free and functionally equivalent to the formal model. 3.4 Key Benefits When a software developer has gained familiarity with the ASD:Suite, it is possible to improve the efficiency of the development process by reducing dependency on testing, cutting development timescales by typically 30%. Maintenance becomes easier because the code generated is operating system independent, and changes are implemented to the model, not the code, which can then be easily regenerated. ASD:Suite delivers high performance software with a small code footprint and low resource requirements which is suited to highly embedded systems. Most importantly, the quality of the software improves, because code defects are eliminated. 4. Integrating ASD Generated Code with the Linux Kernel In general the structure of an ASD application can be described as depicted in Figure 3 below. The ASD clients call ASD components, which communicate with their environment through calls to the ASD Run Time Environment (RTE). This RTE uses an OS Abstraction Layer (OSAL) to make calls to some underlying OS. This should provide easy porting from one OS to another. The OSAL comprises a set of calls to implement thread based scheduling and condition monitors, as well as time related functions used by the RTE. In case of a normal application program the OSAL is directly mapped onto POSIX threading primitives, mutexes and condition variables in order to provide the required scheduling and resource protection. Currently only the C generator uses the OSAL interface. In the case of the I2C driver for Linux there is more to do than just have a standard POSIX implementation for the ASD Run Time Environment, as this now has to operate in
110
A. Klomp et al. / A Mathematically Verified I2C Device Driver
kernel space and implement a device driver. What this means is that we need to have a standard driver interface (i.e. open, read, write, ioctl, close) on top of the ASD components and that the ASD client needs to use the corresponding system calls to communicate with the ASD components from user space. Furthermore we need to talk to the hardware directly, which means that we need to deal with interrupts in a standard way. This is realized by an ASD foreign component. A foreign component is the ASD term for a component for which the implementation is handwritten code that is derived from the ASD interface model. 4.1 Execution Semantics of ASD Components In order to understand the requirements of the RTE one needs to consider the execution semantics of ASD components. 4.1.1 ASD Terminology We introduce the concept of durative and non-durative actions. A durative action takes place when a client synchronously invokes a method at a server. Until this method invocation returns the client remains blocked and it cannot invoke other methods at the same or other servers. After the method has returned to the client, the server is going to process the durative part and eventually the server informs the client asynchronously through a callback. A non-durative action takes place when a client synchronously invokes a method at a server and remains blocked until the server has completely processed the request. Basically an ASD component can offer two different kinds of interfaces: synchronous ones and asynchronous ones: The synchronous interfaces are client interfaces and used interfaces and the asynchronous interfaces are the callback interfaces. A component can offer several client interfaces, use several server interfaces and offer as well as use several callback interfaces. In order to be able to deal with asynchronous interfaces there is a separate Deferred Procedure Call (DPC) thread per callback interface. Making a call to a callback interface is implemented as posting a callback packet in a DPC queue serviced by a DPC thread, and notifying this DPC thread that there is work to do. The component code executes in the thread of the caller (it is just a method/subroutine call). Having the DPC threads execute as soon as they are notified may create a problem when there is shared data between the component thread and the DPC thread. As there always is shared data (state machine state), this data needs to be protected from concurrent access using some form of resource protection.
Figure 3. General overview of an ASD system.
A. Klomp et al. / A Mathematically Verified I2C Device Driver
111
For ASD this amounts to run-to-completion semantics. What this means is that it is guaranteed that the stimulus and all responses have been processed completely in the specified order and all predicates are updated before the state transition is made. This then implies that only one client may be granted access to the component at any time, as well as that DPC access can only occur after completion of a synchronous client call. In other words: an ASD component has monitor semantics. Furthermore DPC threads must execute and empty their callback queues before new client access to the component is granted. This is realized by the Client-DPC synchronization. Finally, special precautions must be taken for the following scenario: A client invokes a non-durative action on the interface of component A, where run-to-completion must be observed. If this invocation results in the invocation of a durative method and the client can only return when a callback for the durative action is invoked, we would get into trouble if the DPC server thread would block on the synchronization between client and server thread. In order to prevent this scenario there is a conditional wait after the release of the client-DPC mutex. When the client leaves the component state machine it needs to pass the conditional wait, to be released by the DPC server thread after this has finished processing the expected callback. The following pseudo code enforces the client side sequence of events with the minimum amount of thread switching and is depicted in Figure 4: Get ClientMutex Get ClientDPCMutex CallProcessing Release ClientDPCMutex ConditionalWait (DPC callback) Release ClientMutex Client call
Component A Client synchronisation
Only if callbacks used
Client-DPC synchronisation
State machine
DPC server thread
DPC queue
Used interface
Component B
Figure 4. Internal structure of an ASD component.
112
A. Klomp et al. / A Mathematically Verified I2C Device Driver
The DPC thread, shown inside the dotted area, executes the following pseudo code: while (true) { WaitEvent(wakeup, timeout) Get ClientDPCMutex CallProcessing Release ClientDPCMutex Signal DPC callback }
4.2 Structure of an ASD Linux Device Driver. When using an ASD component as a Linux device driver some glue code is necessary. The standard driver interface must be offered to the Linux kernel in order that the driver functions may be called from elsewhere. Also the implementation of the ASD execution semantics is not so straightforward, because we now need to consider hardware interrupts from the device we are controlling. As ASD cannot directly deal with interrupts we need a way to convert hardware interrupts to events that ASD can cope with. In effect we need an OS Abstraction Layer implementation for Linux kernel space. This means that instead of user level threads we now need kernel level threads, and we need to decouple interrupts from the rest of the processing. To this end we use a kernel FIFO, in which either the interrupt service routine or an event handler writes requests for the kernel level event handler. The kernel FIFO takes care that posting messages to and retrieving messages from the FIFO is interrupt safe. The first level event handler is implemented as a kernel work queue function scheduled by the interrupt handler, which reads messages from the kernel FIFO and then puts messages in a message queue serviced by a kernel DPC thread. This first level event handler is not allowed to block. The kernel DPC thread, signalled by the first level event handler, is allowed to block. Figure 5 explains this in more detail.
Interrupt
i2c_isr() { event = deal_with_IRQ(); kfifo_put(event); queue_work( interrupt_wq, &i2c_event_work) }
workqueue function
i2c_event_handler(){ while (msg_in_kfifo()){ kfifo_read(…,&intdata,…); /* put msg in DPC queue and schedule DPC thread */ schedule_DPC_callback(&int_data); } } Figure 5. Connecting Interrupts To ASD Callbacks
Figure 5. Connecting interrupts to ASD callbacks.
A. Klomp et al. / A Mathematically Verified I2C Device Driver
113
Figure 6. Structure of the ASD device driver.
Figure 6 depicts the relation between kernel code, OSAL, ASD generated code and handwritten code. Starting from the upper layer downwards we find a standard implementation of a device structure initialisation in the dev module, offering the open, close, read, write and ioctl calls. The I2C core module is also standard, calling the functions i2c_add_adapter, i2c_algo_control and i2c_algo_transfer implemented in adapter file myc.c. This file makes the connection between driver entry points and ASD generated I2C component code. In order that the component can be configured (e.g. set the correct I2C bus speed) there needs to be a way to hand configuration parameters down to a configuration function in the ASD file. ParamConverter converts between speed in bits/sec and the register values necessary to obtain the desired speed. I2cMessageIterator repeats the actions necessary for reading or writing of a single byte until the requested number of bytes is reached. For implementing the ASD execution semantics, the ASD C RunTime Environment is used. To fulfill its job, this RTE calls upon an OSAL implementation for Linux kernel space, which this project is the first to use. The OSAL uses the Linux kernel to implement the required scheduling, synchronisation and timing functionality. The I2CHwComponent implements the interface to the I2C hardware, supported by a Hardware Abstraction Layer (vHAL) implementation already available for this component. This hardware component implementation is based on an interface model derived from the IP hardware software interface documentation. This interface model of the hardware is used to verify the driver. The difficult bit is, of course, to make the model behave exactly the same as the hardware.
114
A. Klomp et al. / A Mathematically Verified I2C Device Driver
5. Results NXP has measured whether two of its key objectives had been met. Following completion of the project, they carried out: 1. Extensive stress testing, to check the stability and robustness of the product, and that the functionality was correct. 2. Performance analysis, in terms of speed of operation and footprint. The results of these investigations are as follows. The original code and data sizes of the handwritten code can be seen in Table 1. Table 1. Code and data size of original NXP driver code. Original code built-in.o
Text
Data
Bss
Total
16
592
13076
12468
This handwritten code is directly linked into the kernel. In order to facilitate testing the ASD driver was implemented as a loadable kernel module. Using C code generated from ASD models using the C-code generator (beta version) and combining this with handwritten code to interface to the I2C hardware we get the results shown in Table 2. Table 2. Code and data sizes for ASD generated C code + driver + ASD OSAL. Type of Code
text
data
bss
total
Handwritten ASD runtime lib incl. OSAL Handwritten code ASD generated code
4824 4876 12048
0 612 40
0 864 0
Total code in mymodule.o
21748
652
864
4824 6352 12088 + 23264
21840
920
864
23624
Final kernel module for Linux file manager mymodule.ko
We can see from these tables that the difference between the code sizes is about 10 Kbytes. This difference is constant, since the implementation of the OSAL and ASD RTE is not dependent on the driver. From inspection of the handwritten I2CHwComponent it is to be expected that there is room for optimization, which could make its code size significantly smaller. Code size optimization was, however, not the goal of this project. It will be considered for further work. Initial benchmarking has also shown that the performance of the code is acceptable, only minimally slower than the handwritten code. This is depicted in Table 3 below. Table 3. Comparison of execution times.
Execution time
Handwritten old driver
ASD Generated code + OSAL
Send of 2 bytes Time in interrupt
380 microseconds 60 microseconds
386 microseconds 20 microseconds
A. Klomp et al. / A Mathematically Verified I2C Device Driver
115
Several remarks can be made about these results. One: ASD components in general provide more functionality. They also capture “bad weather” behaviour that is not always captured or correctly captured in conventional designs. This results in many cases in a small increase in code size. Two: The way the C code is generated from the model is very straightforward. Work is underway to optimize the generated code so that it will be smaller. Even now, the larger code size is acceptable for NXP. Three: The time spent in interrupt context is more in the existing handwritten case because more is done in interrupt context than in the ASD case, resulting is a slightly faster performance. The ASD code blocks interrupts for a shorter time than the existing handwritten device driver code does, resulting in better overall system response. Despite these positive findings, there are still some concerns: •
The ASD driver currently implements less functionality than the original driver (no multi-master, no chaining). Adding this functionality will have more impact on code size.
•
Initially it did not survive a 3 hour stress test.
During integration and test of the ASD driver, a number of flaws where discovered. Some were related to mistakes in the handwritten code and some to modelling errors due to misinterpretation of the hardware software interface. The remaining stress test stability issue was determined to be caused by unexpected responses from the I2C EEPROM used during the stress test. The driver did not expect the response from the EEPROM when writing to a bad block. With a fresh EEPROM there are no problems. Thus, the model needs to be enhanced to cope with this situation, which should be a comparatively simple exercise. 6. Conclusions NXP believes ASD:Suite can also provide major benefits for developing defect free device drivers. The structured way of capturing complete requirements and design enable its software developers to model system behaviour correctly, verify the model and automatically generate defect free code. They are already modelling hardware, and are looking at opportunities to combine this with ASD software models, since the hardwaresoftware interface and (lack of) specification of the corresponding protocols remains a source of errors. This project has clearly shown that ASD modelling helps in developing and verifying deeply embedded software and that using the C code generator is beneficial and practical for device drivers. The biggest advantage seen is the rigorous specification process that is enforced with ASD. Software designers are forced to think before they implement, and ASD helps them ensure a complete and correct specification. It was not possible in this particular case to make a direct comparison of development effort with and without ASD, but other studies have shown that using ASD:Suite can reduce development time and cost by around 30%. Additional benefits include much easier and less costly maintenance. NXP’s own investigations have demonstrated the quality of the product and the performance of the generated C code.
116
A. Klomp et al. / A Mathematically Verified I2C Device Driver
Even where there are requirements with timing constraints that cannot be modelled using ASD, race conditions and deadlocks due to unexpected interleaving of activities are prevented by the model checker, and it is a major advantage for developers to be able to perform manual timing checks on guaranteed defect free code. Because the code generator produces verifiably correct code, the number of test cases developed and needed to run to gain confidence in the final product was considerably less than it would have been using a conventional approach to software development. The model checker revealed that there were more than 700,000 unique execution scenarios for the device driver. Without ASD, it would have required over 700,000 test cases to thoroughly test the software. Thus a major reduction in testing effort was achieved. 7. Future Work Some thoughts have been expressed as to whether the current OSAL interface models the ASD principles in the best way. Viewing an ASD component as a separate process, and using channels to implement interfaces could be a more appropriate model for more modern OSes (QNX, OSEK, (μ-)velOSity, Integrity®), which offer higher level message passing primitives. It would also make the system scalable and offer a more efficient implementation under this kind of OS. For OSes not offering the message passing primitives mutexes and conditions can then be used to implement the required functionality. This thinking is, however, still conceptual, and not in the context of this NXP project. References [1]
[2] [3] [4] [5] [6] [7] [8] [9]
Philippa J. Hopcroft, Guy H. Broadfoot, Combining the Box Structure Development Method and CSP, In ASE ’04: Proceedings of the 19th IEEE international conference on Automated Software Engineering, pages 340–345, Washington, DC, USA, 2004. An introduction to ASD, http://www.verum.com/resources/papers.html NXP, I2C-bus specification and user manual Rev 3, www.nxp.com/acrobat_download/ usermanuals/UM10204_3.pdf, NXP, 2007. Rutger van Beusekom, Nanda Technologies IM1000 Wafer Inspection System, Verum White Paper Study, Verum, 2008 R. Wiericx and L. Bouwmeester, Nucletron uses ASD to reduce development and testing times for Cone Beam CT Scan software, Verum White Paper Study, Verum, 2009 S.J. Prowell and J.H. Poore, Foundations of Sequence-Based Software Specification. IEEE Transactions of Software Engineering, 29(5):417-429, 2003 H.D. Mills and R.C. Linger and A.R. Hevner. Principles of Information Systems Analysis and Design. Academic Press, 1986 C.A.R. Hoare. Communicating Sequential Processes. Prentice Hall, 1985 Formal Systems (Europe) Ltd, Failures-Divergence Refinement: FDR2 User Manual, 2003
Communicating Process Architectures 2009 P.H. Welch et al. (Eds.) IOS Press, 2009 © 2009 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-065-0-117
117
Mobile Escape Analysis for occam-pi Frederick R.M. BARNES School of Computing, University of Kent, Canterbury, Kent, CT2 7NF. England
[email protected] Abstract. Escape analysis is the process of discovering boundaries of dynamically allocated objects in programming languages. For object-oriented languages such as C++ and Java, this analysis leads to an understanding of which program objects interact directly, as well as what objects hold references to other objects. Such information can be used to help verify the correctness of an implementation with respect to its design, or provide information to a run-time system about which objects can be allocated on the stack (because they do not “escape” the method in which they are declared). For existing object-oriented languages, this analysis is typically made difficult by aliasing endemic to the language, and is further complicated by inheritance and polymorphism. In contrast, the occam-π programming language is a process-oriented language, with systems built from layered networks of communicating concurrent processes. The language has a strong relationship with the CSP process algebra, that can be used to reason formally about the correctness of occam-π programs. This paper presents early work on a compositional escape analysis technique for mobiles in the occam-π programming language, in a style not dissimilar to existing CSP analyses. The primary aim is to discover the boundaries of mobiles within the communication graph, and to determine whether or not they escape any particular process or network of processes. The technique is demonstrated by analysing some typical occam-π processes and networks, giving a formal understanding of their mobile escape behaviour. Keywords. occam-pi, escape analysis, concurrency, CSP
Introduction The occam-π programming language [1] is a highly concurrent process-oriented language, derived from classical occam [2], in which systems are built from layered networks of communicating processes. The semantics of classical occam are based largely on those of Hoare’s Communicating Sequential Processes (CSP) [3], an algebra that can be used to reason about the concurrent behaviour of occam programs [4,5]. To occam, occam-π adds new mechanisms and language constructs for data, channel and process mobility, inspired by Milner’s π-calculus [6]. In addition occam-π offers a wealth of other features that allow the construction of dynamic and evolving software systems [7]. Some of these extensions, such as dynamic process creation, mobile barriers and channelbundles, have already had CSP semantics defined for them [8,9,10], providing ways for formal reasoning about these. These semantics are sufficient for reasoning about most occam-π programs in terms of interactions between concurrent components, typically to guarantee the absence of deadlock, or refinement of a specification. However, these semantics do not adequately deal with escape analysis of the various mobile types, i.e. knowing in advance the range of movement of mobiles between processes and process networks. The escape analysis information for an individual process or network of processes is useful in several ways:
118
F.R.M. Barnes / Mobile Escape Analysis for occam-pi
• For checking design-level properties of a system, e.g. ensuring that private mobile data in one part of a system does not escape. • For the implementation, as it describes the components tightly coupled by mobile communication — relevant in shared-memory systems, where pointers are communicated between processes, and for the breakdown of concurrent systems in distributed execution. The remainder of this paper describes an additional mobility analysis for occam-π programs, in a style similar to the well-known traces, failures and divergences analyses of CSP [11]. Section 1 provides a brief overview of occam-π and its mobility mechanisms, in addition to current analysis techniques for occam-π programs. Section 2 describes the additions for mobile escape analysis, in particular, a new mobility model. Section 3 describes how mobile escape analysis is performed for occam-π program code, followed by initial applications of this to occam-π systems in section 4. Related research is discussed in section 5, with conclusions and consideration for future work in section 6. 1. occam-π and Formal Analysis The occam-π language provides a natural expression for concurrent program implementation, based on a communicating processes model as described by CSP. Whole systems are built from layered networks of communicating processes, which interact through a variety of synchronisation and communication mechanisms. The primary mechanism for process interaction is through channel communication, where two processes synchronise (with the semantics of CSP events), and communicate data. The occam-π “BARRIER” type provides synchronisation between any number of processes, but allows no communication (although barriers can be used to provide safe access to shared data [12]). The barrier type is roughly equivalent to the general CSP event, though our implementation does not support interleaving — synchronisation between subsets of enrolled processes. There are four distinct groups of mobile types in the occam-π language, that cover all of the occam-π mobility extensions. These are mobile data, mobile channel-ends, mobile processes and mobile barriers. The operational semantics of these vary depending on the type of mobile (described below). Mobile variables, of all mobile types, are implemented primarily as pointers to dynamically allocated memory. To avoid the need for complex garbage collection (GC), strict aliasing rules are applied. For all mobile types, routines exist in the run-time system that allow these to be manipulated safely including: allocation, release, input, output, assignment and duplication. 1.1. Operational Semantics of Mobile Types Mobile data exists largely for performance reasons. Ordinarily, data is communicated over occam-π channels using a copying semantics — i.e. the outputting process keeps its original data unchanged, and the inputting process receives a copy (overwriting a local variable or parameter). With large data (e.g. 100 KiB or more), the cost of this copy becomes significant, compared with the cost of the synchronisation. With mobile data, only a reference to the actual data is ever copied — a small fixed overhead [13]. However, in order to maintain the aliasing laws of occam (and to avoid parallel race-hazards on shared data), the outputting process must lose the data it is sending — i.e. it is moved to the receiving process. A “CLONE” operator exists for mobile data that creates a copy, for cases where the outputting process needs to retain the data after the output.
F.R.M. Barnes / Mobile Escape Analysis for occam-pi
119
Mobile barriers allow synchronisation between arbitrary numbers of parallel processes. This has uses in a variety of applications, such as the simulation of complex systems [14], where barriers can be used to protect access to shared data (using a phased access pattern of global read then local write). When output by a process, a reference to a mobile barrier is moved, unless it is explicitly cloned, in which case the receiving process is enrolled on the barrier before the communication completes. Mobile channel-ends refer to the end-points of mobile channel bundles. These are structured types that incorporate a number of ordinary channels. Unlike ordinary channels, however, these mobile channel-ends may be moved between processes — dynamically restructuring the process network. Mobile channel ends may be shared or unshared. Unshared ends are always moved on output. Shared channel-ends are always cloned on output. Communication on the individual channels inside a shared channel-end must be done within a “CLAIM” block, to ensure mutually exclusive access to those channels. Mobile processes provide a mechanism for process mobility in occam-π [1]. Mobile processes are either active, meaning that they are connected to an environment and are running (or waiting for an event), or are inactive, meaning that they are disconnected from any environment and are free to be moved between processes. Like mobile data, there is no concept of a shared mobile process, though a mobile process may contain other mobiles (shared and unshared) as part of its internal state. The rules for mobile assignment follow those for communication — in line with the existing laws of occam. For example, assuming “x” and “y” are integer (“INT”) variables, the two following fragments of code are semantically equivalent: CHAN INT c: PAR c ! y c ? x
≡
x := y
This rule must be preserved when dealing with mobiles, whose references are either moved or duplicated, depending on the mobile type used. The semantics of communication are also used when passing mobile parameters to dynamically created (forked) processes [15] — renaming semantics are used for ordinary procedure calls. 1.2. Analysis of occam-pi Programs Starting with an occam-π process, it is moderately straightforward to construct a CSP expression that captures the process’s behaviour [4,5]. Figure 1 shows the traditional “id” process and its implementation, that acts as a one-place buffer within a process network. PROC id (CHAN INT in?, out!) WHILE TRUE INT x: SEQ in ? x out ! x :
in?
id
out!
Figure 1. One place buffer process.
If the specification is for a single place buffer, this code represents the most basic implementation — all other implementations meeting the same specification are necessarily equivalent. The parameterised CSP equation for this process is simply: ID(in, out) = in → out → ID(in, out)
120
F.R.M. Barnes / Mobile Escape Analysis for occam-pi
This captures the behaviour of the process (interaction with its environment by synchronisation on “in” and “out” alternately), but makes no statements about individual data values. CSP itself provides only a limited support for describing the stateful data of a system. Where such reasoning is required, it would be preferable to use related algebras such as Circus [16] or CSPB [17]. Using existing and largely mechanical techniques, the traces, failures and divergences of this “ID” process can be obtained: traces ID = {, in, in, out, in, out, in, . . .} failures ID = (, {out}), (in, {in}), (in, out, {out}),
(in, out, in, {in}), . . . divergences ID = {} As described in [11], the traces of a process are the sequences of events that it may perform. For the ID process, this is ultimately an infinite trace containing “in” and “out” alternatively. The failures of a process describe under what conditions a process will deadlock (behave as STOP ). These are pairs of traces and event-sets, e.g. (X , E ), which state that if a process has performed the trace X and the events E are offered, then it will deadlock. For example, the first failure for the ID process states that if the process has not performed any externally visible events, and it is only offered “out”, then it will deadlock — because the process is actively only waiting for “in”. The divergences of a process are similar to failures, except these describe the conditions under which a process will livelock (behaves as div). The ID process is divergence free. 2. Mobility Analysis The primary purpose of the extra analysis is to track the escape of mobile items from processes. With respect to mobile items, processes can: • create new mobile items; • transport existing mobiles through their interfaces; and • destroy mobile items. Unlike traces, failures and divergences, the mobility of a process cannot be derived from a CSP expression of an occam-π process alone — requiring either the original code from which we would generate a CSP expression, or an augmented version of CSP that provides a more detailed representation of program behaviour, specifically the mobile operations listed above. The remainder of this section describes the representation (syntax) used for mobility sequences, and some simple operations on these. 2.1. Representation The mobility of a process is defined as a set of sequences of tagged events, where the events involved represent channels in the process’s environment. For the non-mobile “id” process discussed in section 1.2, this would simply be the empty set: mobility ID = {}
F.R.M. Barnes / Mobile Escape Analysis for occam-pi
121
For a version of the “id” process that transports mobile data items: mobility MID = {in?a , out!a } The name “a” introduced in the mobility specification has scope across the whole set of sequences (though in this case there is only a single sequence) and indicates that the mobile data received from “in” is the same as that output on “out”. The direction (input or output) is relevant, since escape is asymmetric. Processes that create or destroy mobiles instead of transporting them are defined in similar ways. The syntax for representing and manipulating mobility sequences borrows heavily from CSP [3,11], specifically the syntax associated with traces. 2.1.1. Shared Mobiles For unshared mobile items, simple mobility sequences will have at most two items1 , reflecting the fact that a process acquires a mobile and then loses it — and therefore always in the order of an input followed by an output. For shared mobile items, mobility sequences may contain an arbitrary number of outputs, as a process can duplicate references to that mobile. Where there is more than one output, the order is unimportant — knowing that the mobile escapes is sufficient. Shared mobiles are indicated explicitly — decorated with a “+”. For example, a version of the “id” process that transports shared mobiles has the model: mobility SMID = {in?a+ , out!a+ } 2.1.2. Client and Server Channel Ends As described in section 1.1, mobile channel bundles are represented in code as pairs of connected ends, termed client and server. In practice these refer to the same mobile item, but for the purpose of analysis we distinguish the individual ends — e.g. for some mobile channel a ” for the server-end. A version of “id” that bundle “a”, we use “a” for the client-end and “¯ transports unshared server-ends of a particular channel-type would have the mobility model: mobility USMID = {in?a¯ , out!a¯ } These are slightly different from other mobiles in that they can appear as both superscripts (mobile items) and channel-names (carrying other mobile items). Recursive mobile channel-end structures can also carry themselves, expressed as, e.g. a!a . Where there are multiple channels inside a mobile channel-end, the individual channels can be referred to by their index, e.g. a[0] ?x , a[1] !a , to make clear which particular channel (for communication) is involved. 2.1.3. Undefinedness In certain situations, that are strictly program errors, there is a potential for undefined mobile items to escape a process. Such an undefined mobile cannot be used in any meaningful way, but should be treated formally. A process that declares a mobile and immediately outputs it undefined, for example, would have the mobility model: mobility BAD = {out!γ } The absence of such things can be used to prove that a process, or process network, does not generate any undefined mobiles. 1 Higher order operations, e.g. communicating channels over channels, can produce mobility sequences containing more than two items — see section 3.7.
122
F.R.M. Barnes / Mobile Escape Analysis for occam-pi
2.1.4. Alphabets As is standard in CSP, we use sigma (Σ) to refer to the set of names on which a process can communicate. For mobility sequences, this can be divided into output channels (Σ! ) and input channels (Σ? ), such that Σ = Σ! ∪ Σ? . Ordinary mobile items (data, barriers) are not part of this alphabet, mobile channel-ends are however. The various channels that are in the alphabet of an occam-π process can also be grouped according to their type: Σt , where t is any valid occam-π protocol and T is the set of available protocols, such that t ∈ T. Following on, Σt = Σ!t ∪ Σ?t , and ∀ t : T · Σt ⊆ Σ. For referring to all channels that carry shared mobiles we have Σ+ , with Σ+ = Σ!+ ∪ Σ?+ . 2.2. Operations on Mobility Sequences For convenience, the following operations are defined for manipulating mobility sequences. To illustrate these, the name S refers to a set of mobility sequences, S = {R1 , R2 , . . .}, each of which is a sequence of mobile actions, R = X1 , X2 , . . .. Each mobile action is either an input, X1 = C !x , or an output, X2 = D?v . 2.2.1. Concatenation For joining mobility sequences: X1 , X2 , . . .ˆY1 , Y2 , . . . = X1 , X2 , . . . , Y1 , Y2 , . . . 2.2.2. Channel Restriction Used to remove named channels from mobility sequences: X1 , C !x , . . . − {C } = X1 , . . . Note that this is not quite the same as hiding, the details of which are described later.
3. Analysing occam-pi for Mobility This section describes the specifics of extracting mobile escape information for occam-π processes. Where appropriate, the semantics of these in terms of CSP operators are given. A refinement relation over mobility sets is also considered. 3.1. Primitive Processes The two primitive CSP processes STOP and SKIP are expressed in occam-π using “STOP” and “SKIP” respectively. Although “STOP” is often not used explicitly, it is implicit in certain occam-π constructs — for example, in an “IF” structure, if none of the conditions evaluate to true, or in an “ALT” with no enabled guards. Both SKIP and STOP have empty mobility models. Divergence and chaos, for which there is no exact occam-π equivalent, have undefined though legal mobility behaviours — and are able to do anything that an occam-π process might.
F.R.M. Barnes / Mobile Escape Analysis for occam-pi
123
mobility SKIP = mobility STOP = mobility div = mobility CHAOS = {C !a | C ∈ Σ! } ∪ {D?x | D ∈ Σ? }∪ {C ?v , D!v | ∀ t : T · (C , D) ∈ Σ?t × Σ!t )}
The models of divergence and chaos specify that the process may output defined mobiles on any of its output channels, consume mobiles from any of its input channels, and forward mobiles from any of its input channels to any of its output channels (where the types are compatible). However, neither divergence or chaos will generate (and output) undefined mobiles, but may forward undefined mobiles if these were ever received. 3.2. Input, Output and Assignment Input and output are the basic building blocks of mobile escape in occam-π — they provide the means by which mobile items are moved. For example, a process that generates and outputs a mobile (which escapes): PROC P (CHAN MOBILE THING out!) MOBILE THING x: SEQ ... initialise ‘x’ out ! x :
mobility P = {out!x }
Correspondingly, a process that consumes a mobile: PROC Q (CHAN MOBILE THING in?) MOBILE THING y: SEQ in ? y ... use y :
mobility Q = {in?y }
A similar logic applies to assignment, based on the earlier equivalence with communication. For example: PROC R (CHAN MOBILE THING in?, out!) MOBILE THING v, w: SEQ in ? v w := v out ! w :
mobility R = {in?v , Lc!v , Lc?w , out!w } \ {Lc}
The local channel-name Lc comes from the earlier model for assignment (as a communication between two parallel processes). The semantics for parallelism and hiding are described in the following sections. A compiler does not need to model assignment directly in this manner, however — it can track the movement of mobiles between local variables itself, and generate simpler (but equivalent) mobility sequences. For the above process “R”: mobility R = {in?u , out!u }
124
F.R.M. Barnes / Mobile Escape Analysis for occam-pi
3.3. Sequential Composition Sequential composition provides one mechanism by which a mobile received on one channel can escape on another. In the case of the “id” process, whose mobility model is intuitively obvious (but best determined automatically by a compiler or other tool): SEQ in ? v out ! v
mobility ID = {in?v , out!v }
In general, the mobility model for sequential processes, i.e. mobility(P ; Q), is formed by combining input sequences from mobility P with output sequences from mobility Q, matched by the particular mobile variable input or output. When combining processes in this and other ways, the individual variables representing mobile items may need to be renamed to avoid unintentional capture. 3.4. Choice Programs may make choices either internally (e.g. with “IF” and “CASE”) or externally (with an “ALT” or “PRI ALT”). The rules for internal and external choice are straightforward — simply the union of the sets representing the individual choice branches. For example: PROC plex.data (CHAN MOBILE THING in0?, in1?, out!) WHILE TRUE MOBILE THING v: ALT mobility PD = in0 ? v out ! v in1 ? v out ! v :
{in0?a , out!a , in1?b , out!b }
In general: mobility (P Q) = (mobility P ) ∪ (mobility Q) mobility (P Q) = (mobility P ) ∪ (mobility Q) 3.5. Interleaving and Parallelism Interleaving and parallelism, both specified by “PAR” in occam-π, have straightforward mobility models. For example, a “delta” process for SHARED mobile channel-ends, that performs its outputs in parallel: PROC chan.delta (CHAN SHARED CT.FOO! in?, out0!, out1!) WHILE TRUE SHARED CT.FOO! x: SEQ mobility CD = {in?a+ , out0!a+ , in ? x PAR in?b+ , out1!b+ } out0 ! CLONE x out1 ! CLONE x :
This captures the fact that a mobile input on the “in” channel escapes to both the output channels, indistinguishable from a non-interleaving process that makes an internal choice about where to send the mobile. In general:
125
F.R.M. Barnes / Mobile Escape Analysis for occam-pi
mobility (P Q) = (mobilityP ) ∪ (mobility Q)
Interleaving (e.g. P ||| Q) is a special form of the more general alphabetised parallelism, therefore it is not of huge concern for mobile escape analysis. 3.6. Hiding Hiding is used to model the declaration and scope of channels in occam-π. In particular, it is also responsible for collapsing mobility structures — by removing channel names from them. Where occam-π programs are concerned, channel declarations typically accompany “PAR” structures. For example: PROC network (CHAN MOBILE THING in?, out!) CHAN INT c: PAR mobility thing.id (in?, c!) thing.id (c?, out!) :
NET = {in?a , c!a , c?b , out!b } \ {c}
This reduces to the set: mobility NET = {in?a , out!a } The general rule for which is: mobility (P \ x ) = M ˆN [α/β] |
M ˆx !α , x ?β ˆN ∈ mobility P × mobilityP ∪
(mobility P ) − F ˆx !α | F ˆx !α ∈ mobility P ∪ x ?β ˆG | x ?β ˆG ∈ mobility P ∪ α β H | (H ˆx ! ) ∈ mobility P ∧ (x ? ˆI ) ∈ / mobility P ∧ H = ∪ J | (x ?β ˆJ ) ∈ mobility P ∧ (J ˆx !α ) ∈ / mobility P ∧ J = The above specifies the joining of sequences that end with outputs on the channel x with sequences that begin with inputs on the channel x . The matching sequences are removed from the resulting set, however, the starts of unmatched output sequences and the ends of unmatched input sequences are preserved. 3.7. Higher Order Communication So far, only the transport of mobiles over static process networks has been considered. However, in many real applications, mobile channels will be used to setup connections between processes, which are later used to transport other mobiles (including other mobile channelends). Assuming that the “CT.FOO” channel-type contains a single channel named “c”, itself carrying mobiles, we might write:
126
F.R.M. Barnes / Mobile Escape Analysis for occam-pi
PROC high.order.cli (CHAN CT.FOO! in?) CT.FOO! cli: MOBILE THING v: SEQ in ? cli ... initialise ‘v’ cli[c] ! v :
mobility HOC = in?a , a!b
This captures the fact that the process emits mobiles on the bound name “a”, which it received from its “in” channel. The type “CT.FOO!” specifies the client-end of the mobile channel2 . A similar process for the server-end of the mobile channel could be: PROC high.order.svr (CHAN CT.FOO? in?) CT.FOO? svr: MOBILE THING x: SEQ in ? svr svr[c] ? x ... use ‘x’ :
mobility HOS = in?c¯ , c¯?d
Connecting these in parallel with a generator process (that generates a pair of connected channel-ends and outputs them), and renaming for parameter passing: PROC foo.generator (CHAN CT.FOO! c.out!, CHAN CT.FOO? s.out!) CT.FOO? svr: CT.FOO! cli: SEQ cli, svr := MOBILE CT.FOO mobility FG = c.out!x , s.out!x¯ PAR c.out ! cli s.out ! svr : CHAN CT.FOO! c: CHAN CT.FOO? s: PAR foo.generator (c!, s!) high.order.cli (c?) high.order.svr (s?)
mobility = c!x , s!x¯ , c?a , a!b , s?c¯ , c¯?d \ {c, s} = x !b , ¯ x ?d
This indicates a system in which a mobile is transferred internally, but never escapes. As such, we can hide the mobile channel event “x ” (also “¯ x ”), giving an empty mobility set — concluding that no mobiles escape this small system, as we would have expected. 3.8. Mobility Refinement The previous sections have illustrated a range of mobility sets for various processes and their compositions. Within CSP and related algebras is the concept of refinement, that operates on the traces, failures and divergences of processes, and can in general be used to test whether a particular implementation meets a given specification. In general, we write P Q to mean that P is refined by Q, or that Q is more deterministic than P . 2 The variable “cli” is a mobile channel bundle containing just one channel (named “c”), identified by a record subscript syntax: cli[c].
F.R.M. Barnes / Mobile Escape Analysis for occam-pi
127
For mobile escape analysis, it is reasonable to suggest that there may be a related mobility refinement, whose definition is: ≡
P M Q
mobility Q ⊆ mobility P
The interpretation of this is that Q “contributes less to mobile escape” than P , and where the subset relation takes account of renaming within sets. This is not examined in detail here (an item for future work), but on initial inspection appears sensible — e.g. to test whether a specific implementation meets a general specification. 4. Application As previously discussed, the aim of this analysis is to determine what mobiles (if any) escape a particular network of occam-π processes, and if so, how they escape with respect to that process network (i.e. on which input and output channels). Two examples of the technique are discussed here, one for static process networks and one for dynamically evolving process networks. The former is more typical of small-scale systems, such as those used in small (and memory limited) devices. 4.1. Static Process Networks Figure 2 shows a network of parallel processes and the code that implements it. The individual components have the following mobile escape models: mobility delta = {in?a , out0!a , in?b , out1!b } mobility choice = {in?a , out0!a , in?b , out1!b } mobility gen = {out!a } mobility plex = {in0?a , out!a , in1?b , out!b } mobility sink = {in0?a , in1?b }
A?
X! delta
p plex
q B?
choice
r s
gen
sink
Y!
PROC net (CHAN MOBILE THING A?, B?, X!, Y!) CHAN MOBILE THING p, q, r, s: PAR delta (A?, X!, p!) choice (B?, q!, r!) gen (s!) plex (p?, q?, Y!) sink (r?, s?) :
Figure 2. Parallel process network.
When combined, with appropriate renaming for parameter passing (and to avoid unintentional capture), this gives the mobility set:
mobility Net = {A?a , X !a , A?b , p!b , B ?c , q!c , B ?d , r !d , s!e , p?f , Y !f , q?g , Y !g , r ?h , s?h } \ {p, q, r , s}
128
F.R.M. Barnes / Mobile Escape Analysis for occam-pi
Applying the rule for hiding to the channels p, q, r and s gives: \{p} −→ A?a , X !a , A?b , Y !b , B ?c , q!c , B ?d , r !d , s!e , q?g , Y !g , r ?h , s?h \{q} −→ A?a , X !a , A?b , Y !b , B ?c , Y !c , B ?d , r !d , s!e , r ?h , s?h \{r } −→ A?a , X !a , A?b , Y !b , B ?c , Y !c , B ?d , s!e , s?h \{s} −→ A?a , X !a , A?b , Y !b , B ?c , Y !c , B ?d The resulting mobility analysis indicates that mobiles input on A escape through output on X and Y , and that inputs received on B either escape through Y or are consumed internally. The fact that certain mobility sequences are not present in the result provides more information: that mobiles input on A are never discarded internally, and that the resulting network does not generate escaping mobiles. 4.2. Dynamic Process Networks In dynamically evolving systems, RMoX in particular [18,19], connections are often established within a system for the sole purpose of establishing future connections. An example of this is an application process that connects to the VGA framebuffer (display) device via a series of other processes, then uses that new connection to exchange mobile data with the underlying device. Figure 3 shows a snapshot of connected graphics processes within a running RMoX system. service.core
application
gfx.core
kernel
vga.fb
vga
driver.core
Figure 3. RMoX driver connectivity.
Escape analysis allows for certain optimisations in process networks such as these. If the compiler (and associated tools) can determine that mobile data generated in “vga” or “vga.fb” is not discarded internally, nor escapes through the processes “gfx.core” and “application”, then it will be safe to pass the real framebuffer (video) memory around for rendering. Without the guarantees provided by this analysis, there is a danger that parts of the video memory could escape into the general memory pool — with odd and often undesirable consequences3 . Assuming that framebuffer memory originates and is consumed within “vga.fb”, we have an occam-π process with the structure: PROC vga.fb (CT.DRV? link) CT.GUI.FB! fb.cli: CT.GUI.FB? fb.svr: 3 Mapping process memory (typically a process’s workspace) into video memory, or vice-versa, does provide an interesting way of visualising process behaviour in RMoX, however.
F.R.M. Barnes / Mobile Escape Analysis for occam-pi SEQ fb.cli, fb.svr := MOBILE CT.GUI.FB ...
129
-- create channel-bundle
other initialisation and declarations
PAR WHILE TRUE link[in] ? CASE CT.DRV.R! ret: open.device; ret IF DEFINED fb.cli ret[out] ! device; fb.cli TRUE ret[out] ! device.busy ... other cases PLACED MOBILE []BYTE framebuffer AT ...: WHILE TRUE fb.svr[in] ? CASE get.buffer fb.svr[out] ! buffer; framebuffer put.buffer; framebuffer SKIP
-- request to open device
-- return bundle client-end
-- request from connected client -- outgoing framebuffer -- incoming framebuffer
:
That has the mobility model: a[1] !b , ¯ a[0] ?c mobility VFB = link ?r , r !a , ¯ The escape information here indicates that mobiles are generated and consumed at the serverend of the channel bundle a¯ , whilst the client-end of this bundle, a, escapes through another channel bundle r that the process receives from its link parameter. Instead of going into detail for the other processes involved, that would require a significant amount of space, the generic forwarding and use of connections is considered. 4.2.1. Client Processes The mechanism by which dynamic connections to device-drivers and suchlike are established involves sending the client-end of a return channel-bundle along with the request. A client process (e.g. “application” from figure 3) therefore typically has the structure: PROC client (SHARED CT.DRV! to.drv) CT.DRV.R! r.cli: CT.DRV.R? r.svr: CT.GUI.FB! guilink: SEQ r.cli, r.svr := MOBILE CT.DRV.R CLAIM to.drv to.drv[in] ! open.device; r.cli r.svr[out] ? CASE device.busy ... fail gracefully device; guilink ... use ’guilink’ :
-- create response channel-bundle
-- send request -- wait for response
130
F.R.M. Barnes / Mobile Escape Analysis for occam-pi
This has the mobility model: mobility CLI = to.drv !e , ¯ e ?f ∪ M where M is the mobility model for the part of the process that uses the “guilink” connection to the underlying service, and will communicate directly on the individual channels within f . Connecting this client and the “vga.fb” processes directly, with renaming for parameter passing, gives the following mobility set: ¯ r , r !a , ¯ a[1] !b , ¯ a[0] ?c , A!e , ¯ e ?f ∪ M A?
¯ gives: Hiding the internal link A, A a[1] !b , ¯ a[0] ?c , ¯ e ?f ∪ M e!a , ¯
If we take a well-behaved client implementation for M — i.e. one that inputs a mobile (framebuffer) from the underlying driver, modifies it in some way and then returns it, without destroying or creating these (M = {f[1] ?x , f[0] !x }) — we get: a e! , ¯ a[1] !b , ¯ a[0] ?c , ¯ e ?f , f[1] ?x , f[0] !x Subsequently hiding e, which represents the “CT.DRV.R” link, causes f to be renamed to a, giving the set: a[0] ?c , a[1] ?x , a[0] !x ¯ a[1] !b , ¯ Logically speaking, and for this closed system, b and c must represent the same thing — in this case, mobile framebuffers. Thus we have a guarantee that mobiles generated within the “vga.fb” process are returned there, for this small system. On the other hand, a less well-behaved client implementation for M could be one that occasionally loses one of the framebuffers received, instead of returning it (i.e. M = {f[1] ?x , f[0] !x , f[1] ?y }). This ultimately gives the mobility set: ¯ a[1] !b , ¯ a[0] ?c , a[1] ?x , a[0] !x , a[1] ?y As before, b and c must represent the same mobiles, so the only mobiles received back must have been those sent. However, the presence of the sequence a[1] ?y indicates that framebuffers can be received and then discarded by this client. Another badly behaved client implementation is one that generates mobiles and returns these as framebuffers, in addition to the normal behaviour, e.g. M = {f[1] ?x , f[0] !x , f[0] !z }. This gives the resulting mobility set:
a[0] ?c , a[1] ?x , a[0] !x , a[0] !z ¯ a[1] !b , ¯
In this case, b and c do not necessarily represent the same mobiles — as while x can only be b, c can be either x (and therefore b) or z . Thus there is the possibility that mobiles are returned to the “vga.fb” driver that did not originate there. 4.2.2. Infrastructure Within RMoX, such client and server processes are normally connected through a network of processes that route requests around the system. From figure 3, this includes the “driver.core”, “service.core” and “kernel” processes. In earlier versions of RMoX [19], both requests and their responses were routed through the infrastructure. This is no longer the case — requests now include, as part of the request,
F.R.M. Barnes / Mobile Escape Analysis for occam-pi
131
a mobile channel-end that is used for the response. This is a cleaner approach in many respects and is more efficient in most cases. From the client’s perspective, a little more work is involved when establishing connections, since the return channel-bundle must be allocated. Most of the infrastructure components within RMoX consist of a single server-end channelbundle on which requests are received, whose client-end is shared between multiple processes, and multiple client-ends connecting to other server processes such as “vga.fb” and other infrastructure components. A very general implementation of an infrastructure component is: PROC route (CT.DRV? in, CT.DRV! out.this, SHARED CT.DRV! out.next) WHILE TRUE in[in] ? CASE CT.DRV.R! ret: open.device; ret IF request.for.this out.this[in] ! open.device; ret NOT invalid CLAIM out.next! out.next[in] ! open.device; ret TRUE ret[out] ! no.such.device ...
other cases
:
The mobility model of this process is: mobility Rt = in?a , out.this!a , in?b , out.next!b , in?c The last component indicates that this routing process may discard the request (and the response channel-end) internally — after it has reported an error back on the response channel, of course. With the “route” process as it is, there would need to be an additional process at the end of this chain that responds to all connection requests with an error, e.g.: PROC end.route (CT.DRV? in) WHILE TRUE in[in] ? CASE CT.DRV.R! ret: open.device; ret ret[out] ! no.such.device ...
mobility ERt = in?x
other cases
:
Combining one “route” process and one “end.route” process with the existing “vga.fb” and “client” processes produces the network shown in figure 4. This has the following mobility model:
¯ ?r , r !a , ¯ ¯ ?x , A? ¯ a , C !a , A? ¯ b , B !b , A? ¯ c ∪ M C a[1] !b , ¯ a[0] ?c , A!e , ¯ e ?f , B
132
F.R.M. Barnes / Mobile Escape Analysis for occam-pi
client
A
route
B
C vga.fb
end.route
Figure 4. RMoX routing infrastructure.
Hiding the internal links A, B and C gives: \{A} ¯ ?r , r !a , ¯ ¯ ?x , C !e , B !e ∪ M −→ C a[1] !b , ¯ a[0] ?c , ¯ e ?f , B \{B } ¯ ?r , r !a , ¯ −→ C a[1] !b , ¯ a[0] ?c , ¯ e ?f , C !e ∪ M \{C } −→ e!a , ¯ a[1] !b , ¯ a[0] ?c , ¯ e ?f ∪ M This system has an identical mobile escape model to the earlier directly connected “client” and “vga.fb” system. As such, the system can still be sure that framebuffer mobiles generated by “vga.fb” are returned there. 5. Related Research The use of escape analysis for determining various properties of dynamic systems stems from the functional programming community. One use here is for determining which parts of an expression escape a particular function, and if they can therefore be allocated on the stack (i.e. they are local to the function) [20]. More recently, escape analysis has been used in conjunction with object-oriented languages, such as Java [21]. Here it can be used to determine the boundaries of object references within the object graph, for the purposes of stack allocation and other garbage collector (GC) optimisations [22]. With the increasing use of multi-core and multi-processor systems, this type of analysis is also used to discover which objects are local to which threads (known as thread escape analysis), allowing a variety of optimisations [23]. While escape analysis for functional languages is generally well-understood, it gets extremely complex for object-oriented languages such as C++ and Java. Features inherent to object-oriented languages, inheritance and polymorphism in particular, have a significant impact on formal reasoning. The number of objects typically involved also create problems for automated analysis (state-space explosion). The escape analysis described here is more straightforward, but is sufficient for determining the particular properties identified earlier. The compositional nature of occam-π and CSP helps significantly, allowing analysis to be done in a divide-and-conquer manner, or to enable analysis to be performed on a subset of processes within a system (as shown in section 4.2.2). 6. Conclusions and Future Work This paper has presented a straightforward technique for mobile escape analysis in occam-π, and its application to various kinds of process network. The analysis provides for the checking of particular design-time properties of a system and can permit certain optimisations in the
F.R.M. Barnes / Mobile Escape Analysis for occam-pi
133
implementation. At the top-level of a system, this escape analysis can also provide hints towards efficient distribution of the system across multiple nodes — by identifying those parts interconnected through mobile communication (and whose efficiency of implementation is greatly increased with shared-memory). Although the work here has focused on occam-π, the techniques are applicable to other process-oriented languages and frameworks. The semantic model for mobility presented here is not quite complete. Some of the formal rules for process composition have yet to be specified, though we have a good informal understanding of their operation. Another aspect yet to be fully considered is one of mobile processes. These can contain other mobiles as part of their state (within local variables), and as such warrant special treatment. The analysis techniques shown provide a very general model for mobile processes — in practice this either results in a larger state-space (where mobiles within mobile processes are tracked individually), or a loss in accuracy (e.g. treating a mobile process as CHAOS ). Once a complete semantic model has been established, it can be checked for validity, and the concept of mobility refinement investigated thoroughly. For the practical application of this work, the existing occam-π compiler needs to be modified to analyse and generate machine readable representations of mobile escape. Some portion of this work is already in place, discussed briefly in [24], where the compiler has been extended to generate CSP style behavioural models (in XML) of individual PROCs occam-π code. The mobile escape information obtained will be included within these XML models, incorporating attributes such as type. A separate but not overly complex tool will be required to manipulate and check particular properties of these — e.g. that an application process does not discard or generate framebuffer mobiles (section 4.2). How such information can be recorded and put to use for compiler and run-time optimisations is an issue for future work. Acknowledgements This work was funded by EPSRC grant EP/D061822/1. The author would like to thank the anonymous reviewers for their input on an earlier version of this work. References [1] P.H. Welch and F.R.M. Barnes. Communicating mobile processes: introducing occam-pi. In A.E. Abdallah, C.B. Jones, and J.W. Sanders, editors, 25 Years of CSP, volume 3525 of Lecture Notes in Computer Science, pages 175–210. Springer Verlag, April 2005. [2] Inmos Limited. occam 2.1 Reference Manual. Technical report, Inmos Limited, May 1995. Available at: http://wotug.org/occam/. [3] C.A.R. Hoare. Communicating Sequential Processes. Prentice-Hall, London, 1985. ISBN: 0-13-1532715. [4] M.H. Goldsmith, A.W. Roscoe, and B.G.O. Scott. Denotational Semantics for occam2, Part 1. In Transputer Communications, volume 1 (2), pages 65–91. Wiley and Sons Ltd., UK, November 1993. [5] M.H. Goldsmith, A.W. Roscoe, and B.G.O. Scott. Denotational Semantics for occam2, Part 2. In Transputer Communications, volume 2 (1), pages 25–67. Wiley and Sons Ltd., UK, March 1994. [6] R. Milner. Communicating and Mobile Systems: the Pi-Calculus. Cambridge University Press, 1999. ISBN: 0-52165-869-1. [7] P.S. Andrews, A.T. Sampson, J.M. Bjørndalen, S. Stepney, J. Timmis, D.N. Warren, and P.H. Welch. Investigating patterns for the process-oriented modelling and simulation of space in complex systems. In S. Bullock, J. Noble, R. Watson, and M.A. Bedau, editors, Artificial Life XI: Proceedings of the Eleventh International Conference on the Simulation and Synthesis of Living Systems, pages 17–24. MIT Press, Cambridge, MA, 2008. [8] Frederick R.M. Barnes. Dynamics and Pragmatics for High Performance Concurrency. PhD thesis, University of Kent, June 2003. [9] P.H. Welch and F.R.M. Barnes. Mobile Barriers for occam-pi: Semantics, Implementation and Application. In J.F. Broenink, H.W. Roebbers, J.P.E. Sunter, P.H. Welch, and D.C. Wood, editors, Communicat-
134
[10] [11] [12]
[13]
[14]
[15] [16]
[17]
[18]
[19]
[20]
[21] [22] [23] [24]
F.R.M. Barnes / Mobile Escape Analysis for occam-pi ing Process Architectures 2005, volume 63 of Concurrent Systems Engineering Series, pages 289–316, Amsterdam, The Netherlands, September 2005. IOS Press. ISBN: 1-58603-561-4. P.H. Welch and F.R.M. Barnes. A CSP model for mobile channels. In Proceedings of Communicating Process Architectures 2008. IOS Press, September 2008. A.W. Roscoe. The Theory and Practice of Concurrency. Prentice Hall, 1997. ISBN: 0-13-674409-5. F.R.M. Barnes, P.H. Welch, and A.T. Sampson. Barrier synchronisations for occam-pi. In Hamid R. Arabnia, editor, Proceedings of PDPTA 2005, pages 173–179, Las Vegas, Nevada, USA, June 2005. CSREA press. F.R.M. Barnes and P.H. Welch. Mobile Data, Dynamic Allocation and Zero Aliasing: an occam Experiment. In Alan Chalmers, Majid Mirmehdi, and Henk Muller, editors, Proceedings of Communicating Process Architectures 2001. IOS Press, September 2001. P.H. Welch, F.R.M. Barnes, and F.A.C. Polack. Communicating Complex Systems. In Michael G. Hinchey, editor, Proceedings of the 11th IEEE International Conference on Engineering of Complex Computer Systems (ICECCS-2006), pages 107–117, Stanford, California, August 2006. IEEE. ISBN: 0-76952530-X. F.R.M. Barnes and P.H. Welch. Prioritised dynamic communicating and mobile processes. IEE Proceedings – Software, 150(2):121–136, April 2003. J.C.P. Woodcock and A.L.C. Cavalcanti. The Semantics of Circus. In ZB 2002: Formal Specification and Development in Z and B, volume 2272 of Lecture Notes in Computer Science, pages 184–203. SpringerVerlag, 2002. S. Schneider and H. Treharne. Communicating B Machines. In ZB 2002: Formal Specification and Development in Z and B, volume 2272 of Lecture Notes in Computer Science, pages 251–258. SpringerVerlag, January 2002. F.R.M. Barnes, C.L. Jacobsen, and B. Vinter. RMoX: a Raw Metal occam Experiment. In J.F. Broenink and G.H. Hilderink, editors, Communicating Process Architectures 2003, WoTUG-26, Concurrent Systems Engineering, ISSN 1383-7575, pages 269–288, Amsterdam, The Netherlands, September 2003. IOS Press. ISBN: 1-58603-381-6. C.G. Ritson and F.R.M. Barnes. A Process Oriented Approach to USB Driver Development. In A.A. McEwan, S. Schneider, W. Ifill, and P.H. Welch, editors, Communicating Process Architectures 2007, volume 65 of Concurrent Systems Engineering Series, pages 323–338, Amsterdam, The Netherlands, July 2007. IOS Press. ISBN: 978-1-58603-767-3. Y.G. Park and B. Goldberg. Higher order escape analysis: Optimizing stack allocation in functional program implementations. In Proceedings of ESOP ’90, volume 432 of LNCS, pages 152–160. SpringerVerlag, 1990. B. Joy, J. Gosling, and G. Steele. The Java Language Specification. Addison-Wesley, 1996. ISBN: 0-20-163451-1. B. Blanchet. Escape analysis for Java(TM): Theory and practice. ACM Transactions on Programming Languages and Systems, 25(6):713–775, 2003. K. Lee, X. Fang, and S.P. Midkiff. Practical escape analyses: how good are they? In Proceedings of VEE ’07, pages 180–190. ACM, 2007. F.R.M. Barnes and C.G. Ritson. Checking process-oriented operating system behaviour using CSP and refinement. In PLOS 2009. ACM, 2009. To Appear.
Communicating Process Architectures 2009 P.H. Welch et al. (Eds.) IOS Press, 2009 © 2009 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-065-0-135
135
New ALT for Application Timers and Synchronisation Point Scheduling (Two excerpts from a small channel based scheduler) Øyvind TEIG and Per Johan VANNEBO Autronica Fire and Security1, Trondheim, Norway {oyvind.teig, per.vannebo}@autronicafire.no Abstract. During the design of a small channel-based concurrency runtime system (ChanSched, written in ANSI C), we saw that application timers (which we call egg and repeat timers) could be part of its supported ALT construct, even if their states live through several ALTs. There are no side effects into the ALT semantics, which enable waiting for channels, channel timeout and, now, the new application timers. Application timers are no longer busy polled for timeout by the process. We show how the classical occam language may benefit from a spin-off of this same idea. Secondly, we wanted application programmers to be freed from their earlier practice of explicitly coding communication states at channel synchronisation points, which was needed by a layered in-house scheduler. This led us to develop an alternative to the non-ANSI C “computed goto” (found in gcc). Instead, we use a switch/case with goto line-number-tags in a synch-point-table for scheduling. We call this table, one for each process, a proctor table. The programmer does not need to manage this table, which is generated with a script, and hidden within an #include file. Keywords. application timers, alternative, synch-point scheduling
Introduction This paper describes two ideas that have been implemented as part of a small runtime system (ChanSched), to be used in a forthcoming new product. ChanSched relates to processes, synchronous, zero-buffered, rendezvous-type, one-way data channels and asynchronous signal / timeout data-free channels. It started with the first author showing the second author a private “designer’s note” [1], where the problem of mixing communication states and application states is discussed. Channel state machines are visible in the code, on par with application states. This state mix had been incurred by an earlier runtime system, where synchronous channels were running on top of an asynchronous runtime system. With ChanSched, we now wanted to make it simpler for new programmers to understand and use these channels as a fundamental design and implementation paradigm. Even if the earlier system, described in [2] and [3], has worked fluently for years, the reasonable critique by readers was that the code was somewhat difficult to understand.
1
A UTC Fire & Security Company. NO-7483 Trondheim, Norway. See http://www.autronicafire.no
136
Ø. Teig and P.J. Vannebo / Application Timers and Synchronisation Point Scheduling
So, the second author was inspired to build ChanSched from scratch, based on this earlier experience. Added goals were to use it in systems with low power consumption and high formal product “approval level”. During this development, which ping-ponged between the two of us, we solved two problems present with the previous system: how do we manage application timers without busy-polling and how do we schedule to synchronisation points using only ANSI C ? 1. Application Timers in a New ALT 1.1 The Problem and our Problem The problem was that with the earlier system we had “busy-polled” application timers to trigger actions. The actions could be status checks once per hour or lights to blink in certain patterns. On porting this to ChanSched, we wanted to avoid polling. We had written this channel-based runtime system in C, where its ALT could either have channel components, or a single2 timer that could timeout on silent channels and nothing more. We called the latter an ALTTIMER, defined as non-application timer. We also wanted several timers and channel handling to go on concurrently in the same ALT. Earlier experience with occam [4] was of some help when designing this, but when going through the referees’ comments to this paper, we understood that knowledge had withered. We were asked to show examples in occam-like syntax. Obeying this, we surprisingly (and problematically for the paper, we thought) saw that occam in fact does not need to build application timers by polling – its designers had included them in the ALT! Something had been lost in our translation to C. However, the surprise persisted when we saw that this exercise showed that even classical occam could benefit from our thoughts. Looking over our shoulders, maybe there was a reason why the occam timers had timed out for us. 1.2 Our ANSI C based ChanSched System and the New Flora of Timers We start by showing a ChanSched process. Then, we proceed with a classic occam example and finally to a suggestion for a new timer mechanism for occam. Then, we will come back and do a more thorough discussion of our implementation. The code (Listing 1) is an “extended” version of the Prefix process in Commstime3 ([5] and Figure 1). Our cooperative, non-preemptive scheduler runs processes by calling the process by name, via its resolved for once address. So, on every scheduling and rescheduling, line 3 is entered. There, the process’s context pointer is restored from the parameter g_CP (pointing to a place in the heap), filled by the scheduler. In the second half of this paper we shall see how the PROCTOR_PREFIX causes cooperative synchronisation (or blocking) points in the code to be reached, by jumping over lines, even into the while loop. The consequence of this is that lines 5-7 are initialisation code. Knowing this one could try to understand the code as one would occam: CHAN_OUT and gALT_END are synchronisation points, meaning that line 9 will run only when a communication has taken place, and line 18 when a communication or a timeout has happened. We will not go through every detail here, but concentrate on the timers.
2
Our implementation handled only a single timer in its ALT. When we developed ChanSched, we used Commstime as our natural test case. Since it has no ALT timer handling in any of its processes, we decided to insert our timers in P_Prefix, for no concrete reason.
3
Ø. Teig and P.J. Vannebo / Application Timers and Synchronisation Point Scheduling 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
137
Void P_Prefix (void) // extended “Prefix” { Prefix_CP_a CP = (Prefix_CP_a)g_CP; // get process Context from Scheduler PROCTOR_PREFIX() // jump table (see Section 2) ... some initialisation SET_EGGTIMER (CHAN_EGGTIMER, CP->LED_Timeout_Tick); SET_REPTIMER (CHAN_REPTIMER, ADC_TIME_TICKS); CHAN_OUT (CHAN_DATA_0, &CP->Data_0, sizeof(CP->Data_0)); // first output while (TRUE) { ALT(); // this is the needed ”PRI_ALT” ALT_EGGREPTIMER_IN (CHAN_EGGTIMER); ALT_EGGREPTIMER_IN (CHAN_REPTIMER); gALT_SIGNAL_CHAN_IN (CHAN_SIGNAL_AD_READY); ALT_CHAN_IN (CHAN_DATA_2, &CP->Data_2, sizeof (CP->Data_2)); ALT_ALTTIMER_IN (CHAN_ALTTIMER, TIME_TICKS_100_MSECS); gALT_END(); switch (g_ThisChannelId) { ... process the guard that has been taken, e.g. CHAN_DATA_2 CHAN_OUT (CHAN_DATA_0, &CP->Data_0, sizeof (CP->Data_0)); }; } }
Listing 1. EGGTIMER, REPTIMER and PROCTOR_PREFIX (ANSI C and macros). (See Figure 1 for process data-flow diagram)
Note that ALT guard preconditions are hidden – they are controlled with SET and CLEAR macros (none shown). Only the input macros beginning with ‘g’ check preconditions; the others do not waste time testing a constant TRUE value. As one may understand from the above, timers are seen as channels – one channel per timer. Listing 1 is discussed in more detail throughout this paper. 1.2.1 Flora of Timers: ALTTIMER Line 16 is our ALTTIMER. As mentioned, when neither channel CHAN_SIGNAL_AD_READY4 nor CHAN_DATA_2 have communicated for the last 100 ms, the CHAN_ALTTIMER guard causes the ALT to be taken. (The ALT structure is said to be “taken” by the first ready guard.) When a channel guard is taken, the underlying timer associated with the ALTTIMER is stopped. It is restarted again every time the ALT is entered. 1.2.2 ALTTIMER and Very Long Application Timers The ALTTIMER was the only form of timer we had in our first port [2, 3]. Now we wanted to get further. On any scheduling from that timeout or any channel input, function calls had been made to handle application timers. An application timer is an object in local process context, used as parameter to a timer library. We needed to start, stop or poll for timeout, every so often. Since channels were often silent, the scheduling caused by ALTTIMERs bounded the resolution of our application timers. If a timeout were 100 ms, then a 10 seconds timeout would be accurate to 100 ms. By polling with a library call, we could handle timeouts longer that the global system timer word. The 100 ms could easily build timeouts of days – out of reach of a shorter system timer, which increments a 10 ms tick into a 16 bits integer. It is the system timer that is the basic mechanism of the ALT timer; the same is true for occam. The main rationale for application timer polling in smaller real-time systems may be this extended flexibility. 4
In order to also test non-timer interrupts, we included an analogue input ready channel. The potentiometer value thus read was used to pulse control a blinking LED. This way our example became rather complete.
138
Ø. Teig and P.J. Vannebo / Application Timers and Synchronisation Point Scheduling
1.2.3 Flora of Timers: EGGTIMER and REPTIMER In order to avoid application timers by polling, we decided to implement two new timer types and make them usable directly in the ALT. An EGGTIMER times out once (at a predefined time) and an REPTIMER times out repeatedly (at a pre-defined sequence of equally spaced times). Generically, we will call them EGGREPTIMERs here. These timers are initialised in application code before the ALT. After initialisation, they will be started the first time they are seen in an ALT – the next time the ALT is reached, they will continue to run (discussed later). They are not stopped when their ALT is taken by some other guard, only when they have timed out. So long as their ALT remains on the process execution path (e.g. in a loop), sooner or later they will time out and be taken. Even then, the REPTIMER will already have continued, with no skew and low jitter handling. However, EGGREPTIMERs may be stopped by application code before they have timed out. In this respect, the semantics of ALTTIMER and EGGREPTIMER differ – since an ALTTIMER has no meaning outside an ALT. 1.2.4 Arithmetic of Time Observe that no use of our timers makes reference to system time, or any derived value used to store some previous or future time. Therefore, no time arithmetic with values derived from system time may be done. We have in fact not yet seen any need for time arithmetic at process level5. If this for some reason is needed, we could easily add a function to read system time. 1.3 Timers in Classical occam As we were forced to rediscover: occam is able to handle any timer, including our application timers. In listing 2, there is an example of an ALTTIMER and a REPTIMER6. 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
PROC P_Listing2 (VAL INT n, CHAN INT InChan? OutChan!) -- extended “Prefix” INT Timeout_ALTTIMER, Timeout_REPTIMER: TIMER Clock_ALTTIMER, Clock_REPTIMER: SEQ OutChan ! n Clock_REPTIMER ? Timeout_REPTIMER Timeout_REPTIMER := Timeout_REPTIMER PLUS half.an.hour WHILE TRUE Clock_ALTTIMER ? Timeout_ALTTIMER PRI ALT Clock_REPTIMER ? AFTER Timeout_REPTIMER ... process every 30 minutes Timeout_REPTIMER := Timeout_REPTIMER PLUS half.an.hour -- no skew, only jitter INT Data: InChan ? Data ... process Data Clock_ALTTIMER ? AFTER Timeout_ALTTIMER PLUS hundred.ms ... MyChan pause do background task (starvation possible) -- skew and jitter :
Listing 2. General timers in occam. 5 Note that the Consume process in a proper Commstime implementation in fact does use time arithmetic (for performance measurement). We measured consumed time with the debugger. 6 To free ourselves from the ChanSched ANSI C extended Commstime, the next two examples stand for themselves, reflecting only the additional timer aspects of P_Prefix.
Ø. Teig and P.J. Vannebo / Application Timers and Synchronisation Point Scheduling
139
Scheduling is enabled AFTER the specified timeouts. Therefore both timer types will cause jitter, but REPTIMER will be without skew, provided a timeout is handled before the next one is due. This is the same as for the code in Listing 1. In occam, it is the process code that does the necessary time arithmetic. Inputting system time from any timer into an INT (lines 30 and 33) happens immediately – no synchronisation is involved that may cause blocking. These absolute time values are used to calculate the next absolute timeout value (lines 31 and 42) and used by the ALT guards (lines 35 and 42). The language requires the programmer to manage these values. We remember working with occam code. Should the TIMER or the INT have clock or time in its name (or neither)? And when reading other people’s code: is now a TIMER or an INT? The INT holding those absolute time values obfuscated the thinking process. 1.4 New Timers for occam 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
PROC P_Listing3 (VAL INT n, CHAN INT InChan? OutChan!) -- extended “Prefix” TIMER My_ALTTIMER, My_REPTIMER: -- only timers, no variables SEQ OutChan ! n SET_TIMER (REPTIMER, My_REPTIMER, 30, MINUTE, 24H) SET_TIMER (ALTTIMER, My_ALTTIMER, 0, MILLISEC, 32BIT) WHILE TRUE PRI ALT My_REPTIMER ? AFTER () ... process every 30 minutes (no timeout value to compute) -- no skew, only jitter INT Data: InChan ? Data ... process Data My_ALTTIMER ? AFTER (100) ... MyChan pause do background task (starvation possible) -- skew and jitter :
Listing 3. Concrete configurable timers in a new occam.
Listing 3 shows a suggestion of how to rectify. Here, we don’t need to do time arithmetic. No declaration like Line 26 is needed – we never see those INT values. We just have TIMERs, which now behave more like abstract data types: SET_TIMER is parameterised with unit (granularity) and length (max time). Line 50 sets a granularity of 1 minute and orders a tick every 30 minutes – thus needs 60 x 24 = 1440 (i.e. INT16 is enough) to count up to 24 hours. Line 51 enables timeouts up to 232 milliseconds – some 49 days (INT32). Lines 50 and 51 also define the type of timer: ALTTIMER or REPTIMER. So, we now have a set of configurable timers, mixable within the same ALT. A restriction is that there may be only one ALTTIMER, since it defines the handling of silent channels. The semantics of EGGREPTIMERs are that they are treated like channels: associated content is not touched when the ALT is taken. In this case it means that a stopped or running timer will stay stopped or running. Observe that we have not banned the possibility to read system time and do arithmetic with time. We are not concerned about precise language syntax here; enough to say that we are uncertain about parameters to AFTER in Lines 54 and 60. Neither have usage rules and/or runtime checks for these timers been considered. Another point is that we suggest to initialise an EGGREPTIMER with the SET-command, and start it in the ALT – as we do in ChanSched (see below). It may start to run straight
140
Ø. Teig and P.J. Vannebo / Application Timers and Synchronisation Point Scheduling
away. However, in any case it will timeout in the ALT. It is outside the scope of this paper to discuss the precise semantics or semantic differences of these two possibilities, or the enforcement of usage rules by a compiler. Our choice compares to calculating a new timeout value (initialise) and then using (start) that timeout in a classical occam AFTER. 1.5 More on our ANSI C-based ChanSched System 1.5.1 Some Implementation Issues Our implementation relies on a separate timer server process P_Timers_Handler, handling any number of timeouts (Figure 1). It delivers timeouts through asynchronous send type channels carrying no data, called TIMER channels, one per EGGREPTIMER.
Figure 1. Our test system, called “extended Commstime”. (See Listing 1 for P_Prefix code)
An individual asynchronous non-blocking timer channel thus represents each timer, so any number of these may be signalled simultaneously. When a new EGGREPTIMER is added, the timer process is recompiled with a larger set of timers, controlled by a constant drawn from a table. So, once written and tested, the timer process requires no further editing. Initially, we did parameter handling of EGGREPTIMERs in the ALT, modeled by how the ALTTIMER is handled, where AFTER has a parameter. However, we soon realised that this was impractical, since start, restart and preemptive stop often would be easiest coded (and consequently, understood) away from the ALT. However, the macros/functions used to handle the start, restart and stop do fill the timer data structure, including the value for next timeout. So, there should not be synchronising points between timer set and the ALT, as this could cause a wanted timeout to have passed before the ALT was reached. Section 1.4 discusses this in more detail for the new occam.
Ø. Teig and P.J. Vannebo / Application Timers and Synchronisation Point Scheduling
141
We use a function call and no real channel communication to set the EGGREPTIMERs parameters, used by the concurrent P_Timers_Handler. This does not violate shared value exclusive usage, due to careful coding hidden from the user and the “run-to-completion” semantics of our runtime system. Therefore, any buffer process is not needed. P_Timers_Handler takes its "tick updated" signal from a signal channel, sent directly from the system timer interrupt. Processor sleep continues until the next timeout, if there is no other pending work. 1.6 Discussion and Conclusion The EGGTIMER and REPTIMER do not seem to interfere with the ALTTIMER, even if their states outlive the ALT. To outlive the ALT is not as unique as one may think: any channel would in a way outlive the ALT, since it is possible for a process to be first on a channel when the input ALT is not active. This is, in fact, a bearing paradigm. We have not done any research to find existing uses of this concept in different programming environments. We certainly suspect this could exist. Raising timers from "link level" ALTTIMER to "application level" EGGREPTIMERs we feel is a substantial move towards a more flexible programming paradigm for timers, in our ANSI C based system. Now, none of the processes that need application timers need to do any busy poll. It improves understanding, coding and battery life. Configurability of the timers has been shown. A new occam may benefit from these ideas as well. We have done no formal verification of EGG or REPTIMERs. 2. Cooperative Scheduling via the Proctor Table 2.1 The Scheduler is Not as Transparent to User Code as we Thought This section only considers Listing 1. As described in the introduction, the non-preemptive scheduler controlling an asynchronous message system with processes that have run-tocompletion semantics had never been designed to reschedule to synchronisation points, since there are no synchronisation points in the paradigm. So we built a layer on top of it ([2] and [3]), looking heavily to the SPoC [6] occam to C translator. We had learnt that the asynchronous scheduler worked like the synchronous scheduler of SPoC, which indeed had a rich set of synchronisation points. However, that had occam source code on top. The SPoC scheduler had unique states for channel communication (i.e. synchronisation points) and the compiler flattened application states and communication states. This is the model we had used, where we did flattening of the two state spaces (in ANSI C) by hand. Discovering the obvious – to make channel visibility be like many channel-based libraries – has been a long way to go [1]. The goal described in that note was to “send and send and then receive, including ALT” sequentially in code, with no visible communication states in between. So we decided to make a new cooperative scheduler from scratch, also motivated by the Safety Integrity Level (SIL) requirements as defined by IEC 61508, where arguing along the CSP line of thinking is appreciated [7]. Our main criterion for a new scheduler was that, in some way or another, it should be able to reschedule a process to the code immediately following the synchronisation point. This is when the process had been first on a channel (or set of input channels) and should not proceed until the second contender arrived at the other end. This is the same functionality as described in [2] and [3] – but with invisible synchronisation points.
142
Ø. Teig and P.J. Vannebo / Application Timers and Synchronisation Point Scheduling
2.2 The Proctor Scheduling Table Our solution was the “proctor” jump table: a name invented by us, illustrating that it takes care of scheduling and acts on behalf of the scheduler. It is generated by standard ANSI C pre-processor constructs, by hand coding or by a script. Errors in the table would cause the compiler either to issue an error about a missing label, or to warn about an unused label. We raised that warning to become an error, to make the scheme bullet proof. Listing 4 shows how the CHAN_OUT macro first stores the actual line number, then makes a label like SYNCH_8_L, which is the rescheduling point (in Listing 1). Observe that a C macro, no matter how many lines it may look, is laid out as a single line by the preprocessor. Now, the system has a legal label to which it can reschedule; so a goto (if automatically generated) is a viable mechanism to use. 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
#define SCHEDULE_AT goto #define CAT(a,b,c,d,e) a##b##c##d##e // Concatenate to f.ex. “SYNCH_8_L” #define SYNCH_LABEL(a,b,c,d,e) CAT(a,b,c,d,e) // Label for Proctor-table #define PROC_DESCHEDULE_AND_LABEL() \ CP->LineNo = __LINE__; \ return; \ SYNCH_LABEL(SYNCH,_,__LINE__,_,L): #define CHAN_OUT(chan,dataptr,len) \ if (ChanSched_ChanOut(chan,dataptr,len) == FALSE) \ { \ PROC_DESCHEDULE_AND_LABEL(); \ } \ g_ThisAltTaken = FALSE
Listing 4. Some macros used to build, and usage of line number labels.
The proctor table takes us there. A goto line number (SCHEDULE_AT) taking as parameter (which is not on the stack but in process context) has survived the return:
CP->LineNo
81 82 83 84 85 86 87 88 89
#define PROCTOR_PREFIX()\ switch (CP->LineNo)\ {\ case 0: break;\ case 8: SCHEDULE_AT SYNCH_8_L;\ case 17: SCHEDULE_AT SYNCH_17_L;\ case 21: SCHEDULE_AT SYNCH_21_L;\ DEFAULT_EXIT\ }
Listing 5. The proctor-table.
This is standard ANSI C. We avoid the extension called “computed goto” (address) that is available in gcc, a compiler we do not use for these applications [8]. We could call our solution “scripted goto” (label), just to differentiate. Listing 6 shows the output of our script, which generates the proctor table file for us: 90 91 92
In P_Commstime.c there were 4 processes, and 10 synchronisation points In P_Timers_Handler.c there was 1 process, and 1 synchronisation point There were a total of 2 files, 5 processes and 11 syncronisation points
Listing 6. Log from the ProctorPreprocessor script.
Ø. Teig and P.J. Vannebo / Application Timers and Synchronisation Point Scheduling
143
When the scheduler always schedules the process to the function start in Listing 1, the proctor table macro causes the process to re-schedule to the correct line. The code is truly invisible but available, since the macro body is contained in a separate #included file. Initially, a dummy CP->LineNo, set up by the run-time system, is set to zero. This takes the process through its initialising code: from the proctor table to the first synchronisation point, Line 8 of Listing 1. 2.3 Discussion and Conclusion The complexities of a preemptive scheduler – and the fact that we do not need one – makes this solution quite usable. It is safe and invisible to the user, who do not need to relate to link level states (also called communication states or synchronisation points). So, the user needs to relate only to application states. The code is portable, standard ANSI C. Local process variables that reside on the stack will not survive a synchronisation point, so the programmer has to place these in process context. The overhead of the proctor jump table also includes storing the next line number at run-time, but this is small and acceptable for us. These points are less frequent than function calls, but are comparable in cycle count. 3. Conclusions Section 1 shows that differentiating configurable types of timers in the ALT may raise timers to a higher and more portable level. Section 2 displays the use of a standard ANSI C feature wrapped into a jump (proctor) table, in service for a cooperative scheduler. Making ANSI C process scheduling with invisible channel communication and synchronisation states is a step forward for us. With EGGTIMERs and REPTIMERs, process application code is now easier to write, read and understand. We have also noted that there may be a need to show timer handling into an “extended Commstime”, so that implementors could have a common platform also for this. Acknowledgement We thank the management at the Autronica Fire and Security’s development department in Trondheim for allowing us to publish this work. [Øyvind Teig is Senior Development Engineer at Autronica Fire and Security. He has worked with embedded systems for more than 30 years, and is especially interested in real-time language issues (see http://www.teigfam.net/oyvind/pub for publications and contact information). Per Johan Vannebo is Technical Expert at Autronica Fire and Security. He has worked with embedded systems for 13 years.]
References [1]
Ø. Teig, A scheduler is not as transparent as I thought (Why CSP-type blocking channel state machines were visible, and how to make them disappear), in ‘Designer's Notes’ #18, at author’s home page,
[2]
Ø. Teig, From message queue to ready queue (Case study of a small, dependable synchronous blocking channels API – Ship & forget rather than send & forget), in ‘ERCIM Workshop on Dependable Software Intensive Embedded Systems’, in cooperation with 31st. EUROMICRO Conference on Software Engineering and Advanced Applications (SEAA), Porto, Portugal, 2005. IEEE Computer Press, ISBN 2-912335-15-9. Also at http://www.teigfam.net/oyvind/pub/pub_details.html#Ercim05
http://www.teigfam.net/oyvind/pub/notes/18_A_scheduler_is_not_so_transparent.html
144 [3]
[4] [5]
[6]
[7] [8]
Ø. Teig and P.J. Vannebo / Application Timers and Synchronisation Point Scheduling Ø. Teig, No Blocking on Yesterday’s Embedded CSP Implementation (The Rubber Band of Getting it Right and Simple), in ‘Communicating Process Architectures 2006’, P.H. Welch, J. Kerridge, and F.R.M. Barnes (Eds.), pp. 331-338, IOS Press, 2006. Also at http://www.teigfam.net/oyvind/pub/pub_details.html#NoBlocking Inmos Limited, The occam programming language. Prentice Hall, 1984. Also see http://en.wikipedia.org/wiki/occam_(programming_language) P.H. Welch and F.R.M. Barnes, Prioritised Dynamic Communicating Processes - Part I. In ‘Communicating Process Architectures 2002’, J. Pascoe, R. Loader and V. Sunderam (Eds.), pp. 321–352, IOS Press, 2002. M. Debbage, M. Hill, S. Wykes, D. Nicole, Southampton's Portable occam Compiler (SPoC). In: R. Miles, A. Chalmers (eds.), in ‘Progress in Transputer and occam Research’, WoTUG 17, pp. 40-55. IOS Press, Amsterdam, 1994. IEC 61508, SIL: Safety Integrity Level (SIL level), a safety-related metric used to quantify a system's safety level. See http://en.wikipedia.org/wiki/Safety_Integrity_Level Wikipedia, Computed goto. See http://en.wikipedia.org/wiki/Goto_(command)#Computed_GOTO
Communicating Process Architectures 2009 P.H. Welch et al. (Eds.) IOS Press, 2009 © 2009 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-065-0-145
145
Translating ETC to LLVM Assembly Carl G. RITSON School of Computing, University of Kent, Canterbury, Kent, CT2 7NF, England
[email protected] Abstract. The LLVM compiler infrastructure project provides a machine independent virtual instruction set, along with tools for its optimisation and compilation to a wide range of machine architectures. Compiler writers can use the LLVM’s tools and instruction set to simplify the task of supporting multiple hardware/software platforms. In this paper we present an exploration of translation from stack-based Extended Transputer Code (ETC) to SSA-based LLVM assembly language. This work is intended to be a stepping stone towards direct compilation of occam-π and similar languages to LLVM’s instruction set. Keywords. concurrency, ETC, LLVM, occam-pi, occ21, tranx86
Introduction and Motivation The original occam language toolchain supported a single processor architecture, that of the INMOS Transputer [1,2]. Following INMOS’s decision to end development of the occam language, the sources for the compiler were released to the occam For All (oFA) project [3]. The oFA project modified the INMOS compiler (occ21), adding support for processor architectures other than the Transputer, and developed the basis for today’s Kent Retargetable occam Compiler (KRoC) [4]. Figure 1 shows the various compilation steps for an occam or occam-π program. The occ21 compiler generates Extended Tranputer Code (ETC) [5], which targets a virtual Transputer processor. Another tool, tranx86 [6], generates a machine object from the ETC for a target architecture. This is in turn linked with the runtime kernel CCSP [7] and other system libraries. Tools such as tranx86, octran and tranpc [8], have in the past provided support for IA32, MIPS, PowerPC and Sparc architectures; however, with the progressive development of new features in the occam-π language, only IA-32 support is well maintained at the time of writing. This is a consequence of the development time required to maintain support for a large number of hardware/software architectures. In recent years the Transterpreter Virtual Machine (TVM), which executes linked ETC bytecode directly, has provided an environment for executing occam-π programs on architectures other than IA-32 [9,10]. This has been possible due to the small size of the TVM codebase, and its implementation in architecture independent ANSI C. Portability and maintainability are gained at the sacrifice of execution speed, a program executed in the TVM runs around 100 times slower its equivalent tranx86 generated object code. In this paper we present a new translation for ETC bytecode, from the virtual Transputer instruction set, to the LLVM virtual instruction set [11,12]. The LLVM compiler infrastructure project provides a machine independent virtual instruction set, along with tools for its optimisation and compilation to a wide range of machine architectures. By targeting a virtual instruction set that has a well developed set of platform backends, we aim to increase
146
C.G. Ritson / Translating ETC to LLVM Assembly
Figure 1. Flow through the KRoC and Transterpreter toolchains, from source to program execution. This paper covers developments in the grey box.
the number of platforms the existing occam-π compiler framework can target. LLVM also provides a pass based framework for optimisation at the assembly level, with a large number of pre-written optimisation passes (e.g. deadcode removal, constant folding, etc). Translating to the LLVM instruction set provides us with access to these ready-made optimisations as opposed to writing our own, as has been done in the past [6]. The virtual instructions sets of the Java Virtual Machine (JVM) or the .NET’s Common Language Runtime (CLR) have also been used as portable compilation targets [13,14]. Unlike LLVM these instruction sets rely on a virtual machine implementation and do not provide a clear path for linking with our efficient multicore language runtime [7]. This was a key motivating factor in choosing LLVM over the JVM or CLR. An additional concern regarding the JVM and CLR is that large parts of their code bases are concerned with language features not relevant for occam-π, e.g. class loading or garbage collection. Given our desire to support small embedded devices (section 3.7), it seems appropriate not to encumber ourselves with a large virtual machine. LLVM’s increasing support for embedded architectures, XMOS’s XCore processor in particular [15], provided a further motivation to choose it over the JVM or CLR. In section 1 we briefly outline the LLVM instruction set and toolchain. We describe the steps of our translation from ETC bytecode to LLVM assembly in section 2. Section 3 contains initial benchmark results comparing our translator’s output via LLVM to that from tranx86. Finally, in section 4 we conclude and comment on directions for future work. 1. LLVM In this section we briefly describe the LLVM project’s infrastructure and its origins. Additionally, we give an introduction to LLVM assembly language as an aid to understanding the translation examples in section 2. Lattner proposed the LLVM infrastructure as a means of allowing optimisation of a program not just at compile time, but throughout its lifetime [11]. This includes optimisation at compile time, link time, runtime and offline optimisation. Where offline optimisations may tailor a program for a specific system, or perhaps apply profiling data collected from previous executions.
C.G. Ritson / Translating ETC to LLVM Assembly
define i32 @cube (i32 %x) { %sq = mul i32 %x, %x %cu = mul i32 %sq, %x ret i32 %cu }
147
; multiply x by x ; multiply sq by x ; return cu
Figure 2. Example LLVM function which raises a value to the power of three.
The LLVM infrastructure consists of a virtual instruction set, a bytecode format for the instruction set, front-ends which generate bytecode from sources (including assembly), a virtual machine and native code generators for the bytecode. Having compiled a program to LLVM bytecode it is then optimised before being compiled to native object code or JIT compiled in the virtual machine interpreter. Optimisation passes take bytecode (or its inmemory representation) as input, and produce bytecode as output. Each pass may modify the code or simply insert annotations to influence other passes, e.g. usage information. In this paper we discuss the generation of LLVM assembly language from ETC for use with LLVM’s native code generators. The LLVM assembly language is strongly typed, and uses static single-assignment (SSA) form. It has neither machine registers nor an operand stack, rather identifiers are defined when assigned to, and this assignment may occur only once. Identifiers have global or local scope; the scope of an identifier is indicated by its initial character. The example in Figure 1, shows a global function @cube which takes a 32-bit integer (given the local identifier %x), and returns it raise to the power of three. This example also highlights LLVM’s type system, which requires all identifiers and expressions to have explicitly specified types. LLVM supports the separate declaration and definition of functions: header files declare functions, which have a definition at link time. The use of explicit functions, as opposed to labels and jump instructions, frees the programmer from defining a calling convention. This in turn allows LLVM code to transparently function with the calling conventions of multiple hardware and software ABIs. In addition to functions LLVM provides a restricted form of traditional labels. It is not possible to derive the address of an LLVM label or assign a label to an identifier. Furthermore the last statement of a labelled block must be a branching instruction, either to another label or a return statement. These restrictions give LLVM optimisations a well-defined view of program control flow, but do present some interesting challenges (see section 2.2). In our examples we have, where appropriate, commented LLVM syntax; however, for a full definition of the LLVM assembly language we refer the reader to the project’s website and reference manual [16]. 2. ETC to LLVM Translation This section describes the key steps in the translation of stack-based Extended Transputer Code (ETC) to the SSA-form LLVM assembly language. 2.1. Stack to SSA ETC bases its execution model on that of the Transputer, a processor with a three register stack. A small set of instructions have coded operands, but the majority consume (pop) operands from the stack and produce (push) results to it. A separate data stack called the workspace provides the source or target for most load and store operations. Blind translation from a stack machine to a register machine can be achieved by designating a register for each stack position and shuffling data between registers as operands are
148
LDC 0 LDL 0 LDC 64 CSUB0 LDLP 3 BSUB SB
C.G. Ritson / Translating ETC to LLVM Assembly
; ; ; ; ; ; ;
load constant 0 load workspace location 0 load constant 64 assert stack 1 < stack 0, and pop stack load a pointer to workspace location 3 subscript stack 0 by stack 1 store byte in stack 1 to pointer stack 0
Figure 3. Example ETC code which stores a 0 byte to an array. The base of the array is workspace location 3, and offset to be written is stored in workspace location 0.
LDC 0 LDL 0 LDC 64 CSUB0 LDLP 3 BSUB SB
; ; ; ; ; ; ;
() => (reg 1) () => (reg 2) () => (reg 3) (reg 3, reg 2) => (reg 2) () => (reg 4) (reg 4, reg 2) => (reg 5) (reg 5, reg 1) => ()
STACK STACK STACK STACK STACK STACK STACK
= = = = = = =
1> 2, 3, 2, 4, 5,
reg reg reg reg reg
1> 2, reg 1> 1> 2, reg 1> 1>
Figure 4. Tracing the stack utilisation of the ETC in Figure 3, generating a register for each unique operand.
LDC 0
reg_1
LDL 0
SB
reg_2
BSUB
reg_5
CSUB
LDC 64
reg_3
LDLP 3
reg_4
Figure 5. Data flow graph generated from the trace in Figure 4.
. pushed and popped. The resulting translation is not particularly efficient as it has a large number of register-to-register copies. More importantly, this form of blind translation is not possible with LLVM’s assembly language as identifiers (registers) cannot be reassigned. Instead we must trace the stack activity of instructions, creating a new identifier for each operand pushed and associate it with each subsequent pop or reference of that operand. This is possible as all ETC instructions consume and produce constant numbers of operands. The process of tracing operands demonstrates one important property of SSA, its obviation of data dependencies between instructions. Figures 3, 4 and 5 show respectively: a sample ETC fragment, its traced form and a data flow graph derived from the trace. Each generated identifier is a node in the data flow graph connected to nodes for its producer and consumer nodes. From the example we can see that only the SB instruction depends on the first LDC, therefore it can be reordered to any point before the SB, or in fact constant folded. This direct mapping to the data flow graph representation, is what makes SSA form desirable for pass-based optimisation. We apply this tracing process to the standard operand stack and the floating point operand
C.G. Ritson / Translating ETC to LLVM Assembly
149
; load workspace offset 1 %reg 1 = load i32∗ (getelementptr i32∗ %wptr 1, i32 1) ; add 1 %reg 2 = add i32 %reg 1, 1 ; store result to workspace offset 2 store i32 %reg 2, (getelementptr i32∗ %wptr 1, i32 2) ; load workspace offset 3 %reg 3 = load i32∗ (getelementptr i32∗ %wptr 1, i32 3) ; synchronise barrier call void kernel barrier sync (%reg 3) ; load workspace offset 1 %reg 4 = load i32∗ (getelementptr i32∗ %wptr 1, i32 1) ; add 2 %reg 5 = add i32 %reg 4, 2 ; store result to workspace offset 2 store i32 %reg 5, (getelementptr i32∗ %wptr 1, i32 2) Figure 6. LLVM code example which illustrates the dangers of optimisation across kernel calls.
stack. A data structure in our translator provides the number of input and output operands for each instruction. Additionally, we trace modifications to the workspace register redefining it as required. Registers from the operand stack are typed as 32-bit integers (i32), and operands on the floating point stack as 64-bit double precision floating point numbers (double). The workspace pointer is an integer pointer (i32∗). When an operand is used as memory address it is cast to the appropriate pointer type. In theory, these casts may hinder certain kinds of optimisations, but we have not observed this in practice. 2.2. Process Representation While the programmer’s view of occam is one of all processes executing in parallel, this model is in practice simulated by one or more threads of execution moving through the concurrent processes. The execution flow may leave processes at defined instructions, reentering at the next instruction. The state of the operand stack after these instructions is undefined. Instructions which deschedule the process, such as channel communication or barrier synchronisation, are implemented as calls to the runtime kernel (CCSP) [7]. In a low-level machine code generator such as tranx86, the generator is aware of all registers in use and ensures that their state is not assumed constant across a kernel call. Take the example in Figure 6, there is a risk the code generator may choose to remove the second load of workspace offset 1, and reuse the register from the first load. However the value of this register may have been changed by another process which is scheduled by the kernel before execution returns to the process in the example. While the system ABI specifies which registers should be preserved by the callee if modified, the kernel does not know which registers will be used by other processes it schedules. If the kernel is to preserve the registers then it must save all volatile registers when switching processes. This requires space to be allocated for each process’s registers, something the occam compiler does not do as the instruction it generated was clearly specified as to undefine the operand stack. More importantly, the code to store a process’s registers must be rewritten in the system assembly language for each platform to be supported. Given our goal of minimal maintenance portability this is not acceptable. Our solution is to breakdown monolithic processes into sequences of uninterruptable
150
C.G. Ritson / Translating ETC to LLVM Assembly
Process A
f1
f2
fn
Kernel Process B
g1
g2
gm
Figure 7. Execution of the component functions of processes A and B is interleaved by the runtime kernel.
functions which pass continuations [17]. Control flow is then restricted such that it may only leave or enter a process at the junctures between its component functions. The functions of the process are then mapped directly to LLVM function definitions, which gives LLVM an identical view of the control flow to that of our internal representation. LLVM’s code generation backends will then serialise state at points where control flow may leave the process. Figure 7 gives a graphical representation of this process, as the runtime kernel interleaves the functions f1 to fn of process A with g1 to gm of process B. In practice the continuation is the workspace (wptr), with the address of the next function to execute stored at wptr[−1] . This is very similar to the Transputer’s mechanism for managing blocked processes, except the stored address is a function and thus the dispatch mechanism is not a jump, but a call. Thus the dispatch of a continuation (wptr) is the tail call: wptr[−1] (wptr). We implement the dispatch of continuations in the generated LLVM assembly. Kernel calls return the next continuation as selected by the scheduler, which is then dispatched by the caller. This removes the need for system specific assembly instructions in the kernel to modify execution flow, and thus greatly simplifies the kernel implementation. The runtime kernel can then be implemented as a standard system library. Figure 8 shows the full code listing of a kernel call generated by our translation tool. Two component functions of a process kroc.screen.process are shown (see section 2.5.1 for more details on function naming). The first constructs a continuation to the second, then makes a kernel call for channel input and dispatches the returned continuation. 2.3. Calling Conventions When calling a process as a subroutine, we split the present process function and make a tail call to the callee passing a continuation to newly created function as the return address. This process is essentially the same the Transputer instructions CALL and RET. There are are however some special cases which we address in the remainder of this section. The occam language has both processes (PROC) and functions (FUNCTIONS). Processes may modify their writable (non-VAL) parameters, interact with their environment through channels and synchronisation primitives, and go parallel creating concurrent subprocesses. Functions on the other hand may not modify their parameters or perform any potentially blocking operations or go parallel, but may return values (processes do not return values). While it is possible to implement occam’s pure functions in LLVM using the normal call stack, we have not yet done so for pragmatic reasons. Instead we treat function calls as process calls. Function returns are then handled by rewriting the return values into parameters to the continuation function. The main obstacle to supporting pure functions is that the occ21 compiler lowers functions to processes, this obscures functions in the resulting ETC output. It also allows some
C.G. Ritson / Translating ETC to LLVM Assembly
151
; Component function of process "kroc.screen.process" define private fastcc void @O kroc screen process L0.3 0 (i8∗ %sched, i32∗ %wptr 1) { ; ... code omitted ... ; Build continuation ; tmp 6 = pointer to workspace offset −1 %tmp 6 = getelementptr i32∗ %wptr 1, i32 −1 ; tmp 7 = pointer to continuation function as byte pointer %tmp 7 = bitcast void (i8∗, i32∗)∗ @O kroc screen process L0.3 1 to i8∗ ; tmp 8 = tmp 7 cast to an 32−bit integer %tmp 8 = ptrtoint i8∗ %tmp 7 to i32 ; store tmp 8 (continuation function pointer) to workspace offset −1 store i32 %tmp 8, i32∗ %tmp 6 ; Make kernel call ; The call parameters are reg 8, reg 7 and reg 6 ; The next continuation is return by the call as tmp 9 %tmp 9 = call i32∗ @kernel Y in (i8∗ %sched, i32∗ %wptr 1, i32 %reg 8, i32 %reg 7, i32 %reg 6) ; Dispatch the next continuation ; tmp 10 = pointer to continuation offset −1 %tmp 10 = getelementptr i32∗ %tmp 9, i32 −1 ; tmp 12 = pointer to continuation function cast as 32−bit integer %tmp 12 = load i32∗ %tmp 10 ; tmp 11 = pointer to continuation function %tmp 11 = inttoptr i32 %tmp 12 to void (i8∗, i32∗)∗ ; tail call tmp 11 passing the continuation (tmp 9) as its parameter tail call fastcc void %tmp 11 (i8∗ %sched, i32∗ %tmp 9) noreturn ret void } ; Next function in the process "kroc.screen.process" define private fastcc void @O kroc screen process L0.3 1 (i8∗ %sched, i32∗ %wptr 1) { ; ... code omitted ... } Figure 8. LLVM code example is actual output from our translation tool showing a kernel call for channel input. This demonstrates continuation formation and dispatch.
kernel operations (e.g. memory allocation) within functions. Hence to provide pure function support, the translation tool must reconstruct functions from processes, verify their purity, and have separate code generation paths for process and functions. We considered such engineering excessive for this initial exploration; however, as the LLVM optimiser is likely to provide more effective inlining and fusion of pure functions we intend to explore it in future work. In particular, the purity verification stage involved in such a translator should also be able to lift processes to functions, further improving code generation. Another area affected by LLVM translation is the Foreign Function Interface (FFI). FFI allows occam programs to call functions implemented in other languages, such as C [18,19]. This facility is used to access the system libraries for file input and output, networking and graphics. At present the code generator (tranx86) must generate not only hardware specific assembly, but structure the call to conform to the operating system specific ABI. LLVM greatly simplifies the FFI call process as it abstracts away any ABI specific logic. Hence
152
C.G. Ritson / Translating ETC to LLVM Assembly
foreign functions are implemented as standard LLVM calls in our translator. 2.4. Branching and Labels The Transputer instruction set has a relatively small number of control flow instructions: • • • • •
CALL call subroutine (and a general call variant - GCALL), CJ conditional jump, J unconditional jump, LEND loop end (form of CJ which uses a counting block in RET return from subroutine.
memory),
In sections 2.2 and 2.3 we addressed the CALL and RET related elements of our translation, in this section we address the other instructions. The interesting aspect of the branching instructions J and CJ are their impact on the operand stack. An unconditional jump undefines the operand stack, this allows a process to be descheduled on certain jumps, which provided a preemption mechanism for long running process on the Transputer. The conditional jump instruction branches if the first element of the operand stack is zero, in doing so it preserves the stack. If it does not branch then it instead pops the first element of the operand stack. As part of operand tracing during the conversion to SSA-form (see section 2.1), each encountered label within a process is tagged with the present stack operands. For the purposes of tracing, unconditional jumps undefine the stack and conditional jumps consume the entire stack outputting stackdepth−1 new operands. Having traced the stack we compare the inputs of each label with inferred inputs from the branch instructions which reference it, adjusting instruction behaviour as required. These adjustments can occur, for example, when the target of a conditional jump does not require the entire operand stack. While the compiler outputs additional stack depth information this is not always sufficient, hence our the introduction of an additional verification stage. The SSA syntax of LLVM’s assembly language adds some complication to branching code. When a label is the target of more than one branching instruction, φ nodes (phi nodes) must be introduced for each identifier which is dependent on the control flow. Figure 9 illustrates the use of φ nodes in a contrived code snippet generating 1/n, where the result is 1 when n = 0. The φ node selects a value for %fraction from the appropriate label’s namespace, acting as a merge of values in the data flow graph. In our translation tool we use the operand stack information generated for each label to build appropriate φ nodes for labels which are branch targets. Unconditional branch instructions are then added to connect these labels together, as LLVM’s control flow does not automatically transition between labels. Transputer bytecode is by design position independent, the arguments passed to start process instructions, loop ends and jumps are offsets from the present instruction. To support these offsets the occam compiler specifies the instruction arguments as label differences, such that Lt − Li where Lt is the arguments target label and Li is a label placed before the instruction consuming the jump offset. While we can revert these differences to absolute label reference by removing the subtraction of Li , LLVM assembly does not permit the derivation of label addresses. This prevents us passing labels as arguments to kernel calls such as start process (STARTP). We overcome this by lifting labels, for which the address is required, to function definitions. This is achieved by splitting the process in the same way as is done for kernel calls (see section 2.2). Adjacent labels are then connected by tail calls with the operand stack passed as parameters. There is no need to build continuations for these calls as control flow will not leave the process. Additionally, φ nodes are not required as the passing of the operand stack as parameters provides the required renaming.
C.G. Ritson / Translating ETC to LLVM Assembly
153
; Compare n to 0.0 %is zero = fcmp oeq double %n, 0.0 ; Branch to the correct label ; zero if is zero = 1, otherwise not zero br i1 %is zero, label %zero, label %not zero zero: ; Unconditionally branch to continue label br label %continue not zero: ; Divide 1 by n %nz fraction = fdiv double 1.0, %n ; Unconditionally branch to continue label br label %continue continue: ; fraction depends on the source label: ; 1.0 if the source is zero ; nz fraction if the source is not zero %fraction = phi double [ 1.0, %zero, %nz fraction, %not zero ] Figure 9. Example LLVM code showing the use of a phi node to select the value of the fraction identifier.
As an aside, earlier versions of our translation tool lifted all labels to function definition to avoid the complexity of generating φ nodes, and avoid tracking the use of labels as arguments. While it appeared that LLVM’s optimiser was able to fuse many of these processes back together, it was felt that a layer of control flow was being obscured. In particular this created output which was often hard to debug. Hence, we redesigned our translator to only lift labels when required. 2.5. Odds and Ends This section contains some brief notes on other interesting areas of our translation tool. 2.5.1. Symbol Naming While LLVM allows a wide range of characters in symbol names, the generation of symbol names for processes is consistent with that used in tranx86 [6]. Characters not valid for a C function are converted to underscores, and a O prefix added. This allows ANSI C code to manipulate occam process symbols by name. Only processes marked as global by the compiler are exported, and internally generated symbols are marked as private and tagged with the label name to prevent internal collisions. Declaration declare statements are added to the beginning of the assembly output for all processes referenced within the body. These declarations may include processes not defined in the assembly output; however, these will have been validated by the compiler as existing in another ETC source. The resulting output can then be compiled to LLVM bytecode or system assembly and the symbols resolved by the LLVM linker or the system linker as appropriate. 2.5.2. Arithmetic Overflow An interesting feature of the occam language is that its standard arithmetic operations check for overflow and trigger an error when it is detected. In the ANSI C TVM emulating these arithmetic instructions requires a number of additional logic steps and calculations [10]. This is inefficient on CPU architectures which provide flags for detecting over-
154
C.G. Ritson / Translating ETC to LLVM Assembly
flow. The LLVM assembly language does not provide access to the CPU flags, but instead provides intrinsics for addition, subtraction and multiplication with overflow detection. We have used these intrinsics (@llvm.sadd.with.overflow, @llvm.ssub.with.overflow and @llvm.smul.with.overflow) to efficiently implement the instructions ADD, SUB and MUL. 2.5.3. Floating Point occam supports a wide range of IEEE floating-point arithmetic and provides the ability to set the rounding mode in number space conversions. While an emulation library exists for this arithmetic, a more efficient hardware implementation was present in later Transputers and we seek to mirror this in our translator. However, we found that LLVM lacks support for setting the rounding mode of the FPU (this is still the case at the end of writing, with LLVM version 2.5). The LLVM assembly language specification defines all the relevant instructions to truncate their results. While not ideal, we use this fact by adding or subtracting 0.5, before converting a value in order to simulate nearest rounding. We do not support plus and minus rounding modes as the compiler never generates the relevant instructions. We observed that the occ21 compiler only ever generates a rounding mode change instruction directly prior to a conversion instruction. Thus instead of generating LLVM code for the mode change instruction we tag the proceeding instruction with the new mode. Hence mode changes become static at the point of translation and can be optimised by LLVM, although this was not done for the purposes of optimisation. 3. Benchmarks In this section we discuss preliminary benchmark results comparing the output of the existing tranx86 ETC converter, to the output of our translation tool passed through LLVM’s optimiser (opt) and native code generator (llc). These benchmarks were performed using source code as-is from the KRoC subversion repository revision 6002 1 , with the except of the mandelbrot benchmark from which we removed the frame rate limiter. Table 1 shows the wall-clock execution times of the various benchmarks we will now discuss. All our benchmarks were performed on an eight core Intel Xeon workstation composed of two E5320 quad-core processors running at 1.86GHz. Pairs of cores share 4MiB of L2 cache, giving a total of 16MiB L2 cache across eight cores. Table 1. Benchmark execution times, comparing tranx86 and LLVM based compilations. Benchmark agents 8 32 agents 8 64 fannkuch fasta mandelbrot ring 250000 spectralnorm
tranx86 (s) 29.6 91.8 1.29 6.78 27.0 3.84 23.1
LLVM (s) 27.6 86.5 1.33 6.90 8.74 4.28 14.3
Difference (tranx86 → LLVM) -7% -6% +3% +2% -68% +12% -38%
3.1. agents The agents benchmark was developed to compare the performance of the CCSP runtime [7] to that of other language runtimes. It is based on the occoids simulation developed as part of 1
http://projects.cs.kent.ac.uk/projects/kroc/trac/log/kroc/trunk?rev=6002
C.G. Ritson / Translating ETC to LLVM Assembly
155
the CoSMoS project [20]. A number of agent processes move over a two-dimensional torus avoiding each other, with their behaviour influenced by the agents they encounter. Each agent calculates forces between itself and other agents it can see using only integer arithmetic. The amount of computation increases greatly with the density of agents and hence we ran two variants for comparison. One with 32 initial agents per grid tile on an eight by eight grid, giving 2048 agents, and the other with double the density at 4096 agents on the same size grid. We see a marginal performance improvement in the LLVM version of this benchmark, we attribute this to LLVMs aggressive optimisation of the computation loops. 3.2. fannkuch The fannkuch benchmark is based on a version from The Computer Language Benchmarks Game [21,22]. The source code involves a large numbers of reads and writes to relatively small arrays of integers. We notice a very small decrease in performance in the LLVM version of this benchmark. This may be the result of tranx86 generating a more efficient instruction sequence for array bounds checking. 3.3. fasta The fasta benchmark is also taken from The Computer Language Benchmarks Game. A set of random DNA sequences is generated and output, this involves array accesses and floatingpoint arithmetic. Again, like fannkuch, we notice a negligible decrease in performance and attribute this to array bounds checks. 3.4. mandelbrot We modified the occam-π implementation of the mandelbrot set generator in the ttygames source directory to remove the frame rate limiter and used this as a benchmark. The implementation farms lines of the mandelbrot set image to 32 worker processes for generation, and buffers allow up to eight frames to be concurrently calculated. The complex number calculations for the mandelbrot set involve large numbers of floating point operations, and this benchmark demonstrates a vast improvement in LLVM’s floating-point code generator over tranx86. FPU instructions are generated by tranx86, whereas LLVM generates SSE instructions, the latter appear to be more efficient on modern x86 processors. Additionally, as we track the rounding mode at the source level (see section 2.5.3) we do not need to generate FPU mode change instructions, which may be disrupting FPU pipelining of tranx86 generated code. 3.5. ring Another CCSP comparison benchmark, this sets up a ring of 256 processes. Ring processes receive a token, increment it, and then forward it on to the next ring node. We time 250,000 iterations of the ring, giving 64,000,000 independent communications. This allows us to calculate the communication times of tranx86 and our LLVM implementation at 60ns and 67ns respectively. We attributed the increase in communication time to the additional instructions required to unwind the stack when returning from kernel calls in our implementation. The tranx86 version of CCSP does not return from kernel calls (it dispatches the next process internally). 3.6. spectralnorm The final benchmark from The Computer Language Benchmarks Game. This benchmark calculates the spectral norm of an infinite matrix. Matrix values are generated using floating-
156
C.G. Ritson / Translating ETC to LLVM Assembly Table 2. Binary text section sizes, comparing tranx86 and LLVM based compilations. Benchmark agents fannkuch fasta mandelbrot ring spectralnorm
tranx86 (bytes) 16410 3702 5134 6098 3453 4065
LLVM (bytes) 36715 5522 10494 12865 6716 6318
Difference (tranx86 → LLVM) +124% +49% +104% +111% +94% +55%
point arithmetic by a function which is called from a set of nested loops. The significant performance improvement with LLVM can be attributed to its inlining and more efficient floating-point code generation. 3.7. Code Size Table 2 shows the size of the text section of the benchmark binaries. We can see that the LLVM output is typically twice the size of the equivalent tranx86 output. It is surprising that this increase in binary size does not adversely affect performance, as it increases the cache pressure. As an experiment we passed a −code−model=small option to LLVM’s native code generator; however, this made no difference to binary size. Some of the increase in binary size may be attributed to the continuation dispatch code which is inlined within the resulting binary, rather than as part of the runtime kernel as with tranx86. The fannkuch and spectralnorm benchmarks make almost no kernel calls, therefore contain very few continuation dispatches, and accordingly show the least growth. Another possibility is LLVM aggressively aligning instructions to increase performance. Further investigation is required to establish whether binary size can be reduced, as it is of particular concern for memory constrained embedded devices. 4. Conclusions and Future Work In this preliminary work we have demonstrated the feasibility of translating the ETC output of the present occam-π compiler into the LLVM project’s assembly language. With associated changes to our runtime kernel this work provides a means of compiling occam-π code for platforms other than X86. We see this work as a stepping stone on the path to direct compilation of occam-π using LLVM assembly as part of a new compiler, Tock [23]. In particular, we have established viable representations of processes and a kernel calling convention, both fundamental details of any compiled representation of occam-π. The performance of our translations compares favourably with that of previous work (see section 3). While our kernel call mechanism is approximately 10% slower, loop unrolling enhancements and dramatically improved float-point performance offset this overhead. Typical applications are a mix of communication and computation, which should help preserve this balance. The occ21 compiler’s memory bound model of compilation presents an underlying performance bottleneck to translation based optimisation such as the one presented in this paper. This is a legacy of the Transputer’s limited number of stack registers, and it is our intention to overcome this in the new Tock compiler. The ultimate aim of our work is to directly compile occam-π to LLVM assembly using Tock, bypassing ETC entirely. Aside from the portability aspects of this work, access to an LLVM representation of occam-π programs opens the door to exploring concurrency specific optimisations within an established optimisation framework. Interesting optimisations, such as fusing parallel pro-
C.G. Ritson / Translating ETC to LLVM Assembly
157
cesses using vector instructions and removing channel communications in linear pipelines, could be implemented as LLVM passes. LLVM’s bytecode has also been used for various forms of static verification, a similar approach many be able to verify aspects of a compiled occam-π program such as the safety of its access to mobile data. Going further, it is likely that LLVM’s assembly language may benefit from a representation of concurrency, particularly for providing information to concurrency related optimisations. Acknowledgements This work was funded by EPSRC grant EP/D061822/1. We also thank the anonymous reviewers for comments which helped us improve the presentation of this paper. References [1] David A. P. Mitchell, Jonathan A. Thompson, Gordon A. Manson, and Graham R. Brookes. Inside The Transputer. Blackwell Scientific Publications, Ltd., Oxford, UK, 1990. [2] INMOS Limited. The T9000 Transputer Instruction Set Manual. SGS-Thompson Microelectronics, 1993. Document number: 72 TRN 240 01. [3] Michael D. Poole. occam-for-all – Two Approaches to Retargeting the INMOS occam Compiler. In Brian O’Neill, editor, Parallel Processing Developments – Proceedings of WoTUG 19, pages 167–178, Nottingham-Trent University, UK, March 1996. World occam and Transputer User Group, IOS Press, Netherlands. [4] Kent Retargetable occam Compiler. (http://projects.cs.kent.ac.uk/projects/kroc/trac/). [5] Michael D. Poole. Extended Transputer Code - a Target-Independent Representation of Parallel Programs. In Peter H. Welch and A.W.P.Bakkers, editors, Architectures, Languages and Patterns for Parallel and Distributed Applications, volume 52 of Concurrent Systems Engineering, Address, April 1998. WoTUG, IOS Press. [6] Frederick R.M. Barnes. tranx86 – an Optimising ETC to IA32 Translator. In Alan Chalmers, Majid Mirmehdi, and Henk Muller, editors, Communicating Process Architectures 2001, number 59 in Concurrent Systems Engineering Series, pages 265–282. IOS Press, Amsterdam, The Netherlands, September 2001. [7] Carl G. Ritson, Adam T. Sampson, and Frederick R. M. Barnes. Multicore Scheduling for Lightweight Communicating Processes. In John Field and Vasco Thudichum Vasconcelos, editors, Coordination Models and Languages, 11th International Conference, COORDINATION 2009, Lisboa, Portugal, June 9-12, 2009. Proceedings, volume 5521 of Lecture Notes in Computer Science, pages 163–183. Springer, June 2009. [8] Peter H. Welch and David C. Wood. The Kent Retargetable occam Compiler. In Brian O’Neill, editor, Parallel Processing Developments – Proceedings of WoTUG 19, pages 143–166, Nottingham-Trent University, UK, March 1996. World occam and Transputer User Group, IOS Press, Netherlands. [9] Christian L. Jacobsen and Matthew C. Jadud. The Transterpreter: A Transputer Interpreter. In Dr. Ian R. East, Prof David Duce, Dr Mark Green, Jeremy M. R. Martin, and Prof. Peter H. Welch, editors, Communicating Process Architectures 2004, volume 62 of Concurrent Systems Engineering Series, pages 99–106. IOS Press, September 2004. [10] Christian L. Jacobsen. A Portable Runtime for Concurrency Research and Application. PhD thesis, Computing Laboratory, University of Kent, April 2008. [11] Chris Lattner. LLVM: An Infrastructure for Multi-Stage Optimization. Master’s thesis, Computer Science Dept., University of Illinois at Urbana-Champaign, Urbana, IL, Dec 2002. [12] LLVM Project. (http://llvm.org). [13] Rich Hickey. The Clojure Programming Language. In DLS ’08: Proceedings of the 2008 Symposium on Dynamic Languages, pages 1–1, New York, NY, USA, 2008. ACM. [14] Michel Schinz. Compiling Scala for the Java Virtual Machine. PhD thesis, Institut d’Informatique Fondamentale, 2005. [15] David May. Communicating Process Architecture for Multicores. In Alistair A. McEwan, Wilson Ifill, and Peter H. Welch, editors, Communicating Process Architectures 2007, pages 21–32, jul 2007. [16] LLVM Language Reference Manual. (http://www.llvm.org/docs/LangRef.html).
158
C.G. Ritson / Translating ETC to LLVM Assembly
[17] John C. Reynolds. The discoveries of continuations. Lisp and Symbolic Computation, 6(3-4):233–248, 1993. [18] David C. Wood. KRoC – Calling C Functions from occam. Technical report, Computing Laboratory, University of Kent at Canterbury, August 1998. [19] Damian J. Dimmich and Christan L. Jacobsen. A Foreign Function Interface Generator for occam-pi. In J. Broenink, H. Roebbers, J. Sunter, P. Welch, and D. Wood, editors, Communicating Process Architectures 2005, pages 235–248, Amsterdam, The Netherlands, September 2005. IOS Press. [20] Paul Andrews, Adam T. Sampson, John Markus Bjørndalen, Susan Stepney, Jon Timmis, Douglas Warren, and Peter H. Welch. Investigating patterns for process-oriented modelling and simulation of space in complex systems. In S. Bullock, J. Noble, R. A. Watson, and M. A. Bedau, editors, Proceedings of the Eleventh International Conference on Artificial Life, Cambridge, MA, USA, August 2008. MIT Press. [21] The Computer Language Benchmarks Game. (http://shootout.alioth.debian.org/). [22] Kenneth R. Anderson and Duane Rettig. Performing lisp analysis of the fannkuch benchmark. SIGPLAN Lisp Pointers, VII(4):2–12, 1994. [23] Tock Compiler. (http://projects.cs.kent.ac.uk/projects/tock/trac/).
Communicating Process Architectures 2009 P.H. Welch et al. (Eds.) IOS Press, 2009 © 2009 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-065-0-159
159
Resumable Java Bytecode – Process Mobility for the JVM Jan Bækgaard PEDERSEN and Brian KAUKE School of Computer Science, University of Nevada, Las Vegas, Nevada, United States [email protected], [email protected] Abstract. This paper describes an implementation of resumable and mobile processes for a new process-oriented language called ProcessJ. ProcessJ is based on CSP and the π-calculus; it is structurally very close to occam-π, but the syntax is much closer to the imperative part of Java (with new constructs added for process orientation). One of the targets of ProcessJ is Java bytecode to be executed on the Java Virtual Machine (JVM), and in this paper we describe how to implement the process mobility features of ProcessJ with respect to the Java Virtual Machine. We show how to add functionality to support resumability (and process mobility) by a combination of code rewriting (adding extra code to the generated Java target code), as well as bytecode rewriting.
Introduction In this paper we present a technique to achieve process resumability and mobility for ProcessJ processes executed in one or more Java Virtual Machines. ProcessJ is a new process-oriented language with syntax close to Java and a semantics close to occam-π [20]. In the next subsection we briefly introduce ProcessJ. We have developed a code generator (from ProcessJ to Java) and a rewriting technique of the Java bytecode (which is the result of compiling the Java code generated by the ProcessJ compiler) to alter the generated Java bytecode to save and restore state as well as support for resuming execution in the middle of a code segment. This capability we call transparent mobility [16], which differs from non-transparent mobility in that the programmer does not need be concerned about preserving the state of the system at any particular suspend or resume point. We do not, however, mean that processes may be implicitly suspended at arbitrary points in their execution. ProcessJ ProcessJ is a new process-oriented language. It is based on CSP [8] and the π-calculus [10]. Structurally it is very much like occam-π; it is imperative with support for synchronous communication through typed channels. Like occam-π, it supports mobility of processes. Syntactically it is very close to Java (but without objects), and with added constructs needed for process orientation. ProcessJ currently targets the following execution platforms through different code generators (it produces source code which is then compiled using a compiler for the target language): • Any platform that supports the KRoC [21,23] occam-π compiler. ProcessJ is translated to occam-π, and then passed to the KRoC compiler.
160
J.B. Pedersen and B. Kauke / Resumable Java Bytecode
• C and MPI [5], making it possible to write process-oriented programs for a distributed memory cluster or wide area network. ProcessJ is translated to C with MPI message passing calls and passed to a C compiler. • Java with JCSP [17,19], which will run on any architecture supporting a JVM. ProcessJ is translated to JCSP, which is Java with library-provided CSP support, and passed to the Java compiler. In this paper we focus on the process mobility of ProcessJ for Java/JCSP. As the JVM itself provides no support for continuations and the Java language provides a restricted set of flow control constructs on which to build such functionality, it was initially not clear whether transparent process mobility could be usefully implemented on this platform. Like any other process-oriented language, ProcessJ has the notion of processes (procedures executing in their own executing context), but since the translation is to Java, it is sometimes necessary to refer to methods when describing the generated Java code and Java bytecode. A simple example of a piece of ProcessJ code (without mobility) is a multiplexer that accepts input on two input channels (in1 and in2) and outputs on an output channel (out): proc void mux ( chan < int >. read in1 , chan < int >. read in2 , chan < int >. write out ) { int data ; while ( true ) { alt { data = in1 . read () : out . write ( data ); data = in2 . read () : out . write ( data ); } } }
where chan.read in1 declares c1 to be the reading end of a channel that carries integers, out.write(data) writes the value of data to the out channel, and alt represents an “alternation” between the two input guards guarding two channel write statements. Other approaches such as [2,16] consider thread migration (which involves general object migration) in the Java language, but since ProcessJ is not object-oriented, we do not need to be concerned with object migration at the programmer level. We do use an encapsulation object at the translation level from ProcessJ to Java to hold the data that is transferred (this object serves as a continuation for the translated ProcessJ process). In addition, mobile processes, like in occam-π, are started, and resumed, by simply calling it as a regular procedure (which translates into invoking it as a regular non-static method in the resulting Java code). In this way, we can interpret the suspended mobile as a continuation [7] represented by the object which holds the code, the saved state, and information about where to continue the code execution upon resumption. 1. Resumability We start by defining the term resumability. We denote a procedure as resumable if it can be temporarily terminated by a programmer-inserted suspend statement and control returned to the caller, and at some later time restarted at the instruction immediately following the suspend point and with the exact same local state, possibly in a different JVM located on a different machine (i.e. all local variables contain the same values as they did when the method
J.B. Pedersen and B. Kauke / Resumable Java Bytecode
161
was terminated). In a process-oriented language, the only option a process has for communicating with its environment, that is, other processes, is though channel communication, which means that when the process is resumed, it might be with a different set of (channel, channel end, or barrier) parameters; in other situations a process might take certain parameters for initialization purposes [22], which must be provided with “dummy values” upon subsequent resumptions. Therefore, in this paper we consider resumability for procedures where the values of the local variables are saved and restored when the process is resumed, but where each process resumption can happen with a new set of actual parameters. This allows the environment to interact with the process. We start with a formal definition of resumability. 1.1. Formal Definition of Resumability In this section we briefly define resumability for JVM bytecode in a more formal way (we disregard objects and fields as neither are used in ProcessJ.) Each invocation of a bytecode method has its own evaluation stack; recall, the JVM is a stack based architecture, and all arithmetic takes place on the evaluation stack, which we can model as an array s of values: s = [e0 , e1 , . . . , ei ] In addition to an evaluation stack, each invocation has its own activation record (AR) (we consider non-static methods, but static methods are handled in a similar manner; the only difference is that a reference to this is stored at address 0 in the activation record for non-static method invocations). We can also represent a saved activation record as an array: A = [this, p1 , . . . , pn , v1 , . . . , vm ], where this is a reference to the current object, pi are parameters, and vi are local variables. (p1 = A[1], . . . , pn = A[n], v1 = A[n + 1], . . . , vm = A[n + m]), where A[i] denotes the value of the parameter/local variable stored at address i. We do not need to store this in the saved activation record as it is automatically replaced at address 0 of the activation record for every invocation of a method, but we include it here as there are instructions that refer to address 0, where this is always stored for non-static methods. It is worth mentioning that the encapsulating object used in the ProcessJ to Java translation uses non-static methods and fields; this is necessary since a ProcessJ program might have more mobile processes based on the same procedure. We can now define the semantic meaning of executing a basic block of bytecode instructions by considering the effect it has on the stack and the activation record. Only the last instruction of such a block can be a jump, so we are working with a block of code that will be executed completely. At this point, it is worth mentioning that at the end of a method invocation the stack is always empty; in addition, a ProcessJ suspend statement will translate to a return instruction, and at these return (suspend) points, the evaluation stack will also be empty. We consider a semantic function EJV M [[B]](s, A) where B = i0 i1 . . . ik is a basic block of bytecode statements and define: EJV M [[i0 i1 . . . ik ]](s, A) = EJV M [[i1 . . . ik ]](s , A ), where (s , A ) = EJV M [[i0 ]](s, A) We shall not give the full semantic specification for the entire instruction set for the Java Virtual Machine as it would take up too much space in this paper, but most of the instructions are straightforward. A few examples are (we assume non-static invocations here):
162
J.B. Pedersen and B. Kauke / Resumable Java Bytecode
EJV M [[iload 1]]([. . .], A)
=
EJV M [[istore 1]]([. . . , ek−1 , ek ], A)
=
EJV M [[goto X]]([. . .], A) = EJV M [[ifeq X]]([e0 , e1 , . . . , ek ], A)
=
EJV M [[ifeq X]]([e0 , e1 , . . . , ek ], A)
=
EJV M [[invokevirtual f]]([. . . , q1 , . . . , qj ], A)
=
([. . . , a1 ], A) where A = [this, a1 , . . . , an+m ] ([. . . , ek−1 ], [this, ek , a2 , . . . , an+m ]) where A = [a0 , a1 , . . . , an+m ] EJV M [[B ]]([. . .], A) where B is the basic block that starts at address X. EJV M [[B ]]([e0 , e1 , . . . , ek ], A), if ek = 0 where B is the basic block that starts at address X. EJV M [[B ]]([e0 , e1 , . . . , ek ], A), if ek = 0 where B is the basic block that immediately follows ifeq X. ([. . . , r], A) where r is the return value of EJV M [[Bf ]]([ ], A ) and A = (q0 , q1 , . . . , qj , ⊥, . . . , ⊥), q0 is an object reference, and f (q1 , . . . , qj ) is the actual invocation. Bj is the code for a non-void method f .
where ⊥ represents the undefined value. This is more of a semantic trick than reality as no activation record entries are ever left undefined at the actual use of the value (the Java compiler assures this), but here we simply wish to denote that the values of the locals might not have been assigned a value by the user code at this moment. Now let B = i0 i1 . . . ij−1 ij ij+1 . . . ik be a basic block of instructions (from the control flow graph associated with the code we are executing), and let ij represent a imaginary suspend instruction (as mentioned, eventually it becomes a return): EJV M [[B]]([ ], A) = EJV M [[ij+1 . . . ik ]](s , A ) where (s , A ) = EJV M [[i0 i1 . . . ij−1 ]]([ ], A)
(1)
or equivalently: EJV M [[i0 i1 . . . ij−1 ij ij+1 . . . , ik ]](s, A) = EJV M [[i0 i1 . . . ij−1 ij+1 . . . ik ]](s, A); simply ignore the suspend instruction ij . Naturally, if the code is evaluated in two stages as in the first semantic definition, the invoking code must look something like this (assuming B is the body of a method foo()): . . foo(..); // Execute i0 . . . ij−1 foo(..); // Execute ij+1 . . . ik . . We call this form of resumability “resumability without parameter change” since (1) uses A and not an A where A has the same local variables but different parameter values (i.e, the parameters passed to foo are exactly the same for both calls). Resumability without parameter changes is not particularly interesting from a mobility standpoint in a process-oriented
J.B. Pedersen and B. Kauke / Resumable Java Bytecode
163
language; typically we wish to be able to supply different parameters (most often channels, channel ends, and barriers) to the process when it is resumed (especially because the parameters could be channel ends which allow the process to interact with a new environment, that is, the process receiving the mobile). It turns out that if we can implement resumability without parameter change in the JVM (i.e., devise a method of restoring activation records between invocations), then the more useful type of resumability with parameter change comes totally free of charge! For completeness, let us define this as well: Let us consider again the basic block code B = i0 i1 . . . ij−1 ij ij+1 . . . ik , where again ij represents a suspend instruction that returns control to the caller, and let us assume that the code in B is invoked by the calls foo(v1 , . . . , vn ), and foo(v1 , . . . , vn ) respectively. EJV M [[i0 . . . ij−1 ij ij+1 . . . ik ]](s, A) = EJV M [[ij+1 . . . ik ]](s , A ) where A = [a0 = this, a1 = v1 , a2 = v2 , . . . , an = vn , an+1 = ⊥, . . . , an+m = ⊥] A = [a0 = this, a1 = v1 , a2 = v2 , . . . , an = vn , an+1 = an+1 , . . . , an+m = an+m ] (s , A ) = EJV M [[i0 i1 . . . ij1 ]](s, A) A = [a0 , . . . , an+m ] We call this “resumability with parameter changes”. The above extends to loops (through multiple basic block code segments), and to code blocks with more than one suspend instruction. As we can see from the semantic function EJV M , the activation record must “survive” between invocations/suspend-resumptions; local variables are saved and restored, parameters are not stored and are changed according to each invocation’s actual parameters. Naturally we must assure that the locations in the activation record holding the locals are restored before they are referenced again. 2. Target Bytecode Structure All the extra code needed to save and restore state upon suspension and resumption can be generated by the ProcessJ code generator; only the code associated with resuming execution in the middle of a code block will require bytecode rewriting. Let us consider a very simple example with a single suspend statement (the following is a snippet of legal ProcessJ code): type proc mobileFooType (); mobile proc void foo () implements mobileFooType { int a ; a = 0; while ( a == 0) { a = a + 1; suspend ; a = a - 1; } }
The resulting bytecode would look something like this: public void foo (); Code : 0: iconst_0 1: istore_1 2: iload_1 3: ifne 20 6: iload_1 7: iconst_1
; a = 0; ; while ( a == 0) {
164
J.B. Pedersen and B. Kauke / Resumable Java Bytecode 8: 9: 10: 13: 14: 15: 16: 17: 20:
iadd istore_1 ??? iload_1 iconst_1 isub istore_1 goto 2 return
; ;
a = a + 1; suspend handled here
; a = a - 1; ; }
}
Since the suspend is handled in line 10 by inserting a return instruction, we need to store the local state before the return, and upon resuming the execution, control must be transferred to line 13 rather than starting at line 0 again, and the state must be restored before executing line 13. This requires three new parts inserted into the bytecode: 1. Code to save the local state (in the above example the local variable a) before the suspend statement in line 10. 2. Code to restore the local state before resuming execution of the instructions after the previous suspend statement, that is, after line 1 and before line 13. 3. Code to transfer control to the right point of the code depending on which suspend was most recently executed (before line 0). Thus the goal is to automate the generation of such code. 1 and 2 can be done completely in Java by the ProcessJ code generator, and 3 can be done by a combination of Java code and bytecode rewriting. Before turning to this, let us first mention a few restrictions that mobile processes have in ProcessJ: processes have no return type (the equivalent in Java is a void method), and mobile processes cannot be recursive. The semantics for a recursive mobile process are not yet clear, and we do not see any obvious need for recursion of mobiles at this time. 3. Source Code Rewriting As mentioned, the ProcessJ code generator emits Java source code, which is then compiled using the Java compiler, and the resulting bytecode is subsequently rewritten. Let us describe the Java code emitted from the ProcessJ compiler first. To transform the foo method from the previous section into a resumable process, we encapsulate it in a Java Object that contains two auxiliary fields as well as the process rewritten as a Java method and two dummy placeholder methods. 1. The method is encapsulated in a new class: public class Foo { private private private private
Object [] actRec ; static void suspend () { } static void resume () { } int jumpTarget = 0;
public void foo () { ... switch statement that jumps to resume point . int a ; a = 0; while ( a == 0) { a = a + 1; ... code to save the current state .
J.B. Pedersen and B. Kauke / Resumable Java Bytecode
165
suspend (); resume (); ... code to restore the previous state . a = a - 1; } } }
where actRec represents a saved activation record. The suspend and resume methods are just dummy methods that are added to satisfy the compiler (more about these later). Finally, a field jumpTarget has been added. jumpTarget will hold nonnegative values (0 if the execution is to start from the beginning), and 1, 2, .... if the execution is to resume from somewhere within the code (i.e., not from the start). 2. The code for foo must also be rewritten to support resumability: • Support must be added for saving and restoring the local variable part of the JVM activation record; this is done through the Object array actRec. • A lookupswitch JVM instruction [9] must be added; based on the jumpTarget field it will jump to the instruction following the last suspend executed. A simple Java switch statement that switches on the jumpTarget will translate to such a lookupswitch instruction. 3.1. Saving Local State A Java activation record consists of two or three parts: Local variables, parameters and for non-static methods, a reference to this stored at address 0 in the activation record. The layout is illustrated in Figure 1. Recall, we need the encapsulated method to be non-static. Since this never changes for an object, and since each resumption of the method provides a new set of parameters, all we have to save is the set of locals. As we rely on the JVM invocation instructions, each invocation of a method creates its own new JVM activation record that contains this, the provided parameters, and room for the locals. The first step in resuming the method is to restore the locals to the state they were in when the method was suspended. We use an array of Objects to store the m locals. If the field jumpTarget has value 0, representing that the method starts from the top (this is the initial invocation of the process), no restoration of locals is necessary as the execution starts from the beginning of the code (and the ProcessJ and Java compilers have assured that no path to a use of an uninitialized variable exists). On subsequent resumptions, the saved array of locals must be restored, and the value of the field jumpTarget determines from where execution should continue (immediately after the return instruction that suspended the previous activation of the method).
Figure 1. JVM Activation Records.
166
J.B. Pedersen and B. Kauke / Resumable Java Bytecode
If for example a method has locals a, b, and c of integer type, we can save an Object array with their values in the following way by using the auto-boxing feature provided by the Java compiler: actRec = new Object [] { a , b , c }; jumpTarget = ...;
and they can be restored in the following manner: a = ( Integer ) actRec [0]; b = ( Integer ) actRec [1]; c = ( Integer ) actRec [2];
Both of these code blocks are generated by the ProcessJ code generator, the former before the suspend and the latter after. 3.2. Resuming Execution When a method is resumed (by being invoked), the jumpTarget field determines where in the code execution should continue; namely immediately after the return that suspended the previous invocation. We cannot add Java code that gets translated by the Java compiler for this; in order to do so we would need a goto instruction (as well as support for labels), and although goto is a reserved word in Java, it is not currently in use. To achieve this objective, we must turn to bytecode rewriting. We need to insert a lookupswitch instruction that switches on the jumpTarget field, and jumps to the address of the instruction following the return that suspended the previous invocation. We can generate parts of the code with help from the Java compiler; we insert code like this at the very beginning of the generated Java code: switch ( jumpTarget ) { case 1: break ; default : break ; }
There will be as many cases as there are suspends in the ProcessJ code. We get bytecode like this: 4:
24: 27:
lookupswitch { 1: 24; default : 27 } goto 27 ...
In the rewriting of the bytecode all we have to do is replace the absolute addresses (24 and 27) in the switch by the addresses of the resume points. The addresses of the resume points can be found by keeping track of the values assigned to the jumpTarget field in the generated Java code or by inspecting the bytecode as explained below. Since we replaced the suspend keyword by calls to the dummy suspend() and resume() method, we can look for the static invocation of resume(): 52: 53: 54: 57: 60: 63:
aload_0 iconst_1 putfield invokestatic invokestatic ...
#5; // Field jumpTarget : I #18; // Method suspend :() V #19; // Method resume :() V
J.B. Pedersen and B. Kauke / Resumable Java Bytecode
167
(here found in line 60), and the two instructions immediately before the suspend call will reveal the jumpTarget value that the address (60) should be associated with. The instruction in line 53 will be one of the iconst X (X=1,2,3,4,5) instructions or a bipush instruction. For the above, the lookupswitch should be rewritten as: 4:
24: 27:
lookupswitch { 1: 60; default : 27 } goto 27 ...
Furthermore the lines 57 and 60 must be rewritten to be a return (this cannot be done before compile time, as the Java compiler will complain about unreachable code) and a nop respectively. Alternatively, the resume method can be removed and the jump target will be the instruction following the suspend call. 4. Example Let us rewrite the previous example to obtain this new Foo class: public class Foo { private private private private
Object [] actRec ; static void suspend () { } static void resume () { } int jumpTarget = 0;
public void foo () { int a ; switch ( jumpTarget ) { case 1: break ; default : break ; } a = 0; while ( a == 0) { a = a + 1; actRec = new Object [] { a }; jumpTarget = 1; suspend (); resume (); a = ( Integer ) actRec [0]; a = a - 1; } jumpTarget = 0; }
// Begin : jump
// End : jump
// Begin : save state // End : save state
// restore state
// Reset jumpTarget
}
Note that the jumpTarget should be set to 0 before each original return statement to assure that the next time the process is resumed, it will start from the beginning. This is very close to representing the code we really want, and best of all, it actually compiles. Note also that the line saving local state must include all locals in scope. If the rewriting is done solely in bytecode, this would require an analysis of the control flow graph (CFG) associated with the code – like the approach taken in the Southampton Portable Occam Compiler (SPOC) [12]. But since we generate the store code as part of the code generation from
168
J.B. Pedersen and B. Kauke / Resumable Java Bytecode
the ProcessJ compiler, we have access to all scope information. It is further simplified by the fact that scoping rules for ProcessJ follows those of Java (when removing fields and objects). Let us look at the generated bytecode. Because of the incomplete switch statement, every invocation of foo will always execute the a = 0 statement (i.e. start from the beginning): public void foo (); Code : 0: aload_0 1: getfield jumpTarget I // switch ( jumpTarget ) { 4: lookupswitch { 1: 24; // case 1: ... default : 27 } // default : ... 24: goto 27 // } 27: iconst_0 28: istore_1 // a = 0; 29: iload_1 // while ( a == 0) { 30: ifne 83 33: iload_1 34: iconst_1 35: iadd 36: istore_1 37: aload_0 // a = a + 1; 38: iconst_1 39: anewarray java / lang / Object 42: dup 43: iconst_0 44: iload_1 45: invokestatic java / lang / Integer . valueOf ( I ) Ljava / lang / Integer ; 48: aastore 49: putfield actRec [ Ljava / lang / Object ; // actRec = new Object []{ a }; 52: aload_0 53: iconst_1 54: putfield jumpTarget I // jumpTarget = 1; 57: invokestatic suspend () V // suspend ; 60: invokestatic resume () V // // resume point 63: aload_0 64: getfield actRec [ Ljava / lang / Object ; 67: iconst_0 68: aaload 69: checkcast java / lang / Integer 72: invokevirtual java / lang / Integer . intValue () I 75: istore_1 // a = ( Integer ) actRec [0]; 76: iload_1 77: iconst_1 78: isub 79: istore_1 // a = a - 1; 80: goto 29 // } 83: aload_0 84: iconst_1 85: putfield jumpTarget I // jumpTarget = 0; 88: return }
Lines 0–24 represent the switch statement, 38–54 the save state code, 57–60 the suspend/resume placeholder method calls, 63–75 the restore state code, and 83–88 the rewritten original return code.
J.B. Pedersen and B. Kauke / Resumable Java Bytecode
169
As pointed out above, this code is not correct; a number of things still need to be changed: • Line 4 is the jump table that must be filled with correct addresses. If the field jumpTarget equals 1, execution continues at the invocation of the dummy resume() method – line 60. The default label is already correct and can be left unchanged. • Line 57, the dummy suspend() invocation, should be replaced by a return instruction (we could not simply place a Java return instruction in the source code because the compiler would complain about the code following the return statement being unreachable). • Line 60, the dummy resume() invocation should be replaced by a nop. This only serves as a placeholder; theoretically we could have used address 63 in the lookupswitch. An example of use in ProcessJ could be this: proc void sender ( chan < mobileFooType >. write ch ) { // create mobile mobileFooType mobileFoo = new mobile foo ; // invoke foo (1 st invocation ) mobileFoo (); // send to different process ch . write ( mobileFoo ); } proc void receiver ( chan < mobileFooType >. read ch ) { mobileFooType mobileFoo ; // receive mobileFooType process mobileFoo = ch . read (); // invoke foo (2 nd invocation ) mobileFoo (); } proc void main () { chan < MobileFooType > ch ; par { sender ( ch . write ); receiver ( ch . read ); } }
The resulting Java/JCSP code looks like this: import org . jcsp . lang .*; public class PJtest { public static void sender ( ChannelOutput ch_write ) { Foo mobileFoo = new Foo (); mobileFoo . foo (); ch_write . write ( mobileFoo ); } public static void receiver ( ChannelInput ch_read ) { Foo mobileFoo ; mobileFoo = ( Foo ) ch_read . read (); mobileFoo . foo (); }
170
J.B. Pedersen and B. Kauke / Resumable Java Bytecode public static void main ( String args []) { final One2OneChannel ch = Channel . one2one (); new Parallel ( new CSProcess [] { new CSProcess () { public void run () { sender ( ch . out ()); } }, new CSProcess () { public void run () { receiver ( ch . in ()); } } }). run (); } }
One small change is still needed to support mobility across a network. Since the generated Java code is a class, this can be made serializable by making the generated classes implement the Serializable interface. An object of such a class can now be serialized and sent across a network. Welch et al. [18] provide such a mechanism in their jcsp.net package as well. Since the rewriting described encapsulates the mobile process in a new class, objects of that class can be sent as data across the network and the mobile process inside that object can be resumed by invoking the method that encapsulates the mobile process (mobileFoo.foo() above).
5. Related Work and Other Approaches Approaches to process mobility can be categorized as either transparent or non-transparent, sometimes termed strong and weak migration (mobility), respectively [2,6]. With nontransparent mobility the programmer must explicitly provide the logic to suspend and resume the mobile process whenever necessary. Existing systems such as jcsp.mobile [3,4] already provide this functionality. Transparent mobility significantly eases the task of the programmer, but requires support from the run-time system which does not exist within the Java Virtual Machine. Some early approaches to supporting resumable programs in Java involved modification of the JVM itself [2]. In our view, however, one of the most important advantages of targeting the JVM is portability across the large installed base of Java runtime environments. Therefore any approach that extends the JVM directly is of limited utility. Some success has been demonstrated using automated transformation of Java source code [6]. Due to the lack of support within the language for labeled gotos, this approach suffers from a proliferation of conditional guards and a corresponding increase in code size. Bytecode-only transformations methods targeting general thread resumability in Java are explored in [1] and [16]. These approaches require control flow analysis of the bytecode in order to generate code for the suspend point. Alternatively, the Kilim [15] actor-based frame-
J.B. Pedersen and B. Kauke / Resumable Java Bytecode
171
work uses a CPS bytecode transformation to support cooperatively scheduled lightweight threads (fibers) within Java. Another example of bytecode state capture can be found in Java implementation of the object-oriented language Python (Jython [13]) in order to support generators, a limited form of co-routines [11]. This is perhaps the most similar to our implementation, even though generator functions are somewhat different in concept and application from ProcessJ procedures. We wish, however, to be able to utilize the existing Java compilers to produce optimized bytecode with our back-end. The process-oriented nature of ProcessJ allows us to adopt a simple hybrid approach that combines Java source and bytecode methods. 6. Conclusion In this paper we have shown that a compiler for a process-oriented language can provide transparent mobility using the existing Java compiler tool chain with minimal modification. We developed a simple way to generate Java source code and rewrite Java bytecode to support resumability and ultimately process mobility for the ProcessJ language. We described the Java source code generated by the ProcessJ compiler, and also demonstrated how to rewrite the Java bytecode to save and restore local state in between resumptions of code executions as well as how to assure that execution continues with the same local state (but with possibly new parameter values) at the instruction following the previous suspension point. 7. Future Work A number of interesting issues remain to be addressed. For ProcessJ, where we have channels, an interesting problem arise when assigning a parameter of channel end type to a local variable. If a local variable holds a reference to a channel end, and the process is suspended and sent to a different machine, the end of the channel now lives on a different physical machine. This is not a simple problem to solve; for occam-π, the pony [14] system addresses this problem. One way to approach this problem is to include a channel server, much like the one found in JCSP.net [18], that keeps track of where channel ends are located; this is the approach we are working with for the MPI/C code generator. Mobile channels can be handled in the same way, but are outside the scope of this paper. Other issues that need to be addressed include how resource management is to be handled; if a mobile process contains references to (for example) open files that are not available on the JVM to which the process is sent, accessing this file becomes impossible. We may wish to enforce certain kinds of I/O restrictions on mobile processes in order to more clearly define their behavior under mobility. With a little effort, the saving and restoration could be gathered at the beginning and the end of the method saving some code/instructions, but for clarity reasons we used a different approach (as presented in this paper). 8. Acknowledgments This work was supported by the UNLV President’s Research Award 2008/2009. We would like to thank the reviewers who did a wonderful job in reviewing this paper. Their comments and suggestions have been valuable in producing a much stronger paper.
172
J.B. Pedersen and B. Kauke / Resumable Java Bytecode
References [1] Sara Bouchenak. Techniques for Implementing Efficient Java Thread Serialization. In ACS/IEEE International Conference on Computer Systems and Applications (AICCSA03), pages 14–18, 2003. [2] Sara Bouchenak and Daniel Hagimont. Pickling Threads State in the Java System. In Third European Research Seminar on Advances in Distributed Systems, 2000. [3] Kevin Chalmers and John Kerridge. jcsp.mobile: A Package Enabling Mobile Processes and Channels. In Jan Broenink and Herman Roebbers and Johan Sunter and Peter Welch and and David Wood, editor, Communicating Process Architectures 2005, pages 109–127, 2005. [4] Kevin Chalmers, John Kerridge, and Imed Romdhani. Mobility in JCSP: New Mobile Channel and Mobile Process Models. In Alistair McEwan and Steve Schneider and Wilson Ifill and Peter Welch, editor, Communicating Process Architectures 2007, pages 163–182, 2007. [5] Jack Dongarra. MPI: A Message Passing Interface Standard. The International Journal of Supercomputers and High Performance Computing, 8:165–184, 1994. [6] Stefan F¨unfrocken. Transparent Migration of Java-based Mobile Agents - Capturing and Reestablishing the State of Java Programs. In Mobile Agents, pages 26–37. Springer Verlag, 1998. [7] R. Hieb and R.K. Dybvig. Continuations and Concurrency. ACM Sigplan Notices, 25:128136, 1990. [8] C. A. R. Hoare. Communicating Sequential Processes. Communications of the ACM, 21(8):666–677, August 1978. [9] Tim Lindholm and Frank Yellin. The Java Virtual Machine Specification, 2nd Edition. Prentice Hall PTR, 1999. [10] Robin Milner. Communicating and Mobile Systems: the π-Calculus. Cambridge University Press, 1999. [11] Ana L´ucia De Moura and Roberto Ierusalimschy. Revisiting coroutines. ACM Transactions on Programming Languages and Systems, 31:1–31, 2009. [12] D.A. Nicole, M. Debbage, M. Hill, and S. Wykes. Southampton’s Portable Occam Compiler (SPOC). In A.G. Chalmers and R. Miles, editors, Proceedings of WoTUG 17: Progress in Transputer and Occam Research, volume 38 of Concurrent Systems Engineering, pages 40–55, Amsterdam, The Netherlands, April 1994. IOS Press. ISBN: 90-5199-163-0. [13] Samuele Pedroni and Noel Rappin. Jython Essentials. O’Reilly Media, Inc., 2002. [14] Mario Schweigler and Adam T. Sampson. pony - The occam-π Network Environment. In Peter Welch, Jon Kerridge, and Fred Barnes, editors, Communicating Process Architectures 2006, volume 64 of Concurrent Systems Engineering Series, pages 77–108, Amsterdam, The Netherlands, September 2006. IOS Press. [15] S. Srinivasan and A. Mycroft. Kilim: Isolation-typed actors for java. In Procedings of the European Conference on Object Oriented Programming (ECOOP), pages 104–128. Springer, 2008. [16] Eddy Truyen, Bert Robben, Bart Vanhaute, Tim Coninx, Wouter Joosen, and Pierre Verbaeten. Portable Support for Transparent Thread Migration in Java. In ASA/MA, pages 29–43. Springer Verlag, 2000. [17] Peter H. Welch. Process Oriented Design for Java: Concurrency for All. In Hamid R. Arabnia, editor, Proceedings of the International Conference on Parallel and Distributed Process Techniques and Applications, volume 1, pages 51–57, Las Vegas, Nevada, USA, June 2000. CSREA, CSREA Press. ISBN: 1-892512-52-1. [18] Peter H. Welch, Jo R. Aldous, and Jon Foster. CSP Networking for Java (JCSP.net). Lecture Notes in Computer Science, 2330:695–708, 2002. [19] Peter H. Welch and Paul D. Austin. Communicating Sequential Processes for Java (JCSP) Home Page. Systems Research Group, University of Kent, http://www.cs.kent.ac.uk/projects/ofa/jcsp. [20] Peter H. Welch and Frederick R.M. Barnes. Communicating Mobile Processes: introducing occam-π. In Ali E. Abdallah, Cliff B. Jones, and Jeff W. Sanders, editors, 25 Years of CSP, volume 3525 of Lecture Notes in Computer Science, pages 175–210. Springer Verlag, April 2005. [21] Peter H. Welch, Jim Moores, Frederick R. M. Barnes, and David C. Wood. The KRoC Home Page. http://www.cs.kent.ac.uk/projects/ofa/kroc/. [22] Peter H. Welch and Jan B. Pedersen. Santa Claus - with Mobile Reindeer and Elves. In Proceedings of Communicating Process Architectures, 2008. [23] Peter H. Welch and David C. Wood. The Kent Retargetable occam Compiler. In Brian O’Neill, editor, Parallel Processing Developments, volume 47 of Concurrent Systems Engineering, pages 143–166, Amsterdam, The Netherlands, March 1996. World occam and Transputer User Group, IOS Press.
Communicating Process Architectures 2009 P.H. Welch et al. (Eds.) IOS Press, 2009 © 2009 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-065-0-173
173
OpenComRTOS: A Runtime Environment for Interacting Entities Bernhard H.C. SPUTH a , Oliver FAUST a , Eric VERHULST a and Vitaliy MEZHUYEV a a Altreonic; Gemeentestraat 61A bus 1; 3210 Linden. Belgium {bernhard.sputh, oliver.faust, eric.verhulst, vitaliy.mezhuyev}@altreonic.com Abstract. OpenComRTOS is one of the few Real-Time Operating Systems for embedded systems that was developed using formal modelling techniques. The goal was to obtain a proven dependable component with a clean architecture that delivers high performance on a wide variety of networked embedded systems, ranging from a single processor to distributed systems. The result is a scalable relibable communication system with real-time capabilities. Besides, a rigorous formal verification of the kernel algorithms led to an architecture which has several properties that enhance safety and real-time properties of the RTOS. The code size in particular is very small, typically 10 times less than a typical equivalent single processor RTOS. The small code size allows a much better use of the on-chip memory resources, which increases the speed of execution due to the reduction of wait states caused by the use of external memory. To this point we ported OpenComRTOS to the MicroBlaze processor from Xilinx, the Leon3 from ESA, the ARM Cortex-M3, the Melexis MLX16, and the XMOS. In this paper we concentrate on the Microblaze port, which is an environment where OpenComRTOS competes with a number of different operating systems, including the standard operating system Xilinx Micro Kernel. This paper reports code size figures of the OpenComRTOS on a MicroBlaze target. We found that this code size is considerably smaller compared with published code sizes of other operating systems. Keywords. OpenComRTOS, embedded systems, system engineering, RTOS
Introduction Real-Time Operating Systems (RTOSs) are a key software module for embedded systems, often requiring properties of high reliability and safety. Unfortunately, most commercial, as well as open source implementations cannot be verified or even certified, e.g. according to the DoD 178B [1] or IEC61508 [2] standards. Similarly, software engineering is often done in a non-systematic way, although well defined and established Systems Engineering Processes exist [3,4]. The software is rarely proven to be correct even though this is possible with formal model checkers [5]. In the context of a unified systems engineering approach [6] we undertook a research project where we followed a stricter methodology, including formal model checking, to obtain a network-centric RTOS which can be used as a trusted component. The history of this project goes back to the early 1990’s when a distributed real-time RTOS called Virtuoso (Eonic Systems) [7] was developed for the INMOS transputer [8]. This processor had built-in support for concurrency as well as interprocess communication and was enabled for parallel processing by way of 4 communication links. Virtuoso allowed such a network of processors to be programmed in a topology transparent way. Later, the software evolved and was ported from single chip micro-controllers to systems with over
174
B.H.C. Sputh et al. / OpenComRTOS
a thousand Digital Signal Processors until the technology was acquired by Wind River and after a few years removed it from the market. The OpenComRTOS project was motivated by the lessons learned from developing three Virtuoso generations. These lessons became part of the requirements. We list the most important ones: • Scalability: The RTOS should support very small single processor systems, as well as widely distributed processing systems interconnected through external networks like the internet. To achieve that, the software components must be independent of the execution environment. In other words, it must be possible to map the software components onto the network topology. • Heterogeneous: The RTOS should support systems which consist of multiple nodes, with different CPU architectures. Naturally, different link technologies should be usable as well, ranging from low speed links such as RS232 up to high speed Ethernet links. • Efficiency: The essence of multi-processor systems is communication. The challenge, from an RTOS point of view, is keeping the latency to a minimum while at the same time maximizing the performance. This is achieved when most of the critical code resides in the limited amount of on-chip memory. • Small code size: This has a double benefit: a) performance and b) less complexity. Less complex systems have fewer potential sources of errors and side-effects. • Dependability: As testing of distributed systems becomes very time consuming, it is mandatory that the system software can be trusted from the start. As errors typically occur in “corner cases”, the use of formal methods was deemed necessary. • Maintainability and ease of development: The code needs to be clear and simple to facilitate the development of e.g. drivers, the latter have often been the weak point in system software. OpenComRTOS provides a runtime environment which supports these requirements. The remainder of this paper focuses on this runtime environment and the execution on a MicroBlaze target. But, before we discuss the details of OpenComRTOS in greater detail, we deduce two general points from the list of requirements. The scalability requirement imposes that data-communication is central in the RTOS architecture. The trustworthiness and maintainability aspects are addressed in the context of a Systems Engineering methodology. The use of common semantics during all activities is crucial, because only common semantics enable us to generate most of the implementation code from the modelling and simulation phase. Generated code is more trustworthy compared to handwritten code. To be able to use an “Interacting Entities” paradigm requires a runtime environment that supports concurrency and synchronization/communication in a native way between concurrent entities. OpenComRTOS is this runtime environment 1. OpenComRTOS Architecture Even with the problems mentioned above, Virtuoso was a successful product. The goal was to improve on its weaknesses. Its architecture had a high performance, but very hard to port and to maintain. Hence, for OpenComRTOS we adopted a layered architecture which is based on semantic layering. The lowest functionality level is limited to priority based preemptive multitasking. On this level Tasks exchange standardized Packets using an intermediate entity we call Port. Two tasks rendezvous by one task sending a ‘put’ request and the other task sending a ‘get’ request to the Port. The Port behaves similar to the JCSP Any2AnyChannel [9]. Hence, Tasks can synchronise and communicate using Packets and Ports. The Packets are the essential workhorse of the system. They have header and data fields and are exclusively used for all services, rather than performing function calls or using jump tables. Hence, it be-
B.H.C. Sputh et al. / OpenComRTOS
175
Figure 1. Open License Society: the unified view.
comes straightforward to provide services that operate in a transparent way across processor boundaries. In fact for a task it makes no difference in semantics whether a Port-Hub exists locally or on another node, the kernel takes care of routing the packet to the node which holds the Port-Hub. Furthermore, Packets are very efficient, because kernel operations often come down to shuffling Packets around (using handlers) between system level data structures. At the next semantic level we added more traditional RTOS services like events, semaphores, etc (see Table 2 on Page 6 for the included RTOS services). Finally, the architecture was kept simple and modular by developing kernel and drivers as Tasks. All these Tasks have a ‘Task input Port’ for accepting Packets from other Tasks. This has some unusual consequences like: a) the possibility to process interrupts received on one processor on another processor, b) the kernel having a lower priority than the drivers or even c) having multiple kernel Tasks on a single node. 1.1. Systems Engineering Approach The Systems Engineering approach from the Open License Society [6], outlined in Figure 1, is a classical one as defined in[3,4], but adapted to the needs of embedded software development. It is first of all an evolutionary process using continuous iterations. In such a process, much attention is paid to an incremental development requiring regular review meetings by several of the stakeholders. On an architectural level, the system or product under development is defined under the paradigm of “Interacting Entities”, which maps very well on an RTOS based runtime system. Applied to the development of OpenComRTOS, the process was started by elaborating an initial set of requirements and specifications. Next, an initial architecture was defined. From this point onwards, two groups started to work in parallel. The first group worked out an architectural model, while a second group developed initial formal models using TLA+/TLC [10]. These models were incrementally refined. Note that no real attempt was made to model the complete system at once. This is not possible in a generic way, because formal TLA models cannot be parametrised. For example, one must model a specific set of tasks and services which leads very quickly to a state space explosion which limits the achievable complexity of such models. Hence, we modelled only specific parts, e.g. a model was built for each class of services (Ports, Events, Semaphores, etc.). This was sufficient and has the benefit of having very clean, orthogonal models. Due to
176
B.H.C. Sputh et al. / OpenComRTOS
the orthogonality of the models there is no need to model the complete system, which has the big advantage that they can be developed by different teams. At each review meeting between the software engineers and the formal modelling engineer, more details were added to the models, the models were checked for correctness and a new iteration was started. This process was stopped when the formal models were deemed close enough to the implementation architecture. Next, a simulation model was developed on a PC (using Windows NT as a virtual target). This code was then ported to a real 16bit micro controller in form of the MLX16 from Melexis, who at this time were sponsoring the development of OpenComRTOS. The MLX16 a propriety micro controller used by Melexis to develop application specific ICs, it has up to 2kiB RAM and 32kiB Flash. On this target a few specific optimizations were performed during the implementation, while fully maintaining the design and architecture. The software was written in ANSI C and verified for safe coding practices with a MISRA rule checker [11]. 1.2. Lessons Learnt from Using Formal Modelling The goal of using formal techniques is the ability to prove that the software is correct. This is an often heard statement from the formal techniques community. A first surprise was that each model gave no errors when verified by the TLC model checker. This is actually due to the iterative nature of the model development process and partly its strength. From an initially rather abstract model, successive models are developed by checking them using the model checker and hence each model is correct when the model checker finds no illegal states. As such, model checkers can’t prove that the software is correct. They can only prove that the formal model is correct. For a complete proof of the software the whole programming chain as well as the target hardware should be modelled and verified. This is an unachievable goal due to its complexity and the resulting state space explosion. Nevertheless, it was attempted in the Verisoft project [12]. The model itself would be many times larger than the developed software. This indicates that if we would make use of verified target processors and verified programming language compilers, model checking becomes practical, because it is limited to modelling the application. Other issues, related to formal modelling, were also discovered. A first issue is that the TLC model checker declares every action as a critical section, whereas e.g. in the case of a RTOS, many components operate concurrently and real-time performance dictates that on a real target the critical sections are kept as short as possible. This forced us to avoid shared data structures. However, it would be helpful to have formal model assistance that indicates the required critical sections. 1.3. Benefits Obtained from Using Formal Modelling As was outlined above, the use of formal modelling was found to result in a much better architecture. This benefit results from successive iteration and review of the model. Another reason for the better architecture is the fact that formal model checkers provide a higher level of abstraction compared to the implementation. In the project we found that the semantics associated with specific programming terms involuntarily influence choices made by the architecture engineer. An example was the use of both waiting lists and Port buffers, which is one of the main concepts of OpenComRTOS. A waiting list is associated with just one waiting action but one overlooks the fact that it also provides buffering behaviour. Hence, one waiting list is sufficient resulting in a smaller and cleaner architecture. Formal modelling and abstract levels have helped to introduce, define and maintain orthogonal architectural concepts. Orthogonality is key to small and safe, i.e. reliable, designs. Similarly, even if there was a short learning curve to master the mathematical notation in TLA, with hindsight this was an advantage vs. e.g. SPIN [13], which uses a C-like syn-
B.H.C. Sputh et al. / OpenComRTOS
177
Figure 2. OpenComRTOS-L0 view.
tax. The latter leads automatically to thinking in terms of an implementation code with all its details, whereas the abstraction of TLA helps to think in more abstract terms. This also highlights the importance of specifying first before implementation is started. A final observation is that using formal modelling techniques turned out to be a much more creative process than the mathematical framework suggests. TLA/TLC as such was primarily used as an architectural design tool, aiding the team in formulating ideas and testing them in a rather abstract way. This proved to be teamwork with lots of human interaction between the team members. The formal verification of the RTOS itself was basically a sideeffect of building and running the models. Hence, this project has shown how a combination of teamwork with extensive peer-review, formal modelling support and a well defined goal can result in a “correct-by-design” product. 1.4. Novelties in the Architecture OpenComRTOS has a semantically layered architecture. Table 1 provides an overview over the available services at the different levels. At the lowest: level the minimum set of Entities provides everything that is needed to build a small networked real-time application. The Entities needed are Tasks (having a private function and workspace), and Interacting Entities, called Ports, to synchronize and communicate between the Tasks (see Figure 2). Ports act like channels in the tradition of Hoare’s CSP [14], but they allow multiple waiters and asynchronous communication. One of the Tasks is a Kernel Task which schedules the other Tasks in order of priority and manages Port-based services. Driver Tasks handle inter-node communication. Pre-allocated as well as dynamically allocated Packets are used as carriers for all activities in the RTOS, such as: service requests to the kernel, Port synchronization, data-communication, etc. Each Packet has a fixed size header and data payload with a user defined but global data size. This significantly simplifies the Packet management, particularly at the communication layer. A router function also transparently forwards Packets in order of priority between the network nodes. The priority of a Packet is the same as the priority of the Task from which the Packet originates. In the next semantic level services and Entities were added, similar to those which can be found in most RTOSs: Boolean events, counting semaphores, FIFO queues, resources,
178
B.H.C. Sputh et al. / OpenComRTOS
memory pools, etc. The formal modelling leads to the definition of all these Entities as semantic variants of a common and generic entity type. We called this generic entity a “Hub”. In addition, the formal modelling also helped to define “clean” semantics for such services, whereas ad-hoc implementations often have side-effects. Table 2 summarises the semantics. Table 1. Overview of the available Entities on the different Layers. Layer L0 L1 L2
Available Entities Task, Port Task, Hub based implementations of: Port, Boolean Event, Counting Semaphore, FIFO Queue, Resource, Memory Pool Mobile Entities: all L1 entities moveable between Nodes.
Table 2. Semantics of L1 Entities. L1 Entity Event Counting Semaphore Port FIFO queue Resource Memory Pool
Semantics Synchronisation on a Boolean value. Synchronisation with counter allowing asynchronous signalling. Synchronisation with exchange of a Packet. Buffered communication of Packets. Synchronisation when queue is full or empty. Event used to create a logical critical section. Resources have an owner Task when locked. Linked list of memory blocks protected with a resource.
Table 3. Service synchronization variant. Services Synchronising Behavior variants “Single-phase” services NW Non Waiting: when the matching filter fails the Task returns with a RC Failed. W Waiting: when the matching filter fails the Task waits until such events happens. WT Waiting with a time-out. Waiting is limited in time defined by the time-out value. “Two-phase” services Async Asynchronous: when the entity is compatible with it, the Task continues independently of success or failure and will resynchronize later on. This class of services is called “two-phase” services.
The services are offered in a non-blocking variant ( NW), a blocking variant ( W), a blocking with time out variant ( WT), and an asynchronous variant ( A) for services where this is applicable (currently in development). All services are topology transparent and there is no restriction in the mapping of Task and kernel Entities onto this network. See Tables 2 and 3 for details on the semantics. Using a single generic entity leads to more code reuse, therefore the resulting code size is at least 10 times less than for an RTOS with a more traditional architecture. One could of course remove all such application-oriented services and just use Hub based services. Unfortunately, this has the drawback that services loose their specific semantic richness, e.g. resource locking clearly expresses that the Task enters a critical section in competition with other Tasks. Also erroneous runtime conditions, like raising an event twice (with loss of the previous event), are easier to detect at application level compared with the case when only a generic Hub is used. During the formal modelling process, we also discovered weaknesses in the traditional way priority inheritance is implemented in most RTOSs. Fortunately, we found a way to re-
B.H.C. Sputh et al. / OpenComRTOS
179
duce the total blocking time. In single processor RTOS systems this is less of an issue, but in multi-processor systems, all nodes can originate service requests and resource locking is a distributed service. Hence, the waiting lists can grow longer and lower priority Tasks can block higher priority ones while waiting for the resource. This was solved by postponing the resource assignment until the rescheduling moment. Finally, by generalization, also memory allocation has been approached like a resource locking service. In combination with the Packet Pool, this opens new possibilities for safe and secure memory management, e.g. the OpenComRTOS architecture is free from buffer overflow by design. For the third semantic layer (L2), we plan to add dynamic support like mobility of code and of kernel Entities. A potential candidate is a light-weight virtual machine supporting capabilities as modelled in pi-calculus [15]. This is the subject of further investigations and will be reported in subsequent papers. 1.5. Inherent Safety Support By its architecture the L1 semantic layers are all statically linked, hence an application specific image will be generated by the tool-chain during the compilation process. As we don’t consider security risks for the moment, our concern is limited to verifying whether or not the code is inherently safe. A first level of safety is provided by the formal modelling approach. Each service is intensely modelled and verified with most “corner cases” detected during design time prior to writing code. A second level is provided by the kernel services themselves. All services have well defined semantics. Even when they are asynchronously used, the services become synchronous when resources become depleted. At such moments, a Task is forced to wait, which allows other Tasks to proceed and free up resources (like Packets, buffer space, etc.); Hence, the systems becomes “self-throttling”. A third level is provided by the data structures, mostly based on Packets. All single-phase services use statically allocated Packets which are part of the Task context. These Packets are used for service requests, even when going across processor boundaries. They also carry return values. For two phase services Packets must be allocated from a Packet Pool. When the Pool is empty, the system will start to throttle until Packets are released. Another specific architectural feature is the fact that the system can never run out of space to store requests because there is a strict limit of how many requests there can be in the system (the number of packets). All queues are represented by linked list, and each packet contains the necessary header information, therefore no buffers are required to handle requests, which therefore cannot overflow. In the worst case, the application programmer defined insufficient Packets in the Pool and the buffers will stop growing when all Packets are in use. A last level is the programming environment. All Entities are defined statically, so they are generated together with all other system level data structures by a tool, hence no Entities can be created at runtime. Of course, dynamic support at L2 will require extra support. However, this can only be achieved reliably with hardware support, e.g. to provide protected memory spaces. The same applies to using stack spaces. In OpenComRTOS interrupts are handled on a private and separate stack, so that the Task’s stack spaces are not affected. On the MLX16 such a space can be protected, but it is clear that such an inexpensive mechanism should become the norm for all embedded processors. A full MMU is not only too complex and too large, it is also simply not necessary. The kernel has various threshold detectors and provides support for profiling, but the details are outside the scope of this paper. 2. OpenComRTOS on Embedded Targets Porting OpenComRTOS to the Microblaze soft processor was the first major work done by Altreonic. One reason for choosing the Microblaze CPU as a first target was the prior expe-
180
B.H.C. Sputh et al. / OpenComRTOS
Figure 3. Hardware setup of the test system.
rience of the new team with the Microblaze environment [16,17]. This section compares the Microblaze port with the port of OpenComRTOS to the MLX16. It also gives performance and code size figures for other available ports of OpenComRTOS. The Microblaze soft processor is realised in a Field Programmable Gate Array (FPGA). FPGAs are emerging as an interesting design alternative for system prototyping and implementation for critical applications when the production volume is low [18]. We realised the target architecture with the Xilinx Embedded Developer Kit 9.2 and synthesized with Xilinx ISE version 9.2 on an ML403 board with a Virtex-4 XC4VFX12 FPGA clocked at 100 MHz. Our architecture, shown in Figure 3, is composed of one MicroBlaze processor connected to a Processor Local Bus (PLB). The PLB enables accessing TIMER and GPIO. The TIMER is used to measure the time it takes to execute context switches. The GPIO was used for basic debugging. The processor uses local memory to store code and data of the Task it runs. This memory is implemented through Block RAMs (BRAMs). The MicroBlaze Debug Module (MDM) enables remote debugging of the MicroBlaze processor. 2.1. Code Size Figures This section reports the code size figures of OpenComRTOS on the MicroBlaze target. To put these figures into perspective we did two things. First the OpenComRTOS code size figures on the MicroBlaze target are compared with the ones on the other targets. To this point we have ported OpenComRTOS to the MicroBlaze processor from Xilinx [19], the Leon3 as used by ESA[20], the ARM Cortex-M3[21], and the XMOS XS1-G4[22]. The second comparison is concerned with the code size figures for a simple semaphore example. This example has been implemented using a) Xilinx Micro-Kernel (XMK) and b) OpenComRTOS. The later example is more important, because we can show that the OpenComRTOS version uses only 75% code size (.text segment) compared to the XMK version, to achieve the same functionality. Table 4 reports the code size figures for individual L1 Services for all different targets we support. The total code size of ‘Total L1 Services’ is just the sum of the individual code sizes. The Service ‘L1 Hub shared’ represents the code necessary to achieve the functionality of the Hub, upon which all other L1 Services depend. This explains why adding the Port functionality requires only 4-8 Bytes more code. In general the code size figures are lower for the MLX16, ARM-Cortex-M3 and XMOS due to their 16bit instruction set. Both Microblaze and Leon3 in contrast use a 32bit instruction set. Even among the targets with 16bit instruction sets we can see vast differences in the code size. One reason for this is the number of registers these targets have. The MLX16 has only four registers which need to be saved during a context switch. In contrast the XMOS port has to save 13 registers during a context switch. This has also an impact on the performance figures, which are shown in Table 6 on page 10 2.1.1. Comparing OpenComRTOS Against the Xilinx Micro-Kernel The Xilinx Micro-Kernel (XMK) is an embedded operating system from Xilinx for its Microblaze and PPC405 cores. In this section we compare the size of a comparable application example between OpenComRTOS and XMK for the Microblaze target. A complete comparison of code size figures between XMK and OpenComRTOS is not possible, because these
181
B.H.C. Sputh et al. / OpenComRTOS Table 4. OpenComRTOS L1 code size figures (in Bytes) obtained for our different ports. Service L1 Hub shared L1 Port L1 Event L1 Semaphore L1 Resource L1 FIFO L1 PacketPool Total L1 Services
MLX16 400 4 70 54 104 232 NA 1048
MicroBlaze 4756 8 88 92 96 356 296 5692
Leon3 4904 8 72 96 76 332 268 5756
ARM 2192 4 36 40 40 140 120 2572
XMOS 4854 4 54 64 50 222 166 5414
Figure 4. Semaphore loop example project.
operating systems offer different services. However, to give an indication of the code size efficiency, we implemented a simple application based on two services both OS offer. Figure 4 shows two tasks (T1 T2 ), which exchange messages and synchronise on two semaphores (S1 , S2 ), in both cases 1KiB stacks were defined, which is default size for XMK. In case of the OpenComRTOS implemenation 512Byte would have been more than enough. Table 5 shows that the complete OpenComRTOS program requires about 15% less memory when compared with the memory usage of XMK. This is an important result, because with OpenComRTOS there is more RAM available for user applications. This is particularly important when, either for speed reasons or for PCB size constraints, the complete application has to run in internal (BRAM) memory. Table 5. XMK vs. OpenComRTOS code size in bytes. OS XMK OpenComRTOS
.text 12496 9400
.data 348 1092
.bss 7304 6624
total 20148 17116
2.2. Performance Figures The performance figures were evaluated by measuring the loop time. We define this loop time as the time a particular target takes to complete one loop in the semaphore loop example. The resulting measurement values allow us to compare the performance of OpenComRTOS on different target platforms. OpenComRTOS abstracts the hardware from the application programmer, therefore the application source code, which is executed by the individual targets, stays the same. To show how compact OpenComRTOS application code is, Listings 1 and 2 show the source code for the Semaphore loop example which was used to measure the loop time figures. Listing 1 shows the code for task T1 which represents T1 . The Arguments of the function call are not used. Line 2 defines 3 variables of type 32 bit unsigned int. All the work is done wihtin the infinite loop, starting from Line 3. In Line 4 the number of elapsed processor cycles is stored in the start variable. The code block from Line 8 to 8 signals semaphore 1 (S1) 1000 times and tests semaphore 2 (S1) also 1000 times; for the semantics of L1_SignalSemaphore and L1_TestSemaphore see Table 2. In Line 9 the elapsed processor cycles are stored in the stop variable.
182
1 2 3 4 5 6 7 8 9 10 11
B.H.C. Sputh et al. / OpenComRTOS
void T1 ( L1_TaskArguments Arguments ){ L1_UINT32 i =0 , start =0 , stop =0; while (1) { start = L1_getElapsedCycles (); for ( i = 0; i < 1000; i ++){ L1_SignalSemaphore_W ( S1 ); L1_TestSemaphore_W ( S2 ); } stop = L1_getElapsedCycles (); } } Listing 1. Souce code for task T1.
1 2 3 4 5 6
void T2 ( L1_TaskArguments Arguments ){ while (1) { L1_TestSemaphore_W ( S1 ); L1_SignalSemaphore_W ( S2 ); } } Listing 2. Souce code for task T2.
For completeness, Listing 2 shows the source code for T2 which represents T2 . Similarly to T1 the task is represented by a function whose parameters are not used. Within the whileloop, from Line 2 to 5, semaphore S1 is tested before semaphore S2 is signaled. Both calls are blocking, as indicated by the postfix ‘_W’, see Table 3. After having obtained the start and stop values for all the targets we use the following Equation to calculate the loop time. Loop time =
stop − start Clock speed × 1000
(1)
This equation does not take into account the overhead from getting the elapsed clock cycles and from the loop implementation. This overhead is negligible compared with the processing time for signalling and testing the semaphores. Table 6 reports the measured loop times for the different targets. Each run of the loop requires eight context switches, this is caused by the fact that the Semaphores are accessed in the kernel context. Therefore, any access to a Semaphore requires to switch into the kernel context and afterwards to switch back to the requesting task. Table 6. OpenComRTOS loop times obtained for our different ports. Clock speed Context size Memory location Loop time
MLX16 6MHz 4 × 16bit internal 100.8μs
MicroBlaze 100MHz 32 × 32bit internal 33.6μs
Leon3 40MHz 32 × 32bit external 136.1μs
ARM 50MHz 16 × 32bit internal 52.7μs
XMOS 100MHz 14 × 32bit internal 26.8μs
B.H.C. Sputh et al. / OpenComRTOS
183
The loop times expose the differences between the individual architectures. What sticks out is the performance of the MLX161 , which despite its low Clock speed of only 6MHz is faster than the Leon3 running at more than 6 times the Clock frequency. One of the main reasons for this is that the MLX16 has only to save and restore 4 16bit registers during a context switch compared to 32 32bit registers in case of the Leon3. Furthermore, the Leon3 uses only external memory, whereas all other targets use internal memory. 3. Conclusions The OpenComRTOS project has shown that even for software domains which are often associated with ‘black art’ programming, formal modelling works very well. The resulting software is not only very robust and maintainable but also respectably compact and fast. It is also inherently safer than standard implementation architectures. Its use however must be integrated with a global systems engineering approach, because the process of incremental development and modelling is as important as using the formal model checker itself. The use of formal modelling has resulted in many improvements of the RTOS properties. The previous section analysed two distinct RTOS properties. Namely, code size and speed measurements. With a code size as low as 1kiB a stripped down version of OpenComRTOS fits in the memory of most embedded targets. When more memory is available, the full kernel fits in less than 10kiB on many targets. Compared with the Xilinx Micro-Kernel OpenComRTOS has about 75% of the code size. The loop time measurements brought out the differences between individual target architectures. In general however, the measured loop times confirm that OpenComRTOS performs well on a wide verity of possible targets. Acknowledgments The OpenComRTOS project is partly funded under an IWT project for the Flemish Government in Belgium. The formal modelling activities were provided by the University of Gent. References [1] RTCA. DO-178B Software Considerations in Airborne Systems and Equipment Certification, January 1992. [2] ISO/IEC. TR 61508 Functional Safety of electrical / electronic / programmable electronic safety-related systems, January 2005. [3] The International Council on Systems Engineering (INCOSE) aims to advance the state of the art and practice of systems engineering. www.incose.org. [4] Andreas Gerstlauer, Haobo Yu, and Daniel D. Gajski. RTOS Modeling for System Level Design. In DATE03, page 10130, Washington, DC, USA, 2003. IEEE Computer Society. [5] Formal Systems (Europe) Ltd. Failures-Divergence Refinement: FDR Manual. http://www.fsel.com/fdr2 manual.html. [6] The Open License Sociaty researches and develops a systematic systems engineering methodology based on interacting entities and thrustworthy components. www.openlicensesociety.org. [7] Eonic Systems. Virtuoso The Virtual Single Processor Programming System User Manual. Available at: http://www.classiccmp.org/transputer/microkernels.htm. [8] M. D. May, P. W. Thompson, and P. H. Welch, editors. Networks, Routers and Transputers: Function, Performance and Applications. IOS Press, Amsterdam Netherlands, 1993. [9] P.H.Welch. Process Oriented Design for Java: Concurrency for All. In H.R.Arabnia, editor, Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA2000), volume 1, pages 51–57. CSREA, CSREA Press, jun 2000. 1
Stripped down version of OpenComRTOS
184
B.H.C. Sputh et al. / OpenComRTOS
[10] Leslie Lamport. Specifying Systems: The TLA+ Language and Tools for Hardware and Software Engineers. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2002. [11] The MISRA Guidelines provide important advice to the automotive industry for the creation and application of safe, reliable software within vehicles. http://www.misra.org. [12] Eyad Alkassar, Mark A. Hillebrand, Dirk Leinenbach, Norbert W. Schirmer, and Artem Starostin. The Verisoft Approach to Systems Verification. In Jim Woodcock and Natarajan Shankar, editors, VSTTE 2008, Lecture Notes in Computer Science, Toronto, Canada, October 2008. Springer. [13] Gerard J. Holzmann. The SPIN Model Checker : Primer and Reference Manual. Addison-Wesley Professional, September 2003. [14] C. A. R. Hoare. Communicating sequential processes. Commun. ACM, 21(8):666–677, 1978. [15] Robin Milner. Communicating and Mobile Systems: the Pi-Calculus. Cambridge University Press, June 1999. [16] Bernhard Sputh, Oliver Faust, and Alastair R. Allen. Portable csp based design for embedded multi-core systems. In Communicating Process Architectures 2006, sep 2006. [17] Bernhard Sputh, Oliver Faust, and Alastair R. Allen. A Versatile Hardware-Software Platform for In-Situ Monitoring Systems. In Alistair A. McEwan, Wilson Ifill, and Peter H. Welch, editors, Communicating Process Architectures 2007, pages 299–312, jul 2007. [18] Antonino Tumeo, Marco Branca, Lorenzo Camerini, Marco Ceriani, Matteo Monchiero, Gianluca Palermo, Fabrizio Ferrandi, and Donatella Sciuto. A dual-priority real-time multiprocessor system on fpga for automotive applications. In DATE08, pages 1039–1044. IEEE, 2008. [19] Xilinx. MicroBlaze Processor Reference Guide. http://www.xilinx.com. [20] Gaisler Research AB. SPARC V8 32-bit Processor LEON3 / LEON3-FT CompanionCore Data Sheet. http://www.gaisler.com/cms/. [21] ARM. An Introduction to the ARM Cortex-M3 Processor. http://www.arm.com/. [22] XMOS. XS1-G4 Datasheet 512BGA. http://www.xmos.com/.
Communicating Process Architectures 2009 P.H. Welch et al. (Eds.) IOS Press, 2009 © 2009 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-065-0-185
185
Economics of Cloud Computing: a Statistical Genetics Case Study Jeremy M. R. MARTIN a, Steven J. BARRETT a, Simon J. THORNBER a, Silviu-Alin BACANU a, Dale DUNLAP b, Steve WESTON c a
GlaxoSmithKline R&D Ltd, New Frontiers Science Park, Third Avenue, Harlow, Essex, CM19 5AW, UK b Univa UD, 9737 Great Hills Trail, Suite 300, Austin, TX 78759, USA c REvolution Computing, One Century Tower, 265 Church Street, Suite 1006, New Haven, CT 06510, USA Abstract. We describe an experiment which aims to reduce significantly the costs of running a particular large-scale grid-enabled application using commercial cloud computing resources. We incorporate three tactics into our experiment: improving the serial performance of each work unit, seeking the most cost-effective computation cycles, and making use of an optimized resource manager and scheduler. The application selected for this work is a genetics association analysis and is representative of a common class of embarrassingly parallel problems. Keywords. Cloud Computing, Grid Computing, Statistical Genetics, R
Introduction GlaxoSmithKline is a leading multinational pharmaceutical company which has a substantial need for high-performance computing resources to support its research into creating new medicines. The emergence of affordable, on demand, dynamically scalable hosted computing capabilities (commonly known as ‘Cloud Computing’ [1]) is of interest to the company: potentially it could allow adoption of an ‘elastic’ grid model whereby the company runs a reduced internal grid with the added capability to expand seamlessly on demand by harnessing external computing infrastructure. As these clouds continue to accumulate facilities comprising powerful parallel scientific software [2] and key public databases for biomedical research [3], they will become increasingly attractive places to support large, collaborative, in silico projects. This will improve as efforts [4,5] to evaluate and identify the underlying benefits continue within commercial and public science sectors. In this article we explore the economic viability of transferring a particular Statistical Genetics application from the GSK internal PC grid to an external cloud. The application is written using the R programming language. One run of this application requires 60,000 hours of CPU time on the GSK desktop grid, split across 250,000 jobs which require approximately fifteen minutes of CPU time each. Each job uses approximately 50MB of memory and performs negligible I/O. The GSK desktop grid comprises the entire fleet of R&D desktop PCs within GSK: these are Pentium machines with an average speed of about 3 GHz. At any given time up to 1500 of these devices may be used concurrently, harvested for spare compute cycles using the Univa UD Grid MP system. Based on raw performance, each node in this grid is roughly 3 times faster than a ‘small’ instance on Amazon EC2 [6]. This suggests that the entire application would require approximately 180,000 hours of CPU time on the Amazon EC2 cloud which implies that the naïve cost for running the
186
J.M.R. Martin et al. / Economics of Cloud Computing
application on the EC2, without any optimization, would be around $18,000 ($0.10 per CPU hour). This figure is too high to be attractive when compared with the running costs of the internal GSK grid infrastructure. However if we were able to reduce that price by an order of magnitude it would open up new possibilities for GSK to run an economical elastic grid. In order to tackle this problem we have established a three-way collaboration between GSK, Univa UD which is a pioneering company in the area of grid and cloud computing resource management and scheduling, and REvolution Computing which specialises in high-performance and parallel implementation of the R programming language. Our cost reduction strategy is based on three tactics: •
Pursuit of low-cost computing resources
•
Reduction of serial execution time
•
Efficient scheduling and resource management
The rest of the paper is organized as follows. In section 1 we describe the Statistical Genetics grid application which we are using for this experiment. Section 2 gives a brief introduction to the R programming language. In section 3 we review the current status of cloud computing. Section 4 provides a detailed explanation of our methods, and the results are presented in section 5. Finally we review the results and make predictions for the future uptake of cloud computing in science and technology. 1. Development of Methods for Genetic Analysis Using Simulation Discovery of genetic variants responsible for certain diseases or adverse drug reactions is currently a high priority for both governmental and private organizations. As described in Bacanu et al [7], the tools for discovery consist of statistical methods for detecting association between phenotype of interest and biomarkers (genotypes in this case). While these methods are general in spirit, the type of data they are applied to is unfortunately very specific. For instance, the data for a very common “genome scan” consists of around one million or more variables (markers) that can be very correlated regionally and, at the same time, the correlation structure is highly irregular and its magnitude varies widely between even adjacent genomic regions. Due to the size and irregularity of genetic data, performance of association methods has to be assessed for very large simulation designs with very time-consuming individual simulations. The computational complexity of the problems is further compounded when the number of available subjects is small; under these circumstances asymptotic statistical theory cannot be used and the statistical significance has to assessed via computationally-intensive permutation tests. Consequently, to successfully develop and apply methods, statistical genetics researchers need extensive computational power such as very large clusters, grid or cloud computing. Association analysis is the statistical technique for linking the presence of a particular genetic mutation in subjects to the susceptibility of suffering from a particular disease.. To assess the presence of associations, subjects in study cohorts are divided into cases (those with the disease) and controls (those without). The DNA of each subject is genotyped at particular markers where relatively common mutations are present. The combination of genotype and phenotype (disease symptom) data is then analyzed to arrive at a statistical significance (for association) at each studied marker. A typical design for testing association methods involves running many instances (sometimes more than 250,000) of an identical program, each with different input data or, if
J.M.R. Martin et al. / Economics of Cloud Computing
187
permutations are necessary, a randomly permuted copy of the input data. Each instance produces for each marker a single data point within a simulated probability distribution. Once all the instances have completed the resultant probability distribution is used to approximate the statistical significance of the actual data, and hence whether there is a significant link between a particular genetic marker and the disease under study.
2. The R Programming Language The R programming language[8] is a popular public-domain tool for statistical computing and graphics. R is a very high level function-based language and is supported by a vast online library of hundreds of validated statistical packages[9]. Because R is both a highly productive environment and is freely available it appeals both to those academics who are teaching the next generation of statisticians and to others who are developing new 'state-ofthe-art' statistical algorithms, which are becoming increasingly complex as methods become progressively more sophisticated. The flip-side of this level usefulness and programming productivity is that R programs may take a long time to execute when compared with equivalent programs written in lowlevel languages like C. However the additional learning and effort required to develop complex numerical and statistical applications using C is not a realistic option for most statisticians, so there have been many initiatives to make R programs run faster. These fall into three general categories: 1. Task farm parallelisation: running a single program many times in parallel with different data across a grid or cluster of computers. This is the approach that has been used for the genetics application of this project – using the GSK desktop grid a program which would take seven years to run sequentially can complete in a few days. 2. Explicit parallelisation in the R code using MPI or parallel loop constructs (e.g. R/Parallel [10]). The drawback of this approach is that it requires the R programmer to have some understanding of concurrency in order to use the parallel constructs or functions within the program. 3. Speeding up the performance of particular functions by improved memory handling, or by using multithreaded or parallelised algorithms ‘beneath the hood’ to accelerate particular R functions e.g. REvolution R, Parallel R[11] or SPRINT[12]. With this approach the programmer does not need to make any changes to code – the benefits will be automatic. However at present only certain R functions have been optimised in this way – so this approach will only work for codes which require significant use of those functions. For the purpose of this experiment we will be using a combination of approach 1 with approach 3. The code has already been enabled to run as a collection of 250,000 independent work units (as described above). By also aiming to reduce the serial execution time of the code (by improved memory handling rather than by multithreading or parallelised algorithms) we shall hope to reduce our overall bill for use of cloud virtual machines.
188
J.M.R. Martin et al. / Economics of Cloud Computing
3. Cloud Computing Cloud computing is a style of computing in which dynamically scalable and virtualized resources are provided as a service over the Internet. Major cloud computing vendors, such as Amazon, HP and Google, can provide a highly reliable, pay-on-demand, virtual infrastructure to organizations. This is attractive to small commercial organizations which do not wish to invest heavily in internal IT infrastructure, and who need to adjust the power of their computing resources dynamically, according to demand. It may also be attractive to large global companies, such as pharmaceutical companies [13], who will potentially be able to choose to set their internal computing capacity to match the demands of day-to-day operations, and provide an elastic capability to expand on to clouds to deal with spikes of activity. In either case the economics of cloud computing will be crucial to the likelihood of uptake 14]. As well as economic considerations, cloud computing comes with a number of additional complications and there exists a certain amount of skepticism in the Computing world about its true potential. Armbrust et al [15] list ten obstacles to the adoption of cloud computing as follows: 1. Availability of Service. Risks of system downtime might be unacceptable for business-critical applications. 2. Data Lock-In. Each cloud vendor currently has its own proprietary API and there is little interoperability which prevents customers from easily migrating applications between clouds. 3. Data Confidentiality and Auditability. Cloud computing offerings are essentially public rather than private networks and so are more exposed to security attacks than internal company resources. There are also requirements for system auditing in certain areas of business (such as Pharmaceuticals) which need to be made feasible on cloud virtual machines. 4. Data Transfer Bottlenecks. Migrating data to clouds is currently costly (around $150 per terabyte) and may also be very slow. 5. Performance Unpredictability. Significant variations in performance have been observed between cloud virtual machine instances of the same type from the same vendor. This is a common problem with virtualisation and can be a particular issue for high-performance computing applications which are designed to be load-balanced across homogeneous compute clusters. 6. Scalable Storage. Providing high-performance, shared storage across multiple cloud virtual machines is still considered to be an unsolved problem. 7. Bugs in Large Distributed Systems. Debugging distributed applications is notoriously difficult. Adding the cloud dimension increases the complexity and will require the development of specialised tools. 8. Scaling Quickly. Cloud vendors generally charge according to the quantity of resources that are reserved: virtual CPUs, memory and storage, even if they are not currently in use. Therefore a system which can scale the reserved resources up and down quickly according to actual usage, hence minimizing costs, would be very attractive to customers.
189
J.M.R. Martin et al. / Economics of Cloud Computing
9. Reputation Fate Sharing. One customer’s bad behaviour could potentially tarnish the reputation of all customers of a cloud, possibly resulting in blacklisting of IP addresses. 10. Software Licensing. Current licensing agreements tend be node-locked, assigned to named users, or floating concurrent licenses. What is needed for cloud computing is for software vendors to charge by the CPU hour. Each of these matters needs to be carefully reviewed when a company makes a decision whether to adopt cloud computing. They are all subject to ongoing technical research within the cloud computing community. Scientific software standards also need to be consistently adhered to and specific issues such as random number generation addressed. For the Genetics application described here, GSK is able to avoid data confidentiality issues by taking steps to make the data anonymous. Also data transfer bottlenecks and scalable storage are not major issues here as this problem is highly computationally bound. The concern of data lock-in is to be handled by Univa UD’s Unicloud product, which provides a standard interface whichever cloud vendor is used. The R programming environment is open source and so there are no associated license costs. 4. Tactics for Reducing Cost The main objective of our experiment is to reduce the cost of running the GSK Statistical Genetics application by an order of magnitude from our starting estimate of $18K. We are using three separate, largely independent mechanisms to achieve this. 4.1 Pursuit of Low-Cost Computing Resources Cloud vendors usually offer several different varieties of virtual computer, with differing levels of resources and correspondingly different costs. We will be trialing a number of these from two vendors, Amazon and Rackspace, as listed in Tables 1 and 2. (Note that R is platform neutral: it runs on both Windows and Linux). Table 1. Amazon EC2 instance types.
Storage
Architecture
1.7GB
EC2 Compute Units 1
160GB
32-bit
Price per CPU hour (Linux) $0.10
7.5GB
4
850GB
64-bit
$0.40
15GB 1.7GB
8 5
1690GB 350GB
64-bit 32-bit
$0.80 $0.20
7GB
20
1690GB
64-bit
$0.80
Instance
Memory
Standard Small Standard Large Standard XL High CPU Medium High CPU XL
190
J.M.R. Martin et al. / Economics of Cloud Computing
Table 2. Rackspace cloud instance types.
Memory 256MB 512MB 1024MB 2048MB
Storage 10GB 20GB 40GB 80GB
Price per CPU hour (Linux) $0.015 $0.03 $0.06 $0.12
Note that the CPU power of each Amazon EC2 instance is given as a multiple of their socalled ‘compute unit’ which is claimed to be equivalent to a 1.0-1.2 GHz 2007 Opteron processor[6]. The CPU performance of Rackspace cloud instances, however, is not provided on their website[16] and so for the purpose of our project has been determined by running benchmarks. 4.2 Reduction of Serial Execution Time Given our starting estimate of $18K to run our application on a cloud, it would be well worth investing effort in optimizing the serial performance of each of the 250,000 work units. This can be achieved in two ways: either by performance tuning the R code of the application directly, or by optimizing the R run-time environment. REvolution Computing’s version of the R runtime environment contains optimized versions of certain underlying mathematical functions with improved memory and cache utilization. Programs which make heavy use of the optimized functions may show a significant speed up in serial execution time. In section 4 we shall describe the results of using REvolution’s tools in this case study as well as transforming the R code directly using application profiling. Another potential approach to improving the serial performance would be to rewrite the code in an efficient low-level language such as C or Fortran. However, this would be a very costly undertaking because of the complexity of the R packages used in the program. Also the R program used for our Genetics Analysis is subject to frequent revisions and improvements and the flexibility to make frequent changes would disappear using this approach. 4.3 Efficient Scheduling and Resource Management In order to achieve the highest throughput in a cloud environment, we need to be able to quickly provision a large environment, and also efficiently schedule a large number of jobs within that environment. An HPC cloud is essentially a cluster of a large number of virtual machines. To be useful, we need the ability to quickly create a large clustered environment, consisting of dozens, or possibly hundreds or thousands of virtual cores. Ideally, this process should be highly scalable: requiring little (if any) additional effort to build a 1,000 node cluster, as opposed to a 2-node cluster. In our experiment, we used Univa UD’s UniCloud product to provision the cluster. UniCloud works as follows: one virtual machine (“installer node”) is created manually within the cloud environment (for example, Amazon EC2). This machine is installed with CentOS Linux 5.2. The customers’ account information is configured in a special file (which varies by cloud vendor), and this allows UniCloud to use the cloud vendor’s API to create additional virtual machines programmatically. A UniCloud command (“ngedit”) is used to define the configuration for additional nodes, and the “addhost” command is then
J.M.R. Martin et al. / Economics of Cloud Computing
191
used to actually provision the nodes. This includes building the operating system, setting up ssh keys for password-less login, and installation of optional software packages. One of the optional packages installed by UniCloud is Sun Grid Engine [17], which is used as the batch job scheduler. This allows us to submit all 250,000 jobs at once. The jobs are queued by SGE and then farmed out to different machines as they become available. Without a scheduler, it would be impossible to fully utilize all the compute resources. 5. Experimental Results We ran a representative subset of jobs on three flavours of cloud VM, two from Amazon and one from Rackspace, using the UniCloud scheduler and resource manager in order to achieve an efficient throughput of jobs. Because of the large number of individual jobs that are run in our experiment, the scheduler was able to achieve near perfect efficiency in utilization of processor cores. We used these results to make the cost and execution time predictions in Table 3. Table 3. Initial performance results.
Average Runtime Per Job Jobs Per Hour Per Instance Total Number of Jobs Wall Clock Time (200 instances) Cost per hr Total Cost
EC2 Standard XL
EC2 High CPU XL
Rackspace 256MB
29.1 min
17.9 min
19.6 min
8.25 250,000
26.82 250,000
12.23 250,000
151.6 hr $0.80 $24,250.00
46.6 hr $0.80 $7,458.33
102.2 hr $0.015 $306.72
Using the Rackspace clouds works out over twenty times less expensive that the lowest cost seen with Amazon. But, there are fundamental differences between the two environments that make direct comparison difficult. An instance at Amazon EC2 provides a guaranteed minimum performance, as well as a certain amount of memory and disk space. Each Rackspace VM provides access to four processing cores, and a specific amount of memory (in our case, 256MB). Each physical machine at Rackspace actually has two dualcore processors, and 16GB of memory. There could be up to 64 separate 256MB VMs sharing this hardware. Each VM can use up to the full performance of the machine, if the other VMs are idle. But, it is possible (although unlikely) that the performance will be significantly less. Rackspace’s provisioning algorithm attempts to avoid this problem by distributing new VMs instead of stacking them and most applications already deployed to run on the Rackspace cloud are not at all CPU-hungry. In our testing, the Rackspace 256MB instance was the most cost efficient, because each of our jobs requires about 50MB of memory. So, four concurrent jobs can fully utilize one of these virtual machines without swapping. The scheduler was able to keep each of these cores constantly busy until all the jobs had been processed. Note also that there is currently a limit of 200 concurrent VMs that a Rackspace user may hire in a single session, so Rackspace would appear to be significantly less scalable for HPC than Amazon. The best execution time on the Rackspace cloud is 102 hours. Amazon can do the analysis in 46 hours using 200 VM instances, and could potentially go much faster than this by creating 1000 or more concurrent VMs.
192
J.M.R. Martin et al. / Economics of Cloud Computing
5.1 R Code Optimisation Our performance optimization for the test R program was largely focused in two areas: language-level and processor optimizations. The R language is a rich, easy to use, high-level programming language. The flexibility of the language makes it easy to express computational ideas in many ways, often simplifying their implementation. However, that flexibility also admits many less than optimal implementations. It is possible for several mathematically-equivalent R programs to have widely different performance characteristics. Fortunately, R code is easy to profile for time and memory utilization, allowing us to identify bottlenecks in programs and compare alternate implementations. The Rprof function produces a detailed log of the time spent in every function call over the course of a running program, as well as optional detailed memory consumption statistics. There exist several excellent open-source profile analysis and visualization packages that work with Rprof, incuding 'proftools' and 'profr.' These are essential tools for R performance tuning. Processor optimizations represent the other main approach to R performance optimization. Unlike the language-level optimization, processor optimizations are usually implemented through low-level numeric libraries. The R language relies on low-level libraries for basic computational kernels like linear algebra operations. Running the Statistical Genetics R code using REvolution R showed a marginal performance improvement (6%) over running with the public domain version of R. In order to explore further we used the RProf profiling package to analyse the code. This shows a breakdown of how much time the program spends in each function: for our program the RProf output begins as follows: Each sample represents 0.02 seconds. Total run time: 1774.31999999868 seconds. Total seconds: time spent in function and callees. Self seconds: time spent in function alone. % total 100.00 100.00 99.99 99.99 99.99 99.99 99.99 99.98 64.55 32.21 32.17 30.21 29.92
total seconds 1774.32 1774.32 1774.18 1774.18 1774.18 1774.14 1774.10 1774.02 1145.30 571.48 570.88 536.02 530.80
% self 0.00 0.00 0.00 0.34 0.00 1.09 9.79 0.09 1.51 0.03 0.02 1.22 0.18
self seconds 0.00 0.02 0.00 6.12 0.00 19.34 173.74 1.68 26.88 0.60 0.40 21.72 3.20
name "source" "eval.with.vis" "t" "sapply" "replicate" "lapply" "FUN" "getpval" "lm" "anova" "anova.lm" "eval" "anova.lmlist"
This shows that most of the execution time was within the ‘lm’ function (and related functions) for fitting linear models. The test program runs a large number of linear models, which are ultimately based on the QR matrix factorization. The particular QR factorization
193
J.M.R. Martin et al. / Economics of Cloud Computing
used by R has not yet been adapted to take advantage of tuned numeric libraries. This explains why the speedup obtained was small. (However, optimization of the QR factorization code is work in progress and we expect it will provide about 5 to 10% further speed.) Fortunately it turns out that our program’s performance may be significantly improved by a code transformation. The original code contains the following ‘for loop’ construct: for(i in 1:ncov){ if(i==1){ model0<-paste(model0,"covs[,",i,"]") model1<-paste(model1,"+","covs[,",i,"]") } else{ model0<-paste(model0,"+","covs[,",i,"]") model1<-paste(model1,"+","covs[,",i,"]") } mod0<-lm(as.formula(model0)) mod1<-lm(as.formula(model1)) pvalcov[[i]]<-anova(mod0,mod1)[2,"Pr(>F)"] }
Loops are notoriously inefficient in R. In this case we noticed that each iteration of the loop is independent so we may safely transform this by moving the loop body into a function and then replacing the ‘for’ statement with a more efficient ‘apply’ operation (which is equivalent to a functional ‘map’ statement). We produced an optimized version that runs significantly faster (even using the public-domain version of R). xxmod <- function (i) { if(i==1){ model0<-paste(model0,"covs[,",i,"]") model1<-paste(model1,"+","covs[,",i,"]") } else{ model0<-paste(model0,"+","covs[,",i,"]") model1<-paste(model1,"+","covs[,",i,"]") } mod0<-lm(as.formula(model0)) mod1<-lm(as.formula(model1)) anova(mod0,mod1)[2,"Pr(>F)"] } pvalcov <- sapply (1:ncov, xxmod)
Table 4 shows the results of re-running the experiment using the optimized version of the R program (but still using the public domain version of R). We see a further 20% cost reduction. Table 4. Performance results following code optimisation.
Average Runtime Per Job Jobs Per Hour Per Instance Total Number of Jobs Wall Clock Time (200 instances) Cost per hr Total Cost
EC2 Standard XL
EC2 High CPU XL
Rackspace 256MB
19.2 min
13.5 min
15.8 min
12.50 250,000
35.56 250,000
15.24 250,000
100.0 hr $0.80 $16,000.00
35.2 hr $0.80 $5,625.00
82.0 hr $0.015 $246.09
194
J.M.R. Martin et al. / Economics of Cloud Computing
So we have brought down the total cost of running our application from $18,000 to around $250 using the methods described in this paper. The bottom line is that the reason Rackspace is such a good deal is that it perfectly matches the jobs to be run. That is, a 256MB instance is very cheap, and it just so happens that 4 of these jobs fit perfectly in 256MB. Also this instance only has 10GB of disk but our job mix does not need much disk. (The two EC2 instances that we tested provided 15GB and 7GB of RAM respectively and 1690GB of storage space, most of which was unused.) The UniCloud tool made it possible to use the Rackspace cloud instance efficiently and the R profiling tools also provided a reduction in serial performance time, applicable across every work unit. An important caveat to the Rackspace lowest cost is that it is a best case scenario for that cloud infrastructure, when it is relatively lightly loaded with high-performance jobs. Since the performance of cloud instances is potentially highly variable, the cloud brokerage layer would ideally be set up to constantly monitor the performance of the resources under its control and incorporate the ability to feedback information to the customer and to automatically shut down badly performing VMs. Note that the cost figures presented above are based purely on the cloud computing resource usage – they do not include any software license fees for UniCloud or REvolution R. Actually REvolution R is an open source product and UniCloud is based on the open source Sun Grid Engine product so it would be possible for a programmer to implement the solution described in this article without incurring any software license costs. However the provision of a commercial cloud brokerage system offering simple standardized access to a variety of cloud offerings, allowing its user to shop around for the most suitable virtual infrastructure will be attractive to organizations such as GSK. 6. Conclusions Use of commercial clouds for Grid computing may seem surprisingly expensive at first sight and may appear off-putting to companies that require HPC services. However, by taking a significant example from a major pharmaceutical company, we have shown how to reduce these costs using a combination of cloud brokerage, efficient scheduling, and application tuning. The application described here was computationally bound, having negligible memory and I/O requirements. If we were to look at different classes of applications: such as those which use large-scale databases, then we would have to consider a wider spectrum of economic variables: such as costs for cloud storage, and cloud data transfer. Applications requiring large amounts of memory would not show the same cost benefits in using the Rackspace cloud as we have achieved here. Another important variable is level of data security: customers such as GSK will require watertight guarantees of security from vendors before sending any commercially sensitive data to a cloud environment. Going forward we foresee a model of cloud brokerage emerging whereby a layer of middleware is provided to help satisfy customers constraints in utilising software services based on factors such as cost or execution time. In our example we made cost the primary factor which led us to choosing the Rackspace cloud offering. Had execution time been our driver we would have opted for Amazon. The issue of ‘data lock in’ is particularly relevant to the experiment described here. At present there is no recognised standard for connecting to cloud resources and each vendor provides its own unique programming interface. By using a middleware layer, such as UniCloud, a cloud customer is insulated from these differences and is easily able to exploit
J.M.R. Martin et al. / Economics of Cloud Computing
195
potential economic or performance advantages that we have shown to be possible by careful choice of cloud supplier. Without that additional layer, the additional programming costs required for moving applications around between different clouds could be restrictive. Once the brokerage infrastructure is in place to allow customers dynamic access to powerful and economically attractive computing resources, and major security issues have been ironed out, we foresee major uptake of cloud computing within the scientific community both for collaborative activities and also for metered use of commercial software and database services. At present there is still a great deal of work to do for this vision to become reality. In the meantime if you can find a vendor that has an instance that perfectly matches your workload (or, if it’s possible to vary your workload to perfectly match an inexpensive instance), cloud computing can be very inexpensive. References [1]
[2] [3] [4] [5]
[6] [7] [8] [9] [10] [11] [12]
[13] [14]
[15]
[16] [17]
Rajkumar Buyya, Chee S. Yeo, Srikumar Venugopal, James Broberg, Ivona Brandic Cloud computing and emerging IT platforms: Vision, hype, and reality for delivering computing as the 5th utility. Future Generation Computer System. Vol. 25, No. 6. (2009), pp. 599-616. Elsevier. Michael Schatz. CloudBurst: Highly Sensitive Short Read Mapping with MapReduce http://sourceforge.net/apps/mediawiki/cloudburst-bio/index.php?title=CloudBurst Amazon Web Services for Biology. http://developer.amazonwebservices.com/connect/kbcategory.jspa?categoryID=246 Bateman A, Wood M.(2009) Cloud computing. Bioinformatics. Oxford Journals (Oxford University Press, Oxford, UK) Jun 15;25(12):1475 Brian D. Halligan, Joey F. Geiger, Andrew K. Vallejos, Andrew S. Greene and Simon N. Twigger. Low Cost, Scalable Proteomics Data Analysis Using Amazon’s Cloud Computing Services and Open Source Search Algorithms. J. Proteome Res., 2009, 8 (6), pp 3148–3153. ACS publications. Amazon Elastic Cloud. http://aws.amazon.com/ec2/ Bacanu SA, Nelson MR and Ehm MG. Comparison of association methods for dense marker data (2008). Genet Epidemiol. 32(8):791-9. Wiley-Liss, Inc. Sun Grid Engine. http://www.sun.com/software/sge/ R Development Core Team (2008) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0. Gonzalo Vera, Ritsert C Jansen and Remo L Suppi. R/parallel – speeding up bioinformatics analysis with R, BMC Bioinformatics 2008, 9:390 REvolution Computing. http://www.revolution-computing.com/ Jon Hill , Matthew Hambley , Thorsten Forster , Muriel Mewissen , Terence M Sloan , Florian Scharinger , Arthur Trew and Peter Ghazal. SPRINT: A new parallel framework for R. BMC Bioinformatics 2008, 9:558 Rick Mullin. The New Computing Pioneers, Chemical and Engineering News Volume 87, Number 21 pp. 10-14. ACS publications. Derrick Kondo, Bahman Javadi, Paul Malecot, Franck Cappello, David P. Anderson. Cost-benefit analysis of Cloud Computing versus desktop grids. Proceedings of IEEE International Symposium on Parallel & Distributed Processing 2009 pp 1-12 Michael Armbrust, Armando Fox, Rean Griffith, Anthony D. Joseph, Randy H. Katz, Andrew Konwinski, Gunho Lee, David A. Patterson, Ariel Rabkin, Ion Stoica, Matei Zaharia, Above the Clouds: A Berkeley View of Cloud Computing, Technical Report No. UCB/EECS-2009-28, Electrical Engineering and Computer Sciences, University of California at Berkeley. http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-28.html The Rackspace Cloud, http://www.rackspacecloud.com/cloud_hosting_products/servers Markus Schmidberger, Martin Morgan, Dirk Eddelbuettel, Hao Yu, Luke Tierney, Ulrich Mansmann. State of the Art in Parallel Computing with R. Journal of Statistical Software, Vol. 31, No. 1. (June 2009), pp. 1-27. American Statistical Association.
This page intentionally left blank
Communicating Process Architectures 2009 P.H. Welch et al. (Eds.) IOS Press, 2009 © 2009 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-065-0-197
197
An Application of CoSMoS Design Methods to Pedestrian Simulation Sarah CLAYTON 1 , Neil URQUHART and Jon KERRIDGE School of Computing, Edinburgh Napier University, Edinburgh, EH10 5DT Abstract. In this paper, we discuss the implementation of a simple pedestrian simulation that uses a multi agent based design pattern developed by the CoSMoS research group. Given the nature of Multi Agent Systems (MAS), parallel processing techniques are inevitably used in their implementation. Most of these approaches rely on conventional parallel programming techniques, such as threads, Message Passing Interface (MPI) and Remote Method Invocation (RMI). The CoSMoS design patterns are founded on the use of Communicating Sequential Processes (CSP), a parallel computing paradigm that emphasises a process oriented rather than object oriented programming perspective. Keywords. JCSP, CoSMoS, pedestrian simulation, multi-agent systems
Introduction The realistic simulation of pedestrian movement is a challenging problem, a large number of individual pedestrians may be included in the simulation. Rapid decisions must be made about the trajectory of each pedestrian in relation to the environment and other pedestrians in the vicinity. Parallel processing has allowed such simulations to include tens of thousands of pedestrian processes travelling across large areas. The multi-agent systems paradigm has recently been used to significant effect within pedestrian simulation, typically each pedestrian within the simulation equates to a specific agent. Examples of these are described in [1,2]. In [2], the parallel aspect of the simulation is implemented using MPI. The Complex Systems Modelling and Simulation infrastructure (CoSMoS) research project [3], a successor to the TUNA [4] research project, is an attempt to develop reusable engineering techniques that can be applied across a whole range of MAS and simulations. Examples include three dimensional simulations of blood clot formation in blood vessels, involving many millions of processes running across a number of networked computers [5,6]. The aims of the CoSMoS project are to provide a ‘massively-concurrent and distributed’ [7] infrastructure based on a process-oriented programming model. This process-oriented paradigm derives from the process algebras of CSP [8] and π-calculus [9]. The purpose of these methods is the elimination of perennial problems in concurrency arising from conventional parallel techniques, such as threads and locks. These problems include deadlock, livelock and race hazards. Initial demonstrations produced by the CoSMoS research group are implemented using occam-π [10], a language developed to enable the direct implementation of concurrent systems complying with the principles of CSP and π-calculus. To this end, the Kent Retargetable occam Compiler (KRoC) [11] was developed at the University of Kent Computing Laboratory. Similar capabilities are also provided by the Java Communicating Sequential Processes 1 Corresponding Author: Sarah Clayton, School of Computing, Edinburgh Napier University, Merchiston Campus, Edinburgh EH10 5DT, UK.. Tel: +44 131 455 2477. E-mail: [email protected].
198
S. Clayton et al. / CoSMoS Pedestrian Simulation
(JCSP) [12] packages, also developed at the University of Kent. As a library for the mainstream Java language, JCSP is more accessible to general programmers. It provides a means of implementing the semantics of CSP, that uses the underlying Java threading model, without the developer needing to be concerned with the details. The work described here has been created using JCSP. 1. CSP and the π-calculus In this section, we give a brief summary of the main concepts within CSP and the π-calculus. They are part of a rich set of semantics for concurrent programming, made directly available to programmers through the occam-π programming language and the JCSP library for Java. CSP and the π-calculus are formalisms for designing and reasoning about concurrent systems. Development environments that allow the application of these formalisms encourage a process-oriented, rather than object-oriented approach, to programming. 1.1. Processes Processes have their own thread of control and entirely encapsulate all their data and maintain their own state. These cannot be altered by other processes [13]. Other processes may send a message to a process requesting that its state be updated; such messages may be refused by the receiving process until it is in a correct state to deal with it. This gives two important wins for concurrent design: firstly, state never changes without the process owning that state making the change and secondly, a process never needs to manage data for requests with which it cannot deal. No locks are needed. Of course, care must be taken to avoid deadlocks caused by the refusal of messages. 1.2. Networks, Channels and Barriers Processes combine in parallel to form networks. Processes interact through synchronising on shared events. Instead of being propagated by listeners, events in process oriented systems are built upon channels and barriers. A network of processes is itself a process, so layered architectures – reflecting the layered structures of real life – are simply modelled. Barriers are a fundamental part of Bulk Synchronous Processing (BSP) [14]. They are directlty modelled by multiway events in in CSP. They allow multiple and heterogeneous processes to synchronise their activities, and enforce lockstep parallelism between them. When a process synchronises on a Barrier, it waits until all other processes enrolled on the Barrier have also synchronised before continuing processing. Processes communicate privately to each other using channels. Channels are synchronous. Any process writing to a channel will block until the reader is able to process the message. Although it is possible to implement buffering in these channels (JCSP supports this directly), no buffered channels are used here. One2OneChannels allow point to point communication between processes, and Any2OneChannels allow many processes to communicate with a single process. Other channel types are available but are not discussed here. 2. Implementation The structure of the simulation is based partly on the server structure described in [2] and the description of space described by Andrews et al. in [7]. The simulation uses fine grained parallelism, and is scalable. It has three main elements, Agents, Sites, and an overall SiteServer. These are implemented as processes rather than passive objects, and communicate with each other using channels, rather than through method calls. This allows processes
S. Clayton et al. / CoSMoS Pedestrian Simulation
199
and their data to be completely encapsulated. Any change of state or control information is communicated, as necessary, to other processes through channels. The space to be simulated is modelled by multiple Site processes, in a way similar to [7]. The SiteServer process acts as a server to all its client Sites. A Site process acts as a server to an ever-changing set of mobile Agent clients – see Figure 1. The use of a client-server architecture here eliminates the dangers of deadlock [15].
Figure 1. Server layers based on Quinn et al. [2].
Agents are mobile across sites: they may deregister themselves from any given Site and migrate to another. Each Site has a register channel for this purpose, where it receives registration requests from Agents, in the form of their server channel ends. This allows communication to be immediately established between the Agent as client and the Site as server. The process of registration (by a SiteServer) is described in the code in Listing 1: private void register() { boolean polling = true; timer.setAlarm(timer.read() + timeout); Vector<ServerChannelEnd> servers = new Vector<ServerChannelEnd>(); while (polling) { int index = alternative.select(); // wait for timeout or channel if (index == TIMER) polling = false; else if (index == CHAN) { ServerChannelEnd s = (ServerChannelEnd) register.read(); servers.add(s); timer.setAlarm(timer.read() + timeout); } } for (ServerChannelEnd newserver : servers) { newserver.read(); newserver.write("Hello"); server.add(newserver); } } Listing 1. Code for client registration.
200
S. Clayton et al. / CoSMoS Pedestrian Simulation
The SiteServer operates in a manner conceptually similar to that described in [2]. Agents communicate their current location to the Site with which they are currently registered. Sites then engage in a client-server communication with the SiteServer, which aggregates all this information. In the next phase of the communication, the SiteServer returns this global information to each Site, which then passes it on to each Agent. Agents then act on this information and alter their current position. This is described in Table 1, and compares the three main processes of the simulation. The implementation of the pedestrian simulation is illustrated in Figure 1. 2.1. Discover and Modify In order to ensure that all processes are updated and modified in parallel, two Barriers are used: discover and modify. During the discover phase, all Sites are updated by the SiteServer with the global coordinates of every Agent. Each Site then updates all Agents that are registered with it. As explained in [16,6], autonomous software agents perceive their environment and then act on it. This creates a two phase process for each step of the simulation, discovery and modification, that all processes comply with. These phases are enforced by barriers, described above. The tasks carried out by each type of process for each step of the simulation are described in the Table 1. Table 1. Processing Sequence. Agent
Site SiteServer Synchronise on discover barrier Request global → Receive requests coordinates Receive global ← Send global coordinates coordinates
Request update Receive update
→ ←
Receive requests Send global coordinates Synchronise on modify barrier
Modify state Send state Receive ACK
→ ←
Receive state Send ACK Send updates Receive ACK
→ ←
Receive updates Send ACK Aggregate updates into global coordinates
All communications between processes are on a client-server basis. In effect, a clientserver relationship involves the client sending a request to the server, to which the server is guaranteed to respond [17,15]. Processes at the same level do not communicate directly with each other, only with the process at the next level up. As stated in [6]: “such communication patterns have been proven to be deadlock free”. See [15] for a proof.
S. Clayton et al. / CoSMoS Pedestrian Simulation
201
2.2. Description of Space The division of space between sites allows for a simulation that is scalable and robust, separating out the management of agents between many processes. The Site processes themselves have no knowledge of how they are situated. Each Agent class has a Map object that provides information about the area that a Site is associated with and the means with which the Agent can register with this Site. In this way, Site processes can be associated with spaces of any shape or size. These spaces can range from triangles, simple co-planar two dimensional areas, complex three dimensional shapes, to higher dimensions with dynamically forming and shifting worm-holes. At the edges of each space, an Agent may either migrate to the next Site, or encounter a hard boundary, requiring it to change direction. This is determined by the existence of a reference to the adjacent Site, if one exists. This is a reference to the Site’s register channel, an Any2OneChannel, which allows many writers (the Agents seeking to register) and only one reader (the destination Site). The register process happens in two phases. First the Agent must inform its current Site that it wishes to deregister, during the discovery phase. During the modify phase, before any other operation or communication is carried out, the Agent writes a reference to its communication channels to the register channel of the new Site, and waits for an acknowledgement. In this way, while an arbitrary number of Agents may wish to migrate from Site to Site at any one time, these attempts will always succeed. An image from the software is shown in Figure 2.
Figure 2. Pedestrian agents in the application showing the arc of their field of vision.
The current work implements simple reactive agents. These contain little in the way of intelligence in making their choices. Their field of view replicates that of humans. The span of human vision is 160 degrees. Central vision only occupies 60 degrees of this, with peripheral vision on each side occupying 50 degrees [18]. The minimum distance between agents is delimited by the inner arc of their field of view. Should any other Agent approach this, they will react by choosing a different direction.
202
S. Clayton et al. / CoSMoS Pedestrian Simulation
3. Results A number of test runs were performed to evaluate how the simulation performed when the number of agents was incremented. This was done in order to demonstrate the scalability of the system. This test was carried out with one Site object, and the number of Agents incremented by ten for each run. The results are summarised in Table 2. Table 2. Results from test runs. Number of agents
Total time
10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200
151 412 845 1394 2057 2853 3741 5035 6207 7817 9401 11111 12923 15243 17454 20030 22251 25443 28694 31104
Avg time per step (ms)
Avg time per Agent (ms)
15.11 20.62 28.17 34.86 41.13 47.55 53.44 62.94 68.97 78.17 85.46 92.59 99.41 108.88 116.36 125.19 130.89 141.35 151.02 155.52
1.51 1.03 0.94 0.87 0.82 0.79 0.76 0.79 0.77 0.78 0.78 0.77 0.76 0.78 0.78 0.78 0.77 0.79 0.79 0.78
As can be seen from Table 2, the time to update each Agent during each step of the simulation is more or less constant. This is illustrated in Figure 3. These average times tend to decrease as the number of Agents increase. This reflects the overhead of setting up support processes, such as the display processes. Thereafter, the average times per Agent tend to settle at around 0.78 ms. However, beyond a certain point, the Java Virtual Machine (JVM) is no longer able to allocate any more threads and throws an exception. At the same time, as discussed below, it is unlikely that the number of Agents will exceed thirty in this application.
4. Conclusion In this paper, the application of CoSMoS design patterns in the development of MAS simulations has been discussed. The principles of concurrent processing using non-conventional techniques based on CSP and π-calculus have been explained. The client-server pattern that guarantees livelock and deadlock free concurrency has also been discussed. This offers a firm foundation for future work, using MAS, to simulate pedestrian behaviour.
S. Clayton et al. / CoSMoS Pedestrian Simulation
203
Figure 3. Average update times per Agent (ms) by test run.
Figure 4. Pedestrian trajectory recorded using infra-red sensors along a 15 × 4m corridor.
5. Future Work Although there is an upper limit to the number of threads the JVM can allocate, it would be possible to increase the number of Agents by distributing the application across a number of JVMs and networked computers. This can be done using the NetBarrier feature of the development version of JCSP. However, this would require a redesign of the structure of the application more in line with the occoids example described by Andrews et al. in [7] than the layered server based structure of Quinn et al.’s work in [2]. Although many simple agents have been used to simulate emergent behaviour [19], the purpose of the TRAMP project [20] is the simulation of human behaviour derived from data collected by infra-red sensors. As shown in Figure 4 actual human movements, when navigating across a space, are described by elegant and coherent curves. This is difficult to replicate using simple agents. In order to achieve this aim, agents trained using Learning Classifier Systems (LCS) [21] will be developed, and their interactions studied. The training data will allow the creation of agents that display realistic behaviours. The aim of the TRAMP research project is to simulate simple interactions between individuals, such as overtaking, group behaviour and obstacle avoidance, at the microscopic level. Issues of emergence, at the macroscopic level, are not dealt with. The environment being studied is relatively small, a straight corridor measuring 15 × 4 metres. The density of pedestrians passing through rarely exceeds 30 people. However, this provides a rich set of data on microscopic behaviours.
204
S. Clayton et al. / CoSMoS Pedestrian Simulation
References [1] J. Dijkstra, H.J.P. Timmermans, and A.J. Jessurun. A Multi-Agent Cellular Automata System for Visualising Simulated Pedestrian Activity. In S. Bandini and T. Worsch, editors, Theoretical and Practical Issues on Cellular Automata - Proceedings on the 4th International Conference on Cellular Automata for research and Industry, pages 29–36, October 2000. [2] M.J. Quinn, R.A. Metoyer, and K. Hunter-Zaworski. Parallel Implementation of the Social Forces Model. Pedestrian and Evacuation Dynamics, pages 63–74, 2003. [3] S. Stepney, P.H. Welch, J. Timmis, C. Alexander, F.R.M. Barnes, M. Bates, F.A.C. Polack, and A. Tyrrell. CoSMoS: Complex Systems Modelling and Simulation infrastructure, April 2007. EPSRC grants EP/E053505/1 and EP/E049419/1. URL: http://www.cosmos-research.org/. [4] S. Stepney, P.H. Welch, F.A.C. Pollack, J.C.P. Woodcock, S. Schneider, H.E. Treharne, and A.L.C. Cavalcanti. TUNA: Theory Underpinning Nanotech Assemblers (Feasibility Study), January 2005. EPSRC grant EP/C516966/1. Available from: http://www.cs.york.ac.uk/nature/tuna/index.htm. [5] P.H. Welch, B. Vinter, and F. Barnes. Initial Experiences with occam-pi Simulations of Blood Clotting on the Minimum Intrusion Grid. In H.R. Arabnia, editor, Proceedings of the 2005 International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA’05), pages 201–207, Las Vegas, Nevada, USA, June 2005. [6] C.G. Ritson and P.H. Welch. A Process-Oriented Architecture for Complex System Modelling. In A.A. McEwan, W. Ifill, and P.H. Welch, editors, Communicating Process Architectures 2007, pages 249–266, July 2007. [7] P. Andrews, A.T. Sampson, J.M. Bjørndalen, S. Stepney, J. Timmis, D. Warren, P.H. Welch, and J. Noble. Investigating Patterns for the Process-Oriented Modelling and Simulation of Space in Complex Systems. In S. Bullock, J. Noble, R. Watson, and M.A. Bedau, editors, Artificial Life XI: Proceedings of the Eleventh International Conference on the Simulation and Synthesis of Living Systems, pages 17–24, 2008. [8] C.A.R. Hoare. Communicating Sequential Processes. Prentice Hall, 1985. ISBN: 0-13-153271-5. [9] R. Milner. Communicating and Mobile Systems: the π-calculus. Cambridge University Press, 1999. [10] P.H. Welch and F. Barnes. Communicating Mobile Processes: Introducing occam-pi. In A.E. Abdallah, C.B. Jones, and J.W. Sanders, editors, 25 Years of CSP, volume 3525 of Lecture Notes in Computer Science, pages 175–210, April 2005. [11] P.H. Welch, J. Moores, F.R.M. Barnes, and D.C. Wood. The KRoC Home Page, 2000. Available at: http://www.cs.kent.ac.uk/projects/ofa/kroc/. [12] P.H. Welch, N.C. Brown, J. Moores, K. Chalmers, and B. Sputh. Integrating and Extending JCSP. In A.A. McEwan, W. Ifill, and P.H. Welch, editors, Communicating Process Architectures 2007, pages 349–369, July 2007. [13] P.B. Hansen. Java’s Insecure Parallelism. ACM SIGPLAN Notices, 34:38–45, April 1999. [14] W.F. McColl. Scalable Computing. In Computer Science Today: Recent Trends and Developments, pages 46–61, 1996. [15] J.M.R. Martin and P.H.Welch. A Design Strategy for Deadlock-free Concurrent Systems. Transputer Communications, 3(4):215–232, October 1996. [16] M. Wooldridge. An Introduction to MultiAgent Systems. John Wiley and Sons, 2002. ISBN 9780470519462. [17] P.H. Welch, G.R.R. Justo, and C.J. Willcock. Higher-Level Paradigms for Deadlock-Free HighPerformance Systems. In R. Grebe, J. Hektor, S.C. Hilton, M.R. Jane, and P.H. Welch, editors, Transputer Applications and Systems ’93, Proceedings of the 1993 World Transputer Congress, volume 2, pages 981–1004, Aachen, Germany, September 1993. [18] V. Bruce, P.R. Green, and M.A. Georgeson. Visual Perception: Physiology, Psychology, and Ecology. Psychology Press, New York, 1996. ISBN 1-84169-238-7. [19] V.J. Blue, M.J. Embrechts, and J.L. Adler. Cellular Automata Modeling of Pedestrian Movements. In ’Computational Cybernetics and Simulation’., 1997 IEEE International Conference on Systems, Man, and Cybernetics, 1997., volume 3, pages 2320–2323, Orlando, FL, 1997. IEEE. [20] S. Clayton, N. Urquhart, and J.M. Kerridge. Tracking and Analysis of the Movement of Pedestrians. In Third Annual Scottish Transport Applications and Research Conference, March 2007. [21] J.H. Holland. Studying Complex Adaptive Systems. Journal of Systems Science and Complexity, 19:1–8, March 2006.
Communicating Process Architectures 2009 P.H. Welch et al. (Eds.) IOS Press, 2009 © 2009 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-065-0-205
205
An Investigation into Distributed Channel Mobility Support for Communicating Process Architectures Kevin CHALMERS and Jon KERRIDGE School of Computing, Edinburgh apier University, Edinburgh, EH10 5DT {k.chalmers, j.kerridge}@napier.ac.uk Abstract. Localised mobile channel support is now a feature of Communicating Process Architecture (CPA) based frameworks, from JCSP and C++CSP to occamπ. Distributed mobile channel support has also been attempted in JCSP Networking and occam-π via the pony framework, although the capabilities of these two separate approaches is limited and has not led to the widespread usage of distributed mobile channel primitives. In this paper, an initial investigation into possible models that can support distributed channel mobility are presented and analysed for features such as transmission time, robustness and reachability. The goal of this work is to discover a set of models which can be used for channel mobility and also supported within the single unified protocol for distributed CPA frameworks. From the analysis presented in this paper, it has been determined that there are models which can be implemented to support channel end mobility within a single unified protocol which provide suitable capabilities for certain application scenarios. Keywords. mobile channels, distributed computing, protocol support.
Introduction Recent work in Communicating Sequential Processes for Java (JCSP) Networking has focused on refining the underlying architecture and protocol, as well as providing support for distributed mobility of processes and channels. Last year [1], a universal protocol to support distributed operations across all CPA frameworks was introduced. The initial version of the protocol was designed to reduce resource usage within JCSP Networking, as well as promote interoperability between the other CPA frameworks by having a well defined set of primitive network messages that can be understood by languages as diverse as occam-π and Python. The next stage in this work is to also provide channel end mobility support in the protocol such that channel ends can be passed between, for example, a JCSP application and an occam-π application. The work presented in this article is an initial investigation into models to support distributed channel mobility within the CPA network protocol. This will lead to dynamic topology support, which is useful in fields such as mobile agents [2], complex systems [3] and pervasive computing [4] The rest of this paper is broken down as follows. In Section 1, a discussion on distributed mobility in CPAs is presented, looking at the requirements to support such functionality. Section 2 presents potential models to support distributed channel mobility, and Section 3 analyses certain attributes of these models. Section 4 discusses possible protocol integration for these models. Section 5 presents future work and Section 6 provides conclusions.
206
K. Chalmers and J. Kerridge / Distributed Channel Mobility
1. Distributed Mobility in CPAs Distributed mobility in CPAs refers to the ability to migrate a process or channel end in a distributed CPA application from one network node to another in a manner that is transparent to the application (this is referred to as logical mobility [2]). Localised mobility support has been possible in JCSP since the initial version due to the pass-by-reference semantics of Java. Mobility support for occam was introduced with occam-π [5], the emphasis being on providing correct mobility support. Distributed mobile processes have also been implemented in both JCSP [6] and occam-π [3], the former having further support for code mobility. Distributed channel end mobility has also been implemented in JCSP [6] and the pony framework supported distributed channel mobility for occam-π [7]. Trap [3] is a successor to pony that currently has no support for channel mobility. There are difficulties in implementing distributed channel and process mobility in a manner that still emits the behaviour that we would expect from both localized and distributed mobility. 1.1 Difficulties with Distributed Mobility against Localised Mobility Previous work examining the challenges of mobility in CPA frameworks was highlighted in [6], and is summarized in Table 1: Table 1. Complexity of mobility.
Mobile Primitive Input Channel End Output Channel End Simple Process Complex Process
Local Mobility Simple Simple Simple Simple
Distributed Mobility Difficult Simple Moderate Very Difficult
On a single machine, mobility of channel ends and processes is relatively simple, requiring the passing of a reference from one process to another, occam-π hiding this underlying transaction from the developer. For distributed mobility, the implementation is more difficult. Output Channel End mobility is relatively simple as it normally only requires the transmission of an address to send messages to. Input Channel End mobility is difficult as it requires informing any Output Channel End(s) connected to the Input Channel End. Simple Process mobility refers to single processes, and the moderate difficulty refers to the inclusion of a code mobility system to support transparent process mobility. Complex Process mobility requires the suspension and subsequent resumption of a process network which has internal communication between the migrating processes. As the table indicates, the difficult problems to solve are Input Channel End mobility and Complex Process Mobility. Output Channel End mobility is solved based on the chosen Input Channel End mobility solution, and Simple Process mobility has been solved in JCSP via code mobility support [6]. The focus of this article is Input Channel End mobility, which helps enable Complex Process mobility as discussed in Section 1.3. 1.2 Code Mobility Logical mobility is discussed within the context of code mobility [8]. The code mobility paradigm discusses various models of mobile software components (e.g. mobile agents and client-server). Code mobility is also categorised into strong and weak mobility, the difference lying in the movement of active or passive components. An active component is one that has its own thread of control, whereas a passive component does not. Weak code
K. Chalmers and J. Kerridge / Distributed Channel Mobility
207
mobility requires non-stateful movement of a component from one networked node to another. Strong code mobility requires capturing the current execution state of an active component and transferring this to the new location. Both approaches include passive state capture (e.g. attributes of an object) and mobility of code. Execution state can be considered as the instruction pointer and call stack of an individual thread that is to be transferred. Strong code mobility is related to complex process mobility as discussed in Section 1.1. The difficulty in a platform such as Java is that the application developer does not have access to the internal instruction pointer or call stack of a thread, and therefore state capture of active components is difficult. Attempts have been made to overcome this limitation (for example see [9,10,11]), although they require modified Java Virtual Machines (JVMs) or compilers. The code mobility viewpoint of logical mobility has limitations when analysed within software architecture models, as shall be discussed in the following two sub-sections. 1.2.1 Software Architecture Generally, software architectures are defined by components and the connections between the components. For example, with CPA there are process components and channel connectors, and for object-orientation there are objects and the references between them. A system can be defined architecturally by the set of components and the connection relationships between them. Architectural elements can be further analysed by defining the ports (the inputs and outputs of a component) and the connection ends (inputs to a connection and the outputs from it). In CPAs, connection ends correspond to channel ends, although these are classified from the process point of view. Therefore a channel output end is the output from a process into a channel, and not the output from a channel. Ports can be considered as the set of events which a process operates on. This definition is illustrated in Figure 1.
Figure 1. CPA architecture.
1.2.2 Limitations of the Code Mobility View Code mobility has a limitation from a software architecture point of view in that connection mobility is not considered. This leads to the situation where a mobile component in a code mobility system can be viewed as an isolated piece of data, an isolated component (which may have internal components) or a whole application with all the internal components and connectors persisted. A component does not take its external connections with it when it migrates. Initial work on the π-Calculus [12] considered that process mobility was enabled by channel mobility, whereas code mobility has not considered this approach in depth. There has been some discussion on coordination mobility support within logical mobility. Roman et al. [13] has argued that coordination and location are the most important factors for logical mobility as coordination mobility enables the decoupling of components. Roman also argues that coordination mobility should be considered separate to component mobility. Phillips et al. [14] has argued for better modelling of communication between mobile components. Therefore, the current focus on distributed CPA mobility is on connection mobility to support component mobility.
208
K. Chalmers and J. Kerridge / Distributed Channel Mobility
1.3 Component Mobility We define a more concise model of component mobility which overcomes the limitations of the code mobility model. A mobile element in a code mobility system can be considered to have the following structure: • •
Code – the code defining the structure and behaviour of the mobile element. This is required in a code mobility system. State – the current state of the mobile element. This is further categorised into: o Passive state – the data attributes of the mobile component. This is required in a code mobility system. o Active state – the execution state of the mobile element. For a strong code mobility system this is required.
Our view of a mobile component has the following structure: •
•
Type – the type of the component. This describes its structure and behaviour. The type is required for interpretation at the receiving node in a distributed application. Further, the type may also include: o Code – the code, which may have to be loaded at the receiving end to allow interpretation of the mobile component. This is not a requirement for component mobility, particularly if we want to allow component mobility between diverse frameworks. State – the current state of the mobile component. This has three sub-elements: o Connection state – any connections to external components that the mobile component may have. This is a requirement for strong component mobility. o Data state – the attributes of the mobile component. This is required for any mobile component. o Behaviour state – the current execution state of the component. This is a requirement for strong code mobility.
Our model of component mobility allows for full definition of any mobile element that a system may have. For example, as only the type and data state are required for the most primitive form of mobile component, we can define mobile data (a simple message) within the mobile component structure. 1.4 Comparing Component Mobility to Code Mobility In code mobility, strong and weak mobility is distinguished by the capturing and sending of current execution state with the mobile element. Component mobility requires both connection state and behaviour state to determine strong mobility. Unlike code mobility, there is no express requirement in component mobility to transfer code with the mobile element. The reason to take this view is twofold. Firstly, we wish to be able to map primitive (well known) data messages within our definition. For example, 32-bit integers and strings are standard data types with no functionality (code) associated with them. Secondly, we want to acknowledge the ability to send a mobile component from one framework to another. It might become feasible to have strong component mobility from a JCSP application to an occam-π application. No cyclic references could be within the sent message. Having a uniform method of connection mobility between frameworks is required to support inter-framework component mobility.
K. Chalmers and J. Kerridge / Distributed Channel Mobility
209
The main addition that component mobility brings is the inclusion of connection state. This is not the internal connectivity of the mobile component but the external connected interface. Retaining this state allows the migration of the component in a manner that is transparent to other components in the system as communication between components remains intact. Adequate connection state migration therefore enables transparent component mobility. With CPAs, channels are treated as first class, thereby decoupling a component from its connections. This is important to enable strong mobility of component and connection. 1.5 Difficulties in First Class Mobility of Object-Oriented Applications Object-orientation does not exhibit both first class component and connection mobility. When running on a single machine, an object-oriented application passes references to objects during method invocation, and thus only connection mobility is evident. For a distributed application, the reverse is evident with an object being serialized and copied from one networked node to another. There is no concept of passing an object reference from one application to another. There is a definite machine boundary in an object-oriented application which separates the distributed from the localised. Because of the limitation of object-orientation, mobility support in CPA can lead to more transparent mobile applications. The following section describes seven different models that can support distributed channel end mobility, and Section 3 analyses some of the properties of these models. A more in depth discussion is provided elsewhere [15]. 2. Models of Distributed Connection Mobility Through examination of other techniques to support connection mobility, seven possible models to support channel mobility have been discovered. These models are described in the following sections. This is not an exhaustive collection of models, although we have surveyed available work within reason. 2.1 One-to-One etworked Channel Networked channels are Any-to-One in that any number of output ends may connect to an input end. As it is unknown how many output ends may be connected to an input end, informing output ends of the movement of an input end is not a one-to-one communication. The One-to-One model is illustrated in Figure 2. Consumer
Producer
Migrate
Consumer
Figure 2. One-to-One networked channel.
Muller [16] has presented a mobile channel protocol that supports One-to-One communication. Channel end states are used and vary based on whether the end is locally or remotely connected, and each channel end knows the location of its corresponding
210
K. Chalmers and J. Kerridge / Distributed Channel Mobility
partner. When a channel end migrates, it informs its companion of the new location once it has arrived. Mobility is easier in comparison to the standard Any-to-One model as it can be guaranteed that the companion channel end has been notified of the new location. 2.2 ame Server Mobile channel locations contained on a server is the approach taken by pony [7, 17]. Each channel is allocated an identifier unique to the application context (the set of networked nodes that make up a single pony application). Identifiers are managed by a server which tracks the current location of the channel. When the channel end is migrated the location is updated on the server. An output end connected to an input end can resolve this location, and then connect directly to the input end. If the input end should later move the output end retrieves the new location from the central server. This model is basically an extension of the common broker architecture used in distributed systems, and is illustrated in Figure 3.
Figure 3. Name server.
All the other models may use a name server for channel end resolution, although this is not a requirement. A channel can be connected using only the address. This model requires a name server, and also adds functionality to the server to support channel end mobility. 2.3 Message Box Message boxes are the approach used within mobile agent frameworks [18], and the model previously proposed for JCSP Networking channel mobility [6]. The node declaring the input channel end creates a message box process, which allows the output end to send to a static address, and the input channel end to request the next message from the message box. The message box model is illustrated in Figure 4.
Figure 4. Message box.
K. Chalmers and J. Kerridge / Distributed Channel Mobility
211
2.4 Message Box Server The message box model can be combined with a server allowing creation of message boxes on the server instead of locally on a node [19]. Apart from the requirement of server creation, the operation of the server controlled message box is identical to the standard message box model. This model is illustrated in Figure 5.
Figure 5. Message box server.
2.5 Chain The chain model [20] requires each previous location of a channel end to forward any message onto the next location until the message reaches the current location of the input end. When an input end arrives at a new location it informs the previous location of the new location. When an output end moves, the previous location is sent with the migration message, which is used to send to the previous output end location. Thus a chain of connections is formed, and any message must traverse the entire length of the chain. The model is illustrated in Figure 6.
Figure 6. Chain.
As networked channels are Any-to-One, there will be chains of various lengths in operation. The length from the original input location to the current input location is always determined by the number of migrations that have been made by the input end. The length of the output end(s) depends on how far the output end has moved from the original location. Thus, as different output ends may traverse different distances, there will be multiple chain lengths in operation. 2.6 Reconfiguring Chain To overcome the loop and transmission problems of the chain model [21], the chain can reconfigure itself by finding shortcuts to a previous link. Any loop is therefore removed and transmission time may become reduced whenever the chain is shortened. The reconfiguring chain model is illustrated in Figure 7.
212
K. Chalmers and J. Kerridge / Distributed Channel Mobility
Consumer
Producer
Link
Migrate
Link
Link
Consumer
Unnecessary Link Producer
Figure 7. Reconfiguring chain.
To achieve reconfiguration, a migrating channel end takes all previous location in the chain. On arrival, the locations are iterated through and reconnection is attempted to the oldest possible link in the chain. Loops are removed as a node can always shortcut to itself. Transmission time for messages can be reduced as the most direct route between two nodes is used instead of the total distance travelled by the mobile end. 2.7 Mobile IP Mobile IP [22] is used for physical device mobility within IP based networks. Connections are registered with a home agent responsible for forwarding messages onto the current location of the input end. When a connection migrates, it informs the home agent, which buffers messages until the new location is resolved. The new location address is generated by the foreign agent within the domain of the channel end’s new location. The home agent forwards received messages to the foreign agent, which forwards messages to the channel end’s new location. Whenever the mobile end moves, the foreign agent informs the home agent, and the same migration process occurs. This model is illustrated in Figure 8. Producer
Producer
Agent
Agent
Consumer
Agent
Consumer
Figure 8. Mobile IP.
To enable mobility between network sub-domains, tunnelling is used to allow messages to be sent to the new foreign agent. Tunnelling can be reproduced in a mobile channel context by utilizing a chain of agents that forward messages to the respective channel end location or next agent. The difference between an agent chain and a normal chain is that the agent chain is a fixed architecture which only grows when contact with a new domain occurs. This creates a hybrid model of chaining, server and message box. The agents act as both gateways between domains and routers of messages.
K. Chalmers and J. Kerridge / Distributed Channel Mobility
213
2.8 Advantages and Disadvantages Each of these models has certain advantages and disadvantage in comparison to the other models. These advantages and disadvantages are summarized in Table 2. These advantages and disadvantages are of interest as they highlight where some of the models are more suitable than others in certain application scenarios. Table 2. Advantages and disadvantages of mobility models.
Model One-to-One
Advantages Direct connection; simple model
Disadvantages No support for Any-to-One connections. Name server Direct connection Requires a name server. Message box All transmissions require only one Requires origin node to host the hop message box. Message box All transmissions require only one Requires a server to host the server hop message box; server may become overloaded. Chain Channel ends can travel freely Requires all previous nodes to support the chain; transmission time increases with each migration; single node failure can break multiple chains; loops may exist in the chain. Reconfiguring Channel ends can travel freely Reconfiguring the chain takes chain time; some of the chain disadvantages may still exist. Mobile IP Channel ends can travel freely Requires a backbone of agents to support mobility; loops may exist.
3. Analysis of Connection Mobility Models For analysis of the different models, a restricted addressing layout of standard TCP/IP based communication networks is used. A network domain may consist of several subdomains, which in turn consist of sub-domains, etc. At the root of the domain tree is the global domain. Each node in the tree can be allocated an identifier to represent the domain in the hierarchy that it belongs to. Messages are sent between members of domains. Figure 9 presents an example domain tree. This layout is not a representation of physical network layout, but rather the logical domain addressing mechanism in place. Each node in the tree has an identifier based on its domain branch. For example, leaf E has identifier G.A.C.E. A simplistic viewpoint of connectivity is taken in that members of a sub-domain may connect to members of the same sub-domain and members of parent domains. Thus, any leaf in the tree can connect to any domain further up its branch until the global domain root node is reached. For example, a member of G.A.C.E. can connect to a member of G.A.C., G.A., and G. This form of connectivity will be called addressability, implying that members of the node can address members in a given domain unambiguously.
214
K. Chalmers and J. Kerridge / Distributed Channel Mobility
Figure 9. Domain tree.
This view of addressability is taken to represent the fact that members of a given subdomain may be given addresses which are also used in another sub-domain. For example, domain G.A.C.E. may provide members with IP addresses in the standard local domain form 192.168.x.x. Domain G.A.C.F. may also use the local domain addressing mechanism. Thus, a member of G.A.C.E. may have an IP address 192.168.1.1, and so might a member of G.A.C.F. The domain tree structure ensures that this is not a problem. As a sub-domain may address its parent domain, then it becomes obvious that a member of a parent domain may be connected to a member of a sub-domain. However, this connection must be initiated by the member of the sub-domain; connectivity is allowed down the domain tree but not addressability. For the purposes of discussion, messages can travel up or down the tree but not both in a single operation. A message travelling up or down must be received by a domain member before being sent in the other direction. This is normally handled by routers within normal network architectures but, as mobile channels are logical connections, an equivalent logical router is needed to redirect the message. The analysis presented represents input channel end mobility, as this is the most complicated to achieve. For an input channel to be migrated, the architecture of the described model usually requires reconfiguration to ensure that messages are still received at the new input end location. For an output end, the majority of models permit the address or some other representation of the input end to be sent and a new output end to be created, effectively copying the output end at a new location. This is due to the Any-to-One architecture of a networked channel, where multiple output ends can connect to a single input end. Adding a new output end is trivial, and output end mobility involves adding a new output end and destroying the old one. To aid in analysis, a number of values are defined. These are standard message types used in the underlying protocol to support CPA networking [15]: •
proto – a protocol message without any data. Acknowledgement messages are also considered protocol messages. As these messages should be of fixed size, communication time is constant.
•
addr – the size of a channel location structure. These structures are used to permit the output end of a channel to connect to an input end. addr may vary based on implementation, thought communication time is considered constant.
K. Chalmers and J. Kerridge / Distributed Channel Mobility
•
215
msg – a data message sent from one domain member to another. The size of msg is variable, and therefore communication time depends on message size.
To represent mobility, Mn is used, where n is the number of movement operations that have occurred since initial channel creation – M0 representing a channel end that has not migrated. Four properties of these models are investigated. These are transmission time, reconfiguration time, reachability and robustness. 3.1 Transmission Time Transmission time is the time taken for a sent data message to arrive at its destination. This is an important Quality of Service (QoS) property in any distributed application, and is therefore an important value to analyse. The time taken to transfer a message of a particular type is expressed by the function t and is based on the amount of data sent. For the purposes of discussion a single communication between two domain members (even members in different domains in a branch), t is not affected by the actual distance up or down the domain tree travelled. A summary of these values is presented in Table 3. For simplicity, we assume that the transmission time for a message is independent of other messages being sent. In all cases, a data message requires a subsequent acknowledgement, hence the msg and proto definitions within these equations. Table 3. Transmission time.
Model One-to-One Name Server
Transmission Time Mn = tmsg + tproto Mn = tmsg + tproto [ + tmsg + tproto]
Message Box
M0 = tmsg + tproto Mn = 2·tmsg + taddr + tproto
Message Box Server Chain
Mn = 2·tmsg + taddr + tproto
Mobile IP
Mn = (up + down)·tmsg + (up + down)·tproto
M0 = tmsg + tproto Mn = n·tmsg + n·tproto Reconfiguring M0 = tmsg + tproto tmsg + tproto ≤ Mn ≤ n·tmsg + n·tproto Chain
Description Connections are always direct. Connections are normally direct, although a connection may move thus requiring a resend. First transmission is always direct. Subsequent messages require sending to message box and request from message box. As message box, although all sends are through the server. All messages travel the length of the chain. With no reconfiguration messages travel the entire length of the chain. If reconfigured, there is the possibility of direct connections. Messages travel through the domain agents up and down the domain tree.
3.2 Reconfiguration Time Reconfiguration time is the time taken to reconfigure the communication architecture to permit the new communication path created by the migration of a channel. The time to reconfigure the architecture is another important QoS consideration and will affect
216
K. Chalmers and J. Kerridge / Distributed Channel Mobility
transmission time. Reconfiguration complexity is represented by the parameter r that takes three values: easy for an architecture requiring little reconfiguration; mod for an architecture that requires some extra functionality and link creation; and hard for an architecture that requires a great deal of extra functionality and link creation to permit channel mobility. The time represented by r will generally be small in comparison to the time taken to transfer messages between nodes to allow reconfiguration. Transfer time is taken into consideration for message transfer and acknowledgement. Channel transfer time for all models is either a protocol message or an address message, except for the reconfiguring chain which takes all previous addresses with it. Table 4 summarises. Table 4. Reconfiguration time.
Model One-to-One
Reconfiguration Time Mn = reasy + 2·taddr + 2·tproto [+ tmsg]
Name Server
Mn = reasy + 6·tproto + 2·taddr
Message Box
Mn = reasy + taddr + tproto
Message Box Server Chain
Mn = reasy + taddr + tproto Mn = reasy + 2·taddr
Reconfiguring reasy + 2·taddr ≤ Mn ≤ rhard + (n – 1)·taddr Chain
Mobile IP
Mn = rmod + 2·(up + down)·2·taddr + (up + down)·tproto
Description The sent mobile channel structure consists of an addr and acknowledgement, and this must also be sent to the companion channel end. A waiting message may also be sent with the channel. The input end must send the new address to the server (ack’ed) and the client requests and receives this address (ack’ed). Reconfiguration is simply sending the address to the new location with an acknowledged message. As message box. The channel send contains the address and is acknowledged with the new address. Worst case the channel end contains all previous addresses and must contact each to try and reconfigure. Best case is as chain. The channel send contains two addresses (channel address and old address) and requires the new location to be sent back which contains two addresses. The send then must be acknowledged.
K. Chalmers and J. Kerridge / Distributed Channel Mobility
217
3.3 Reachability Reachability is the set of domains where a channel input end can be hosted and a channel output end still successfully communicate to the input end within the defined model. This value is of interest as in theory we wish to send a channel end anywhere within a network and still provide connectivity between the input and output end. The problem lies in the domain architecture presented in Figure 9. For an output end to successfully connect to the input end, addressability must be possible. As addressability is only possible up a branch of the domain tree, supporting architecture is normally required to support full connectivity across the entire domain tree. To discuss reachability, three sets of domains are defined: • • •
SUB_TREE – the domain in which the input end of the channel is located, and all the sub-domains of this domain BRACH – the set of domains within the same branch as the input end, implying both up and down traversal of the domain tree GLOBAL – the set of all domains
As it is possible for a node within a domain to connect up the tree, any model that allows such a connection is deemed to permit an output channel end that has migrated using an existing connection to be connected to an input channel end down the tree via the existing connection, although not the One-to-One model as shall be highlighted. Table 5 summarises reachability for the given models. Table 5. Reachability.
Model One-to-One
Reachability BRANCH (first) SUB_TREE ∩ BRANCH (subsequent)
Name Server
SUB_TREE ∩ BRANCH
Message Box
SUB_TREE BRANCH
Message Box Server
SUB_TREE
Chain
GLOBAL
Reconfiguring GLOBAL Chain Mobile IP GLOBAL
Description The first interaction is always between connected domain members. Further migrations can only occur within the domain of the input end. Although the server is SUB_TREE reachable, the input end must be reachable from the output end also. If the input end migrates to a separate branch, the output end cannot reach the input end, so reach is restricted to a branch of the domain. The host of the message box can be told to connect up the tree to the new input location, and thus reachability is the union of SUB_TREE and BRANCH. As the server is a dedicated it would normally be connected to by the output and input end, and due to addressability gives reachability of SUB_TREE. The chain can stretch anywhere providing GLOBAL reachability. As chain. The chain of agents can stretch through all domains, providing GLOBAL reachability.
218
K. Chalmers and J. Kerridge / Distributed Channel Mobility
3.4 Robustness The robustness of the model defines how strong the individual connections are between the input end and the output end of a channel. Robustness is defined by the reliance on external elements. A server (e.g. the message box model) is a stronger element than a normal networked node in the application due to the server being dedicated to supporting the network. Robustness is another key QoS property as if the connections between input and output ends are unreliable there is more chance that an application will fail. For robustness there are three values to consider: • • •
conn – a connection between two nodes node – a normal node in the network server – a server node in the network
The robustness of the model depends on how many of these individual elements are required to support channel mobility. The fewer elements required by the model, the stronger the model. Table 6 summarises the robustness of the various models. Table 6. Robustness.
Model One-to-One
Robustness Mn = conn
Name Server
Mn = conn
Message Box
M0 = conn Mn = node + 2·conn
Message Box Server
Mn = server + 2·conn
Chain
M0 = conn Mn = n·(node + conn)
Reconfiguring M0 = conn Chain conn ≤ Mn ≤ n·(node + conn)
Mobile IP
M0 = conn Mn = (up+down)·(conn+server)
Description There is only the connection between the input and output end. The connection between the sender and receiver is always direct. A name server is required, but does not affect the robustness of the individual connection. Initially, the model permits direct connection from sender to receiver. Then, connections from the individual ends to the message box are required. As message box, although a server is considered more robust than a hosting node. There is also no initial direct connection. Initially the chain is directly connected. All subsequent migrations require the previous nodes to stay operational to provide connectivity. The reconfiguring chain may be as weak as the normal chain in many regards, although a reconfiguration may result in a direct connection. Although a chain of agents is required for connectivity, these are considered dedicated entities in the architecture and thus provide moderate robustness to the connection backbone. Initially, the model provides a direct connection.
219
K. Chalmers and J. Kerridge / Distributed Channel Mobility
3.5 Summary Table 7 summarises the different mobile channel models by placing them in order from best to worst under the respective property headings. Taking these attributes together we can come to some firm categorisations of each of the models. These are not specific to certain hardware configurations, but are about the attributes that an application scenario might have as key considerations. •
Best (if Any-to-One is not required) – the One-to-One networked channel model provides the best transmission and reconfiguration times, as well as the strongest connectivity model. This comes at the cost of having restricted channel architectures, and this can especially be problematic for applications requiring a server type solution where multiple clients connect to a single server. Reachability is also poor.
•
Cluster – the name server model provides good transmission time, reconfiguration time and robustness. Reachability is poor, but a cluster is in a centralised domain. If no Any-to-One connections are required, then the Oneto-One model provides a better solution.
•
Global connectivity – only three models provide global migration of channels but still allow connectivity between input and output channel ends. Of these, the two chain models are not strong and have high transmission times. Therefore, if global connectivity is required, the Mobile IP model is best. This comes at a cost of having a backbone of agents to handle routing and reconfiguration. Table 7. Summary of mobile channel models
Transmission Time One-to-One Name server Message box Message box server Reconfiguring chain Mobile IP Chain
Reconfiguration Time One-to-One Message box server Message box Chain
Reachability
Robustness
Chain Reconfiguring chain Mobile IP Message box
One-to-One Message box server Name server Mobile IP
Name server Mobile IP
Message box server One-to-One
Message box
Reconfiguring chain
Name server
Reconfiguring chain Chain
So it can be seen that there is no one model which ideally suits all scenarios. As part of this work is to implement messages within the underlying network protocol to allow channel mobility between diverse frameworks, this becomes a problem as different frameworks generally have different application scenarios in mind. Further investigation into the protocol messages required is provided in the following section.
220
K. Chalmers and J. Kerridge / Distributed Channel Mobility
4. Protocol Integration In this section only a brief discussion is presented, and a full discussion into the individual states and protocol messages required to support the various models is provided elsewhere [15]. In general, there are two important operations that must be supported by the protocol. The first is the migration of an input channel end (MIGRATE_INPUT), and the second is the migration of an output channel end (MIGRATE_OUTPUT). This provides the two most fundamental message types for each model. Subsequent to these two messages, there is a requirement for informing another entity of the arrival at a new location of a channel end, usually in the form of the address of the new channel location. This is based on the type of model being used. Beyond these most primitive message types, each model requires its own set of messages to support the reconfiguration of the underlying network architecture to support the migration of a channel end. Table 8 summarises the required protocol messages. 4.1 Summary An analysis of the required protocol messages shows a number of commonalities between the separate models, which are in fact the three basic messages defined in at the beginning of this section. These three messages (MIGRATE_IPUT, MIGRATE_OUTPUT and MOVED) are common in the majority of models, and are the only messages in four of the models. Although adding these messages to the underlying network protocol may allow the usage of the separate models transparently, further work is required to discover if these models can all be supported within the protocol. Some of the messages require extra information within them to support the level of functionality, and the different states and underlying architectures may cause a problem. Therefore further examination within these areas is required. 5. Future Work More analysis work in this area is required. The goal of this work is to provide an initial examination of these models in regards to suitability of supporting channel end mobility. From these attributes, scenarios can be developed that can be examined further within the context of the models presented. Initially, simulation of each of the individual models in a suitable simulation environment is required. There are a number of network simulation tools available, and implementing each of these models within a simulator can help determine if any further messages or channel states are required to support the mobile channel architecture. The usage of a network simulator to generally simulate the underlying network protocol and architecture would also be advantageous, although some verification work has already been undertaken on JCSP Networking [15]. The individual models have yet to be implemented to examine the practical usage of each in real situations. This is one piece of work that must be carried out to determine whether or not the models are individually capable and suitable of supporting channel end mobility in a manner that is transparent to the user. The implementations can then have actual QoS properties measured and compared against the anticipated values. Actual required states and protocol messages can also be determined. Furthermore, suitable application models can be tested for suitability within each model.
K. Chalmers and J. Kerridge / Distributed Channel Mobility
221
Table 8. Protocol messages.
Protocol Message MIGRATE_INPUT
MIGRATE_OUTPUT MOVED (a) MOVED (b)
MOVED (c)
MOVING
ARRIVED
RESOLVE RESOLVE_REPLY CHECK
CHECK_RESPONSE
REQUEST
Description Sent from the current hosting node of an input channel end to the new host node when an input channel end is migrated. Essentially a SEND of the input channel end. As MIGRATE_INPUT but for an output channel end. Sent to the companion channel end to inform of a location change. Sent from the previous location of an input end to inform that the output end should resolve the new location of the input end from the name server. Sent from a node to a previous link to inform of the new location of the input channel end and that messages should be forwarded to this location. For the Mobile IP model this message is sent between the routing agents to reconfigure the channel path. Sent by the host of the input end to inform the name server that the channel end is about to move and that subsequent address resolutions should be buffered. Sent by the receiver of an input end to inform the server that the input end has a new address and any pending resolutions may complete. Sent to the server to request the current address of a given channel. Sent from the server as a response to the address resolution message. Sent by the input end to check if any messages are waiting in the message box. This is required for guarded operations on the input channel. Sent in reply to a CHECK request. The response is immediate, although a later response may occur when a message appears in the message box. The message is dropped if the guarded operation completed prior to this.. Request the next available message in the message box.
Models All
All One-to-One Name Server
Chain Reconfiguring Chain Mobile IP
Name Server
Name Server
Name Server Name Server Message Box Message Box Server
Message Box Message Box Server
Message Box Server
222
K. Chalmers and J. Kerridge / Distributed Channel Mobility
Verification work is also required beyond the general simulation of the mobile channel models. Examining the models to ensure that they emit the behaviour that is expected, as well as examining the models for fundamental problems such as deadlock and livelock is important. More verification work on the new network protocol and architecture is also required to analyse behaviour. Finally, work is still ongoing in regards to implementing the common protocol and a supporting architecture within the various CPA based distributed application frameworks. Work is ongoing with PyCSP[23] and the protocol is set to be implemented in Communicating Haskell Processes[24] and occam-π in the future. Work on a reduced JCSP version for small devices is also underway. 6. Conclusions In this paper, we have presented a number of models that have the possibility of supporting distributed channel mobility in CPA based frameworks. Each of these models show promise in supporting the required functionality, but when analysed against some critical attributes such as message transmission time and robustness, it has been discovered that not all are fit for application scenarios that may require reliability or a certain level of Quality of Service. However, a number of models do show potential for supporting particular application scenarios very well, in particular the name server approach for cluster computing work. This does highlight that pony [7] did use the correct model considering its application area. The problem lies in finding a model that supports as many application scenarios as possible, which may be difficult. To further support channel mobility, it is likely that a set of models is required, supported transparently by the underlying protocol. An analysis of the required protocol messages has highlighted three messages that are generally required, and a number of models that can be supported by a small number of protocol messages. Potentially, this means that the protocol can have channel mobility built in, and the underlying application architecture can support mobility in the manner best fitting the application scenario. Further work is required to analyse this potential further. References [1] K. Chalmers, J. Kerridge, and I. Romdhani, "A Critique of JCSP Networking," in Communicating Process Architectures 2008, P. H. Welch et al., Eds. Amsterdam, The Netherlands: IOS Press, 2008, pp. 271-291. [2] G. P. Picco, "Mobile Agents: An Introduction," Microprocessors and Microsystems, 25(2), pp. 65-74, 2001. [3] F. A. C. Polack, P. S. Andrews, and A. T. Sampson, "The Engineering of Concurrent Simulations of Complex Systems," in 2009 IEEE Congress on Evolutionary Computation.: IEEE Press, 2009, pp. 217224. [4] M. Satyanarayanan, "Pervasive Computing: Vision and Challenges," IEEE Personal Communications, 8(4), pp. 10-17, 2001. [5] P. H. Welch and F. R. M. Barnes, "Communicating Mobile Processes - Introducing occam-pi," in Communicating Sequential Processes: The First 25 Years - Symposium on the Occasion of 25 Years of CSP, A. E. Abdallah, C. B. Jones, and Sanders J. W., Eds. Berlin / Heidelberg, Germany: Springer, 2005, pp. 175-210. [6] K. Chalmers, J. Kerridge, and I. Romdhani, "Mobility in JCSP: New Mobile Channel and Mobile Process Models," in Communicating Process Architectures 2007, A. A. McEwan et al., Eds. Amsterdam, The Netherlands: IOS Press, 2007, pp. 163-182. [7] M. Schweigler and A. T. Sampson, "pony - The occam-pi Network Environment," in Communicating
K. Chalmers and J. Kerridge / Distributed Channel Mobility
223
Process Architectures 2006, P. H. Welch, J. Kerridge, and F. R. M. Barnes, Eds. Amsterdam, The Netherlands: IOS Press, 2006, pp. 77-108. [8] A. Fuggetta, G. P. Picco, and G. Vigna, "Understanding Code Mobility," IEEE Transactions on Software Engineering, 24(5), pp. 342-361, 1998. [9] J. Howell, "Straightforward Java Persistence Through Checkpointing," in Proceedings of the 3rd International Workshop on Persistence and Java (PJW3): Advances in Persistent Object Systems, D. Kotz and F. Mattern, Eds.: Morgan Kaufmann Publishers, Inc., 1999, pp. 322-334. [10] D. Weyns, E. Truyen, and P. Verbaeten, "Serialization of Distributed Execution-state in Java," in Objects, Components, Architectures, Services, and Applications for a etworked World: International Conference etObjectDays, ODe 2002, M. Aksit, M. Mezini, and R. Unland, Eds. Berlin / Heidelberg, Germany: Springer, 2003, pp. 41-61. [11] W. Zhu, C.-L. Wang, W. Fang, and F. C. M. Lau, "A New Transparent Java Thread Migration System Using Just-In-Time Recompilation," in The 16th IASTED International Conference on Parallel and Distributed Systems: PDCS 2004, T. Gonzalez, Ed.: ACTA Press, 2004, pp. 766-771. [12] R. Milner, J. Parrow, and D. Walker, "A Calculus of Mobile Processes, I," Information and Computation, 100(1), pp. 1-40, 1992. [13] G.-C. Roman, G. P. Picco, and A. L. Murphy, "Software Engineering for Mobility: A Roadmap," in Proceedings of the Conference on the Future of Software Engineering.: ACM Press, 2000, pp. 241-258. [14] A. Phillips, N. Yoshida, and S. Eidenbach, "A Distributed Abstract Machine for Boxed Ambient Calculi," in Programming Languages and Systems: 13th European Symposium on Programming, ESOP, D. Schmidt, Ed. Berlin / Heidelberg, Germany: Springer, 2004, pp. 155-170. [15] K. Chalmers, "Investigating Communicating Sequential Processes for Java to Support Ubiquitous Computing," Edinburgh Napier University, Edinburgh, PhD Thesis 2009. [16] H. Muller and D. May, "A Simple Protocol to Communicate Channels over Channels," in Proceedings 4th International Euro-Par Conference: Euro-Par'98 Parallel Processing, D. Pritchard and J. Reeve, Eds. Berlin / Heidelberg, Germany: Springer, 1998, pp. 591-600. [17] M. Schweigler, "A Unified Model for Inter- and Intra-Process Concurrency," The University of Kent, Canterbury, PhD Thesis 2006. [18] X. Zhong and C.-Z. Xu, "A Reliable Connection Migration Mechanism for Synchronous Transient Communication in Mobile Codes," in International Conference on Parallel Processing 2004.: IEEE Computer Society, 2004, pp. 431-438. [19] A. R. Silva, D. D. Ramao, and M. M. da Silva, "Towards a Reference Model for Surveying Mobile Agent Systems," Autonomous Agents and Multi-Agent Systems, 4(3), pp. 187-231, 2001. [20] J. M. Molina, J. M. Corchado, and J. Bajo, "Ubiquitous Computing for Mobile Agents," in Issues in Multi-Agent Systems, A. Moreno and J. Pavon, Eds.: Birkhauser Basel, 2007, pp. 33-57. [21] F. Baude, D. Caromel, F. Huet, and J. Vayssiere, "Communicating Active Mobile Objects in Java," in High Performance Computing and etworking: 8th International Conference, HPC Europe 2000, M. Bubak et al., Eds. Berlin / Heidelberg, The Netherlands: IOS Press, 2000, pp. 633-643. [22] C. E. Perkins, "Mobile IP," IEEE Communications Magzine, 40(5), pp. 66-82, 2002. [23] J. M. Bjorndalen, B. Vinter, and O. Anshus, "PyCSP - Communicating Sequential Processes for Python," in Communicating Process Architectures 2007, A. A. McEwan et al., Eds. Amsterdam, The Netherlands: IOS Press, 2007, pp. 229-248. [24] N. C. C. Brown, "Communicating Haskell Processes: Composable Explicit Concurrency using Monads," in Communicating Process Architectures 2008, P. H. Welch et al., Eds. Amsterdam, The Netherlands: IOS Press, 2008, pp. 67-83. [25] L. Bass, P. Clements, and R. Kazman, Software Architecture in Practice, 2nd ed.: Addison Wesley, 2003.
This page intentionally left blank
Communicating Process Architectures 2009 P.H. Welch et al. (Eds.) IOS Press, 2009 © 2009 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-065-0-225
225
Auto-Mobiles: Optimised Message-Passing Neil C. C. BROWN School of Computing, University of Kent, Canterbury, Kent, CT2 7NF, England [email protected] Abstract. Some message-passing concurrent systems, such as occam 2, prohibit aliasing of data objects. Communicated data must thus be copied, which can be timeintensive for large data packets such as video frames. We introduce automatic mobility, a compiler optimisation that performs communications by reference and deduces when these communications can be performed without copying. We discuss bounds for speed-up and memory use, and benchmark the automatic mobility optimisation. We show that in the best case it can transform an operation from being linear with respect to packet size into constant-time. Keywords. message-passing, mobility, optimisation
Introduction
Aliasing in concurrent systems can lead to data race-hazards when two or more concurrently executing processes are able to modify the same data without restriction. The nondeterministic order of the modifications affects the program’s subsequent behaviour. Processoriented programming eliminates this aliasing of mutable objects in favour of messagepassing: self-contained processes communicate data between each other without any sharing. In implementations of the occam 2 programming language, all communicated data is copied, which can be expensive for large messages (for example, video frames). The concept of mobility was introduced in occam-π [1], which (amongst other things) allows data to be mobile – that is, held by reference [2]. Aliasing is prevented by a transferral or movement semantics that guarantees that the reference is never held by more than one variable. Movement semantics state that after an assignment or communication, the source variable becomes undefined, because the reference has moved to the destination variable. Communication of mobile data is much more efficient, as only the reference to the data need be communicated. (In this paper we assume a common, uniform memory architecture.) Data mobility increases the burden on the programmer by requiring the addition of annotations to designate data as mobile, and requiring the use of the simple but unusual movement semantics. Thus, the efficiency of mobile data comes at a price to the programmer and can be a barrier to learning the language for newcomers. In this paper we present a technique for gaining the efficiency advantages of data mobility without any deviation from the standard copy semantics and unadorned syntax of occam 2. We call this technique automatic mobility. Automatic mobility is a compiler optimisation for occam 2 programs that requires no change to the original code and neither syntactic nor semantic changes to the occam 2 language. Automatic mobility also generalises to other message-passing languages; it is not a technique specific to occam 2.
226
N.C.C. Brown / Auto-Mobiles
PROC id (CHAN LARGE.DATA in, CHAN LARGE.DATA out) LARGE.DATA s: SEQ i = 0 FOR 1000 SEQ in ? s out ! s :
PROC id (CHAN MOBILE LARGE.DATA in, CHAN MOBILE LARGE.DATA out) INITIAL MOBILE LARGE.DATA s IS MOBILE LARGE.DATA: SEQ i = 0 FOR 1000 SEQ in ? s out ! CLONE s :
(a)
(b)
scope in: s in ? s SEQ i = .. out ! s scope out: s
done
(c)
PROC id (CHAN MOBILE LARGE.DATA in, CHAN MOBILE LARGE.DATA out) INITIAL MOBILE LARGE.DATA s IS MOBILE LARGE.DATA: SEQ i = 0 FOR 1000 SEQ in ? s out ! s :
(d)
Figure 1. An example of converting a variant of the occam 2 identity process (a), first into its exact mobile occam-π semantic equivalent (b). The flow graph of the process (c) is then used to turn this into the auto-mobile version (d), based on the observation that the data is not needed inbetween the output and being overwritten. This can be observed by following the dotted lines in the flow graph (c) which lead to an overwrite via an input or to going out of scope.
1. Automatic Mobility Automatic mobility comprises two compiler transformations. The first mobilisation transformation is to store all large1 data items in the program’s shared heap, rather than the usual occam 2 implementation of storing them in the workspace/stack. If these arrays are allocated at the point of declaration, freed when they go out of scope, and copied on assignment and output, the occam 2 semantics are directly preserved by this transformation. This transformation can be thought of as turning all data mobile, but always CLONEing2 it rather than moving it. An example of converting a slightly modified identity process is given in Figure 1a (the original) and Figure 1b (the transformed version). The key insight of automatic mobility is that once all the inputting processes are expecting to receive a reference, the outputting process has the option to either allocate a new copy of the data and send that (i.e. to send a CLONE), or to send the original reference. If the outputting process will not use the data again after the output, it can therefore send the original reference and discard it. This is the second transformation: the copy/move decision. We continue the earlier example, with the control-flow graph in Figure 1c and the final transformed version in Figure 1d. The details of the move/copy decision are discussed in section 2. We will consider in this paper how to mobilise arrays in the most general case. Records can be conceived of as arrays (albeit heterogeneous), and other data (e.g. integers) can be considered to be an array with one element. In occam 2, arrays are the typical way to store large amounts of data. 1 2
What we determine to be large will be guided by the benchmarks in section 8. CLONE is the occam-π syntax for making a copy of an item of mobile data.
N.C.C. Brown / Auto-Mobiles
227
2. Mobility Rule Our rule for deciding whether an outputting process should move or copy is simple: An array can be moved iff no array element can possibly be subsequently read from, before being overwritten or the array going out of scope. Otherwise, it should be copied. We will first consider implementing the mobility rule for operations that involve the entire array such as reading an array from a channel; operations on individual elements are discussed in section 5. The first step to implementing the mobility rule is to generate a control-flow graph for the occam PROCedure. This graph is processed to calculate two sets for each node: the sequentially-later set of variables, and the in-parallel set of variables. Discovering information about sequentially-later uses of a variable involves an iterative data-flow algorithm [3, pp 231]. A set of variables is calculated at each node by taking the union of all the variables read at that node and variable-sets from future-sequential nodes, minus the set of all variables written to at that node. The algorithm iterates to a fixed point. The occam compiler enforces a CREW (Concurrent-Read, Exclusive-Write) safety rule, but this permits concurrent reads. Therefore it is possible that even though a variable will not be read from by code that is sequentially later in the flow-graph, it may be read from by a node in parallel that happens to execute later. We deal with this by also finding all nodes that are in parallel with each other (trivial from the abstract syntax tree of a program) and recording the read-from variables. Determining whether an array is read-from after it is sent on a channel is a matter of examining the two sets (sequentially-later and in-parallel reads) at the corresponding node. If the variable is in neither set, it can be moved. If it is in either or both sets, it must be copied. The analysis for the move/copy decision is performed solely by examining the writer process for a particular channel-end. No information is known nor assumed about the reading process. This means that the analysis is robust in the face of separate compilation, providing all compilation units have the automatic mobility feature enabled, as the mobilisation transformation must have been performed on the reader.
3. Allocations For occam 2 the semantics are that an array is available from the point of its definition, with undefined contents, and continues to be available for reads and writes until the last use. This can be emulated with dynamic arrays by allocating an array at the point of definition and following the automatic mobility rules. Allocating an array at the point of definition can be inefficient, however. For example, consider: 1 [64]INT array: 2 SEQ 3 in ? array
The array would be allocated on line 1, and then immediately deallocated when the new array is received from the channel on line 3. To avoid this, we use the control-flow graph. If an array is not used before it is written-to in its entirety (for example, by reading from a channel), the array does not need to be allocated. In all other cases, it must still be allocated.
228
N.C.C. Brown / Auto-Mobiles
4. Examples In this section we present several example processes and explain the automatic mobility transformation’s effect on the processes. 4.1. Identity Process The simplest example is that of the identity process that forwards values from one channel to another: 4 5 6 7 8 9 10 11
PROC id (CHAN [64]INT in, CHAN [64]INT out) [64]INT s: -- Definition outside the loop, WHILE TRUE [64]INT s: -- or inside the loop SEQ in ? s out ! s -- becomes a move :
The flow analysis is subtly different if the array is defined outside the loop, but the result is the same. With the definition inside the loop, s is analysed as never being used again after its output. With the definition outside the loop, s is analysed as not being used again before being completely overwritten. Either way, s is moved during the output, and will also not be allocated at the point of definition (in both cases, it will be overwritten entirely by the input). The same results apply to transformation processes, such as this amplifier process: 12 13 14 15 16 17 18 19 20
PROC amp (CHAN [64]INT in, VAL INT factor, CHAN [64]INT out) WHILE TRUE [64]INT s: SEQ in ? s SEQ i = 0 FOR 64 s[i] := s[i] * factor out ! s -- becomes a move :
Automatic mobility is not just for communications (although that is the most common case), but also works for assignments. Thus if we reimplement the identity process as follows: 21 22 23 24 25 26 27 28
PROC id.2 (CHAN [64]INT in, CHAN [64]INT out) WHILE TRUE [64]INT s, t: SEQ in ? s t := s -- becomes a move out ! t -- becomes a move :
Both the assignment and the output become moves, and thus this process is no more expensive than the original identity process, even though there is an extra assignment of the array. 4.2. Delta Process A delta process is one that reads in data from one channel and sends out the same data on several channels. Its definition is:
N.C.C. Brown / Auto-Mobiles
229
29 PROC delta (CHAN [64]INT in, []CHAN [64]INT out!) 30 WHILE TRUE 31 [64]INT s: 32 SEQ 33 in ? s 34 PAR i = 0 FOR SIZE out 35 out[i] ! s -- becomes a copy 36 :
This will result in cloning for all the outputs, because they are in parallel with each other. Knowing which process is the last to output (and could thus perform a move) is difficult when the outputs are in parallel – but this would be a worthwhile goal, especially if there are only two output channels, as is commonly the case. With two output channels, two clones could become one clone and one move. There are several possible solutions. One is to make the outputs sequential, at which point the compiler could unroll the last loop iteration and turn that into a move. Another solution would be to pull out the copying of the data to outside the PAR3 : 37 PROC delta (CHAN [64]INT in, []CHAN [64]INT out!) 38 [SIZE out][64]INT ss: 39 WHILE TRUE 40 SEQ 41 in ? ss[0] 42 SEQ i = 1 FOR (SIZE out) - 1 43 ss[i] := ss[0] 44 PAR i = 0 FOR SIZE out 45 out[i] ! ss[i] 46 :
For the compiler to spot the necessary optimisations, it must be able to handle the array indexing to spot that the assignments (line 43) must be clones, but that the communications (line 45) can be moves. In this case, it is straightforward as the array is not used again (in whole nor in part) after the communications before an overwrite. It is possible that this transformation could be performed by the compiler, where it spots a delta-like pattern in a PROC. 4.3. Merging Process A merging process is one that reads in data from several channels, and somehow turns them into a single output using a folding operation. For example, this sum process zips together many arrays using addition: 47 PROC sum ([]CHAN [64]INT ins, CHAN [64]INT out) 48 WHILE TRUE 49 [64]INT acc, s: 50 SEQ 51 SEQ i = 0 FOR 64 52 acc[i] := 0 53 SEQ j = 0 FOR SIZE ins 54 SEQ 55 ins[j] ? s 56 SEQ i = 0 FOR 64 57 acc[i] := acc[i] + s[i] 58 out ! acc 59 : 3
Note that the dynamic array dimension is not legal occam 2.1, but see the appendix.
230
N.C.C. Brown / Auto-Mobiles
This process receives many arrays, and will deallocate them all before the next is read. The new accumulated total will be allocated on each iteration of the loop and sent out with a move (as with the identity process, this is true regardless of whether the accumulator is declared inside or outside the loop). There is an opportunity to prevent this allocation, by re-using one of the incoming arrays. This can be done as follows (for brevity, we ignore the possibility that ins is size zero): 60 61 62 63 64 65 66 67 68 69 70 71
PROC sum ([]CHAN [64]INT ins, CHAN [64]INT out) WHILE TRUE [64]INT acc, s: SEQ ins[0] ? acc SEQ j = 1 FOR (SIZE ins) - 1 SEQ ins[j] ? s SEQ i = 0 FOR 64 acc[i] := acc[i] + s[i] out ! acc :
Note that this is also more efficient than the original even without automatic mobility, as it avoids the initialisation of acc with zeroes, and avoids the first loop execution of additions. 5. Individual Elements We have so far considered how to implement automatic mobility for entire arrays. For example, we can determine that the output should be a movement in, for example: 72 73 74
SEQ out ! array.x array.x := some.other.array
We will now consider the code: 75 76 77 78
SEQ out ! array.x SEQ i = 0 FOR SIZE array.x array.x[i] := some.other.array[i]
One option would be to have an optimisation rule to transform lines 77 and 78 into an assignment of the entire array. However, for now we will consider the general principle of what must be done with individual array accesses such as these. There are two options for the rules we could adopt to transform this code with automatic mobility. 5.1. Copying The code could be transformed by changing the output to a copy when individual elements are written-to after the output: 79 80 81 82
SEQ out ! CLONE array.x SEQ i = 0 FOR SIZE array.x array.x[i] := some.other.array[i]
This comprises one allocation (during the CLONE on line 80), and two copies of the array (one as part of the CLONE on line 80, one from the loop on lines 81 and 82).
N.C.C. Brown / Auto-Mobiles
231
5.2. Moving and Allocating If the individual array elements needed to be read afterwards we would have to make the output a copy as explained. However, an alternative rule would be to make the output a move, and then allocate a fresh array if the array elements are only written-to. 1 SEQ 2 out ! array.x 3 array.x := MOBILE ARRAY.X.TYPE 4 SEQ i = 0 FOR SIZE array.x 5 array.x[i] := some.other.array[i]
This comprises one allocation (line 3), and one copy (lines 4 and 5), and is therefore more efficient than the first option. This rule can only be applied if all the array elements are written-to before being read-from (and thus none of the data present before the output is required after the output). This relies upon the compiler being able to detect that no elements of the array are read from before being written to. 5.3. Dynamic Index Reasoning Detecting whether array elements are read-from before being written-to in the presence of dynamic indices can be done with the Omega Test [4]. To explain this, we will consider a more subtle problem: 1 SEQ 2 out ! array.x 3 SEQ i = 0 FOR (SIZE array.x / 2) 4 array.x[2*i] := some.other.array[2*i] 5 SEQ j = 0 FOR SIZE array.x 6 IF 7 j \ 2 == 1 8 array.x[j] := array.x[j-1] 9 TRUE 10 SKIP
The portion of the array written-to by the first loop can be described by the index 2i and the compiler-derived inequalities: 0 ≤ i < SIZE array.x / 2 The portion of the array read-from in the second loop can be described by the index j −1 and the compiler-derived inequalities: 0 ≤ j < SIZE array.x j \ 2=1 The modulo expression is transformed into further inequalities, and the whole system can then be solved by the Omega Test to show that there is no solution to the equation 2i = j − 1 that satisfies the other equations, i.e. that none of the read-from portion in the second loop is not written-to by the the first loop. 6. Efficiency Bounds We can examine the bounds for the speed-up with automatic mobility. Given the number of communications of same-size data in a process network with consistent communication behaviour, we can determine what proportion, M , are mobile (0 ≤ M ≤ 1).
232
N.C.C. Brown / Auto-Mobiles
If we assume that copying data is expensive, but that allocation of memory is potentially cheap enough to be negligible, the maximum speed-up of communications in the mobile version is: 1 (1) 1−M Therefore, if no communications are mobile the maximum speed-up is 1× (i.e. no speedup), if 34 of the communications are mobile the maximum speed-up is 4×, and if all of the communications are mobile, the maximum speed-up is infinite (i.e. unbounded). This gives us an optimistic maximum speed-up bound. However, memory allocation is unlikely to be negligible. In fact, the improvement in performance of the automatic mobility technique is reliant on memory allocation being cheaper than copying. We can instead assume that the time taken to allocate a block of memory of size S is a(S) and label the time to copy a block of the same size as c(S). Under automatic mobility, the time for copying operations is a(S) + c(S), while the time for movement operations is a constant V . The speed-up of automatic mobility (all data being size S) can thus be more accurately estimated as: c(S) M V + (1 − M )(a(S) + c(S)) The denominator is the time for the moves multiplied by the proportion of moves (M V ) plus the time for allocation and copying multiplied by the proportion of copies ((1 − M )(a(S) + c(S))) – the numerator is the time for copying everything. If we assume that C (time for communicating a reference) approximates zero, and consider when this speed-up factor will be greater than one, we can rearrange to: a(S) M > 1−M c(S)
(2)
So automatic mobility gives a speed-up iff the factor by which the mobile communications outweigh the non-mobile is greater than the saving in time of allocation over copying. If only 1 of communications in a system were mobile, automatic mobility would only be worthwhile 10 if allocating blocks was at least 9 times faster than copying them. If at least half of the communications in a system are mobile, automatic mobility will give a speed-up if allocation is faster than copying. 7. Memory Usage One problem with occam 2’s static allocations, in the absence of automatically growing stacks, is that memory must be allocated in a process’s stack/workspace ready for the maximum memory use of a process rather than the current use. Thus an identity process that copies values of 1MB will always have 1MB of stack allocated to it, even if it is idle for most of its lifetime. In contrast, the same process under automatic mobility will have a small stack, and the space for the data is only allocated if the identity process is currently holding a data packet. More generally, data in occam 2 lives only in the stacks of processes. Data is never held in a channel; the synchronous communications are simply a direct copy from one process’s stack to another’s. Thus if P processes all have a stack variable of size S, the total memory allocated is slightly larger than P S. In an automatic mobility setting, all large data lives in the heap, not on the stack. If D items of data of size S are allocated in the heap, the memory use is slightly larger than DS. It can be seen that in a non-deadlocking system with synchronous channels, the number of
233
N.C.C. Brown / Auto-Mobiles 3.35544e+07
3.35544e+07
2 8 64
1.04858e+06
Time (microseconds)
Time (microseconds)
1.04858e+06
32768
1024
32768
1024
32
32
1 64
2 8 64
256
1KB
4KB
16KB 64KB 256KB 1MB
Array Size (bytes)
(a)
1 64
256
1KB
4KB
16KB 64KB 256KB 1MB
Array Size (bytes)
(b)
Figure 2. Timings for the ring benchmark, with N communications in a ring. Each line illustrates timings for a different value of N , where the overall number of communications is held constant. The X-axis is the size of each data packet, and the Y-axis is time; both axes are logarithmic. Times are shown for the original copying version (a) and the mobile version (b).
data items allocated, D, must be less than or equal to the number of processes in the system that might be communicating those values, P . Since D ≤ P and S is positive, DS ≤ P S, and thus the system will always have better or equal memory use under automatic mobility than with fixed stacks. 8. Benchmarks To investigate the speed-up that automatic mobility can provide, we benchmarked several programs on an AMD Athlon 64 3000+ with 512KB cache, using our Tock compiler to generate CCSP code compiled by GCC. 8.1. Ring To estimate the maximum speed-up that could be achieved through automatic mobility, we first benchmark simple rings. Each pipeline has one prefix and one recorder, connected by N − 2 identity processes. The prefix process sends out one array and then acts like an identity process. The recorder acts like an identity process, but times the running of the network. In a classic setting, there will be N copy-communications of the data per iteration. In an automatic mobility setting, there will be N move-communications. Thus the difference in times will reveal the difference in cost between the two communication types. We benchmarked the system with several sizes of data and lengths of pipeline. The results are given in Table 1a and depicted in Figure 2. It can easily be seen from the graph that the mobile version operates in constant time (w.r.t. to the data size) whereas the copy version is linearly proportional to the amount of bytes being communicated (once the cache size is exceeded). Thus, the speed-up is also linearly proportional to the amount of bytes being communicated. Referring to our earlier measure in equation 1, here M = 1, and thus the potential speed-up is unbounded. For this benchmark, automatic mobility provides a benefit even at the lowest size of 64 bytes per data packet. 8.2. Twin Pipeline To investigate speed-up in a program involving copying and movement, we benchmarked a program with two pipelines. The first pipeline has a producer, followed by N delta processes.
234
N.C.C. Brown / Auto-Mobiles
Table 1. Times for the (a) ring benchmark, (b) twin benchmark and (c) oak benchmark, showing the means and standard deviations (S.D.) for 30 runs of each condition, measured in milliseconds, given to 4 significant figures, for each number of processes N and packet size. Note: KB=210 , MB=220 . N N
Size Plain (bytes) Mean
Plain Mobile Mobile S.D. Mean S.D.
2 64 2564 94.34 1394 15.15 2 256 4586 20 1405 61.15 2 1KB 13020 108.1 1389 8.678 2 4KB 47190 257.3 1389 8.161 2 16KB 223900 551.8 1389 8.257 2 64KB 1134000 2119 1395 22.26 2 256KB 5722000 61350 1404 29.05 2 1MB 25570000 67420 1391 10.12 8 64 1419 82.62 784.3 26.86 8 256 2948 39.79 778.2 6.344 8 1KB 8066 28.88 780.7 13.21 8 4KB 29600 257.7 778 5.776 8 16KB 157300 509.4 778.2 6.023 8 64KB 818300 6986 780.2 7.383 8 256KB 3716000 18510 778.5 7.154 8 1MB 16190000 31240 779.7 7.203 64 64 1216 40.58 660 7.532 64 256 2554 249.5 659.5 7.907 64 1KB 8341 15.37 658 7.118 64 4KB 33300 24.83 660 6.411 64 16KB 173600 846.9 660.4 7.297 64 64KB 788800 1662 659.3 8.559 64 256KB 3429000 12320 660.5 7.415 64 1MB 14830000 23350 660.9 8.138
(a) Size Plain (bytes) Mean
Size (bytes)
Plain Mean
Plain S.D.
Mobile Mean
Mobile S.D.
1 64 636.7 27.56 993.3 16.74 1 256 1237 12.68 1087 11.43 1 1KB 4148 15.26 1520 12.75 1 4KB 15310 441.8 3195 29.05 1 16KB 65710 186.7 17620 280.6 1 64KB 325100 689.7 81480 1360 1 256KB 1668000 8667 957900 3764 1 1MB 6830000 13660 5581000 59210 1 4MB 27360000 41890 22540000 254100 1 16MB 109500000 131200 90240000 1039000 8 64 453.3 243.1 497 9.471 8 256 770.6 8.01 537.5 10.1 8 1KB 2385 75.21 723.6 14.9 8 4KB 10180 512.2 1818 8.528 8 16KB 43920 350.3 7991 124.8 8 64KB 277300 4678 77420 1336 8 256KB 1139000 25980 362800 8817 8 1MB 4568000 68190 2285000 64340 8 4MB 18180000 231200 9099000 238800 8 16MB 71930000 234700 35740000 959700 64 64 201.5 6.344 230.5 7.442 64 256 362.9 4.582 250 11.07 64 1KB 1421 32.08 375.8 6.274 64 4KB 5039 53.51 1053 172.2 64 16KB 27550 2302 7058 2462 64 64KB 134500 21980 35830 6002 64 256KB 527100 1843 137700 2227 64 1MB 2113000 4969 865200 15760 64 4MB 8267000 54730 3396000 59160 (b) Plain S.D.
Mobile Mean
Mobile S.D.
64 1161 114.7 1446 84.64 256 2371 597.3 2056 90.95 1KB 7492 634.6 5182 103.5 4KB 34570 230.4 22860 155.7 16KB 153700 1325 95890 1400 64KB 860000 9057 593000 9443 256KB 3552000 38880 2434000 33080 1MB 14450000 162300 11800000 221300 4MB 57410000 566300 46580000 817900 (c)
The second pipeline has a producer, followed by N merge processes. The delta processes are connected to the merge processes. The two pipelines both feed into a recorder process responsible for timing. The benchmark is depicted in Figure 3. There will be 2 + 3N communications in the benchmark, of which 2 + N will be copies and 2N will be moves. In the case where N = 1 (one delta and one merge process), M = 25 in equation 1, and thus the maximum speed-up is 53 . As N increases, the proportion M will tend to 23 . Therefore the maximum speed-up for larger pipelines is 3. We benchmarked the system with several sizes of data and lengths of pipeline. The results are given in Table 1b and depicted in Figure 4. It can be seen that for packet sizes up
235
N.C.C. Brown / Auto-Mobiles
Producer
Delta0
Delta1
...
DeltaN-1 Consumer
Producer
Merge0
Merge1
...
MergeN-1
Figure 3. The twin pipeline benchmark: two producers, a dual consumer, and N connected delta and merge processes inbetween.
Plain time divided by mobile time
6
6
1 8 64
5
5
4
4
3
3
2
2
1
1
0 64
256
1KB
4KB
16KB
64KB
256KB
1MB
4MB
0 16MB
Array Size (bytes)
Figure 4. Speed-up factors for the twin pipeline benchmark, depicted in Figure 3. Each line illustrates the speed-up for a different value of N , where the overall number of communications is held constant. The logarithmic X-axis is the size of each data packet and the Y-axis is speed-up (a factor greater than one indicates the mobile version is faster). The data point for a pipeline of 64 with 16 megabyte packets could not be carried out due to address-space limitations.
to 256KB, the speed-up is slightly chaotic, and larger than our theoretical upper bound! This is due to the processor cache, which can be better taken advantage of in the mobile version, due to the smaller amounts of data allocated in the program. For sizes of 1MB upwards (i.e. that are larger than the cache) the speed-up is stable. In this benchmark, automatic mobility provides a benefit for sizes of 256 bytes and upwards. 8.3. Occam Audio Kit The occam audio kit (oak) is a library of useful processes for performing audio generation and manipulation in occam 2, written by Adam Sampson. Almost all the processes use channels of blocks of audio data. Aside from common utility processes such as delta and the sum-like mixer process, there are processes for generating sine waves, process networks for performing simple feedback effects and amplifier processes. The oak library thus represents a real use case of communicating occam arrays, in a stream-processing setting. It is not a purely linear pipeline, and has diverging and merging processes. We benchmark a music-generating program, with the final passing-to-hardware step replaced with a timing process, and some of the more expensive floating-point operations removed to avoid a confound. Profiling revealed that each iteration has 55 communications, of which 11 are from producer processes (and thus involve allocations and copies), and a further 10 of which are copycommunications as part of a delta process. The remaining 34 are all move-communications, . meaning that M is 34 55 The results of the benchmark are given in Table 1c and graphed in Figure 5. This benchmark, unlike the other two, features a computational component alongside the communica-
236
N.C.C. Brown / Auto-Mobiles
Plain time divided by mobile time
2
2
1.5
1.5
1
1
0.5
0 64
0.5
256
1KB
4KB
16KB
64KB
256KB
1MB
0 4MB
Array Size (bytes)
Figure 5. Speed-up factors (Y-axis) for the occam audio kit (oak) benchmark, timing a fixed number of blocks with varying block sizes (the logarithmic X-axis). A speed-up factor greater than one indicates that the mobile version is faster.
tion, and as would be expected, the speed-up is fairly low. However, the speed-up is still larger than one; it does improve the speed of the program, for packets of 256 bytes and larger. 9. Alternative Approaches The analysis is currently performed on individual procedures. This means that pulling out common code into sub-procedures can interfere with the analysis. For example, this modified identity procedure is currently not optimised as the normal identity procedure is: 1 2 3 4 5 6 7 8 9 10 11
PROC send (CHAN [64]INT out, VAL [64]INT x) out ! x : PROC foo (CHAN [64]INT in, CHAN [64]INT out) [64]INT x: WHILE TRUE SEQ in ? x send(out, x) :
In future it would be better to perform whole-system analysis, which could avoid this problem – and also allow the optimisations to take into account the behaviour of the reader, not just the writer (a more complex topic). An additional possibility to the approach described here is the use of copy-on-write references. Instead of sending a reference on a channel that the reader owns (as in this paper), some run-time support could be added to support the sending of read-only references. If the reader needs to write to the data, it must make a modified copy of the data. This could potentially allow even less copies than the current conservative approach, as copies would only be made when actually necessary at run-time. The cost of copy-on-write references is that co-ordination will also be required in the run-time to destroy data at the right time. Once a read-only reference has been created to a piece of data (in addition to the original reference), there must be some way of ensuring that the data is not destroyed until all the references to it have been overwritten or gone out of scope. In a concurrent system, the cost of this coordination can be high as it must involve locks or atomic operations. Thus we have chosen the simplest approach, that relies only on static analysis and not on costly run-time support.
N.C.C. Brown / Auto-Mobiles
237
9.1. Linked Lists This paper has focused on simple blocks of data, such as arrays. An alternative data structure is linked lists. In languages without automatic run-time memory management, linked lists have O(n) preserving concatenation (copying the two lists into a new list, leaving the old lists intact) but O(1) destructive concatenation (transferring the two lists into a new list, and joining the head of one to the tail of the other) if both head and tail pointers are maintained. Consider this pseudo-code for occam with linked-lists: 1 PROC merger (CHAN LIST INT inA, CHAN LIST INT inB, CHAN LIST INT out) 2 WHILE TRUE 3 LIST INT a, b: 4 SEQ 5 PAR 6 inA ? a 7 inB ? b 8 out ! (a ++ b) 9 :
With a standard implementation, this process would recieve the two lists, concatenate them in an O(n) operation and send out this new copy on the channel. Using our automatic mobility optimisation, this O(n) concatenation could be transformed into an O(1) destructive concatenation, destroying a and b because they are not used again after the output. As concatenation is a very common operation on linked lists, it could be that automatic mobility is more beneficial for linked lists than it is for arrays. 10. Conclusions We have introduced automatic mobility, an optimisation for concurrent message-passing systems. Under automatic mobility, data items of 256 bytes or larger are allocated on the heap. They are communicated by reference wherever possible, and by cloning (allocating a new copy) otherwise. This should typically provide speed-up of 1–2×, but in the ideal case it can turn a program that is linear in the size of data being communicated into one that executes in constant time. Automatic mobility requires no changes to occam 2 programs, and is simply an optimisation flag in our Tock compiler. Wherever it provides a speed-up, it should be enabled and has no disadvantages. However, automatic mobility is not supported on embedded systems that lack support for dynamically allocating memory. Based on our results, we expect that for most programs automatic mobility will be faster than the original application. The speed-up is particularly dependent on the speed of the memory allocator, so fast memory allocators would be worth investigating in future. Automatic mobility also has the potential to reduce the overall memory allocation for a program. 10.1. Dynamically-Sized Arrays The occam language was originally designed for the Transputer hardware. Each process had a statically allocated workspace. The size of the workspace needed for an occam process could be determined at compile-time. This was possible because all array bounds were constant. For embedded applications, this is still a useful feature of occam programs, but for modern desktop or server hardware this is a cumbersome, prohibitive restriction. The occam-π language allows dynamic arrays through mobiles that are allocated on the heap. With our new automatic mobility transformation that also stores arrays on the heap, we may offer dynamically sized arrays. The problem is no longer one of implementation, but instead one of language design, which is beyond the scope of this paper.
238
N.C.C. Brown / Auto-Mobiles
10.2. Algorithms Summary The first step in auto-mobilising a program is to convert all data items beyond a threshold size (we recommend 256 bytes, based on benchmarks) to being mobile. They should (at this point) become allocated at the point of declaration, and all the communications should become clone communications. All the subsequent steps refer solely to these mobilised variables. The next step is to perform program analysis. A control-flow graph must be derived from the program. This is then used as part of a backwards iterative data flow algorithm [3, pp 231]. Each node begins with an associated empty set of variables. The algorithm then processes each directly connected node pair A and B (where A is followed by B). The new value for A is the union of three sets: its old value, B’s current value and all variables read from at B, minus the set of all variables written to at B. The algorithm repeatedly processes all directly connected node pairs until the values for the nodes no longer change. The resulting value for each node is the set of all values that are read afterwards. This is the sequentially-later set; the program is also analysed to form sets of variables used in-parallel (trivial from the abstract syntax tree) for each node. Also generated is a used-before-overwrite set which is almost identical to the sequentially-later set but the union of three things becomes a union of four – it also adds any variables that are partially written to at B. The next step is to process the declarations. Each declaration is checked against the used-before-overwrite set at that node. If the variable is in the set, it must remain allocated at the point of declaration. If it is not in the set (and thus is entirely overwritten before being accessed) the allocation of memory can be removed in favour of leaving it undefined (i.e. a null reference). Finally, each output and assignment where the source (right-hand side) is a mobile variable is checked. If that source variable is in either the in-parallel or sequentially-later sets at that node, it remains as a clone. If it is in neither set, the output/assignment is modified to become a movement rather than a clone. Acknowledgements The author is grateful to the anonymous reviewers for their comments on this paper, and also to Peter Welch, who supported this idea from the outset. References [1] Peter H. Welch and Fred R. M. Barnes. Communicating Mobile Processes: introducing occam-pi. In 25 Years of CSP, volume 3525 of Lecture Notes in Computer Science, pages 175–210. Springer Verlag, 2005. [2] Fred R. M. Barnes and Peter H. Welch. Mobile Data, Dynamic Allocation and Zero Aliasing: an occam Experiment. In Communicating Process Architectures 2001, pages 243–264. IOS Press, September 2001. [3] Steven S. Muchnick. Advanced Compiler Design & Implementation. Morgan Kaufmann Publishers, 1997. [4] William Pugh. The Omega Test: a fast and practical integer programming algorithm for dependence analysis. Communications of the ACM, 35(8):102–114, August 1992.
Communicating Process Architectures 2009 P.H. Welch et al. (Eds.) IOS Press, 2009 © 2009 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-065-0-239
239
A Denotational Study of Mobility Jo¨el-Alexis BIALKIEWICZ and Fr´ed´eric PESCHANSKI UPMC Paris Universitas – LIP6, 104, avenue du Pr´esident Kennedy, 75016 Paris, France {Joel-Alexis.Bialkiewicz , Frederic.Peschanski} @lip6.fr Abstract. This paper introduces a denotational model and refinement theory for a process algebra with mobile channels. Similarly to CSP, process behaviours are recorded as trace sets. To account for branching-time semantics, the traces are decorated by structured locations that are also used to encode the dynamics of channel mobility in a denotational way. We present an original notion of split-equivalence based on elementary trace transformations. It is first characterised coinductively using the notion of split-relation. Building on the principle of trace normalisation, a more denotational characterisation is also proposed. We then exhibit a preorder underlying this equivalence and motivate its use as a proper refinement operator. At the language level, we show refinement to be tightly related to a construct of delayed sums, a generalisation of non-deterministic choices. Keywords. mobility, denotational semantics, refinement
Introduction Mobile calculi such as the π-calculus [1] provide a suitable abstraction to model and reason about the dynamics of concurrent systems. In the spirit of CCS, they adopt in general a purely operational point of view; a syntax is elaborated, and then some operational semantics rules are figured out. Proof principles, generally based on bisimulation, are proposed above these. In our opinion there is little room in such an approach for high-level reasoning principles such as the ones available in the world of CSP: fixed-point characterisations, refinement, etc. On the other hand, the denotational point of view makes the constructs of the language simple syntactic sugars for natural operators found at the semantic level. The syntax is a derivative of the semantics and not the converse. Our objective is thus to elaborate solid foundations for mobile calculi from a denotational point of view. Unsurprisingly, we adopt the same basic construction as CSP: a model of trace semantics. There are various difficulties in designing a trace model that encompasses the features and expressivity of mobile calculi such as the π-calculus. First, standard trace models do not take the branching structure of process behaviours into account. Instead of relying on stable failures, we use an alternative approach - introduced in [2] - of enriching trace sets with structured locations that record at the same time when and where actions are performed. The interest of this approach is that beyond the adequate and well-integrated characterisation of non-determinism, the location model also provides a solution for the mobile features of the π-calculus, in particular name passing and the important issue of freshness. The idea is to relate the events concerning names (e.g. the creation of fresh names or the extrusion of their scope) to the locations where these events are taking place. The contributions of the paper are as follows. First, the proposed trace model underlies a family of equivalences, most notably a notion of split-equivalence that is satisfying in that it is both observational and compositional. Regarding the proof techniques, we introduce
240
J.-A. Bialkiewicz and F. Peschanski / A Denotational Study of Mobility
original principles of trace transformation and normalisation that allow to equate process behaviours in an easily mechanisable way. The second contribution of the paper relates to refinement [3,4]. At the language level, we show that the refinement ordering is tightly related to a construct of delayed sums, a strict generalisation of the standard choice operators. As an illustration, we show that the refinement order underlies a complete lattice structure of least fixed points used as foundations for the characterisation of recursive behaviours. The outline of the paper is as follows. In Section 1, we describe the main characteristics of the proposed denotational model. The construction of the trace semantics of process behaviours is discussed in Section 2. In Section 3 we present a language with its syntactic constructs built above the trace semantics. Then, in Section 4, we present and discuss the family of behavioural equivalences we use to distinguish trace sets in complementary ways. We insist on the original notion of split-equivalence that accounts for branching-time behaviours. The refinement order and its complete lattice structure are discussed in Section 5. This is followed by a panorama of related work, the conclusion and bibliographical references. In the paper we omit a few auxiliary definitions and proof details, as well as the complete axiomatisation of split-equivalence. These can be found in a companion technical report [5]. 1. The Trace Model 1.1. Observations and Locations The goal of a denotational model for a process algebra is to characterise precisely the external, or observational, part of process behaviours, abstracting from the details of their internal computations. Our characterisation is based on trace models, largely inspired by the CSP semantics. A trace is a sequence of observations – or observable actions – recorded from a process behaviour. Definition 1. An action is either an output c!d of subject (channel) c and object d, an input c? of subject c or a termination . The subject of an action α is denoted subj (α) and its object obj (α). Note that the input action does not involve any bound variable. The binder is in fact implicitly defined by the location when and where the input is observed. Another important remark is that unlike CSP, the characterisation of mobility requires to consider channels as firstclass citizens1 . The data passed along channels are names, which on occasion may identify other channels. We distinguish the plain names e.g. a, b, . . . (which are known in the global scope), received names ρl (received at location l) or escaping name νl (escaped at location l). Note that in order to give proper semantics to name equality and extract adequate laws (cf. Section 4.3), the occurrences of names within traces are in fact equivalence classes of names. For convenience, the singleton set {n} with n a name is simply denoted n. Figure 1 depicts two processes that are only distinguishable by their branching-time behaviour. In the left process there is a non-deterministic choice between two possible continuations starting with an α action. In the right process the action α is first performed and then an external choice is performed for either the action β on the left or γ on the right. In standard trace semantics these two behaviours are not distinguished, and it is one of the main argument to rely on bisimulations to compare process behaviours. Below these two examples, we give the characterisation of these behaviours in the proposed framework. Ignoring for now the details of this encoding, the bottom part of the picture provides an operational 1 The integration of CSP-like events in the model is not very difficult. The idea is to consider events as observations and modify the means of synchronisations between events. But events cannot be used for channel mobility so we omit them in the paper.
J.-A. Bialkiewicz and F. Peschanski / A Denotational Study of Mobility
α β
α γ
241
α β
γ
α::21 , β:: , α::22 , γ:: α:: , β::21 , α:: , γ::22 α::21 β::
α::22 γ::
α:: β::21
γ::22
Figure 1. Examples of branching behaviours.
interpretation. We can see that the two behaviours are distinguished in this interpretation. The important aspect is that the notion of observation is tightly related to the notion of location in the proposed model. Definition 2. An observation is the adjunct α::l of an action α and a location l. The characterisation of the branching structure is not the only problem we have to face. The semantics of mobile calculi such as the π-calculus and its variants introduce history dependence [6] — the semantics of a name depends on what happened before its considered occurrence. A first example is when some data is received from the environment, e.g. in a prefix c?x where x must be bound to “something” we do not really know about. For example if this input is followed by a match [x = y] (comparison between names x and y) then we must “remember” x was bound and also assume now it is equal to y. Another example is when a private name n is emitted to the environment, e.g. in a prefix c!n under a restriction ν(n). Now the name n is not private anymore because it can be received by external processes, but it is neither public because it can only be known by those external processes which actually receive the name. Once again this introduces an history-dependence in the behaviour since we have to remember than n escaped the process, and also when it escaped. Moreover, this name must be guaranteed fresh, i.e. unique up-to any context in which the behaviour can be observed. This freshness guarantee is difficult to model except in a symbolic way using scope extrusion laws [1]. The other issue we have to deal with is the interpretation of match and mismatch (or any combination of these), which is quite easy in symbolic terms [7] but much less so when considering a denotational interpretation. Interestingly, we use the very same idea of location to solve most of these issues at once. Of course, we need a slightly more structured notion of location than e.g. [2]. Definition 3. Let i, j be integers such that 1 ≤ i ≤ j. A locator is either a strong locator ji , a weak locator ji or the origin locator . A (relative) location l is defined by the following grammar where is the origin, s a strong locator, w a weak locator and ϕ a logical guard on channel names: ˜ l ::= | λ ˜ ::= λ | (ϕ, w).λ ˜ λ λ ::= (ϕ, s) As a convenience, we denote λ the location (true, λ). Locations share many features with term positions in term algebras [8]. Since the location of a given observation is relative to its predecessors, the origin locator is necessary to be able to reconstruct absolute locations. We denote l the absolute location leading to the relative location l. For each atomic location (ϕ, λ) the formula ϕ corresponds to the guard “protecting” locator λ and thus the observation made there (i.e. the observation really occurs only when ϕ is true). Since guards protect
242
J.-A. Bialkiewicz and F. Peschanski / A Denotational Study of Mobility
locators, it is necessary to be able to extract the combined guard of a location. This is the role of the grd function. Definition 4. Let ϕn range over guards and ln over locators: grd((φ1 , l1 ) . . . (φn , ln )) = φ1 ∧ . . . ∧ φn grd() = true A split location (either strong or weak) expresses a branching, or a non-deterministic ji (with 1 ≤ i ≤ j) corresponds to the i-th choice, in the behaviour. A branching ji , or branch within a choice among j distinct branches. The weak variant is used to describe non11 ) describes the absence deterministic choices due to internal actions2 . The locator 11 (resp. of a choice, which we call a strong (resp. weak) next locator, denoted (resp.
) for the sake of readability. 1.2. Sequences and Trace Sets Definition 5. A sequence is an ordered collection of properly decorated observations. The empty sequence is denoted and a non-empty sequence α1 ::l1 , α2 ::l2 , . . .. The absolute location of observation αn within sequence α1 ::l1 , α2 ::l2 , . . . , αn ::ln , . . . is the concatenation l1 .l2 . . . ln−1 .ln . It is equivalently denoted ln . For instance, in the sequence α:: , β:: 83 , γ::22 , the absolute location of β is
83 Notation 1. We use a few standard notations for sequences. The prefixing of a sequence S by a decorated observation α::l is denoted α::l.S The concatenation of sequences S1 and S2 is denoted S1 S2 . Sequence S1 is a prefix of sequence S2 , which is denoted S1 ≤ S2 , if and only if ∃S , S2 = S1 S . A few operators on sequence are introduced to deal with locations. Definition 6. The pre-sequence (resp. post-sequence) of a sequence S at a location l, denoted S ↑ l (resp. S ↓ l) is defined inductively as follows: ⎡
def
(α::l.S) ↑ l.L = α::l.(S ↑ L) ⎢ ⎢ α::l.S ↑ l def = α::l ⎣ def S ↑ L = otherwise
⎛
⎡
⎜ ⎜ ⎜resp. ⎜ ⎝
⎢ ⎟ ⎢ α::l.S ↓ l def ⎟ =S ⎢ ⎟ ⎢ ⎟ def ⎣ α::lm.S ↓ l = α::m.S ⎠
def
α::l.S ↓ l.L = S ↓ L
⎞
def
S ↓ L = otherwise
Informally, computing the pre-sequence of a sequence S consists in following the path l and extracting the prefix S of S at this point. Conversely, the post-sequence extracts the suffix after that point. For instance, let us consider trace set T = {α:: , β::21 , α:: , γ::22 }. Here the pre-sequence of T after absolute location , denoted T ↑ , will be {α:: } and the corresponding post-sequence T ↓ will be {β::21 , γ::22 }. Note that the pre-trace set and the post-trace set do not partition a trace set in the general case: they instead isolate a given branching point in the behaviour. Definition 7. A substitution of x by y in sequence S, denoted S{y/x}, consists in the sequence S where all the occurrences of x are replaced by y. A generic substitution of any x by any y with respect to ϕ in sequence S, denoted S{y/x | ϕ}, where ϕ is a property on x and y, corresponds to applying the substitution to any pair of names satisfying ϕ. 2 The internal actions are not recorded in traces, but their effect on the branching structure has to be recorded, hence the introduction of weak split location.
J.-A. Bialkiewicz and F. Peschanski / A Denotational Study of Mobility
243
Trace sets are built from sets of sequences with a few constraints. Unless otherwise stated, all sequence operators naturally extend to trace sets by simply applying to all sequences within them. We define below some useful operators on localised trace sets. Definition 8. The relocation of a trace set T from location l1 to location l2 at location l, denoted T {l1 l2 }l , corresponds to the set {S{l1 l2 }l | S ∈ T } with: ⎧ ⎨ α::kl2 m.S{Vl2 n /Vl1 n | n ∈ L, V ∈ {ρ, ν}} def • α::kl1 m.S{l1 l2 }k = if ∃ϕ, L, kl2 m = L(ϕ, ni ) ⎩ ::kl2 m otherwise def
• α::l.S{l1 l2 }l.l = α::l.(S{l1 l2 }l ) def
• S{l1 l2 }l = S otherwise Relocation is a very important operator, a little bit technical but conceptually quite simple. The idea is to update a trace set so that one of its location is renamed. But such a local change has a non-local impact on the trace. First, the locations are related in a prefix ordering so all successors must be updated in consequence. Moreover, because trace sets are closed under prefixing, a potentially infinite number of sequences can be concerned by the relocation. The ρ and ν elements in the definition are related to the fresh names generated by the model, which are uniquely characterised by the absolute location where they were created and thus have also to be updated whenever it is relocated. Their exact role will be clarified later on. As an illustration of the relocation process, we take again the previous example T = {α:: , β::21 , α:: , γ::22 }. The relocation T {21 31 } {22 32 } yields {α:: , β::31 , α:: , γ::32 } The relocated set of sequences we obtain is not well-formed because there is no third branch involved in the behaviour. However, this partial trace set can be used by higher-level operators (e.g. for choice or parallel compositions) to recombine correct trace sets. def
Definition 9. T {(ϕ, l1 ) ↔ (ψ, l2 )}l = T {(ϕ, l1 ) (ϕ, •)}l {(ψ, l2 ) (ψ, l1 )}l {(ϕ, •) (ϕ, l2 )}l Not all sets of sequence are valid trace sets, it is thus important to characterise precisely the structure of the set T of all possible trace sets. Technically, this is a setoid characterised as follows: Definition 10. A trace set T is a set of sequence of the setoid (T , =) with the following properties: fin ∀S ∈ T, S is finite pref ∀S ∈ T, ∀S ≤ S, S ∈ T move T {(ϕ, ni ) ↔ (ψ, nj )}l = T The axioms [fin] and [pref] are identical to their CSP counterpart. The axiom [move] allows arbitrary commutations of locators: the order among the particular branches of a given location is not significant3 For the sake of readability, the trace sets presented in the paper are abbreviated as plain, non-prefixed sets of sequences but we of course assume the trace set axioms unless stated otherwise. 3 The removal of the axiom [move] from the model leads to a notion of prioritised trace semantics that could be worth studying.
244
J.-A. Bialkiewicz and F. Peschanski / A Denotational Study of Mobility
2. Trace Semantics The key aspect of CSP-like trace semantics is the possibility to construct arbitrarily complex process behaviours from a reduced set of elementary behaviours and composition operators. Definition 11. The empty behaviour is represented by the empty trace set {} which we sometimes denote ∅ for brevity. Definition 12. The termination behaviour is the trace set {::, }. The main operator for sequential behaviour is the prefixing of a trace set T by an action α, which we denote α::l.T (because the action must be located somewhere). In the case of an output, every sequences in the trace set is prefixed by the action decorated by (true, ). The locator , which is synonymous to 11 , corresponds to a step forward in time; and since observing that prefix is unconditional, its condition is true. We remind the reader that the subject and object of observations are name sets rather than names; however, whenever there is no possible confusion the brackets may be omitted for the sake of brevity. def
Definition 13. c!a.T = {{c}!{a}::(true, ).S{ } | S ∈ T } Note that the sequences following the output must be relocated after the initial . As hinted previously, a key aspect of our encoding is the absence of any form of symbolism, in particular binders, as that would break the denotational nature of the model. For instance, there is no variable or binder attached to input prefixes: whatever is present at the current branching point will be received. If the data is received by an input occurring at absolute location L, it will be known thereafter as ρL . That absolute location will be built with the trace set by the use of the relocation operator. def
Definition 14. c?x.T = {{c}?::(true, ).S{ }{ρ /x} | S ∈ T } A silent or internal action τ can obviously not be observed, this is the main idea underlying this notion. However, since it may have an effect on the branching structure, it is recorded as a weak location
attached to the initial action in the continuation. def
)l.S{
} | α::l.S ∈ T } Definition 15. τ.T = {α::(true, Note that an observation may have any number of weak locations, but it has at most one strong location that corresponds to the point where the actual observation occurs. Guarding a trace set by a condition is simply done by guarding the head location of all the initials of its individual sequences, so that their initials may only be observed it that condition is true. However, to take into account the match conditions, e.g. [a = b], we allow equivalence classes of names to be formed, e.g;. to replace both a and b by {a, b} in the remainder of the trace set conditioned by the match4 . def
Definition 16. [G]T = {α::(G ∧ ϕ, l0 )l.S | α::(ϕ, l0 )l.S ∈ T {x ∪ y/x | G =⇒ x = y}} Restriction corresponds to declaring a name as private, and thus not allowing to communicate using it as a subject. However, if it is used as the object of an output, it will escape its scope and become visible to the outside world. This is the core of mobility, and also what in our opinion is the most involved aspect of the model. The restriction of a name n is recorded in a trace set as an “escape”. The effects of restriction is to cut short the sequences from the point where it is not possible to interact using the restricted name anymore (starting 4
The idea of implementing match condition by equivalence classes of names is developed in [9].
J.-A. Bialkiewicz and F. Peschanski / A Denotational Study of Mobility
245
scope of s c
T1
c s
s
T2
T3
scope of s T2 T1
s
s d
T3
def
T = sync(T1 , (νs)sync(T2 , T3 )) T1 = {c?:: , ρ !d:: , ::} T2 = {c!s:: , ::} T3 = {s?:: , ::} T1 = {ρ !d:: , ::} T2 = {::} T3 = {s?:: , ::} Figure 2. Mobile behaviour illustrated.
from any action whose subject is an occurrence of the restricted name n that is not escaped) and by finding the actions where the restricted name indeed escapes. The formal definitions, relatively technical, are given below. Definition 17. def
En (T ) = {En (S) | S ∈ T } def
En () = ⎧ α::l{f alse/(n = x)}.S if grd(l) =⇒ n = x ⎪ ⎪ ⎨ α::(f alse ∧ ϕ, λ)L if subj (α) = n and l = (ϕ, λ)L def En (α::l.S) = Fl (α::l.S){νl /n} if α = c!n and c = n ⎪ ⎪ ⎩ n α::l.En (S) otherwise def
FLn () = ⎧ α::l{f alse/(n = x)}.S if grd(l) =⇒ n = x and not x = ρm , L ≤ m ⎪ ⎪ ⎨ or elseα::l .FLn (S) with def FLn (α::l.S) = l{true/(n = x)} if not x = ρm , L ≤ m ⎪ ⎪ ⎩ l = l otherwise Each escape is effected by replacing all the free occurrences of n by a name νl generated from this point on. Since absolute locations are unique by construction, this name is guaranteed fresh. Figure 2 illustrates how the scope of a restricted channel may evolve in time. At the beginning of the execution, only T2 and T3 are in the scope of s. However, T2 sends s along public channel c, which is illustrated in the second step. This allows s to escape its scope, which now also integrates T1 . Since the name was received at absolute location , this name is called ρ in the continuation of the trace set. The third step is the communication of d along channel s which is now known by T1 (as ρ ). The choice between behaviours T1 and T2 is denoted T1 ⊕ T2 . It corresponds to a disjoint union between both behaviours. All locators are re-numbered in order to create a new, compound trace set.
246
J.-A. Bialkiewicz and F. Peschanski / A Denotational Study of Mobility def
n+m Definition 18. T1 ⊕ T2 = T1 {(ϕ, sni ) (ϕ, sn+m )} ∪ T2 {(ψ, sm j ) (ψ, sj+n )} i
The interleaving of T1 and T2 is denoted ileave(T1 , T2 ). It corresponds to interleaving the sequences of both behaviours. Definition 19. def ileave(T1 , T2 ) = S1 ∈ T1 ileave(S1 , S2 ) S2 ∈ T2
def
ileave(α1 ::l1 .S1 , α2 ::l2 .S2 ) = α1 ::l1 .ileave(S1 , α2 ::l2 .S2 ) ⊕ α2 ::l2 .ileave(α1 ::l1 .S1 , S2 ) The pure synchronisation between T1 and T2 is denoted sync(T1 , T2 ). It corresponds to allowing all possible communications between the two behaviours to occur without any interaction with the environment, much like the CSP parallel operator does. Definition 20. def sync(T1 , T2 ) = S1 ∈ T1 sync(S1 , S2 ) S2 ∈ T2
def
sync(a!d::l1 .S1 , b?::l2 .S2 ) = sync(b?::l2 .S2 , a!d::l1 .S1 ) =
)} sync(S1 , S2 {d/ρl2 }){ (a = b,
Since internal synchronisations can not be observed, the trace set of pure communications between processes, if it is computable, can only contain sequences where the only observation is the termination decorated by a chain of conditioned weak locations. Both notions of parallelism can be combined into an universal parallel operator which allows to account for both interleaving and communication between behaviours. It is defined using mutually recursive modifications of ileave and sync, which are denoted interleave and intersync. Definition 21. def T1 T2 = interleave(T1 , T2 ) ⊕ intersync(T1 , T2 ) def interleave(T1 , T2 ) = S1 ∈ T1 interleave(S1 , S2 ) def
intersync(T1 , T2 ) =
S2 ∈ T2
S1 ∈ T1 S2 ∈ T2
intersync(S1 , S2 ) def
interleave(α1 ::l1 .S1 , α2 ::l2 .S2 ) = α1 ::l1 .(S1 α2 ::l2 .S2 ) ⊕ α2 ::l2 .(α1 ::l1 .S1 S2 ) def
intersync(a!d::l1 .S1 , b?::l2 .S2 ) = intersync(b?::l2 .S2 , a!d::l1 .S1 ) =
)} (S1 S2 {d/ρl2 }){ (a = b, 3. The Language
As explained in the introduction, we think that an important characteristic of CSP is that the syntax of the language follows the denotation and not the converse. The syntactic constructs that, in our opinion, naturally emerge from the denotation proposed in the previous sections is summarised in Table 1. Table 1. Syntax of the language. P, Q, . . . α, . . . ϕ, ψ, . . .
::= ::= ::=
VOID | END | α.P | P + Q | P Q | P · Q | P Q | (νn)P | μX.Q | X τ | c!a | c?x | [ϕ]α a = b | a = b | ϕ ∧ ψ | ϕ ∨ ψ | ¬ϕ
J.-A. Bialkiewicz and F. Peschanski / A Denotational Study of Mobility
247
With first-class channels, a dynamic restriction operator and generalised choice operator, the language is at the surface closer to the π-calculus than to CSP. But since we share with CSP the same philosophy (i.e. “denotation speaks”) and also many semantic concepts, we rather see this language as a hybrid. To reflect this we adopted when possible the syntactic style of CSP. Note that there is no natural equivalent of the choice operators and of CSP in our denotation, because these relate to stable failures while we use locations instead. The generalised choice + is in fact neither deterministic nor non-deterministic. It is deterministic whenever possible, and non-deterministic otherwise. A purely internal choice can be encoded by weak locations (inserted by explicit τ prefixes in the syntax). Mainly because it has simpler denotation and axiomatisation, we also prefer explicit guarding of processes than the if-thenelse construct. But it is possible to encode P < | ϕ> | Q as [ϕ]P + [¬ϕ]Q. We can now connect the syntax to the semantics. Definition 22. The trace set of a process P is P calculated according to Table 2. The semantic encoding of Table 2 illustrates, in our opinion quite demonstratively, the fact that the proposed syntactic constructs naturally emerge from the semantics. Each construct has a dedicated operators applying at the semantic level. The encoding of recursive process is, as a first approximation, encoded as a simple unfolding. A subtlety is that we introduce a silent action “before” each unfolding. This has the advantages of taking into account the computational cost of unfolding, and it makes divergences to be observable (as unbounded sequences of weak locations). A more denotational characterisation of recursion as fixed points requires a proper refinement model. This is proposed in Section 5. Table 2. Semantics of the language. VOID
def
END
def
def
= α.P
[G]P
def
α.P
def
P + Q = P ⊕ Q
= ∅
(νn)P = En (P ) def def
= [G]P
def
P Q = ileave(P , Q) P · Q P Q = P Q
= {::}
μX.P
def
= sync(P , Q)
def
= P {τ.μX.P/X}
To illustrate the calculation of trace sets, we provide a few examples in Table 3. The first example illustrates the sum recorded as a combination of branches built using the operator ⊕. Note that the combined branches are correctly renumbered. The second example illustrates the treatment of “binders”, i.e. received names recorded together with the absolute location of their reception. The third one is about parallel composition, which composes the possible interleavings and communications of the operand processes. Interleavings are computed by the function interleave, communications using the function intersync, and those behaviours are joined together by ⊕ like a process sum. The next three examples are about restriction, and illustrate that restriction behaves as expected, which can be checked easily by following restriction/escaping function En . Example 4 gives a situation where there is no possible interaction. In Example 5, the process very much behaves like [a = z]τ.b!c since the only possible interaction is an internal communication. Example 6 is an example of the escape of a private name sent over a public channel. The last example illustrates unfolding recursion on a very simple case. 4. Split-equivalence The proposed denotational semantics is not as simple or “beautiful” as we would like it to be. Locations represent at the same time its strength and weakness from this point of view. On
248
J.-A. Bialkiewicz and F. Peschanski / A Denotational Study of Mobility Table 3. Examples illustrating trace set construction. 1. τ.a!b.EN D + c!d.EN D = {a!b:: 21 , ::, c!d::22 , ::} 2. a?x.b?y.[x = y]c!x.EN D = {a?:: , b?:: , c!{ρ , ρ }::(ρ = ρ , ), ::} 3. a!b.EN D c?x.x!d.EN D = {a!b::41 , c?:: , ρ41 !d:: , ::, c?::42 , a!b:: , ρ42 !d:: , ::, c?::43 , ρ43 !d:: , a!b:: , ::, b!d::(a = c, 44 ) , ::} 4. (νa)a!b.EN D = {::} 5. (νa)(a!b.EN D z?x.x!c.EN D) = {b!c::(a = z, ) , ::} 6. (νa)c!a.a?x.x!m.EN D = {c!ν :: , ν ?:: , ρ !m:: , ::} 7. μP.(νn)a!n.P = {a!ν :: , a!ν :: , a!νe :: , a!ν :: , a!νe :: , a!νee :: , . . .}
the positive side, they offer a well integrated encoding of the branching structure of process behaviours, and perhaps most importantly when it comes to mobility, an adequate (i.e. compositional) characterisation of freshness. But in return, they are also quite fine-grained and, not unlike de Bruijn indices in the λ-calculus, uneasy to deal with in the formal definitions5 . It is thus very important to develop proof principles and techniques allowing to abstract away from the technical details. The basic step towards that objective is the development of a proper notion of semantic equivalence. The trace model presented in the previous sections naturally underlies an equivalence relation based on the setoid identity of Definition 10. This so-called localised trace equivalence is denoted P =L Q and holds if and only if P = Q. It is not, however, a satisfying equivalence in all situations. For instance, it does not preserve such a basic property as P + P = P (because there is a supplementary split location in the left hand side of the equality). A first obvious nevertheless useful way to loosen the comparison is to simply forget about locations altogether. This results in the notorious trace equivalence, denoted =T , that does not take into account the branching structure of processes. Trace equivalence is enough to address safety issues, but is too imprecise as a general equivalence forgetting about nondeterministic choices, e.g. it equates processes such as α.(P + Q) and α.P + α.Q The trace-based equivalence developed in this section, split-equivalence, can be seen as an intermediate between the localised and plain forms of trace equivalences. On the one side it preserves the observational contents as captured by =T and on the other side it weakens the constraints imposed by =L on locations. 4.1. Trace Transformations The general idea is to start from the localised equivalence =L but allow a certain number of transformations that preserve the branching structure and observational semantics. Table 4. The transformations on trace sets. [merge] [perco] [weak false] [strong false]
n n−1 ∀s, s ∈ {, }, T {(ϕ, sna ) (ϕ ∨ ψ, snb )}l {(ξi , s i )li ξi , s i−1 )li ∀i > a}L n n−1 n {(ξi , s i )li ξi , s i )li ∀i < a}l if T ↓ l(ϕ, sa ) = T ↓ l(ψ, snb ) T {(ϕ ∧ ψ, l1 ) (ψ, l1 )}l if grd(l) =⇒ ϕ T {lf (ψ, sk,n )le }l where grd(lf ) = f alse, hd (lf ) = (ϕ, snk ) and f(lf ) = (ψ, si,j )le ∀i > k, T \ (T ↓ l.lf ){(ϕ, sni ) (ϕ, sn−1 )∀i < k}l {(ϕ, sni ) (ϕ, sn−1 i i−1 )∀i > k}l , n s ∈ {, } where lf = (ψ, s k )lf , α::lf .S ∈ T ↓ l, grd(lf ) = f alse and f(lf ) = ∅
5 Despite the apparent complexity of the proposed formulations on paper, most of the proposed definition are designed as simple recursive functions easily implementable.
J.-A. Bialkiewicz and F. Peschanski / A Denotational Study of Mobility
Definition 23. def f(∅) = ∅ def
f((ϕ, l).L) =
249
∅ if ϕ ⇐⇒ f alse l.f(L) otherwise
These transformations are described in Table 4. Note that they rely heavily on trace relocation. Despite their somewhat technical definitions, the transformation are conceptually simple: Merge The transformation merges two distinct branches (identified by two different sni and snj strong or weak split locators) at a given location l whenever they exhibit the same behaviour (wrt. =L ). Put in other terms, one of them will be deleted, and all the other branches at location l will be renumbered so that split locator numbering remains consistent. In terms of processes, merge transforms P + P into P compositionally. Perco Guards are logical conditions that control whether an action may occur. Since actions are treated sequentially, a guard is implicitly in conjunction with all the previous guards of its prefix sequence. For that reason, any guard which is already implied by the ones of the absolute location l where it appears will be removed. In terms of processes, perco transforms [G]α.[G]P into [G]α.P . Weak false If the full (absolute) guard of an observation’s strong locator is false, then this observation will never be reached, and it does not belong to the process behaviour. However, if at least one of the internal actions included in the observation (as weak locators) is reachable then a deadlock condition exists, which must be recorded properly. In terms of processes, weak false transforms τ.[f alse]P +Q into τ.END+Q. Note that in this case END is used to record the possibility of branching into a deadlock, which clearly differs from the usual CSP semantics. Strong false This complements the previous one when every single locator of an observation is unreachable, in which case the whole observation must be removed. This consists in removing all the branches starting at the disabled location, up-to the renumbering of split locators for consistency. In terms of processes, strong false transforms [f alse]P + Q into Q. Definition 24. A split-relation R is a symmetric binary relation on trace sets such that T1 RT2 if and only if T1 = T2 or there exists a couple of transformations U, V such that def
U(T1 )RV(T2 ). The split-equivalence is = = {R | R is a split-relation} We extend the notation to process expressions, considering two processes P and Q as split-equivalent, denoted P = Q, if and only if P = Q. The rationale is that two processes are equivalent if and only if there trace sets are either identical (according to =L , which means the axioms of Definition 10 hold) or they can be transformed (using the transformations of Table 4) an arbitrary number of times so as to be made identical (still according to =L ). The equivalence itself is defined in coinductive terms, which means it encompass infinite behaviours (by allowing to apply an infinite times the transformations). 4.2. Normalisation Techniques The definition of split-equivalence is concise and conceptually simple. Unfortunately, the coinduction proof technique is relatively cumbersome to deal with in practice. It only works well for either very general and simple relations (e.g. showing that =L is a split-relation and thus included in split-equivalence), or on very simple behaviours (in this case exhibiting a split relation is easy). In general this is not a very practical proof technique. The first reason is that it tells nothing about what transformation to choose for a given context. Also, it
250
J.-A. Bialkiewicz and F. Peschanski / A Denotational Study of Mobility
gives an operational feel to the semantics because one has to consider each transformation individually. Moreover the straightforward approach does not terminate even for some finite systems, potentially requiring an infinite number of transformations to be applied (e.g. to relate μX.(τ.X + P ) and μY.(P + τ.Y )). Indeed, the transformation rules could be added as axioms for the trace setoid. There is however a reason why we maintain these rules outside the setoid: trace normalisation. The technique we now discuss is based on the idea of rewrite systems [8]. Definition 25. A trace rewrite is a triple (T, U::l, T ) with T the result of applying the transformation U (excluding identity) on the subtrace of T at the absolute location l. We also U ::l use the notation T −−→ T . If the considered rewrite is not possible at the given location, we U ::l
denote T −−→. An arbitrary rewrite (at an arbitrary location) is denoted T → T . If a trace T is such that T − − → then T is said in normal form, which is denoted T The rewrite rules give a directed and localised interpretation of the transformations of Table 4. The definition of a normal forms is also important it that it provides an alternative characterisation of split-equivalence. = Q Proposition 1. P = Q iff P This follows naturally from the fact that the normalisation itself is a split-relation. Now, in order to prove that two processes P and Q are split-equivalent, we can compute the normal forms of their trace sets and equate the later using =L . A useful lemma shows that normal forms are unique up-to =L . Lemma 1. Let T be a trace set. Suppose T1 and T2 such that T →∗ T1 and T →∗ T2 . Then T1 = T2 = T. The proof for this lemma requires a diamond property, relatively technical, which is detailed in the technical report [5]. For the moment, we do not gain much by using the normalisation technique to prove split equivalence. It is possible, however, to take advantage of the finitely branching structure of behaviours as well as the well-foundedness of the prefix ordering on locations to uncover a weak termination property of the normalisation process. For this we must introduce a higherlevel notion of parallel rewrite, which consists in applying simultaneously all the independent rewrites that can be applied on a given trace set. Two rewrites are strongly independent if they apply at unrelated locations (with respect to the prefix-ordering on locations), and weakly independent if their location is comparable but they can be applied in an arbitrary order. Definition 26. A single parallel simplification of a trace set T is a triple (T, Υ, T ) with Υ Υ the set of all the independent rewrites applicable on T . The triple is denoted T =⇒ T . By Lemma 1, we know that the order of application of the individual rewrites U::l ∈ Υ is not significant so if we consider such parallel application as atomic, the relation ⇒ enjoys a decisive weak termination lemma. Lemma 2. The parallel simplification of a trace set T is terminating, i.e the descending chain Υ1 Υ2 Υn T =⇒ T . . . = =⇒ T is terminated (i.e. n is finite) T =⇒ This can be demonstrated by an induction on the structure of locations in trace set T . The important step is the fact that a parallel rewrite Υk+1 can only be performed at locations that are prefixes of the locations of Υk , and there is no infinite descending chains of location prefixes.
J.-A. Bialkiewicz and F. Peschanski / A Denotational Study of Mobility
251
4.3. Laws Equipped with adequate proof principles, we can now discuss a certain number of laws about the language constructors, interpreted in term of trace properties. The technical report [5] contains more properties of the model, with more thorough proof details. Most notably, it provides a complete axiomatisation of split-equivalence. We first discuss the compositional nature of split-equivalence (i.e. it is a congruence for all language constructors). Lemma 3. Let α::l a location, T and T a couple of trace sets, then T = T =⇒ α::l.T = α::l.T Proof. This is simple: α::l.T = {α::l.S | S ∈ T }, α::l.T = {α::l.S | S ∈ T } and T = T so that ∀S ∈ T, ∃S ∈ T with S = S and ∀S ∈ T , ∃S ∈ T with S = S , which trivially gives ∀α::l.S ∈ α::l.T, ∃α::l.S ∈ α::l.T with α::l.S = α::l.S and ∀α::l.S ∈ α::l.T , ∃α::l.S ∈ α::l.T with α::l.S = α::l.S Lemma 4. Let σ an injective substitution of names by other names, T and T trace sets, then T = T =⇒ T σ = T σ. The same is true for σ an injective substitution of locations by other locations, provided the cosupport of σ only contains fresh split or weak split locations. Proof. This is trivial by considering T σ = {Sσ | S ∈ T } and T σ = {S σ | S ∈ T }, using a similar proof scheme to that of Lemma 3, we can match all Sσ’s in T to S σ’s in T and vice versa, which is enough to conclude Lemma 5. For any single-hole context C: P = Q =⇒ C[P ] = C[Q] Proof. Let C be a single-hole context.We proceed by case analysis on C. The common hypothesis in all cases is that P = Q, i.e., there exists Υ1 and Υ2 a couple of simplifications such that Υ1 (P ) = Υ2 (Q). In all cases, we can separate the issue in first exhibiting a couple of simplifications Υ1 and Υ2 to recover a comparison up-to =L , and then secondly to discuss the observable properties of the modified contexts. In many cases, it is enough to delay Υ1 and Υ2 to obtain the simplifications we may apply on the contexts. def
def
def
• Case C = τ.[.]: we take Υ1 = Υ1 ::(true,
) and Υ2 = Υ2 ::(true,
). We have def def
)}. τ.P = P { (true, )} (see Table 2) and also τ.Q = Q{ (true,
)(P { (true,
)}) = Moreover, by Definition 25, we obtain Υ1 ::(true,
)}) and Υ2 ::(true,
)(Q{ (true,
)}) = Υ2 (P { Υ1 (P { (true, (true,
)}) and we conclude the case by Lemma 3. def
def
def
• Case C = α.[.]: we take Υ1 = Υ1 ::(true, ) and Υ2 = Υ2 ::(true, ). We def
def
have α.P = α::(true, ).P (see Table 2) and also α.Q = α::(true, ).Q. Moreover, by Definition 25, we obtain Υ1 ::(true, )(α::(true, ).P ) = α::(true, ).Υ1 (P ) and Υ2 ::(true, )(α::(true, ).Q) = α::(true, ).Υ2 (P ) and we conclude the case by Lemma 3. def def def • Case C = [ϕ][.] where ϕ is a guard: we take Υ1 = Υ1 and Υ2 = Υ2 . The modification of the observational contents is making the head locations of both P and Q guarded by ϕ, and if α::(ψ, l) = β::(ξ, l) then of course α::(ϕ ∧ ψ, l) = β::(ϕ ∧ ξ, l). def
def
def
• Case C = (νn)[.]: we also take Υ1 = Υ1 and Υ2 = Υ2 . The function En will be applied on each sequence of P and Q. It is obvious that En (S) = En (S) but here the trace sets are not strictly equal; however, their normalisations are. We now
252
J.-A. Bialkiewicz and F. Peschanski / A Denotational Study of Mobility
have to prove that for each possible individual simplification θ, {En (S) | S ∈ T } = {En (S ) | S ∈ θ(T )} and the case will be proved by transitivity on simplifications. We will only consider the sequences possibly modified by each kind of rewrite rule. If θ is a compression, the conclusion is obvious by Lemma 4 on both locations and names. If θ is a merge, it may transform two sequences into one whose head guard will be the disjunction of the two previous guard conditions. If En changed both sequences to ::l because of the head guard conditions, it will do the same to the result sequence. If it didn’t on either sequence, it can’t become true on the combined sequence. If it did on one sequence and not on the other, it will obviously not remove the combined sequence, because (G1 =⇒ X) ∧ ¬(G2 =⇒ X) =⇒ (G1 ∨ G2 =⇒ X), so we may conclude. def • Case C = [.] + R where R is a process expression. Here we generalise the context by considering the case of delayed sums (cf. Def. 28). So the goal becomes first P +l R = Q +σl R where l is a location and σ a substitution from names (public names and def
def
place-names) to place-names. We have P +σl R = P ⊕ R and Q +σl R = Q ⊕ R if l = and σ is the identity, and in this case the conclusion is a simple
def
fact: Υ1 (P ) ⊕ R = Υ2 (Q) ⊕ R. Now if l > then we have P +σl R = def
P ⊕σl R. Complementarily we have Q +σl R = Q ⊕σl R. We may now apply the adequate simplification Υ1 (resp. Υ2 ) for each occurrence of P (resp. Q) with the adequate delay and obtain Υ1 (P ) ⊕σl R = Υ2 (Q) ⊕σl R. Since +σl is not symmetric, we need also to consider the second goal R +σl P = R +σl Q, whose proof is obvious from this one. def • Case C = μ(X).CX , which is the solution of the equation Y = P {Y /X}. It is easy to show that if Y = Z then P {Y /X} = P {Z/X} (by a simple induction on the contexts for Y and Z). Thus, μ(X).P is a fixed point of the function f such that P {Y /X} = f (Y ). The proof thus relies on the existence of such a fixed point for f , which we ensure by the least fixed point lemma (Lemma 11) and the monotonicity of the language constructors (cf. lemma 1). def • Case C = [.] R where R is a process expression. The case is subsumed by the other cases if we apply the expansion law. This concludes the congruence proofs for = Lemma 6. SUM1 P + Q = Q + P SUM2 P + (Q + R) = (P + Q) + R SUM3 P + EN D = P SUM4 [ϕ]P + [ψ]P = [ϕ ∨ ψ]P Proof. SUM1 P + Q = P ⊕ Ql = Ql ⊕ P l by Definition 10. SUM2 P + (Q + R) = P ⊕ Q + R = P ⊕ Q ⊕ R by Definition 10. SUM3 P + EN D = P ⊕ EN D. We know that EN D = {::l} so we may conclude. SUM4 If P = α::(ξ, l1 )L.S then [ϕ]P + [ψ]P = merge α::(ϕ ∧ ξ, l1 )L.S ∪ α::(ϕ ∧ ξ, l1 )L.S −−−→ α::((ϕ ∨ ψ) ∧ ξ, l1 )L.S = [ϕ ∨ ψ]P l
J.-A. Bialkiewicz and F. Peschanski / A Denotational Study of Mobility
253
Lemma 7. RES1 (νn)EN D = EN D /ϕ RES2 (νn)[ϕ]P = [ϕ](νn)P if n ∈ RES3 (νn)(νm)P = (νm)(νn)P /α RES4 (νn)α.P = α.(νn)P if n ∈ RES5 (νn)α.P = EN D when n = subj (α) RES6 (νn)(P + Q) = (νn)P + (νn)Q Proof. RES1 (νn)EN D = {En (S) | S ∈ EN D} = {En (), En (::l)} = {, ::l} = EN D RES2 (νn)[ϕ]P = {En (S) | S ∈ [ϕ]P } Let’s examine the cases for En . If subj (hd (S)) = n or S = we have En (S) = ::s(l) (resp. ) so in this case we do have En ([ϕ]S) = [ϕ]En (S). If hd (S) = c!n and c = n we have / ϕ the substitution {νl /n}, where l is the absoEn (S) = F∅n (S){νl /n} but since n ∈ lute location of the action that causes the escape of name n, and the semantic function FLn will behave as the identity for S, so we will have En ([ϕ]S) = ([ϕ]S){νl /n} = [ϕ](S{νl /n}) = [ϕ]En (S). Otherwise, En (S) = hd (S).En (tl (S)) and we will have En ([ϕ]S) = En (S ) = hd (S ).En (tl (S )) = [ϕ]En (S). This allows to conclude that ∀n ∈ / ϕ, En ([ϕ]S) = [ϕ]En (S) RES3 Since all the conditions of the En (resp. Em ) function depend on n (resp. m) the order of applying the two restriction functions can’t have any influence on their effect, so En ◦ Em = Em ◦ En RES4 (νn)α.P = {En (S) | S ∈ α.P } = {En (S) | S ∈ α::l .P }. Except for and ::l, all the sequences in α::l .P begin by α::l . We know that En is the identity for ::l and so those two sequences will not be problems. The other sequences / α, we will have En (S) = α::l .En (S ) will be of the form S = α::l .S . Since n ∈ RES5 (νn)α.P = {En (S) | S ∈ α.P } = {En (S) | S ∈ α::l .P }. Except for and ::l, all the sequences in α::l .P begin by α::l . Like before, En behaves as the identity for those. The other sequences will be of the form S = α::l .S , but here, n = subj (α) so En (S) = ::l by definition. RES6 (νn)(P + Q) = {En (S) | S ∈ P + Q} = {En (S) | S ∈ P l ∪ Ql } = {En (S) | S ∈ P l } ∪ {En (S) | S ∈ Ql } = (νn)P l ∪ (νn)Ql = (νn)P + (νn)Q
Lemma 8. GRD1 [f alse]P = EN D GRD2 [true]P = P GRD3 [ϕ]P = [ψ]P if ϕ ⇐⇒ ψ GRD4 [ϕ](P + Q) = [ϕ]P + [ϕ]Q GRD5 (νa)[a = b]P = (νa)P if b = a /ϕ GRD6 [ϕ]α.P = [ϕ]α.[ϕ]P if bn(α) ∈ GRD7 [ϕ]α.P = [ϕ]α{a/b}.P if ϕ =⇒ a = b GRD8 [ϕ][ψ]P = [ϕ ∧ ψ]P Proof. φ
→ {::l} where φ is the application GRD1 [f alse]P = {α::f alseL.S | α::lL ∈ P } − of transformation false from Table 4. GRD2 If P = α::(ϕ, l1 )L.S then [true]P = α::(true ∧ ϕ, l1 )L.S = P
254
J.-A. Bialkiewicz and F. Peschanski / A Denotational Study of Mobility
GRD3 Our guards being logical expressions, all laws of first order logic apply so ϕ and ψ are consideredthe same object β::(η, m1 )M .S then [ϕ](P + Q) = GRD4If P = α::(ξ, l 1 )L.S and Q = α::(ϕ ∧ ξ, l1 )L.S ⊕ β::(ϕ ∧ η, m1 )M .S = [ϕ]P + [ϕ]Q GRD5 (νa)[a = b]P = {Ea (S) | S ∈ [a = b]P } = {Ea (α::(a = b ∧ ϕ, l) . . .).S )} = {α::(ϕ{true/a = b}, l) . . ..E( S )} = (νa)P GRD6 Application of the perco rewrite rule removes guards that have already enforced earlier in the sequence GRD7 From Table 2, any occurrence of a or b will be replaced by {a, b} which allows to conclude GRD8 From Table 2, when calculating the trace set of a guarded process the guard will be put in conjunction with that of the head location of the head observation in all sequences, and the substitutions will be composed. The result is trivial by associativity of conjunction and of function composition
Lemma 9. P Q = αi ::li .(Pi βj ::lj .Qj )+ βj ::lj .(αi ::li .Pi Qj )+[c = d]τ.(Pi Qj {d/x}) Proof. Soundness of the expansion law is quite easily proved by induction on sequences. The induction hypothesis is that if the property is true for Si Sj it is for i,j αi ::li .Si βj ::lj .Sj . Except if one of the processes can only terminate immediately, in which case the other one is the result of the parallel composition (which provides a fixpoint for our induction), we have P Q = {αi ::li .Si } {βj ::lj .Sj } where α i ::li .Si ∈ P , βj ::lj .Sj ∈ Q ::l .S , β ::l .S )} ⊕ {interleave(βj ::lj .Sj , αi ::li .Si )} = {interleave(α i i i j j j ⊕ {intersync(αi ::li .Si , βj ::lj .Sj )} ⊕ {intersync(βj ::lj .Sj , αi ::li .Si )} = {α i ::li .(Si βj ::lj .Sj )} ⊕ {βj ::lj .(Sj αi ::li .Si )} = db ∧ ϕa ∧ ψb ,
)lz .Tz ⊕ {γz ::wa wb (ca | {αi ::li .Si } {βj ::lj .Sj }{ea /ρlj } = pk=1 γk ::lk .Tk } ⊕ . . . = αi ::li .(Pi βj ::lj .Qj ) ⊕ βj ::lj .(αi ::li .Pi Qj ) ⊕[c = d]τ.(Pi Qj {d/x})where αi = c!d and βj = d?x or the converse = αi ::li .(Pi βj ::lj .Qj )+ βj ::lj .(αi ::li .Pi Qj )+[c = d]τ.(Pi Qj {d/x})
For the sake of readability, the above proof elides as . . . the converse case of the communication condition since it behaves exactly the same, except for the fact that the sending and receiving processes are reversed. 5. Refinement One advantage of manipulating prefix-closed sets of sequences as trace sets is that set inclusion then provides a simple yet powerful means for behavioural refinement. Definition 27. A process P refines a process Q, which we denote Q P , iff ∃T ⊆ Q such that P = T . The relation is equivalently denoted P ⊆/= Q. Before investigating the main properties of the order, we introduce a syntactic construction that, indeed, characterises properly the notion of refinement in the proposed model. We remind that for a given trace set T , its pretrace set T ↑ l (resp. its postrace set T ↓ l) corresponds to the subtrace containing all the prefixes (resp. suffixes) of T before (resp. after) l. A
J.-A. Bialkiewicz and F. Peschanski / A Denotational Study of Mobility
255
further notation is that of the trace complement T (l) which is defined as T \ (T ↑ lT ↓ l). These are all the sequences that do not go “through” l. It is maybe easier to remind the invariant: T = T (l) ∪ T ↑ lT ↓ l. Note that the complement set may be empty and thus may not be a valid trace set, i.e. its codomain is T % {∅}. We may now introduce the notion of delayed sum, a strict generalisation of the sum operator, as follows: Definition 28. Let P and Q be arbitrary processes, l a location and σ a substitution from names (a, b, ρl , . . .) to place names only (ν , ρe , . . .). The delayed sum operator at l, denoted P +σl Q, is defined as follows: def
P +σl Q = P ⊕σl Q def
n+m T1 ⊕σl T2 = T1 ↑ l( T1 ↓ l{sni sn+m } ∪ T2 {sm j sj+n } i {Vlm /Vm | V ∈ {ρ, ν}, Vm ∈ / support(σ)}σ )
Notation 2. Let Id be the identity substitution. For the sake of brevity, P +Id l Q is denoted P +l Q Delayed sums are similar to ordinary sums, except that the effect of the operator – the branching point – is delayed until the path l has been followed in the left operand. For example, α.P +∅ Q = α.(P + Q). An explicit substitution must be provided when the right-hand process refers to names that have been created before (wrt. the prefix order on locations) the location where it is inserted in the left-hand process behaviour. When the substitution is not necessary it may be omitted. It is important to note that the operator is not symmetric6 . The delayed sum operator is clearly a generalisation of ordinary sum. Proposition 2. P + Q = P + Q Proof. Here we assert that an ordinary sum is the same as a delayed sum with delay def . If we apply Definition 28, then we have for any location l, P + Q = P () ∪ n+m n+m n m (P ↑ (P ↓ {si si } ∪ Q{sj sj+n } )). By Definition 6 we have for any def
def
trace set T , T ↑ = {} and T ↓ = T . The trace set complement T () is thus the def
n+m } ∪ Q{sm empty set, which gives P + Q = P {sni sn+m j sj+n } = P ⊕ Q = i P + Q.
There is a tight connection between delayed sums and refinement, as characterised by the following lemma: Lemma 10. P Q ⇐⇒ ∃RL = ni=1 {(Ri , li , σi )} s. t. P = Q +σl11 R1 . . . +σlnn Rn Proof. For the if part, let l be an arbitrary location and T ⊆ P { l } a trace set such that T = Q{ l }. Such a trace set exists by hypothesis and Definition 27. Now we consider def
T = P { l } \ T the set of all sequences that are in the behaviour of P but not in Q. Note that if T is empty then we are finished and P and Q exactly match through reflexivity. l is maximal with We now take all the maximal locations l ∈ l such that T ↑ l = ∅. The set l then l1 ≤ l2 and l2 ≤ l1 . For each of such respect to location prefixing in that if l1 , l2 ∈ } ∪ Rl { location l, we identify a process Rl such that P { l } ↓ l = T ↓ l{sni sn+m i n+m s }{V /V | V ∈ {ρ, ν}, V ∈ / support(σ)}σ. Such a process exists since l }{sm lm m m j j+n T ↓ l is not empty and strictly contained in P { l } by definition. Hence, P { 6 We think the dual notion of “premature sums” also worth studying but this requires a notion of “reversible locations” that we have to investigate furthermore.
256
J.-A. Bialkiewicz and F. Peschanski / A Denotational Study of Mobility
def
def
P =
Q=
b!ν
TTTT TTTT e33 ooo TTTT oooe3 c!v o o TTTT TTTT ooo v!ν 2 o d? o T* w o 21
c!v
d?
ρ
!ν e 3 2
ρ
!ν e 3 3
b!ν
??? e22 ?? ? e21 d? ?
e (c=d,31 )
c!v
OOO OOO 22 OOO OOO OO' ρ !ν e 3
d?
3
c!v
ρ
!ν e 2 1
c!v
ρ
!ν e 2 2
P = { b!ν :: , v!ν ::
(c = d, 31 ),
32 , d?:: , ρe3 !ν :: , b!ν :: , c!v:: 2 b!ν :: , d?::
33 , c!v::21 , ρe3 !ν :: , 3 b!ν :: , d?::
33 , ρe3 !ν ::22 , c!v:: } 3
Q = { b!ν :: , c!v::
21 , d?:: , ρe2 !ν :: , 1 b!ν :: , d?::
22 , c!v:: , ρe2 !ν :: } 2
P = (νa)b!a.τ.([c = d]v!a.V OID + c!v.d?x.x!a.V OID+ d?x.(c!v.x!a.V OID + x!a.c!v.V OID)) Q = (νa)b!a.τ.(c!v.d?x.x!a.V OID + d?x.c!v.x!a.V OID) Figure 3. Illustrating delayed sums (1).
n+m l } = l∈el (P { l } ↑ lP { l } ↓ l{sni sn+m }) ∪ T {sm j sj+n }{Vlm /Vm | V ∈ j / support(σ)}σ, which we may finally rephrase as P = Q +σl11 R1 . . . +σlnn Rn {ρ, ν}, Vm ∈ For the only if part we suppose P = Q +σl11 R1 . . . +σlnn Rn , and for each li ∈ l,
n+m } ∪ Rli {sm Q + Rli = Qli () ∪ (Qli ↑ (Qli ↓ {sni sn+m j sj+n } {Vlm /Vm | i / support(σ)}σ)) (Definition 28), which implies trivially that Q{ V ∈ {ρ, ν}, Vσm ∈ l } ⊆ i Q +lii Ri { l } and, following the hypothesis, Q{ l } ⊆/= P { l }. We thus conclude P Q
def
We now illustrate the delayed sum characterisation of refinement. Consider the processes of Figure 3. Intuitively, it should be the case that P Q (i.e. Q refines P ) because P may at least perform all the actions and non-deterministic choices of Q, but can do even more of course. However, it is not the case that P = P + Q (see Fig. 4) so the (standard) sum operator does not characterise refinement in a complete way. As Fig. 5 makes clear
such Lemma 10 guarantees the existence of a delay l, σ and a delayed sum of processes R σ that P = Q +l R. Refinement, as characterised by delayed sums, is a proper ordering relation with (parametrised) monotone properties on the language constructors.
J.-A. Bialkiewicz and F. Peschanski / A Denotational Study of Mobility
257
def
P + Q =
jj TTTTTT 2 TTTT 2 jjjj j j j TTTT j j j j TTTT jjjj b!ν TTT* b!ν tjTjjj T T o 3 3 TTTT e e (c=d,1 ) ooo TTTT 3 o TTTT ooo e3 c!v e d? TTTT ooo v!ν 2 o d? o TT* wo O ρ !ν 21 OOOOO 21e33 OOO d? c!v OOO OO' c!v 22 21
ρ 2 e 3 !ν 1 2
ρ 2 e 3 !ν 1 3
c!v
ρ 2 e !ν 2
Figure 4. Illustrating delayed sums (2).
{ν /a}
Q +e
{ρ
e 3 3
R1 +e3
/a, ν /b}
3
def
R2 =
b!ν
o TTTTTT 3 TTTTe3 ooo o o TTTT o e3 c!v o o TTTT oo v!ν 2 TTT* d? wooo O 21 OOOOO 22 OOO d? OO c!v ρe3 !ν OOO' 3
e (c=d,31 )
ρ
!ν e 3 2
ρ
!ν e 3 3
c!v
R1 = [c = d]v!a.V OID R1 = { v!a::(c = d, )}
R2 = a!b.c!v.V OID R2 = {a!b:: , c!v:: } Figure 5. Illustrating delayed sums (3).
Theorem 1. P P P Q ∧ Q R =⇒ P R P Q ∧ Q P =⇒ P = Q P Q =⇒ ∃ l, Cel [P ] Cel [Q] for any language context C All these properties use a similar proof schema that consists in replacing each property in the context of the split equivalence and delayed sums.
258
J.-A. Bialkiewicz and F. Peschanski / A Denotational Study of Mobility
Proof. • (reflexivity) By definition, P + P P and since P + P = P + P (by lemma 2) and P + P = P so we conclude P P . • (transitivity) We have P Q and Q R so ∃l1 . . . ln , A1 . . . An s. t. Q +l1 A1 . . . +ln An = P and ∃l1 . . . ln , A1 . . . An s. t. R +l1 A1 . . . +ln An = Q. Since splitequivalence is transitive, thus R +l1 A1 . . . +ln An +l1 A1 . . . +ln An = P which allows us to conclude. • (antisymmetry) We have P Q so ∃l1 . . . ln , A1 . . . An s. t. Q+l1 A1 . . .+ln An = P . B , B1 . . . Bm s. t. P +l1 B1 . . .+lm = Q. And also by hypothesis Q P so ∃l1 . . . lm m {B In terms of trace sets, it is easy to derive the property that ni=1 {Ai } = m i } and i=1 so P = Q. • (monotonicity) The property can be rewritten as follows: ∃l1 . . . ln , R1 . . . Rn s. t. Q = P +l1 R1 . . . +ln Rn =⇒ C[Q] = C[P +l1 R1 . . . +ln Rn ] whose proof is a particular case of the congruence property for delayed sums. This concludes the proofs for Lemma 1. Note that unsurprisingly the congruence result is relative to a given set of locations delays, it does not follow from P Q that P +σl R Q +σl R, since P and Q may have differently ordered branches. Instead, if P Q and ∃P = P such that P ↑ l = Q ↑ l, then P +l R Q +l R. The refinement ordering is trivially bounded by process VOID on one side, and by prodef cess RUN such that RUN = T on the other side. A simple fact is that for any process P we have RUN P VOID. A much more general result about is the following one: Theorem 2. is a complete lattice The property is relatively easy to exhibit if we interpret it in terms of trace sets. It says, in fact, that (T , ⊆/= ) (i.e. the subset relation for the equivalence classes with respect to = ) itself possesses a complete lattice structure. Simple set theoretic arguments (using /= and /= , the generalised intersection and union operators with respect to T/= ) then suffice to establish the property. This leads to the most important result of the section: Lemma 11. Let φ a function from process expressions to process expressions. If φ is monotone with respect to then it admits a least fixed point with respect to = , i.e. φ(P ) = P for process P . Moreover, if φ is continuous with respect to , then the least fixed point is any {φn (VOID) | n ∈ N}. Both the properties correspond to transpositions of Tarski’s lemmas in the realm of the proposed framework, considering the complete lattice structure of the refinement order. Thanks to lemma 11, we may now introduce a general rule for recursion as follows: [rec]
μ(X).P is the least fixed point solution of P {Y /X} = f (Y ) with f such that Y = P {Y /X}
Note that the existence of f and the existence or unicity of its least fixed point are not always guaranteed. An example is with unguarded recursions (e.g. μ(X).X) for which nothing is recorded. The explicit recording of divergences would allow for a more thorough treatment of unguarded recursion, which is left as a future work.
J.-A. Bialkiewicz and F. Peschanski / A Denotational Study of Mobility
259
6. Related Work To our knowledge, there are very few investigations aiming at developing mobile extensions for CSP. In [10] the authors propose to encode mobile channels as processes. This makes sense from the point of view of execution environments and close-world semantics, but channel mobility has an important impact on the theory, and thus something must be proposed at that level to be able to reason about such mobile extensions. In a recent yet unpublished paper [11], an interesting proposition is made for an encoding of both a channel-passing version of CSP and of the π-calculus within CSP+, the language of CSP enriched by a construction for exceptional behaviours [12]. The channel-passing variant of the parallel construct is more about dynamic alphabets and the explicit manipulation of read-write access rights on channels, but it is not mobility in the sense of the π-calculus. In particular the scope of name remains static in this variant. Concerning the encoding of the π-calculus itself, the proposition remains mostly informal as of today but the general idea is to encode the effect of binders as non-deterministic choices among the (potentially infinite) possibilities of name substitutions involved. For the input binder this idea clearly relates to the early semantics for the pi-calculus, and it is shown that for finite-state problem the choices are also finite (using open bisimilarity). However, we do not convey the idea of early semantics in our model because it has a significant cost when conducting proofs or developing verification algorithms. The idea is that one has to consider all the possible substitutions of names in order to solve the problem, which can be infinite (early case) or restricted to the finite number of names actually used in the process being analysed (open case). Moreover this does not solve the compositionality issue because channel names that are not bound are not considered by the substitution. In our case we provide a single, uniquely defined name — attached to the absolute location of the considered observation — to serve the same purpose (and more). The advantage of our denotation, also, is that it is not parametric unlike [11] because of the explicit manipulation of infinite replacement sets to capture freshness. In our case a single location must be recorded instead. Unlike our proposition, the model also seems to suffer from the same compositionality as the “real” π-calculus. A very positive point of the presented model is the natural switch from the operational to the denotational characterisation and vice-versa. In our case we had to exhibit a non-trivial axiomatisation of the denotation, which is quite an involved process (especially the completeness part of the adequacy theorem, cf. our technical report). The positive point is that the axiomatisation gives us a minimal set of laws for the proposed language constructs. In comparison with the standard failure-divergence (FD) model of CSP, an interesting characteristic of the trace model we propose is that it integrates well the concepts of (standard) trace sets, the encoding of the branching structure and the mobile features. There is no need for separate specifications (traces, failures and divergences) and the resulting denotation conveys the complete lattice structure of the plain trace semantics. This is mostly thanks to the notion of location. In return, the manipulation of the trace sets must ensure the correct (re-)location of the observations, which makes the definitions more intricate than those of FD, even if we remove the mobility part. We show, however, that thanks to trace normalisation we are able to abstract away from the fine-grained nature of the locations. The proof techniques we propose beyond the denotation itself do not suffer from the fine-grained nature of locations. In the same line of idea, we provide in [13] the first sketch of a CSP-like predicate logic built on the present model. The logic allows the practical reasoning on mobile systems without having to deal directly with e.g. explicit locations, escape functions or split-relations. The study of the π-calculus semantics from a denotational point of view has also been investigated in [14,15] with the objective of characterising full abstraction lemmas wrt. testing preorders. Most interpretations consider set-theoretic trace models built from the operational semantics. In this paper, we adopt a complementary point of view of building trace models
260
J.-A. Bialkiewicz and F. Peschanski / A Denotational Study of Mobility
directly from process expressions, with the goal of providing proof principles and techniques directly applicable on trace models as in CSP. While [14,15] relate trace-based denotations, so-called acceptance traces, with tests based on the operational semantics in order to provide full abstraction lemmas, we address complementary questions at the intersection of denotational and axiomatic semantics. Nevertheless, there are some affinities between the two approaches, most notably the development of behavioural preorder relations, and set-inclusion as the general driving principle. The trace model of [15] is also different in that it implements early commitments – ours are late in comparison. Must testing equivalence is weaker than split-equivalence, with tau-laws releasing even more constraints than weak late congruence [16]. This obviously raises compositionality issues. 7. Conclusion and Future Work In this paper we have shown that it is possible to model mobility in an observational and compositional way. The resulting model is not as concise and elegant as the standard CSP model but this is, in our opinion, the price to pay for the characterisation of mobility. There is, however, a form of minimalism in the model. The single concept of location plays for example quite a versatile role. They are used to encode the branching structure of the process within trace sets (which is why they were introduced at first in [2]) but they are also used to give fresh identities to dynamic names, which are in our opinion the central characteristic of mobile — dynamic — behaviours. The issue with locations is that, similarly to de Bruijn indices, we have to ensure their consistency, especially when composing trace sets. From a mathematical point of view, there may exist cleaner foundations where, for example, permutations of branches and trace transformations would come “for free”. But this is not a certitude because concurrent systems are not “pure” mathematical objects to start with. What we propose in this paper is the simplest model we were able to develop. Our intuition is that its complexity is inherent to the phenomena we try to characterise. A lesson we learnt from experience is that it is much better to go from the denotation to the language, as in CSP, than the converse, although it is of course necessary to have some intuitions about the language at first. Initially we tried the other way around (starting from the π-calculus directly) and it generally led to dead ends. From a practical point of view, the trace normalisation principles of the model appear as quite attractive. We are now developing algorithmic principles based on normalisation that will hopefully lead to the development of an equivalence and refinement checking tool for the proposed language. References [1] Robin Milner. Communicating and Mobile Systems: The π-Calculus. Cambridge University Press, 1999. [2] Frederic Peschanski. On Linear Time and Congruence in Channel-Passing Calculi. In Communicating Process Architectures 2004, pages 39–54. IOS Press, 2004. [3] C. A. R. Hoare. Communicating Sequential Processes. Prentice Hall, 1985. [4] A. W. Roscoe. The Theory and Practice of Concurrency. Prentice Hall, 1997. [5] Fr´ed´eric Peschanski and Jo¨el-Alexis Bialkiewicz. A denotational model for mobile processes. Technical report, LIP6, http://www-poleia.lip6.fr/~pesch/data/tracepitr08.pdf, 2008. [6] Ugo Montanari and Marco Pistore. History-dependent automata: An introduction. In Marco Bernardo and Alessandro Bogliolo, editors, SFM, volume 3465 of Lecture Notes in Computer Science, pages 1–28. Springer, 2005. [7] M. Hennessy and H. Lin. Symbolic bisimulations. Theor. Comput. Sci., 138(2):353–389, 1995. [8] Franz Baader and Tobias Nipkow. Term Rewriting and All That. Cambridge University Press, 1998. [9] Fr´ed´eric Peschanski and Jo¨el-Alexis Bialkiewicz. Modelling and verifying mobile systems using pigraphs. In Mogens Nielsen, Anton´ın Kucera, Peter Bro Miltersen, Catuscia Palamidessi, Petr Tuma, and
J.-A. Bialkiewicz and F. Peschanski / A Denotational Study of Mobility
[10]
[11] [12] [13] [14] [15] [16]
261
Frank D. Valencia, editors, SOFSEM, volume 5404 of Lecture Notes in Computer Science, pages 437–448. Springer, 2009. Frederick R. M. Barnes and Peter H. Welch. A CSP Model for Mobile Channels. In Frederick R. M. Barnes, Jan F. Broenink, Alistair A. McEwan, Adam Sampson, G. S. Stiles, and Peter H. Welch, editors, Communicating Process Architectures 2008, pages –, sep 2008. A. W. Roscoe. On the expressiveness of csp. http://www.comlab.ox.ac.uk/publications/publication2766-abstract.html. A. W. Roscoe. The three platonic models of divergence-strict csp. In Proceedings of the 5th international colloquium on Theoretical Aspects of Computing, pages 23–49, Berlin, Heidelberg, 2008. Springer-Verlag. J-A.Bialkiewicz and F.Peschanski. Logic for mobility: a denotational approach. In Logic, Agents and Mobility (LAM’09), pages 44–59. Technical report (Duhram University), 2009. Michele Boreale and Rocco De Nicola. Testing equivalence for mobile processes. Inf. Comput., 120(2):279–303, 1995. Matthew Hennessy. A fully abstract denotational semantics for the pi-calculus. Theor. Comput. Sci., 278(1-2):53–89, 2002. Huimin Lin. Complete inference systems for weak bisimulation equivalences in the pi-calculus. Inf. Comput., 180(1):1–29, 2003.
This page intentionally left blank
Communicating Process Architectures 2009 P.H. Welch et al. (Eds.) IOS Press, 2009 © 2009 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-065-0-263
263
PyCSP Revisited Brian VINTER a,1 , John Markus BJØRNDALEN b and Rune Møllegaard FRIBORG a a Department of Computer Science, University of Copenhagen b Department of Computer Science, University of Tromsø Abstract. PyCSP was introduced two years ago and has since been used by a number of programmers, especially students. The original motivation behind PyCSP was a conviction that both Python and CSP are tools that are especially well suited for programmers and scientists in other fields than computer science. Working under this premise the original PyCSP was very similar to JCSP and the motivation was simply to provide CSP to the Python community in the JCSP tradition. After two years we have concluded that PyCSP is indeed a usable tool for the target users; however many of them have raised some of the same issues with PyCSP as with JCSP. The many channel types, lack of output guards and external choice wrapped in the select-thenexecute mechanism were frequent complaints. In this work we revisit PyCSP and address the issues that have been raised. The result is a much simpler PyCSP with only one channel type, support for output guards, and external choice that is closer to that of occam than JCSP. Keywords. Python, CSP, PyCSP, alternation, concurrency
Introduction When PyCSP was introduced in 2007 [1] it was a CSP [2] library in the JCSP [3,4,5] tradition and primarily targeted pedagogical purposes. After having worked with PyCSP for another two years the authors decided to evaluate the experiences made and decide on the future of PyCSP. The outcome of the evaluation had three potential conclusions: 1. PyCSP was a nice exercise but of little or no practical use and the project should be stopped 2. PyCSP is a success as it is and no further work is needed, thus the research should be stopped 3. PyCSP has shown potential but needs more work and/or alternative approaches Having included PyCSP in the Extreme Multiprogramming class at the University of Copenhagen three years in a row with a combined number of students in excess of 200, we did have a sizable set of inputs on PyCSP. On the upside the students claimed to like PyCSP for a number of reasons: • It is Python and thus perceived to be easier to work with than most other languages2 • The fact that PyCSP channels are type indifferent3 is convenient when changing the functionality in an application On the downside, a number of students also had reservations with PyCSP: 1 Corresponding Author: Brian Vinter, Department of Computer Science, University of Copenhagen, DK-2100 Copenhagen, Denmark. Tel.: +45 3532 1421; Fax: +45 3521 1401; E-mail: [email protected]. 2 This may only be true in the context of this class where the focus is on scientific applications. 3 The type indifference is easy because Python is dynamically typed.
264
B. Vinter et al. / PyCSP Revisited
• The many channel types make code less intuitive and any-to-any was the de-facto choice, though it does not support external choice • No real parallelism unless functionality is written in C Going through the final reports for the last exam, we discovered that more than 80% of the students had chosen PyCSP for their solution, second was JCSP, then followed C++CSP. A single report used occam. While some of the success of PyCSP is bound to be due to veneration for a locally developed system, there is little doubt that the students do like PyCSP: Java is the usual language of choice in other classes. It was especially interesting that students with a non-CS background, such as math, physics, nano-science and biology, all chose PyCSP, which indicates that our original intention of making a system for multi-core programming for scientists is within reach. We decided that option 3, “PyCSP has shown potential but needs more work and/or alternative approaches”, was the conclusion of our evaluation and went on to address the input we have gotten from the many users. The most frequent comment we received was disappointment that true parallelism could not be obtained using pure Python code. This is because Python uses a global interpreter lock, GIL, which means that threads in Python are useful only if a thread calls outside Python or to handle asynchronous events. To address this, the new implementation supports operating system processes in addition to threads. Strictly speaking this could be done with no changes to PyCSP and a new process-based implementation could transparently replace the old one. However, a number of other comments we received addressed the syntax and semantics of PyCSP and we thus decided to revisit the design. The work on the new implementations is presented in another paper [6]. It was also decided to make changes to the PyCSP API. Originally, PyCSP had been inspired by the other CSP libraries – most importantly JCSP – but it was evident that many students found the compact expressions in occam, especially the representation of external choice, attractive. While students easily understood why external choice on output channels is not needed in CSP, they still, rightly, claimed that they would be convenient. Finally, termination through poisoning was easily understood but also claimed to be inconvenient. Thus we decided to change PyCSP in four major ways: 1. There should be only one channel type, any-to-any, and it must support external choice 2. The channels should support both input and output guards for external choice 3. PyCSP should provide a mechanism for joining and leaving a channel with support for automatic poisoning of a network 4. The expressive power in Python should be used to make PyCSP look more like occam where possible In the following we describe the new PyCSP library based on the above four design criteria. The result is a PyCSP implementation that follows the decisions and, in our own opinion, makes PyCSP programs even more readable and maintainable. 1. The New PyCSP 1.1. Processes Just as in the original PyCSP, processes are wrapped in a process decorator, i.e. they are not merely implementations of a Process class as in JCSP or C++CSP. The advantage of this approach is partly that processes will be easily recognizable in the source code, and that it gives great flexibility for the PyCSP runtime environment to handle processes in different ways.
B. Vinter et al. / PyCSP Revisited
265
The constructor used is @process, and a hello world example could look like the following example: @process def hello_world ( msg ): print " Hello world , this is my message " + msg
Usually one or more channel ends will be part of the parameters for a process. Defining a process as above will not instantiate or execute any code: it is simply defined as a process to be used in a network at a later time. 1.2. Process Sets Once a process is defined, a set of processes may be instantiated and executed using the Parallel or Sequential constructs similar to the old version. However, in order to accommodate variable size networks a process set may now include lists of processes as well as individual processes. Parallel ( source () , [ worker () for i in range (10)] , sink () )
In the above example source, worker and sink have all been defined as processes and the parallel construct will run one source, ten workers and one sink process in parallel and return once all processes have terminated. Naturally the example makes little sense without the use of channels for communication; these will be introduced below. Apart from the support for mixing scalars and vectors of processes, the Parallel and Sequential constructs work as in the previous version and should be intuitive to anybody with any CSP experience. 1.3. Channels PyCSP originally based much of its design on JCSP, continuing the use of specialized channel types: One2One, One2Any, Any2One and Any2Any. The type names designate how many writer and reader processes were allowed to be attached to the respective channel ends. The main reason for the specialized channel types was that the implementation of the Alternative construct, which allowed external choice, was based on the JCSP version and placed strict limitations on the use of channels: only one process could safely use an Alternative construct with a given channel end. To safeguard against misuse, only the reading end of channel types that were restricted to one reader could be used as guards in an external choice. Limitations such as these can be cumbersome to work around when designing your CSP application and even more so for newcomers to PyCSP. 1.3.1. New Channel Type There is only one channel type in the new PyCSP. The channel is similar to the previous Any2Any channel, but with the difference that both input and output channel ends support external choice. The use of external choice is described in section 1.4. Retrieving channel ends for use in processes has also changed in PyCSP. Previously, a programmer would grab a channel end by calling the read() or write() method of the channel. This has been replaced with the channel.reader() and channel.writer() functions which also have a role in channel poisoning described below. As an experiment a shorthand for channel.reader() and channel.writer() is introduced as -channel and +channel; whether this more compact notation introduce more confusion than it is worth is left to future observations.
266
B. Vinter et al. / PyCSP Revisited
1.3.2. Channel Poison The concept of poisoning channels with the purpose of shutting down an application was introduced in C++CSP [7] and later investigated in some detail by Bernhard Sputh [8]. A channel is poisoned and all subsequent reads or writes on this channel will throw an exception. This exception can be caught and used as a shut-down procedure or just to shut down that single channel. In the following example we create two processes, source and sink, and a channel to connect them. The source process finally poisons the channel to terminate the network, which will happen since the sink process does not catch the exception. @process def source ( chan_out ): for i in range (10): chan_out ( " Hello world " ) poison ( chan_out ) @process def sink ( chan_in ): while True : print chan_in () chan = Channel () Parallel ( source ( chan . writer ()) , sink ( chan . reader ()))
Since all channels now support multiple readers and writers it is easy to add more readers and writers: Parallel ( source ( chan . writer ()) , source ( chan . writer ()) , source ( chan . writer ()) , source ( chan . writer ()) , source ( chan . writer ()) ,
sink ( chan . reader ()) , sink ( chan . reader ()) , sink ( chan . reader ()) , sink ( chan . reader ()) , sink ( chan . reader ()))
or Parallel ([ source ( chan . writer ()) for i in range (5)] , [ sink ( chan . reader ()) for i in range (5)])
Both versions produce five source and five sink processes, however the created network will not do what the user may intuitively think it does. One of the sources is bound to finish first and it will then poison the channel, which will terminate the network before all the expected messages have been printed. The problem is extremely common in producer-consumer class applications, and users end up with complex solutions for terminating the network. To address this we introduce a poison mechanism similar to reference counting. Creating channel ends and retiring from them updates a counter of how many readers or writers we have on a channel, and the leave method may perform automatic poisoning when no readers or no writers are left. The reader() and writer() methods automatically join the respective ends of a channel, returning a unique reference to that channel end. A new function, retire(), is used to leave a channel end. All subsequent requests to this channel end reference will raise an exception. When all readers or writers have retired a channel, the other end of the channel is also retired. This is similar to how poison is propagated in the previous versions of PyCSP, but with one important difference: with a poisoned channel any reference to that channel will trigger a ChannelPoisonException which is caught in the Process class that wraps all PyCSP processes. The exception handler then poisons all the other channels that were passed to the
B. Vinter et al. / PyCSP Revisited
267
process upon initialization. With a retired channel, the ChannelRetireException is thrown and the other channel ends are retired rather than poisoned, implementation wise the two are identical apart from the name of the exception that is raised. This can remove some potential race conditions when terminating networks, as seen in the Monte Carlo Pi example in section 3 and in the example below. The following code demonstrates how the retire expression can be used instead of the poison expression. The network will now be poisoned by the last source process to finish, rather than the first. This feature hugely simplifies many networks. @process def source ( chan_out ): for i in range (10): chan_out ( " Hello world " ) retire ( chan_out )
1.4. External Choice One of the criticisms that the original PyCSP attracted was the way that external choice was implemented, which had more in common with UNIX socket programming using select than the more compact occam ALT operation. After executing an external choice (Alternative) you are required to read from the selected channel. Failing to do so would break the rules for the choice construct in CSP. Thus we decided to simplify the usage of Alternative by combining select with a custom-defined action on the guard, similar to the occam ALT. Based on this, we introduce a new choice named Alternation. Alternation has changed significantly from Alternative, partly to make it more like occam, partly to support output guards. A guard set is now represented as a list of Python dictionaries where the keys can be channels from which to read, or two-tuples where the first entry is a channel and the second the value that should be written to that channel. The value of each dictionary entry is a function of type choice which may be executed if the guard becomes true. If the guard is an input guard then the choice function will always have the parameter __channel_input available which is the value that was read from the channel. Alternation also supports other guard types, inheriting from a common Guard class. Alternation has two calls: • Execute – which waits for a guard to complete and then executes the associated choice function, similar to the occam ALT instruction • Select – which returns a two-tuple: the guard that was chosen by Alternation and, if the guard was an input-guard, the message that was read. This is equivalent to the original Alternative Note that the execute call in alternation always performs the guard that was chosen, i.e. channel input or output is executed within the alternation, so even the empty choice or a choice function where the results are simply ignored still performs the guarded input or output. The code that is executed within a guard may be specified in two ways, either as a function that is defined using a choice decorator, similar to processes, or as a string containing code to be executed. The latter is easy to use but becomes quite slow since runtime compilation is required. A choice function is defined as follows: @choice def action ( __channel_input = None ): print __channel_input
It is not possible to change the name of __channel_input in a choice function since it is passed as a keyword argument when it is the result of a selected input guard. Once the choice is
268
B. Vinter et al. / PyCSP Revisited
defined, a process may perform an alternation on a set of channel ends. In the following example the same guarded code, action, is called independently of which channel becomes ready. This is an option but naturally not a requirement. @process def par_reader ( cin1 , cin2 , cin3 , cin4 ): Alternation ([ { cin1 : action () } , { cin2 : action () } , { cin3 : action () } , { cin4 : action () } ]). execute ()
An action might alternatively be passed as a string. This string is then evaluated with a copy of the current namespace. All mutable types can be updated from the evaluation of this string. In Python, the list, dict and set types are built-in mutable types. @process def counter ( cin0 , cin1 ): try : cnt = [0 , 0] # use mutable type while True : Alternation ([ { cin0 : ’ cnt [0] += 1 ’ } , { cin1 : ’ cnt [1] += 1 ’ } ]). execute () except C ha nn el Poi so nE xce pt io n : print ’ Counted : ’ , cnt
Guards are prioritized in the order they occur in the guard list, while guards in a dictionary are unordered. This gives us the option to model both an ordinary external choice and a prioritized external choice. It is important to note that priority only makes real sense in the scenario where more than one guard is ready when the alternation is entered, which guard is woken first if no guards are immediately available may be down to race-condition or the priority order of another guard statement. Ordinary external choice is obtained by a list with just one dictionary holding guards. The entries in the dictionary are then treated in an non-prioritized way: [{ cin1 : action () , cin2 : action () , cin3 : action () }]
On the other hand, prioritized external choice is obtained by providing a list of dictionaries with guards. These are then prioritized in the way the dictionaries appear in the list. [ { cin1 : action () } , { cin2 : action () } , { cin3 : action () } ]
It is fully possible to mix the two models, i.e. a prioritized list with dictionaries of nonprioritized guards. This option should only be used for special purposes.
B. Vinter et al. / PyCSP Revisited
269
PyCSP provides four built-in guard types to use with external choice. The first three of them are well known to the CSP community: • Channel input • Timeout - A counter relative to current time, when it expires, it will become true and allow the alternation to complete • Skip - Always true, and often used to define a default alternative • Channel output The fourth is new in PyCSP, although thoroughly discussed throughout the years and previously seen in Communicating Java Threads [9]. It is well understood by most programmers that use process algebra that output guards are not needed from a CSP point of view, and one may with relative ease construct equivalences for any type of output guard using only input guards. However, output guards are convenient for the user of PyCSP and equivalences are hard to construct for users that are not professional programmers, thus we provide the output guard as a primitive in PyCSP. All guard types supported can be interrupted by channel poisoning or retiring. PyCSP channels may be guarded in both ends, i.e. an output guard can be matched by an input guard. The following code example shows how non-blocking writes and input with timeouts can be modelled using the new Alternation construct: # Non - blocking write Alternation ([ { ( cout , datablock ): None } , # Try to write to a channel { Skip (): " print ’ skipped ! ’ " } # Skip the alternation ]). execute () # Input with timeout Alternation ([ { cin : " print __channel_input " } , { Timeout ( seconds =1): " print ’ timeout ! ’ " } ]). execute ()
2. Implementation This section introduces only highlights of the implementation. An in depth description of the implementation details of PyCSP may be found in [6]. The only implementation detail that is non-trivial is the support for output guards and channels with multiple processes at either end. The implementation is quite complex and uses more than a hundred lines of Python code. The overall design is based on each alternation being represented by a request structure, called a handle in the pseudocode below, that includes a lock which ensures mutual exclusion. When a new alternation is activated it will traverse the guards in the choice list by priority, and for each guard it will look for a waiting handle that matches the handle for the alternation, i.e. a read matches a write and vice versa. If no match is found, the handle is added to the set of waiting handles for that channel. Please note that the pseudocode is heavily simplified and the actual implementation relies on global ordering of events to avoid livelocks; for details refer to [6]. handle = new_request_handle () for guard in choice : lock ( guard . channel ) if handle match registered_handle in guard . channel : perform communication
270
B. Vinter et al. / PyCSP Revisited make_active ( handle , registered_handle ) else : guard_channel . regis tered_ handle . add ( handle ) unlock ( guard . channel ) waitfor active ( handle )
This is the procedure for every channel communication possible. Whenever a match is tried, two locks are required: one owned by the reading end and one owned by the writing end. In the case of alternation, this lock is shared between all guards to ensure the integrity of the alternation. A diamond design where every process alternates on an input and an output end could look similar to the example code below. @process def P ( id , c1 , c2 ): while True : Alternation ([{( c1 , True ): None , c2 : None }]). select () c = [ Channel ( str ( i )) for i in range (4)] Parallel ( P (1 , c [0]. writer () , P (2 , c [1]. writer () , P (3 , c [2]. writer () , P (4 , c [3]. writer () , )
c [1]. reader ()) , c [2]. reader ()) , c [3]. reader ()) , c [0]. reader ())
Without acquiring the locks in perfect order, this code eventually results in a deadlock. Two processes have both acquired one of the alternation owned locks and are waiting to acquire the other in opposite order. To acquire the locks in order, we always acquire the lock with the lowest memory address first. This ensures the same lock order for all processes. We are synchronizing input and output guards without the Oracle process used in JCSP [10]. The Oracle process was introduced in JCSP to handle external choice on barriers and output guards. The new PyCSP views all communication requests as an offer and all offers are protected by individual locks. One lock per offer eliminates the need for an Oracle process, because it is guaranteed that an offer is only matched while it is active. The purpose of the Oracle process is to ensure that an offer is still active, when it is matched.
3. Examples Some of the changes in PyCSP refer to either performance and implementation, or pure syntactical presentation of the concepts. In the following, we show two examples that motivate the three semantic changes that have been introduced: retire as an alternative to channel poisoning, output-guards, and support for alternation with channels that have multiple readers and/or writers. The purpose of the examples is to demonstrate why PyCSP becomes easier to use after the introduced changes. 3.1. Monte Carlo Pi In the original PyCSP we did not have the retire feature, which meant that most producerconsumer programs tended to look like the example in figure 1.
B. Vinter et al. / PyCSP Revisited
271
Figure 1. Producer-consumer program forwarding termination criteria between producer and consumer.
from pycsp import * from random import random @process def producer ( term_out , job_out , bagsize , bags ): term_out ( bags ) for i in range ( bags ): job_out ( bagsize ) poison ( job_out ) @process def worker ( job_in , result_out ): try : while True : cnt = job_in () # Get task sum = reduce ( lambda x , y : x +( random ()**2+ random ()**2 <1.0) , range ( cnt )) result_out ((4.0* sum )/ cnt ) # Forward result except C h a n n el Poi so nE xce pt io n : pass # When done , _don ’ t_ forward poison @process def consumer ( term_in , result_in ): cnt = term_in () # Get number of results sum = 0 for i in range ( cnt ): sum += result_in () # Get result print sum / cnt # We are done - print result jobs = Channel () results = Channel () term = Channel () Parallel ( producer ( term . writer () , jobs . writer () , 1000 , 10000) , [ worker ( jobs . reader () , results . writer ()) for i in range (10)] , consumer ( term . reader () , results . reader ()))
Listing 1. Implementation of producer-consumer program forwarding with an explicit termination channel between producer and consumer.
A simple Monte Carlo simulation of the design in figure 1 is implemented in listing 1. This approach is simple to implement but suffers from two complexities: first of all, the termination criterion (number of bags) must be sent from the producer to the consumer, bypassing the workers. Secondly, the workers must explicitly avoid forwarding the channel poison in
272
B. Vinter et al. / PyCSP Revisited
the network as the consumer will otherwise die before all results are received and processed. The complexity of the solution grows even further if the setup has multiple producers or consumers. With the new retire operation, termination of the network can be handled in a more straightforward way. When the producer retires the job_out channel end, the channel is not poisoned, but the retire operation is forwarded to the other end of the channel in a similar way to poisoning. The main difference is that when the channel is retired and job_in() terminates the worker process, the workers channels are retired rather than poisoned. This will delay termination propagation along those channels until all workers have retired, and the consumer will not be prematurely terminated. Figure 2 shows the new network, which no longer needs to forward a termination criterion between the producer and consumer process. Multiple producers can also be plugged into the network without changing more than the parallel construct. Note that the consumer in listing 2 catches a ChannelRetireException, which allows it to terminate cleanly and print out the results before terminating.
Figure 2. Producer-consumer program using retire, avoiding termination criteria forwarding between producer and consumer. from pycsp import * from random import random @process def producer ( job_out , bagsize , bags ): for i in range ( bags ): job_out ( bagsize ) retire ( job_out ) @process def worker ( job_in , result_out ): while True : cnt = job_in () # Get task sum = reduce ( lambda x , y : x +( random ()**2+ random ()**2 <1.0) , range ( cnt )) result_out (( cnt , sum )) # Forward result @process def consumer ( result_in ): cnt , sum = 0 , 0 try : while True : c , s = result_in () # Get result cnt , sum = cnt +c , sum + s except C ha nn el Ret ir eE xce pt io n : print 4.0* sum / cnt # We are done - print result
B. Vinter et al. / PyCSP Revisited
273
jobs = Channel () results = Channel () Parallel ( producer ( jobs . writer () , 1000 , 10000) , [ worker ( jobs . reader () , results . writer ()) for i in range (10)] , consumer ( results . reader ())) Listing 2. Producer-consumer using retire, avoiding termination criteria forwarding between producer and consumer.
3.2. Branch-and-bound Branch-and-bound algorithms in CSP are easy [11] but associated with some complex decisions with regards to how and when to update the bound variable. The challenge in the bound update is to balance the communication and work: if the bound variable is updated too rarely the parallel work will perform more work than necessary. If the bound is updated too often, it will result in too frequent communication. Basically, three approaches exist: 1. Update the bound only when a job is finished. Submitting a bound equals requesting a job 2. Update the bound as soon as you find it. A special bound value identifies a job request 3. Update the bound as soon as you find it. Jobs are requested independently The first approach is simple and a common choice; however, the infrequent update of the bound results in slower overall execution. The second approach is complex and requires parsing of the input to determine if an incoming message requires an outgoing job. The third is easy but requires output guards. To keep this section from growing too large we only present the code required for receiving bound variables and passing on jobs. Solution 1 is the trivial case, where the master sends a new job back when receiving a result from a worker. We don’t need an alternation in this case since all channels are any-toany. When there are no more jobs to be executed, the master retires the job channel which will terminate workers trying to read from it. The master continues to receive results until all workers have retired which will retire the result channel. This will in turn throw a ChannelRetireException in the master, which can be caught to print out the final result. bound = 10 e10 while jobs : next = jobs . pop () bid = results_in () bound = best ( bid , bound ) # Best is an o ptimization specific function jobs_out (( next , bound )) # Without retire the code becomes even more complex retire ( jobs_out ) try : while True : bound = best ( results_in () , bound ) except C ha nn el Ret ir eE xce pt io n : print bound
Listing 3. Solution 1.
274
B. Vinter et al. / PyCSP Revisited
Solution 2 allows workers to submit new bound variables before the job is finished by sending an update message over the request channel. This allows the bound variable to be updated faster and thus potentially reduces the total work that must be done. The solution is almost identical to solution 1, except that messages from workers are parsed to know if the message is an update. Updates should not trigger a blocking write of a new job to the jobs channel. Termination is identical to solution 1. bound = 10 e10 while jobs : next = jobs . pop () request , bid = results_in () if request == ’ Update ’: # Update means don ’t send new job bound = best ( bid , bound ) # Best is an optimization specific function else : jobs_out (( next , bound )) # Without retire the code becomes even more complex retire ( jobs_out ) try : while True : bound = best ( results_in ()[1] , bound ) except C h a n n e l R e t i r e E x c e p t i o n : print bound
Listing 4. Solution 2.
Solution 3 uses output guards to eliminate parsing of the incoming messages. Instead, the alternation accepts either an incoming result or an outgoing job to a worker. Once there are no more jobs, the solution terminates like the other solutions. It should be noted that this design, which is as simple as solution 1 and as efficient as solution 2, also provides simpler initialization of the workers since they are not required to submit a bogus result to trigger the delivery of the first job. # A Python limitation requires a mutable type here my_locals = { ’ bound ’: 10 e10 , ’ next ’ : jobs . pop () # We require at least two jobs to start with !!! } while jobs : Alternation ([{ results_in : " my_locals [ ’ bound ’] = best ( __channel_input , my_locals [ ’ bound ’]) " , ( jobs_out , ( next , my_locals [ ’ bound ’ )) : " my_locals [ ’ next ’] = jobs . pop () " }]). execute () # Without retire the code becomes even more complex retire ( jobs_out ) try : while True : my_locals [ ’ bound ’] = best ( results_in () , my_locals [ ’ bound ’ ]) except C h a n n e l R e t i r e E x c e p t io n : print bound
Listing 5. Solution 3.
B. Vinter et al. / PyCSP Revisited
275
If one wishes the workers to update their knowledge of the bounds, this may easily be done in solution 3 by adding a dedicated channel for propagating the bound and letting the workers do a blocking input from that channel as frequently as desired. The server then adds another output guard that at any time is ready to write the best known bound. 4. Future Work The new version of PyCSP provides a very convenient means of writing concurrent applications for non-computer scientists, allowing them to use CSP for parallel and concurrent programming. Initial response on the new version has been quite positive and we thus plan to continue the work. An extension to alternation in JCSP is the concept of fairSelect, which can be used to avoid starvation. This would be of interest also to PyCSP users, probably as default with the truly random alternation becoming an option for special purposes. Network construction is still fairly complex in PyCSP, and the only improvement that the new version offers is the option of mixing single processes and lists of processes in one Parallel constructor. We are working on a library of network constructors that will allow users to easily specify networks of processes in rings, meshes, fully interconnected and other common process-oriented design patterns. The previous version of PyCSP was extended with a module that allowed processes to be executed on Grid when channel communication can be represented as a synchronous event, i.e. input;execute;output, the Grid enabled processes cannot support all channel communications, i.e. alternation or patterns as input;execute;input;execute;output cannot be used but classic, client-server patterns fits will with Grid execution. This feature is desired for very demanding jobs and would be relevant to reintroduce in the new version of PyCSP. While the type indifference of channels in PyCSP is highly praised by students there are scenarios where type matching is equally attractive. Future plans include adding support for type-checking channels. 5. Conclusions The original PyCSP borrowed heavily from JCSP to get semantics and functionality correct while still attempting to make the solution native to Python. It was quite well received, especially amongst students and scientists who often find Python a productive programming environment. After exposing more than 200 students to PyCSP, we did however receive some negative feedback. One of the central complaints was about the many channel types and especially the hardship of changing between them in an existing application. Another frequent complaint was the lack of support for output guards and channels with multiple readers and/or writers in alternation. In addition to the feedback from the users, the authors identified two shortcomings in the original version of PyCSP: first, students frequently demonstrated raceconditions when terminating a network by use of poisoning, and second, it is desired to make PyCSP look more like occam. The complaints and identified shortcomings resulted in an evaluation that confirmed the need for the following major changes to PyCSP. All channels are now any-to-any which greatly simplifies design changes since a user may add more readers or writers to a channel that previously had only one. Since external choice is central to CSP, these any-to-any channels are naturally supported in the alternation implementation of external choice. PyCSP external choice now supports output guards in addition to input guards. This works with multiple readers and writers on a channel. The use of output guards is a heavily debated issue in CSP as they are clearly not needed nor trivial to implement. However, it is
276
B. Vinter et al. / PyCSP Revisited
evident that the users of PyCSP find output guards a very convenient feature and considerable work has been put into supporting output guards in the alternation implementation in PyCSP. External choice has also been modified to more closely mimic occam so that a guard and the associated code can be expressed in one statement. This brings PyCSP much closer to conventional CSP than the previous model where a ready guard was first identified and then read from. In order to reduce the risk of race-conditions when using poison to terminate a CSP network, this version of PyCSP introduces the concept of retirement from a channel. When all processes on one end of a channel retire their channel ends, the channel becomes retired. The effect is that the propagation of the retire signal is activated upon the termination of the last process at a given channel end rather than the first as with the poison operation. Overall, the changes to PyCSP are well integrated and we believe that using PyCSP is now easier for the unsophisticated users than with the previous version. The newest version may be found as PyCSP under Google-code [12]. References [1] John Markus Bjørndalen, Brian Vinter, and Otto Anshus. PyCSP - Communicating Sequential Processes for Python. In A.A.McEwan, S.Schneider, W.Ifill, and P.Welch, editors, Communicating Process Architectures 2007, pages 229–248, July 2007. [2] C.A.R. Hoare. Communicating sequential processes. Communications of the ACM, 21(8):666-677, pages 666–677, August 1978. [3] Communicating Sequential Processes for Java. http://www.cs.kent.ac.uk/projects/ofa/jcsp/. [4] Jim Moores. Native JCSP: the CSP-for-Java library with a Low-Overhead CPS Kernel. In P.H. Welch and A.W.P. Bakkers, editors, Communicating Process Architectures 2000, volume 58 of Concurrent Systems Engineering, pages 263–273. WoTUG, IOS Press (Amsterdam), September 2000. [5] Peter H. Welch. Process Oriented Design for Java: Concurrency for All. In H.R. Arabnia, editor, Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA’2000), volume 1, pages 51–57. CSREA, CSREA Press, June 2000. [6] Rune Møllegaard Friborg, John Markus Bjørndalen, and Brian Vinter. Three Unique Implementations of Processes for PyCSP. In Communicating Process Architectures 2009. WoTUG, IOS Press. [7] Neil C.C. Brown and Peter H. Welch. An introduction to the Kent C++CSP Library. Communicating Process Architectures 2003, September 2003. [8] Bernhard H.C. Sputh and Alastair R. Allen. JCSP-Poison: Safe Termination of CSP Process Networks. Communicating Process Architectures 2005, September 2005. [9] Gerald H. Hilderink, Jan F. Broenink, Wiek Vervoort, and Andr`e W. P. Bakkers. Communicating Java Threads. In Andr`e W. P. Bakkers, editor, Proceedings of WoTUG-20: Parallel Programming and Java, pages 48–76, March 1997. [10] Peter H. Welch, Neil C.C. Brown, Jim Moores, Kevin Chalmers, and Bernhard Sputh. Integrating and Extending JCSP. In Steve Schneider, Alistair A. McEwan, Wilson Ifill, and Peter H. Welch, editors, Communicating Process Architectures 2007, volume 65 of Concurrent Systems Engineering, pages 349– 370, Amsterdam, The Netherlands, July 2007. WoTUG, IOS Press. [11] Peter H. Welch and Brian Vinter. Cluster Computing and JCSP Networking. In James Pascoe, Roger Loader, and Vaidy Sunderam, editors, Communicating Process Architectures 2002, pages 203–222, September 2002. [12] PyCSP distribution. http://code.google.com/p/pycsp.
Communicating Process Architectures 2009 P.H. Welch et al. (Eds.) IOS Press, 2009 © 2009 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-065-0-277
277
Three Unique Implementations of Processes for PyCSP Rune Møllegaard FRIBORG a,* , John Markus BJØRNDALEN b and Brian VINTER a a Department of Computer Science, University of Copenhagen b Department of Computer Science, University of Tromsø Abstract. In this work we motivate and describe three unique implementations of processes for PyCSP: process, thread and greenlet based. The overall purpose is to demonstrate the feasibility of Communicating Sequential Processes as a framework for different application types and target platforms. The result is a set of three implementations of PyCSP with identical interfaces to the point where a PyCSP developer need only change which implementation is imported to switch to any of the other implementations. The three implementations have different strengths; processes favors parallel processing, threading portability and greenlets favor many processes with frequent communication. The paper includes examples of applications in all three categories. Keywords. Python, CSP, PyCSP, concurrency, threads, processes, co-routines
Introduction The original PyCSP [1] implemented processes as threads, motivated by an application domain with scientific users and the assumption that these applications would spend most of their time in external C calls. While the original PyCSP was well received, users often aired two common complaints. First and foremost programmers were disappointed that pure Python applications would not show actual parallelism on shared memory machines, most frequently multi-core machines, because of Python’s Global Interpreter Lock. The second common disappointment was the limited number of threads supported, typically an operating system limitation in the number of threads per process, and the overhead of switching between the threads. In this paper we present a new version of PyCSP that addresses these issues using three different implementations of its concurrency primitives. PyCSP The PyCSP library presented in this paper is based on the version of PyCSP presented in [2] which we believe reduces the complexity for the programmer significantly. It is a new implementation of CSP constructs in Python, that replaces the original PyCSP implementation from [1]. This new PyCSP uses threads like the original PyCSP, but introduces four major changes and uses a better and simpler approach to handle the internal synchronization of channel communications. The four major changes are: simplification to one channel type, input and output guards, automatic poisoning of CSP networks and making the produced Python code look more like occam where possible. * Corresponding Author: Rune Møllegaard Friborg, Department of Computer Science, University of Copenhagen, DK-2100 Copenhagen, Denmark. Tel.: +45 3532 1421; Fax: +45 3521 1401; E-mail: [email protected].
278
R.M. Friborg et al. / Three Unique Implementations
When we refer to the threads implementation of PyCSP, we are referring to the new PyCSP presented in [2] and referenced in this paper as pycsp.threads. This is used as our base to implement the alternatives to threading presented in this paper. 1. Motivation We have looked at three underlying mechanisms for managing tasks and concurrency: coroutines, threads and processes. Each provide different levels of parallelism that come with increasing overhead. All of them are available in different forms, and in this paper we define them as follows: Co-routines provide concurrency similar to user-level threads and are scheduled and executed by a user-level runtime system. One of the main advantages is very low overhead. Threads are kernel-level threads scheduled by the operating system, has a separate execution stack, but share a global address space. Processes are operating system processes and data can only be shared through explicit system calls. When programming a concurrent application, it is necessary to choose one or several of the above. If the choice turns out to be wrong, then the application needs to be rewritten. A rewrite is not a simple task, since the mechanisms are very different by design. Using Python and PyCSP, we want to simplify moving between the three implementations. The intended users are scientists that are able to program in Python and who want to create concurrent applications that can utilize several cores. Python is a popular programming language among scientists because of a simple and readable syntax and the many scientific modules available. It is also easy to extend with code written in C or Fortran and does not require explicit compilation. 1.1. Release of GIL to Utilize Multi-Core Systems Normally PyCSP is limited to execution on a single core. This is a limitation within the CPython1 interpreter and is caused by the Global Interpreter Lock (GIL) that ensures exclusive access to Python objects. It is very difficult to achieve any speedup in Python from running multiple threads unless the actual computation is performed in external modules that release the GIL. Instead of releasing and acquiring the GIL in external modules it is possible to use multiple processes that run separate CPython interpreters with separate GILs. In Python 2.6 we can use the new multiprocessing module [3] to handle processes, enabling us to compare threads to processes. The comparison in Table 1 shows the result of computing Monte Carlo pi in parallel using threads and processes. Table 1. Comparison of threads and multiprocessing on a dual core system with Python 2.6.2. Workers Threads Processes
1 0.98s 1.01s
2 1.52s 0.57s
3 1.56s 0.54s
4 1.55s 0.54s
10 1.57s 0.56s
The GIL is to blame for the poor performance for threads illustrated in Table 1. It is possible to obtain good performance for threads, but to do this you must compute in an external module and manually release the GIL. The unladen-swallow project [4] aims to remove the Global Interpreter Lock entirely from CPython. 1
CPython is the official Python interpreter.
R.M. Friborg et al. / Three Unique Implementations
279
1.2. Maximum Threads Available On a standard configured operating system, the maximum number of threads in a single application is limited to around 1000. In PyCSP, every CSP process is implemented as a thread. Thus, there can be no more CSP processes than the maximum number of threads. We want to overcome this and give PyCSP the ability to handle CSP networks consisting of more than 100000 CSP processes, by using co-routines. We thus decided to address these issues by providing two additional implementations, one that provides real parallelism for multi-core machines and one that does not expose the processes to the operating system. All versions should implement the exact same interface, and a programmer should need only to change the code that imports PyCSP to change between the three different versions. Having a common interface for three implementations of PyCSP has another purpose besides being a fast and effective method for changing the concurrent execution platform. It is also an easy method for students to learn what consequences it has to run a specific PyCSP application with co-routines, threads or processes. PyCSP is often chosen by students in the Extreme Multiprogramming Class, which is a popular course at the University of Copenhagen teaching Communicating Sequential Processes [5]. 2. Three Implementations of PyCSP The three implementations of concurrency in PyCSP – pycsp.threads, pycsp.processes and pycsp.greenlets – are packaged together in the pycsp module. Although packaged together these are completely separate implementations sharing a common API. It is possible to combine the implementations to produce a heterogeneous application with threads, processes and greenlets, but the support is limited since the choice (Alternation) construct does not work with channels from separate implementations and when communicating between implementations only channels from the processes implementation are supported. The primary purpose of packaging the three implementations in one module is to motivate the developer to switch between them as needed. A common API is used for all implementations making it trivial to switch between them, as shown in Listing 1. A summary of advantages and limitations for each of the implementations are listed at the end of this section. # Use threads from pycsp . threads import *
# Use processes from pycsp . processes import *
Listing 1. Switching between implementations of the PyCSP API.
When switching to another implementation, the PyCSP application may execute very differently as processes may be scheduled in another order and less fair. Hidden latencies may also become more apparent when all other processes are waiting to be scheduled. In the following sections we present an overview of the implementations in order to understand how they affect the execution of a PyCSP application. 2.1. pycsp.threads This implementation uses the standard threading module in Python, which implements kernel-level threads. All threads access the same memory space, thus when communicating data only the reference to the data is copied. If the data is a mutable Python type it can be updated from multiple threads in parallel, though it is not recommended to do so since it might cause unexpected data corruption and does not fit with the CSP programming model. Details of pycsp.threads are presented in [2] and is a remake of the original PyCSP [1].
280
R.M. Friborg et al. / Three Unique Implementations
2.2. pycsp.greenlets Greenlets are lightweight (user-level) threads, and all execute in the same thread. A simple scheduler has been created to handle new greenlets, dying greenlets and greenlets that are rescheduled after blocking on communication. The scheduler has a simple FIFO policy and will always try to choose the first greenlet among the greenlets ready to run. The PyCSP API has been extended with an @io decorator that can wrap blocking IO operations and run the operations in a separate thread. In pycsp.threads and pycsp.processes, this decorator has no function while in pycsp.greenlets an Io object is created. It is necessary to introduce this construct because the greenlets are all running in one thread, and if one greenlet blocks without yielding control to the scheduler, all greenlets in this thread are blocked. For threads and processes, this is not a problem because the operating system can yield on IO and use time slices to interrupt execution, thus rescheduling new threads or processes. Greenlets are never forced to yield to another greenlet. Instead, they must yield execution control by themselves. Invoking the __call__ method on the Io object will create a separate thread running the wrapped function. After the separate thread has been started, the greenlet yields control to the scheduler in order to schedule a new greenlet. Listing 2 provides an example of how to use @io. Without @io, the greenlet would not yield, thus blocking all other greenlets ready to be scheduled. This would serialize the processes, and the total runtime of Listing 2 would be around 50 seconds instead of the expected 10 seconds. @io def wait ( seconds ): time . sleep ( seconds ) @process def delay_output ( msg , seconds ): wait ( seconds ) print msg Parallel ( [ delay_output ( ’% d second delay ’ % i , i ) for i in range (1 , 11)] ) Listing 2. Yielding on blocking IO operations.
Communicating on channels from outside a PyCSP greenlet process is not supported, since the scheduler needs to work on a greenlet process to coordinate channel communication. This means that you can not communicate with the main greenlet at the top-level environment. Calls to pycsp.greenlets functions from a @io thread will fail for the same reason. Calls to the pycsp.threads or pycsp.processes implementations are recommended to be wrapped with the @io decorator, otherwise they could block the scheduler and cause a deadlock. 2.3. pycsp.processes This implementation uses the multiprocessing module available in Python 2.6+. Processes started with the multiprocessing module are executed in separate instances of the Python interpreter. On systems supporting the UNIX system call fork, starting separate Python interpreters with a copy of all objects is trivial. On Microsoft Windows, this is much more challenging for the multiprocessing module, since no equivalent of fork is available. The multiprocessing module simulates the fork system call by starting a new Python interpreter, loading all necessary modules, serializing / unserializing objects and initiating the requested
R.M. Friborg et al. / Three Unique Implementations
281
function. This is very slow compared to fork, but it still works in lack of a better alternative for Windows. When an application is written in pure Python and PyCSP, it is now possible with pycsp.processes to utilize multi-core CPUs. For most cases all PyCSP applications will be able to run without any changes, but if the data communicated does not support serialization, the application will fail. An example of such data is an object containing pointers initialized by external modules, fortunately this type of data is not very common in Python applications. pycsp.processes uses shared memory pointers internally and must allocate everything before any processes are forked. For this reason, it might in extreme cases be necessary to tweak a set of constants for pycsp.processes. To do this, a singleton Configuration class is instantiated as shown in the example (Listing 3). New constants must be set before any other use of pycsp.processes, since everything is allocated on first use. from pycsp . processes import * Configuration (). set ( PROCESSES_SHARED_CONDITIONS , 50) Configuration (). get ( PRO C E S S E S _ S H A R E D _ C O N D I T I O N S ) # returns 50 Listing 3. Example of setting and getting a constant.
Using this configuration class it is possible to change the size of shared memory and the amount of shared locks and conditions allocated on initialization. The allocated shared memory is used as buffers for channel communication, which means that the size of data communicated on channels at any given time can never exceed the size of the buffer. The default size of the shared memory buffer is set to 100MB, but can easily be increased by setting the constant PROCESSES ALLOC MSG BUFFER. 2.4. Summary of Advantages and Limitations The following is a summary of the advantages (+) and limitations (-) of the individual implementations before moving on to the Implementation and Experiments section. Threads: + Only references to data are passed by channel communication. + Other Python modules usually only expect threads. + Compatible with all platforms supporting Python 2.6+. - Limited by the Global Interpreter Lock (GIL), resulting in very poor performance for code not releasing the GIL. - Limited in the maximum number of CSP processes. Greenlets: + More optimal switching between CSP processes, since we can limit the context switches to the point where they are blocking. Performance does not decrease with more CSP processes competing for execution. + Very small footprint per CSP process, making it possible to run a large number of processes, only limited by the amount of memory available. + Fast channel communications (≈ 20μs). - No utilization of more than one CPU core. - Unfair execution, since execution control is only yielded when a CSP process blocks on a channel. - Requires that the developer wraps blocking IO operations in an @io decorator to yield execution to another CSP process.
282
R.M. Friborg et al. / Three Unique Implementations
Processes: + Can utilize more cores, without requiring the developer to release the GIL. - Fewer processes possible than pycsp.threads and pycsp.greenlets. - Windows support is limited, because of lack of the fork system call. - All data communicated are serialized, which requires the data type be supported by the pickle module. + A positive side-effect of serializing data is that data is copied when communicated, rendering it impossible to edit the received data from the sending process. 3. Implementation When processes communicate through external choice at both the reading and writing end a number of challenges must be addressed to avoid live-lock and dead-lock problems, this is well researched in [6,7,8]. The PyCSP solution introduces what we believe to be a new algorithm for this problem. The algorithm is very simple and quite fast in the common case. Every channel has two queues associated with it, one for pending read-operations and one for pending write-operations. Every active choice (Alternation) is represented with a request structure, this request has a lock, to ensure mutual exclusion on changes to the request, an unique id, a status field, and the actual operation, i.e. read or write with associated data. When an Alternation is run a reference to the request structure is added to the queue it belongs to, i.e. input-requests (IR) and output-requests (OR), on every channel in the choice. Then all requests are tested against all potentially matching requests on all involved channels. When a match is found the state of the request structure is changed to Done to ensure that the request is matched only once. When the arbitration function comes across an inactive request structure it is evicted from the queue. def double_lock ( req_1 , req_2 ): if req_1 . id < req_2 . id : lock ( req_1 . lock ) lock ( req_2 . lock ) else : lock ( req_2 . lock ) lock ( req_1 . lock ) Listing 4. The double lock operation in pseudocode.
Live-lock is avoided by using blocking locks only, so if a legal match exists it will always be found the first time it is available. Deadlock is avoided by using the unique id of a request to sort the order in which locks are acquired, thus we have an operation, double lock (Listing 4), that acquires two individual locks in order and returns once both locks are obtained. If two threads attempt to lock the same requests they will always do so in the same order and thus never deadlock. for w in write_queue : for r in read_queue : double_lock (w , r ) match (w , r ) unlock (w , r ) Listing 5. The arbitration algorithm.
The arbitration algorithm in Listing 5 then performs the protected matching by acquiring locks with the double lock operation. For every Alternation, read or write action there is
283
R.M. Friborg et al. / Three Unique Implementations
exactly one request and this request is always enqueued on the destination channel queues before the arbitration algorithm is run. It may seem unnecessarily expensive at first glance, but it is important to remember that if we do not enqueue the request before matching against potential matches, then there exists a scenario where a read-operation and a matching writeoperation may be arbitrated in step-lock without detecting each other. An example of a correctly committed Alternation is shown in figure 1.
IR Active Channel A
OR Done, Fail
Requests 1
Active 2 3
Channel B
cin = A.reader() id = cin()
IR Done, Ok
OR Done, Ok
4
cout = A.writer() cin = B.reader() Alternation([ { cin:received_id() }, { (cout, id):send_id(id) } ]).execute()
cout = B.writer() cout(new_id)
cout = A.writer() cout(id)
Figure 1. Snapshot of synchronization with two channels and four communicating processes. Channel B has found a match between two request structures; one in the input request queue (IR) and one in the output request queue (OR). Next, channel A will match the two active requests on channel A’s request queues.
The presented algorithm for handling synchronization in PyCSP is relevant for pycsp.threads and pycsp.processes, while the pycsp.greenlets does not need this to ensure correctness. The algorithm is a main feature of the new PyCSP, if interested in other features of pycsp.threads then the description of these can be found in [2]. Next we will focus on the implementation details for pycsp.processes and pycsp.greenlets. 3.1. pycsp.greenlets For co-routines, the greenlet module [9] was chosen because it is a very small module, easy to install, provides full control (no internal scheduler) and allows yielding from nested functions. Python’s own generators which make it possible to create a co-routine-like API, do not allow yielding from nested functions, which would not allow us to yield when blocked on a channel communication. Another option was to use Stackless Python [10] for our implementation. Stackless Python was originally based on the greenlet design and has since then matured. It is slightly faster than the greenlet module and allows a larger number of allocated co-routines. However, having to install an extra Python interpreter to make the co-routine implementation run was found unacceptable, leaving the greenlet module as the only valid choice left. A limitation with co-routines is that everything runs in a single thread, which means that a blocking call will block all other co-routines as well. This is especially a problem with IO operations, since the blocking action might happen in a system call, which we are not able to detect in the Python environment. The @io decorator attempts to solve this by wrapping a function into a run method on an Io thread object. This Io thread object is created on-the-fly and yields execution to the scheduler after starting the thread. When the thread’s run method finishes, the return value is saved and the calling co-routine is moved onto the scheduler’s
284
R.M. Friborg et al. / Three Unique Implementations
next queue. Wrapping a function in an @io decorator introduces an overhead of starting and stopping a thread. We carried out a test, to see whether this overhead could be minimized by using a thread worker pool. The overhead was found to be similar to the time needed to start and stop a thread, thus the idea of a thread worker pool was abandoned. The idea of delegating a blocking system call to a separate thread was presented by Barnes [11] for the Kent Retargetable occam-π Compiler. occam-π implements a set of channels keyboard and screen that can be used to communicate to processes reserved for these IO operations. This could also be an option for PyCSP, but it was decided that the @io decorator would provide more flexibility for the programmer. The channel communication overhead is much lower for greenlets than the other two implementations because we can avoid the conditions and locks when synchronizing. To optimize for fast switching on channel communications, a central queue of blocked greenlets is not used when handling synchronizations. Whenever a greenlet blocks on channel communication, it saves a self-reference together with the channel communication request. Since channel communication requests are located in queues on channels these can be viewed as wait queues, from where a request is matched with an offer for communication. It is now the responsibility of another greenlet that matches this channel communication request to place the blocking greenlet on the scheduler’s next queue. The scheduler uses a simple FIFO policy, thus choosing the first element of the next queue for execution. The next queue is usually short as most greenlets will be blocked on channel communication. # Reschedule , without putting this process on either # the next [] or a blocking [] list . def wait ( self ): while self . state == ACTIVE : self . s . getNext (). greenlet . switch () Listing 6. Blocking and scheduling a new greenlet CSP process.
When switching, we switch directly from CSP process to CSP process without spending any time on having to switch to a scheduler process. The code in Listing 6 is the wait method, which is executed when a CSP process blocks on channel communication. The method is responsible for scheduling the next CSP process. The self.s attribute is a reference to the scheduler, which is implemented as a singleton class. If the next and new queues are empty, then getNext() will return a reference to the scheduler greenlet which will then be activated. The scheduler greenlet will then investigate whether there are any current Timeout guards or @io threads active. In case all queues are empty it will terminate since everything must have been executed. 3.2. pycsp.processes Using processes instead of threads requires that we run separate Python interpreters. For fast communication we can choose among several existing inter-process communication techniques, which includes message passing, synchronization and shared memory. Which techniques are available and how they are implemented differs between platforms. In order to have cross-platform support we construct the pycsp.processes implementation on top of the multiprocessing module available in Python 2.6. The multiprocessing module presents a uniform method of creating processes, shared values and shared locks. When Python objects are communicated through shared values, they are serialized using the pickle module [12]. Some Python objects cannot be serialized, shared values and locks are two examples of these. This requires us to initialize everything at startup, so that references can be passed to new processes as arguments. A singleton ShmManager class maintains all references to shared val-
R.M. Friborg et al. / Three Unique Implementations
285
ues and locks. This instance is automatically located in the memory address space of newly created processes. Every channel instance requires a lock to protect critical regions, and every channel communication requires a condition linked to the channel request offered to processes to ensure that this request is updated in a critical region and can be signaled when updated. This usage of locks and conditions can be a problem when having many processes and channels. The total number of available locks and conditions in shared memory is much lower for the multiprocessing module than for the threading module. The solution was to let the ShmManager class maintain a small pool of shared conditions and locks. The size of the lock pool needs to be large enough to prevent a delay when entering a critical region. Likewise the size of the condition pool should be large enough to avoid waking up to many false processes, causing an overhead in context switches. 20 locks and 20 conditions seem to be enough for most situations possible with pycsp.processes, though a small performance increase is possible for the micro benchmark experiments by using more conditions. Sending data around in a CSP network requires a method to actually transfer data from one process to another. Since all references to shared memory have to be initialized and allocated at startup a message buffer is allocated in shared memory. Unfortunately Python only supports allocating shared memory through the multiprocessing module, thus we will have to handle the memory management in PyCSP by calling get and set methods on objects allocated using the multiprocessing module. A large shared string buffer is allocated and partitioned into blocks of a static size. To handle the allocation of the required number of blocks for a channel communication and freeing them again afterwards, a dynamic memory allocator is implemented. The memory allocator uses a simple strategy that resembles the next-fit strategy: init A list of free blocks is initialized with one entry that equals the entire message buffer. alloc Any size is allocated by searching the list of free blocks for an entry that has enough space. The needed blocks are then cut from this entry and an index to the first block is returned. free Allocated blocks are freed by appending an entry containing the index and size of the free blocks list. Every new allocation will fragment the message buffer into smaller sections. If at some point we cannot find a partitioned area large enough, a step of combining free blocks is executed. This solution makes it possible to send both large messages and very small messages. If necessary, the buffer and block size can be tweaked using the Configuration().set() functionality. We do not expect this dynamic memory allocator to affect the performance of parallelism in general, even though the allocation of a buffer is protected by a shared lock. The amortized cost of allocating buffers is low, since most allocations will be able to allocate blocks from the first entry in the list of free blocks and while the more rare and expensive action of reassembling blocks introduces a delay, it is a delay that will not affect the overall execution much. In the micro benchmarks (Section 5.1) and in the Mandelbrot experiment (Section 5.3) we successfully communicate small and larger data sizes. 4. Related Work Communicating Sequential Processes (CSP) was defined by Hoare [5] in 1978, but it is during the last decade that we have seen numerous new libraries and compilers for CSP. Several implementations are optimized for multi-core CPUs that are becoming the de-facto standard when buying even small desktop computers. occam-π [13] and C++CSP2 [14] are two CSP
286
R.M. Friborg et al. / Three Unique Implementations
implementations which stand out by being able to utilize multiple cores and use user-level threads for fast context switching. User-level threads are more efficient and provides greater flexibility than kernel-threads. They exists only to the user and can be made to use very little memory. It is possible to optimize the scheduling of threads to fit with the internal priority in the application, because the scheduler is in user code and the operating system is not involved. occam-π implements processes as user-level threads and uses a very robust and optimized scheduler that can handle millions of processes. The utilization of multiple cores is handled automatically by the scheduler and is described in detail in [15]. This is different from C++CSP2, where it is necessary to specify whether processes should be run as user-level threads or kernel-level threads. Several libraries exist for Python that enable the Python programmer to manage tasks or threads, but they do not enable the programmer to easily change from threads to co-routines. Some of these libraries are Stackless Python [10], Fibra [16] and the multiprocessing module [3] and they provide an abstraction that uses the concept of processes and channels resembling a subset of the constructs available in the CSP algebra. Stackless Python is a branch of the standard CPython interpreter and provides very small and efficient co-routines (greenlets), bidirectional channels and a round-robin scheduler. Fibra is based on Python generators that are similar to co-routines, but it is impossible to hide the fact that a co-routine is a Python generator since the keyword yield is the only method to switch between generators. In Fibra, co-routines communicate through tubes by yielding values to a scheduler. The multiprocessing module in Python 2.6 provides a method of using operating system processes, shared memory and pipes for buffered communication. Operating system processes are heavy processes requiring a large amount of memory, but contrary to threads they are not affected by the Global Interpreter Lock (GIL). However, no libraries exist for Python that provide the functionality of the choice construct that makes it possible to program with non-deterministic behaviour in the communication between processes. 5. Experiments We have run three different experiments, to show the strengths and weaknesses of the PyCSP implementations. The first experiment consists of two micro benchmarks where one is showing how the implementations handle an increasing amount of processes until reaching the maximum possible amount. The other micro benchmark is showing how well an implementation copes with performing an increasing amount of concurrent communications in a network of static size. After the micro benchmarks, we generate primes using a simple PyCSP application as a case for when it is convenient to be able to switch from threads or processes to co-routines. Finally, a benchmark computing the Mandelbrot set is used to compare speedup on an 8-core system. The Mandelbrot set is computed twice using two different strategies and producing two very different speedup plots. One has the Global Interpreter Lock (GIL) released during computation by computing in an external module and one was computed using the numpy module [17]. All benchmarks are executed on a computer with 8 cores: two Intel Xeon E5310 Quad Core processors and 8 GB RAM running Ubuntu 9.04. 5.1. Micro Benchmarks The results of these micro benchmarks provides a detailed view of how the implementations behave when they are stressed. The benchmarks are designed with the purpose of measuring the channel communication time including the necessary time required to context switch. Extra unnecessary context switches may be added by the operating system and is related to the PyCSP implementation used.
R.M. Friborg et al. / Three Unique Implementations
287
... Initiate token and destroy token
...
Figure 2. Ring of variable size 10000
Chan. Comm. time (microseconds)
pycsp.threads pycsp.processes pycsp.greenlets
1000
100
10 44 21 26 72 10 13 6 53 65 8 76 32 4 38
92
16
96
81
48
40
24
20
2
10
6
51
8
25
12
64
32
16
8
4
2
Ringsize (csp processes)
Figure 3. Micro benchmark measuring the channel communication time including the overhead of context switching for an increasing number of CSP processes.
Using the ring design in Figure 2, we run a benchmark that sends a token around a ring of increasing size. The ring benchmark was inspired from a similar micro benchmark in [15]. N elements are connected in a ring and every element passes a token from the previous element to the next. This challenges the PyCSP implementations ability to handle an increasing amount of processes and channels. The time measurements does not include startup and shutdown time and each measured run is divided by the size of the ring to compute an average channel communication time. The test system has been tweaked to allow a larger number of threads and processes than the default. The results for our test system (in Figure 3) show that we can reach 512, 16384 and 262144 CSP processes depending on the PyCSP implementation used. It is obvious that pycsp.processes should only be used for applications with few CSP processes because of the exponential increase in latency, though it is possible to configure pycsp.processes using Configuration().set(PROCESSES_SHARED_CONDITIONS, 50) and achieve marginally better performance. As expected, pycsp.greenlets is able to handle a large number of CSP processes with only a small decrease in performance. Investigating the performance in a different perspective, we use four rings of static size
288
R.M. Friborg et al. / Three Unique Implementations
N and then send 1 to N-1 tokens to circle concurrently. In the previous benchmark there was only one communication at a time, which is a rare situation for an actual application. With this benchmark we see pycsp.processes performs much better, since it can now utilize more cores. Based on the results in Figure 4 we can conclude that pycsp.processes has a higher throughput of channel communications than pycsp.threads when enough concurrent communications can utilize several cores. 350 pycsp.threads pycsp.processes pycsp.greenlets
300
Chan. Comm. time (microseconds) / Tokens
Chan. Comm. time (microseconds) / Tokens
350
250 200 150 100 50 0
250 200 150 100 50 0
1
2
3
4 5 Concurrent Tokens
6
7
2
350
4
6 8 10 Concurrent Tokens
12
14
350 pycsp.threads pycsp.processes pycsp.greenlets
300
Chan. Comm. time (microseconds) / Tokens
Chan. Comm. time (microseconds) / Tokens
pycsp.threads pycsp.processes pycsp.greenlets
300
250 200 150 100 50 0
pycsp.threads pycsp.processes pycsp.greenlets
300 250 200 150 100 50 0
5
10
15 20 Concurrent Tokens
25
30
10
20
30 40 Concurrent Tokens
50
60
Figure 4. Micro benchmarks measuring the average channel communication time including the overhead of context switching for an increasing number of concurrent tokens in four rings of size 8, 16, 32 and 64.
Looking at the results for the four rings of size N in Figure 4, an interesting pattern is observed whenever the number of concurrent tokens comes close to N. For N-1 concurrent tokens the performance of pycsp.threads are almost equal to the performance of one concurrent token. The reason for this behaviour is explained by the blocking nature of CSP, because when all processes but one has a token, then only this one is able to receive. This behaviour mimics the behaviour of the test with one token and explains why the results in Figure 4 are mirrored around the center. From these micro benchmarks we can see that, pycsp.threads performs consistently in both benchmarks. pycsp.processes does poorly in Figure 3 where the cost of adding more processes is high, but perform better in Figure 4 where a number of concurrent tokens are added. Finally pycsp.greenlets has proved able to do fast switching and many processes, regardless of the amount of concurrent tokens. 5.2. Primes This is a simple and inefficient implementation of prime number generation found in [18]. The CSP design of the implementation is shown in Figure 5. It adds one CSP process for every computed prime, which sets a limit on how many primes can be calculated using this design. The maximum number of primes equals the maximum amount of CSP processes
289
R.M. Friborg et al. / Three Unique Implementations
or channels possible. The latency involved in spawning new CSP processes and performing context switches varies when swapping between threads, processes and greenlets.
Producer
Workers
Printer
...
Spawn new worker process and take the role of a worker
If number mod 2 == 0 Then skip Else pass to next process If number mod 3 == 0 Then skip Else pass to next process If number mod 5 == 0 Then skip Else pass to next process
Natural number generator
Print incoming prime numbers
Figure 5. Primes CSP design
100000 10000 1000
Runtime (s)
100 10 1 0.1 0.01 pycsp.threads pycsp.processes pycsp.greenlets
0.001 0.0001
8 76 32 4 38
92
16
96
81
48
40
24
20
2
10
6
51
8
25
12
64
32
16
8
4
2
1
Number of primes / CSP worker processes
Figure 6. Results of primes experiment.
We run a benchmark computing primes, plotting the runtime results in Figure 6. The processes implementation failed with the message “maximum recursion depth exceeded” after creating 90 processes. This is a limitation in the Python multiprocessing module which is only apparent when spawning new processes from child processes. This primes benchmark does not compare to a simple implementation in pure Python, which would be orders of magnitude faster than the implementation using PyCSP. This benchmark is meant as a method to compare one aspect of the PyCSP implementations and it proves why greenlets is an important player compared to threads and processes. Running for an entire day (86400s) would produce ≈ 16000 primes using the threads implementation and ≈ 60000 primes using the greenlets implementation. Also 16384 threads is close to an upper limit for threads, while greenlets has no real upper limit on the amount of greenlets.
290
R.M. Friborg et al. / Three Unique Implementations
5.3. Computing the Mandelbrot Set This experiment is a producer-consumer-worker example that tests PyCSP’s ability to utilize multiple cores. It produces the image in Figure 7 at a requested resolution. The image requires up to 5000 iterations for some pixels and is located in the Mandelbrot set at the coordinates: xmin = -1.6744096758873175 xmax = -1.6744096714940624 ymin = 0.00004716419197284976 ymax = 0.000047167062611931696 The simple CSP design in Figure 7 communicates jobs from the producer-consumer to the workers using the Alternation in Listing 7. Workers can request and submit jobs in any order they like. while len ( jobs ) > 0 or len ( results ) < jobcount : if len ( jobs ) > 0: Alternation ([{ workerIn : received_job , ( workerOut , jobs [ -1]): send_job }]). execute () else : received_job ( workerIn ()) Listing 7. Producer-Consumer: Delegating and receiving jobs.
Producer / Consumer
Worker
...
Worker
Figure 7. The Mandelbrot CSP design and the computed Mandelbrot set.
The experiment is divided into two different runs. They differ by using two different implementations of the worker process. One releases the GIL during computation by using the ctypes module [19] to call compiled code contained in an operating system specific dynamic library. Executing external code using ctypes is advanced, but does also provide a performance improvement over the other method which is using the numpy module [17] to manipulate and compute on matrices. The numpy module is a package used for scientific computing and provides a N-dimensional array object including tools to manipulate this array object. The numpy module also releases the GIL on every call, but this is much more fine-grained than the course-grained release and acquire used in the ctypes module, thus a larger overhead is expected for the numpy module. The results in Figure 8 clearly shows that pycsp.processes is superior in this application by attaining a good speedup in both runs. It is interesting that pycsp.processes is able to compete with pycsp.threads when using the ctypes worker, since pycsp.processes for every communication includes an extra overhead of serializing data to a string format, allocating a message buffer, copying the string data to the message buffer, retrieving the string data from the message buffer, freeing the message buffer and finally unserializing the string data into a copy of the original data. As expected we have no multi-core speedup at all from using pycsp.greenlets. We could have wrapped the computation in the @io decorator and
291
R.M. Friborg et al. / Three Unique Implementations
Ctypes
Numpy
8
8 pycsp.threads (ctypes) pycsp.processes (ctypes) pycsp.greenlets (ctypes)
7
6
6
5
5
Speedup
Speedup
7
pycsp.threads (numpy) pycsp.processes (numpy) pycsp.greenlets (numpy)
4
4
3
3
2
2
1
1
0
0 1
2
3
4 5 Worker CSP processes
6
7
8
1
2
3
4 5 Worker CSP processes
6
7
8
Figure 8. Speedup plots of computing the Mandelbrot set displayed in Figure 7. The resolution is 1000 ∗ 1000 and the work is divided in 100 jobs. The run time for the case with a single worker is used as the base for the speedup calculation and was 592.5 seconds for the numpy benchmark and 10.6 seconds for the ctypes benchmark.
gained a speedup for the ctypes benchmark, but this is not the purpose of the @io decorator and would encourage wrong usage of the new PyCSP library. Based on the experiments performed, the three implementations have different strengths; processes favors parallel processing, threading favors portability and applications that release the GIL and greenlets favor many processes and frequent communication. 6. Conclusions With the PyCSP version presented in this paper, any application written in Python and using PyCSP can change the concurrent execution model from threads to co-routines or processes just by changing which module is imported. Depending on a user’s domain and application a user can choose to circumvent the Global Interpreter Lock by using processes, provided that the application does not create more than the maximum allowed processes for the operating system. Alternatively, a user may want to speed up the communication time by a factor of ten by using greenlets. Then again if the application is changed further and the user suddenly wants to return to using threads, this is a simple task that does not require the user to transfer code changes to an older revision. Using pycsp.processes it is now possible to utilize all cores of an 8-core system without requiring the computation to take place in an external module. This is important for programmers who want to utilize more cores when the performance of pycsp.threads is limited by the Global Interpreter Lock. Additionally, running more than 262144 processes in a single PyCSP application is made possible using pycsp.greenlets. This amount is smaller than what is possible with occam-π [13] or C++CSP2 [14], but it does open up to the possibility of developing more fine-grained CSP-designs using PyCSP. PyCSP is available at Google-code using the project name pycsp [20]. 6.1. Future Work The obvious next step would be to create pycsp.net, a distributed version of PyCSP that connects processes by networked channels. pycsp.net would be required to be fully compatible with the current API, so that any PyCSP application can be transformed into a distributed application, just by changing the imported module. Channels could be given names so that they could be registered on a nameserver and identified from different hosts. Using pycsp.net and running the Mandelbrot benchmark application from the Experiments section would allow us to utilize multiple machines. The producer-consumer would be
292
R.M. Friborg et al. / Three Unique Implementations
started on one host, and starting additional worker processes on other hosts would be trivial, since they would request the correct channel reference from the nameserver by a known name and automatically start requesting jobs. References [1] John Markus Bjørndalen, Brian Vinter, and Otto J. Anshus. PyCSP - Communicating Sequential Processes for Python. In Alistair A. McEwan, Wilson Ifill, and Peter H. Welch, editors, Communicating Process Architectures 2007, pages 229–248, July 2007. [2] Brian Vinter, John Markus Bjørndalen, and Rune Møllegaard Friborg. PyCSP Revisited. In Communicating Process Architectures 2009. WoTUG, IOS Press. [3] Python multiprocessing module. http://docs.python.org/library/multiprocessing.html. [4] unladen-swallow distribution. http://code.google.com/p/unladen-swallow/. [5] C. A. R. Hoare. Communicating sequential processes. Commun. ACM, 21(8):666–677, 1978. [6] Peter H. Welch, Neil C.C. Brown, Jim Moores, Kevin Chalmers, and Bernhard Sputh. Integrating and Extending JCSP. In Steve Schneider, Alistair A. McEwan, Wilson Ifill, and Peter H. Welch, editors, Communicating Process Architectures 2007, volume 65 of Concurrent Systems Engineering, pages 349– 370, Amsterdam, The Netherlands, July 2007. WoTUG, IOS Press. [7] Peter H. Welch. Tuna: Multiway synchronisation outputs. http://www.cs.york.ac.uk/ nature/tuna/outputs/mm-sync/, 2006. [8] Alastair R. Allen, Oliver Faust, and Bernhard Sputh. Transfer Request Broker: Resolving Input-Output Choice. In Frederick R.M. Barnes, Jan F. Broenink, Alistair A. McEwan, Adam T. Sampson, G. S. Stiles, and Peter H. Welch, editors, Communicating Process Architectures 2008, September 2008. [9] greenlet distribution. http://pypi.python.org/pypi/greenlet. [10] Christian Tismer. Continuations and Stackless Python. Proceedings of the 8th International Python Conference, January 2000. [11] Frederick R.M. Barnes. Blocking system calls in KRoC/Linux. In P.H.Welch and A.W.P.Bakkers, editors, Communicating Process Architectures 2000, volume 58 of Concurrent Systems Engineering Series, pages 155–178. Computing Laboratory, University of Kent, IOS Press, September 2000. [12] Python pickle module. http://docs.python.org/library/pickle.html. [13] occam-pi distribution. http://www.cs.kent.ac.uk/projects/ofa/kroc/. [14] Neil C.C. Brown. C++CSP2: A many-to-many threading model for multicore architectures. Communicating Process Architectures 2007: WoTUG-30, page 23, Jan 2007. [15] Carl G. Ritson, Adam T. Sampson, and Frederick R.M. Barnes. Multicore Scheduling for Lightweight Communicating Processes. In John Field and Vasco Thudichum Vasconcelos, editors, Coordination Models and Languages, 11th International Conference, COORDINATION 2009, Lisboa, Portugal, June 9-12, 2009. Proceedings, volume 5521 of Lecture Notes in Computer Science, pages 163–183. Springer, June 2009. [16] Fibra distribution. http://code.google.com/p/fibra/. [17] numpy distribution. http://numpy.scipy.org/. [18] Donald E. Knuth. The Art of Computer Programming - Volume 2 - Seminumerical Algorithms. AddisonWesley, third edition, 1998. [19] Python ctypes module. http://docs.python.org/library/ctypes.html. [20] PyCSP distribution. http://code.google.com/p/pycsp.
Communicating Process Architectures 2009 P.H. Welch et al. (Eds.) IOS Press, 2009 © 2009 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-065-0-293
293
CSP as a Domain-Specific Language Embedded in Python and Jython Sarah MOUNT 1 , Mohammad HAMMOUDEH, Sam WILSON and Robert NEWMAN School of Computing and I.T., University of Wolverhampton, U.K. Abstract. Recently, much discussion has taken place within the Python programming community on how best to support concurrent programming. This paper describes a new Python library, python-csp, which implements synchronous, message-passing concurrency based on Hoare’s Communicating Sequential Processes. Although other CSP libraries have been written for Python, python-csp has a number of novel features. The library is implemented both as an object hierarchy and as a domain-specific language, meaning that programmers can compose processes and guards using infix operators, similar to the original CSP syntax. The language design is intended to be idiomatic Python and is therefore quite different to other CSP libraries. python-csp targets the CPython interpreter and has variants which reify CSP process as Python threads and operating system processes. An equivalent library targets the Jython interpreter, where CSP processes are reified as Java threads. jython-csp also has Java wrappers which allow the library to be used from pure Java programs. We describe these aspects of python-csp, together with performance benchmarks and a formal analysis of channel synchronisation and choice, using the model checker SPIN. Keywords. CSP, domain-specific languages, dynamic languages, Python
Introduction Python is a lexically scoped, dynamically typed language with object-oriented features, whose popularity is often said to be due to its ease of use. The rise of multi-core processor architectures and web applications has turned attention in the Python community to concurrency and distributed computing. Recent versions of Python have language-level or standard library support for coroutines2 , (system) threads3 and process4 management, the latter two largely in the style of the POSIX thread library. This proliferation of concurrency styles is somewhat in contrast to the “Zen” of Python [1] which states that “There should be one—and preferably only one—obvious way to do it.”. One reason for adding coroutines (and therefore the ability to use “green” threads) and operating-system processes is the performance penalty of using Python threads, which is largely due to the presence of the global interpreter lock (GIL) in the C implementation of the Python interpreter. The GIL is implemented as an operating system semaphore or condition variable which is acquired and released in the interpreter every time the running thread blocks 1 Corresponding Author: Sarah Mount, School of Computing and I.T., University of Wolverhampton, Wulfruna St., Wolverhampton, WV1 1SB, U.K.. Tel.: +44 1902 321832; Fax: +44 1902 321478; E-mail: [email protected]. 2 Python PEP 342: Coroutines via Enhanced Generators http://www.python.org/dev/peps/pep-0342/ 3 http://docs.python.org/library/threading.html#module-threading 4 http://docs.python.org/library/multiprocessing.html#module-multiprocessing
294
S. Mount et al. / CSP as a Domain-Specific Language Embedded in Python and Jython
for I/O, allowing the operating system to schedule a different thread. A recent presentation by Dave Beazley5 contained benchmarks of the following CPU-bound task: def count ( n ): while n > 0: n -= 1
and found that a parallel execution of the task in threads performed 1.8 times slower than a sequential execution and that performance improved if one (of two) CPU cores was disabled. These counter-intuitive results are often the basis for developers to call for the GIL to be removed. The Python FAQ6 summarises why the GIL is unlikely to be removed from the reference implementation of the interpreter, essentially because alternative implementations of thread scheduling have caused a performance penalty to single-threaded programs. The current solution, therefore, is to provide programmers with alternatives: either to write singlethreaded code, perhaps using coroutines for cooperative multitasking, or to take advantage of multiple cores and use processes and IPC in favour of threads and shared state. A second solution is to use another implementation of Python, apart from the CPython interpreter. Stackless Python [2] is an implementation which largely avoids the use of the C stack and has green threads (called “tasklets”) as part of its standard library. Google’s Unladen Swallow7 is still in the design phase, but aims to improve on the performance of CPython five-fold and intends to eliminate the GIL in its own implementation by 2010. This paper describes another alternative, to augment Python with a higher-level abstraction for message-passing concurrency, python-csp based on Hoare’s Communicating Sequential Processes [3]. The semantics of CSP are relatively abstract compared with libraries such as pthreads and so the underlying implementation of CSP “processes” as either system threads, processes or coroutines is hidden from the user. This means that the user can choose an implementation which is suitable for the interpreter in use or the context of the application they are developing (for example, processes where a multi-core architecture is expected to be used, threads where one is not). Also CSP was designed specifically to help avoid well-known problems with models such as shared memory concurrency (such as deadlocks and race conditions) and to admit formal reasoning. Both properties assist the user by encouraging program correctness. The authors have a particular interest in using Python to implement complex tasks requiring coordination between many hosts. The SenSor simulator and development tool [4,5] provided facilities for prototyping algorithms and applications for wireless sensor networks in pure Python, using shared-memory concurrency. The burden of managing explicit locking in an already complex environment made the implementation of SenSor difficult. A new version of the tool is currently under development and will be built on the python-csp and jython-csp libraries. To deal with the different performance characteristics of threads and processes in the current implementations of Python, the python-csp library currently has a number of different implementations: • The csp.cspprocess module which contains an implementation of python-csp based on operating system processes, as managed by the multiprocessing library; and • the csp.cspthread module which contains an implementation of python-csp based on system threads. There is also a version of the library called jython-csp that targets Jython, a version of Python which runs on the Java VM. jython-csp uses Java threads (which are also sys5
http://blip.tv/file/2232410 http://www.python.org/doc/faq/library/ 7 http://code.google.com/p/unladenswallow/ 6
S. Mount et al. / CSP as a Domain-Specific Language Embedded in Python and Jython
295
tem threads), rather than Python threads. Jython allows the user to mix Java and Python code in programs (almost) freely. As such, jython-csp allows users to use any combination of Python and Java, including being able to write pure Java programs using wrappers for jython-csp types. In general, in this paper we will refer to python-csp, however, the syntax and semantics of the libraries can be assumed to be identical, unless stated otherwise. The remainder of this paper describes the design and implementation of python-csp and jython-csp. Section 1 gives an overview of the design philosophy of the library and its syntax and (informal) semantics. Section 2 describes a longer python-csp program and gives a discussion of the design patterns used in the implementation. Section 3 begins an evaluation of python-csp by describing benchmark results using the Commstime program and comparing our work with similar libraries. python-csp and jython-csp have been bench-marked against PyCSP [6], another realisation of CSP in Python and JCSP [7], a Java library. Section 4 outlines ongoing work on model checking channel synchronisation and non-deterministic selection in the python-csp implementation. Section 5 concludes and describes future work. 1. python-csp and jython-csp: Syntax and Semantics The design philosophy behind python-csp is to keep the syntax of the library as “Pythonic” and familiar to Python programmers as possible. In particular, two things distinguish this library from others such as JCSP [7] and PyCSP [6]. Where languages such as Java have strong typing and sophisticated control over encapsulation, Python has a dynamic type system, often using so-called “duck typing” (which means that an object is said to implement a particular type if it shares enough data and operations with the type to be used in the same context as the type). Where an author of a Java library might expect users to rely on the compiler to warn of semantic errors in the type-checking phase, Python libraries tend to trust the user to manage their own encapsulation and use run-time type checking. Although Python is a dynamically typed language, the language is helpful in that few, if any, type coercions are implicit. For example, where in Java, a programmer could concatenate a String and an int type when calling System.out.println, the equivalent expression in Python would raise an exception. In general, the Python type system is consistent and this is largely because every Python type is reified as an object. Java differs from Python in this respect, as primitive Java types (byte, short, int, long, char, float, double, boolean) do not have fields and are not created on the heap. In Python, however, all values are (first-class) objects, including functions and classes. Importantly for the work described here, operators may be overloaded for new types as each Python operator has an equivalent method inherited from the base object, for example: >>> 3 >>> 3 >>> [1 , >>> [1 , >>>
1 + 2 (1). __add__ (2) [1] * 3 1 , 1] ([1]). __mul__ (3) 1 , 1]
Lastly, Python comes with a number of features familiar to users of functional programming languages such as ML that are becoming common in modern, dynamically typed languages. These include generators, list comprehensions and higher-order functions. CSP [3] contains three fundamental concepts: processes, (synchronous) channel communication and non-deterministic choice. python-csp provides two ways in which the user may create and use these CSP object types: one method where the user explicitly creates
296
S. Mount et al. / CSP as a Domain-Specific Language Embedded in Python and Jython
instances of types defined in the library and calls the methods of those types to make use of them; and another where users may use syntactic sugar implemented by overriding the Python built in infix operators. Operator overloading has been designed to be as close to the original CSP syntax as possible and is as follows: Syntax P>Q P&Q c1 | c2 n∗A A∗n Skip()
Meaning CSP equivalent Sequential composition of processes P; Q Parallel composition of processes P Q Non-deterministic choice c 1 c2 Repetition n•A Repetition n•A Skip guard, always ready to synchronise Skip
where: • • • •
n is an integer; P and Q are processes; A is a non-deterministic choice (or ALT); and c1 and c2 are channels.
The following sections describe each of the python-csp features in turn. 1.1. python-csp Processes In python-csp a process can be created in two ways: either explicitly by creating an instance of the CSPProcess class or, more commonly, by using the @process decorator8 . In either case, a callable object (usually a function) must be created that describes the run-time behaviour of the process. Listing 1 shows the two ways to create a new process, in this case one which opens a sensor connected to the USB bus of the host and continuously prints out a transducer reading every five minutes. Whichever method is used to create the process, P, a special keyword argument _process must be passed in with a default value. When the process P is started (by calling its start method) _process is dynamically bound to an object representing the underlying system thread or process which is the reification of the CSPProcess instance. This gives the programmer access to values such as the process identifier (PID), or thread name which may be useful for logging and debugging purposes. When the start methods in a CSPProcess object has returned the underlying thread or process will have terminated. Once this has happened, accessing the methods or data of the corresponding _process variable will raise an exception. # Using the CSPProcess class : def print_rh_reading (): rhsensor = ToradexRH () # Oak temp / humidity sensor rhsensor . open () while True : data = rhsensor . get_data () print ’ Humidity %% g : Temp : % gC ’ % data [1:] dingo . platform . gumstix . sleep (60 * 5) # 5 min P = CSPProcess ( print_rh_reading , _process = None ) P . start () # Using the @process decorator : 8 A “decorator” in Python is a callable object “wrapped” around another callable. For example, the definition of a function fun decorated with the @mydec decorator will be replaced with fun = mydec(fun).
S. Mount et al. / CSP as a Domain-Specific Language Embedded in Python and Jython
297
@process def print_rh_reading ( _process = None ): rhsensor = ToradexRH () # Oak temp / humidity sensor rhsensor . open () while True : data = rhsensor . get_data () print ’ Humidity %% g : Temp : % gC ’ % data [1:] dingo . platform . gumstix . sleep (60 * 5) # 5 min P = print_rh_reading () P . start () Listing 1. Two ways to create a python-csp process.
1.2. python-csp Parallel and Sequential Execution CSP processes can be composed either sequentially or in parallel. In sequential execution each process starts and terminates before the next in the sequence begins. In parallel execution all processes run “at once” and therefore the order of any output they effect cannot be guaranteed. Parallel and sequential execution can be implemented in python-csp either by instantiating Par and Seq objects or by using the overloaded & or > operators. In general, using the overloaded infix operators results in clear, simple code where there are a small number of processes. Listing 2 demonstrates sequential and parallel process execution in python-csp. ... ... ... >>> >>> 1 2 >>> >>> 1 2 >>> >>> 2 1 >>> >>> 1 2
def P (n , _process = None ): print n # In sequence , using syntactic sugar ... P (1) > P (2)
# In sequence , using objects ... Seq ( P (1) , P (2)). start ()
# In parallel , using syntactic sugar ... P (1) & P (2)
# In parallel , using objects ... Par ( P (1) , P (2)). start ()
Listing 2. Two ways to run processes sequentially and in parallel.
1.3. python-csp Channels In CSP communication between processes is achieved via channels. A channel can be thought of as a pipe (similar to UNIX pipes) between processes. One process writes data down the channel and the other reads. Since channel communication in CSP is synchronous, the writing channel can be thought of as offering data which is only actually written to the channel when another process is ready to read it. This synchronisation is handled entirely by the language, meaning that details such as locking are invisible to the user. Listing 3 shows how channels can be used in python-csp.
298
S. Mount et al. / CSP as a Domain-Specific Language Embedded in Python and Jython
In its csp.cspprocess implementation, python-csp uses an operating system pipe to transmit serialised data between processes. This has a resource constraint, as operating systems limit the number of file descriptors that each process may have open. This means that although python-csp programs can create any number of processes (until available memory is saturated), a limited number of channels can be created. In practice this is over 600 on a Ubuntu Linux PC. To compensate for this, python-csp offers a second implementation of channels called FileChannel. A FileChannel object behaves in exactly the same way as any other channel, except that it uses files on disk to store data being transferred between processes. Each read or write operation on a FileChannel opens and closes the operating system file, meaning that the file is not open for the duration of the application running time. Programmers can use FileChannel objects directly, or, if a new Channel object cannot be instantiated then Channel will instead return a FileChannel object. A third class of channel is provided by python-csp, called a NetworkChannel. A NetworkChannel transfers data between processes via a socket listener which resides on each node in the network for the purpose of distributing channel data. By default when the Python csp.cspprocess package is imported, a socket server is started on the host (if one is not already running). @process def send_rh_reading ( cout , _process = None ): rhsensor = ToradexRH () # Oak temp / humidity sensor timer = TimerGuard () rhsensor . open () while True : data = rhsensor . get_data () cout . send ( ’ Humidity %% g : Temp : % gC ’ % data [1:]) timer . sleep (60 * 5) # 5 min guard . read () # Synchronise with timer guard . @process def simple_sensor ( _process = None ): ch = Channel () Printer ( ch ) & send_rh_reading ( ch ) Listing 3. Two processes communicating via a channel.
1.4. Non-Deterministic Selection and Guards in python-csp Non-deterministic selection (called “select” in JCSP and “ALT” or “ALTing” in occam) is an important feature of many process algebras. Select allows a process to choose between a number of waiting channel reads, timers or other “guards” which are ready to synchronise. In python-csp this is achieved via the select method of any guard type. To use select an Alt object must be created and should be passed any number of Guard instances. See Listing 4 for an example. The select method can be called on the Alt object and will return the value returned from the selected guard. To implement a new guard type, users need to subclass the Guard class and provide the following methods: enable which should attempt to synchronise. disable which should roll back from an enable call to the previous state. is selectable which should return True if the guard is able to complete a synchronous transaction and False otherwise. select which should complete the synchronous transaction and return the result (note this semantics is slightly different from that found in JCSP, as described by Welch [7], where an index to the selected guard in the guard array is returned by select).
S. Mount et al. / CSP as a Domain-Specific Language Embedded in Python and Jython
299
poison which should be used to finalize and delete the current guard and gracefully terminate any processes connected with it [8,9]. Since Python has little support for encapsulation, the list of guards inside an Alt object is available to any code which has access to the Alt. @process def p r i n t A v a i l a b l e R e a d i n g ( cins , _process = None ): alt = Alt ( cins ) while True : print alt . select () @process def simpleSensorArray ( _process = None ): chans , procs = [] , [] for i in NUMSENSORS : chans . append ( Channel ()) procs . append ( sendRHReading ( chans [ -1])) procs . append ( prin tAva i l a b l e R e a d i n g ( chans )) Par (* procs ). start () Listing 4. Servicing the next available sensor reading with non-deterministic selection.
Like many other implementations of CSP, python-csp implements a number of variants of non-deterministic selection: select enables all guards and either returns the result of calling select on the first available guard (if only one becomes available) or randomly chooses an available guard and returns the result of its select method. The choice is truly random and determined by a random number generator seeded by the urandom device of the host machine. priority select enables all guards and, if only one guard becomes available then priority_select returns the result of its select method. If more than one guard becomes available the first guard in the list passed to Alt is selected. fair select enables all guards and, if only one guard becomes available then fair_select returns the result of its select method. Alt objects keep a reference to the guard which was selected on the previous invocation of any select method, if there has been such an invocation and the guard is still in the guards list. If fair_select is called and more than one guard becomes available, then fair_select gives lowest priority to the guard which was returned on the previous invocation of any of the select methods. This idiom is used to reduce the likelihood of starvation as every guard is guaranteed that no other guard will be serviced twice before it is selected. There are two forms of syntactic sugar that python-csp implements to assist in dealing with Alt objects: a choice operator and a repetition operator. Using the choice operator, users may write: result = guard_1 | guard_2 | ... | guard_n
which is equivalent to: alt = Alt ( guard_1 , guard_2 , ... , guard_n ) result = alt . select ()
To repeatedly select from an Alt object n times, users may write: gen = n * Alt ( guard_1 , guard_2 , ... , guard_n )
300
S. Mount et al. / CSP as a Domain-Specific Language Embedded in Python and Jython
or: gen = Alt ( guard_1 , guard_2 , ... , guard_n ) * n
this construct returns a generator object which can be iterated over to obtain results from Alt.select method calls. Using generators in this way is idiomatic in Python and will be familiar to users. The following is a typical use case: gen = Alt ( guard_1 , guard_2 , ... , guard_n ) * n while True : ... gen . next () ...
Each time gen.next() is called within the loop, the select() method of the Alt object is called, and its result returned. In addition to channel types, python-csp implements two commonly used guard types: Skip and TimerGuards. Skip is the guard which is always ready to synchronise. In python-csp its select method always returns None, which is the Python null value. TimerGuards are used to either suspend a running process (by calling their sleep method) or as part of a synchronisation where the guard will become selectable after a timer has expired: @process def alarm ( self , cout , _process = None ): alt = Alt ( TimerGuard ()) t0 = alt . guard [0]. read () # Fetch current time alt . guard [0]. set_alarm (5) # Selectable 5 secs from now alt . select () duration = guard . read () - t0 # In seconds cout . write ( duration )
1.5. Graceful Process Termination in python-csp Terminating a parallel program without leaving processes running in deadlock is a difficult problem. The most widely implemented solution to this problem was invented by Peter Welch [10] and is known as “channel poisoning”. The basic idea is to send a special value down a channel which, when read by a process, is then propagated down any other channels known to that process before it terminates. In python-csp this can be affected by calling the poison method on any guard. A common idiom in python-csp, especially where producer-consumer patterns are implemented, is this: alt = Alt (* channels ) for i in xrange ( len ( channels )): alt . select ()
Here, it is intended that each guard in channels be selected exactly once. Once a guard has been selected its associated writer process(es) will have finished its computation and terminate. In order to support this idiom efficiently, python-csp implements a method called poison on Alt objects which serves to poison the writer process(es) attached to the last selected guard and remove that guard from the list, used as follows: a = Alt (* channels ) for i in xrange ( len ( channels )): a . select () a . poison ()
By shortening the list of guards less synchronisation is required on each iteration of the for loop, reducing the computational effort required by the select method.
S. Mount et al. / CSP as a Domain-Specific Language Embedded in Python and Jython
301
1.6. Built-In Processes and Channels python-csp comes with a number of built-in processes and channels, aimed to speed development. This includes all of the names defined in the JCSP “plugnplay” library. In addition to these and other built-in processes and guards, python-csp comes with analogues of every unary and binary operator in Python. For example, the Plus process reads two values from channels and then writes the addition of those values to an output channel. An example implementation9 of this might be: @process def Plus ( cin1 , cin2 , cout , _process = None ): while True : in1 = cin1 . read () in2 = cin2 . read () cout . write ( in1 + in2 ) return Listing 5. A simple definition of the built-in Plus process.
1.7. jython-csp Implementation and Integration with Pure Java jython-csp is a development of python-csp for integration with Jython. Jython has similar semantics as Python but uses the Java runtime environment (JRE) which allows access to the large number of Java libraries, such as Swing, which are useful, platform-independent and well optimised. jython-csp has similar workings to the initial python-csp with the ability to utilise any class from the standard Java and Python class libraries. jython-csp utilises Java threads and would be expected to perform similarly to other CSP implementations based on Java threads, such as JCSP (e.g. [7]). A comparison between jython-csp and python-csp implementations of the Monte Carlo approximation to π is shown in Table 1. Table 1. Results of testing Java threads against Python threads and OS processes. Thread Library jython-csp (Java Threads) python-csp (Python Threads) python-csp (OS Processes)
Running time of π approximation (seconds) 26.49 12.08 9.59
As we shall see in Section 3 the JCSP library performs channel communication very efficiently, so one might expect that jython-csp would also execute quickly. A speculation as to why jython-csp (using Java threads) performs poorly compared to the other CSP implementations is a slow Java method dispatch within Jython. In addition to the modification of the threading library used jython-csp also takes advantage of Java locks and semaphores from the java.util.concurrent package. jython-csp has no dependency on non standard packages; the library will work with any JRE which is compatible with Jython 2.5.0final. 9 In fact, these processes are implemented slightly differently, taking advantage of Python’s support for reflection. Generic functions called _applybinop and _applyunaryop are implemented, then Plus may be defined simply as Plus = _applybinop(operator.__add__). The production version of this code is slightly more complex as it allows for documentation for each process to be provided whenever _applybinop and _applyunaryop are called.
302
S. Mount et al. / CSP as a Domain-Specific Language Embedded in Python and Jython
1.8. Java-csp Java-csp is the integration of jython-csp with Java to allow Java applications to utilise the flexibility of jython-csp. The Java CSP implementation attempts to emulate the built in Java thread library (java.lang.Thread) with a familiar API. As with Java threads there are two ways to use threads in an application: • By extending the Thread class and overwriting the run method; or • By implementing the Runnable interface. The Java csp implementation has a similar usage: • Extending the JavaCspProcess and overwriting the target method; or • Implementing the JavaCspProcessInterface interface. jython-csp and python-csp uses the pickle library as a means of serialising data down a channel. Pickle takes a Python/Jython object and returns a sequence of bytes; this approach only works on Python/Jython object and is unsuitable for native Java objects. As a solution java-csp implements a wrapped version of Java object serialization which allows Jython to write pure Java objects, which implement the Serializable interface, to a channel, in addition to this, Python/Jython objects can be written down a channel if they extend the PyObject class. 2. Mandelbrot Generator: an Example python-csp Program 2.1. Mandelbrot Generator Listing 6 shows a longer example of a python-csp program as an illustration of typical coding style. The program generates an image of a Mandelbrot set which is displayed on screen using the PyGame library10 (typically used for implementing simple, 2D arcade games). The Mandelbrot set is an interesting example, since it is embarrassingly parallel – i.e. each pixel of the set can be calculated independently of the others. However, calculating each pixel of a large image in a separate process may be a false economy. Since channel communication is likely to be an expensive part, the resulting code is likely to be I/O bound. The code in Listing 6 is structured in such a way that each column in the image is generated by a separate “producer” process. Columns are stored in memory as a list of RGB tuples which are then written to a channel shared by the individual producer process and a single “consumer” process. The main work of the consumer is to read each image column via an Alt object and write it to the display surface. This producer-consumer architecture is common to many parallel and distributed programs. It it not necessarily, however, the most efficient structure for this program. Since some areas of the set are essentially flat and so simple to generate, many producer processes are likely to finish their computations early and spend much of their time waiting to be selected by the consumer. If this is the case, it may be better to use a “farmer” process to task a smaller number of “worker” processes with generating a small portion of the image, and then re-task each worker after it has communicated its results to the farmer. Practically, if efficiency is an important concern, these options need to be prototyped and carefully profiled in order to determine the most appropriate solution to a given problem. from csp . cspprocess import * @process def mandelbrot ( xcoord , ( width , height ) , cout , 10
http://www.pygame.org
S. Mount et al. / CSP as a Domain-Specific Language Embedded in Python and Jython
303
acorn = -2.0 , bcorn = -1.250 , _process = None ): # Generate image data for column xcoord ... cout . write (( xcoord , imgcolumn )) _process . _terminate () @process def consume ( IMSIZE , filename , cins , _process = None ): # Create initial pixel data . pixmap = Numeric . zeros (( IMSIZE [0] , IMSIZE [1] , 3)) pygame . init () screen = pygame . display . set_mode (( IMSIZE [0] , IMSIZE [1]) , 0) # Wait on channel data . alt = ALT (* cins ) for i in range ( len ( cins )): xcoord , column = alt . select () alt . poison () # Remove last selected guard and producer . # Update column of blit buffer pixmap [ xcoord ] = column # Update image on screen . pygame . surfarray . b lit_array ( screen , pixmap ) def main ( IMSIZE , filename ): channels , processes = [] , [] for x in range ( IMSIZE [0]): # Producer + channel per column . channels . append ( Channel ()) processes . append ( mandelbrot (x , IMSIZE , channels [ x ])) processes . insert (0 , consume ( IMSIZE , filename , channels )) mandel = PAR (* processes ) mandel . start () Listing 6. Mandelbrot set in python-csp (abridged).
The worker-farmer architecture shows an improvement in performance over the producerconsumer architecture. The code in Listing 7 is structured in such a way that each column in the image is computed by a single process. Initially a set of workers are created and seeded with the value of the column number, the pixel data for that column is generated then written to a channel. When the data has been read, the “farmer” then assigns a new column for the worker to compute. If there are no remaining columns to be generated, the farmer will write a terminating value and the worker will terminate. This is required to instruct the “worker” that there are no more tasks to be performed. The consumer has the same function as before although with a smaller set of channels to choose from. Workers which have completed their assigned values have a shorter amount of time to wait until the Alt object selects them. @process def mandelbrot ( xcoord , ( width , height ) , cout , acorn = -2.0 , bcorn = -1.250 , _process = None ): # Generate image data for column xcoord ... cout . write (( xcoord , imgcolumn )) xcoord = cout . read () if xcoord == -1: _process . _terminate () @process def consume ( IMSIZE , filename , cins , _process = None ): global SOFAR # Create initial pixel data pixmap = Numeric . zeros (( IMSIZE [0] , IMSIZE [1] , 3)) pygame . init () screen = pygame . display . set_mode (( IMSIZE [0] , IMSIZE [1]) , 0) # Wait on channel data alt = Alt (* cins )
304
S. Mount et al. / CSP as a Domain-Specific Language Embedded in Python and Jython for i in range ( IMSIZE [0]): xcoord , column = alt . pri_select () # Update column of blit buffer pixmap [ xcoord ] = column # Update image on screen . pygame . surfarray . b lit_array ( screen , pixmap ) if SOFAR < IMSIZE [0]: alt . last_selected . write ( SOFAR ) SOFAR = SOFAR + 1 else : alt . last_selected . write ( -1) Listing 7. Mandelbrot set in python-csp (abridged) using the “farmer” /“worker” architecture.
The improvement in performance can been seen in Figure 1: using a smaller number of processes reduces the run time of the program.
Figure 1. Run times of “farmer” / “worker” Mandelbrot program with different numbers of CSP processes.
The graph shows a linear characteristic, which would be expected as the select method in Alt is O(n). 3. Performance Evaluation and Comparison to Related Work The Commstime benchmark was originally implemented in occam by Peter Welch at the University of Kent at Canterbury and has since become the de facto benchmark for CSP implementations such as occam-π [11], JCSP [7] and PyCSP [6]. Table 2 shows results of running the Commstime benchmark on JCSP version 1.1rc4, PyCSP version 0.6.0 and python-csp. To obtain fair results the implementation of Commstime used in this study was taken directly from the PyCSP distribution, with only syntactic changes made to ensure that tests run correctly and are fairly comparable. In each case, the type of “process” used has been varied and the default channel implementation has been used. In the case of the python-csp channels are reified as UNIX pipes. The JCSP implementation uses the One2OneChannelInt class. The PyCSP version uses the default PyCSP Channel class for each process type. All tests were run on an Intel Pentium dual-core 1.73 GHz CPU and
S. Mount et al. / CSP as a Domain-Specific Language Embedded in Python and Jython
305
1 GB RAM, running Ubuntu 9.04 (Jaunty Jackalope) with Linux kernel 2.6.28-11-generic. Version 2.6.2 of the CPython interpreter was used along with Sun Java(TM) SE Runtime Environment (build 1.6.0 13-b03) and Jython 2.50. Table 2. Results of testing various CSP libraries against the Commstime benchmark. In each case the default Channel class is used. CSP implementation JCSP PyCSP PyCSP PyCSP python-csp python-csp jython-csp
Process reification JVM thread OS process Python thread Greenlet OS process Python thread JVM thread
min (μ s) 15 195.25 158.46 24.14 67.6 203.05 135.05
max (μ s) 29 394.97 311.2 25.37 155.97 253.56 233
mean (μ s) 23.8 330.34 292.2 24.41 116.75 225.77 157.8
standard deviation 4.29 75.82 47.21 0.36 35.53 17.51 30.78
The results show that channel operations in jython-csp are faster than channel operations between python-csp objects when reified as threads, but slower than the thread-based version of python-csp. This is a surprising result, as Java threads are better optimised than Python threads (because of the way the Python GIL is implemented) and, as the results for JCSP show, it is possible to implement CSP channels very efficiently in pure Java. The loss of performance is likely to be due to the way in which methods are invoked in Jython. Rather than all compiling Jython code directly to Java bytecode (as was possible when Jython supported the jythonc tool), Jython wraps Python objects in Java at compile time and executes pure Python code in and instance of a Python interpreter. Mixing Python and Java code, as the jython-csp library does, can therefore result in poor performance. It may be possible to ameliorate these problems by implemented more of the jython-csp library in pure Java code. It is also possible that future versions of Jython will improve the performance of method invocation and/or provide a method of compiling Python code directly to Java bytecode. The difference between the python-csp and PyCSP libraries is also surprising. python-csp implements channel synchronisation in a simple manner, with two semaphores protecting each channel, and two reentrant locks to guard against conflicts between multiple readers and/or writers. PyCSP has a very different architecture and synchronisation strategy which may account for the difference in performance. More detailed profiling of the two libraries, together with the completion of the model checking work described in Section (to ensure that python-csp is not simply faster because it is somehow incorrect) will form part of our future work. 3.1. Related Work There are some differences between the implementation of python-csp and other realisations of CSP, such as occam-π, JCSP [7] and PyCSP [6]. In particular, any channel object may have multiple readers and writers. There are no separate channel types such as JCSP’s One2OneChannel. This reflects the simplicity that Python programmers are used to and the PEP20 [1] maxim that ““There should be one—and preferably only one—obvious way to do it.”. Also, when a Channel object (or variant of such) is instantiated, the instance itself is returned to the caller. In contrast, other systems return a “reader” and “writer” object, often implemented as the read and write method of the underlying channel. This is similar to the implementation of operating system pipes in many libraries, where a reader and writer to the pipe is returned by a library call, rather than an abstraction of the pipe. The authors of these other CSP realisations would argue that their design is less likely to be error prone and that
306
S. Mount et al. / CSP as a Domain-Specific Language Embedded in Python and Jython
they are protecting the error-prone programmer from inadvertently reading from a channel that is intended to be a “output” channel to the given process or writing to a channel that is intended to be an “input” to the process. However, python-csp comes with strong tool support which may ameliorate some of these potential errors, and the profusion of channel types in some systems may confuse rather than simplify. Similarly, Alt.select methods return the value read by a guard rather than the index to the selected guard (as in JCSP) or a reference to the selected guard (PyCSP). The last guard selected is stored in the field Alt.last_selected and so is available to users. Some pragmatic concessions to the purity of python-csp have been made. In particular, the three default channel types (Channel, FileChannel and NetworkChannel) have very different performance and failure characteristics and so are implemented separately and conspicuously to the user. The chances of a network socket failing, and the potential causes of such a failure, differ greatly from that of an operating system pipe. Equally, a process which times-out waiting for a channel read will need to wait considerably longer for a read on a FileChannel than a Channel. These are non-trivial differences in semantics and it seems beneficial to make them explicit. 4. Correctness of Synchronisation in python-csp To verify the correctness of synchronisation in python-csp, a formal model was built using high level language to specify systems descriptions, called PROMELA (a PROcess MEta LAnguage). This choice is motivated by convenience since a large number of PROMELA models are available in the public domain and some of the features of the SPIN (Simple PROMELA INterpreter) tool environment [12], which interprets PROMELA, greatly facilitate our static analysis. PROMELA is a non-deterministic language, loosely based on Dijkstra’s guarded command language notation and borrowing the notation for I/O operations from Hoare’s CSP language. The model was divided into the two primary processes: the read and the write. Synchronisation was modeled by semaphores for several readers and several writers in PROMELA. Process type declarations consist of the keyword proctype, followed by the name of the process type, and an argument list. For example, n instances of the process type read are defined as follows: active [ n ] proctype read () { do :: ( rlock ) -> rlock = false ; /* obtain rlock */ atomic { /* wait ( sem )... acquire available */ available > 0 -> available - } c_r ? msg ; /* get data from pipe ... critical section */ taken ++; /* release taken */ rlock = true ; /* release rlock */ od ; }
The do-loop (terminated by od) is an infinite loop. The body of the loop obtains the rlock flag to protect from races between multiple readers, then blocks until an item becomes available in the pipe, then gets the item from the pipe (in the critical section), then announces the item has been read, then releases the rlock flag. The :: symbol indicates the start of a command sequence block. In a do-loop, a non-deterministic choice will be made among the command sequence blocks. In this case, there is only one to choose from. The write process is declared in a similar style to the read process. The body of the do-loop obtains the wlock flag
S. Mount et al. / CSP as a Domain-Specific Language Embedded in Python and Jython
307
to protect from races between multiple writers, then places an item in the pipe, then makes the item available for the reader, then blocks until the item has been read, and finally releases the wlock. active [ n ] proctype write () { do :: ( wlock ) -> wlock = false ; /* obtain wlock */ s_c ! msg ; /* place item in pipe ... critical section */ available ++; /* release available */ atomic { taken > 0; taken - - /* acquire taken */ } wlock = true ; /* release wlock */ od ; }
The channel algorithm used in this model is defined as: chan s_c = [0] of { mtype }; /* rendezvous channel */ chan c_r = [0] of { mtype }; active proctype data () { mtype m ; do :: s_c ? m -> c_r ! m od }
The first two statements declare sc and cr to be channels. These will provide communication to write into or read from the pipe, with no buffering capacity (i.e., communication will be via rendezvous) that carries messages consisting of a byte (mtype). The body of the do-loop retrieves the received message and stores it into the local variable m on the receiver side. The data is always stored in empty channels and is always retrieved from full channels. Firstly, the model was run in SPIN random simulation mode. The SPIN simulator enables users to gain early feedback on their system models that helps in the development of the designer’s understanding of the design space before they advance in any formal analysis. However, SPIN provides a limited form of support for verification in terms of assertion checking, i.e. the checking of local and global system properties during simulation runs. For example, a process called monitor was devoted to assert that a read process will not be executed if the buffer is empty and the buffer can not be overwritten by write process before the read process is executed. proctype monitor () { do :: assert ( taken <2) ; :: assert ( available <2); od }
Figure 2 shows the Message Sequence Chart (MSC) Panel. Each process is associated with a vertical line where the “start of time” corresponds to the top of the MSC moving down with the vertical distance represents the relative time between different temporal events. The message passing is represented by the relative ordering of arrows between process execution lines.
308
S. Mount et al. / CSP as a Domain-Specific Language Embedded in Python and Jython
Figure 2. Message sequence chart of SPIN model checker showing two readers and two writers.
SPIN verification mode is used to verify liveness and safety properties like deadlock detection, invariants or code reachability. Verification parameters are set to enable check for “invalid endstates” in the model. The verification output does not show any output referring to “invalid end states” which means that the verification has passed. 5. Conclusions and Future Work python-csp provides a “Pythonic” library for the structuring of concurrent programs in a CSP style. This provides an alternative to the event-driven style which has become prevalent with the increasing popularity of object oriented methods. python-csp realises the three fundamental concepts of CSP, processes, synchronous channel communication and nondeterministic choice, for use both explicitly and with appropriate syntactic sugar to provide program texts with much more of the “look and feel” of CSP. python-csp has been realised as a number of distinct realisations. One notable implementation is jython-csp, which, as a result of Jython’s reliance on the Java Virtual Machine, yields a platform independent implementation. As an example of a real program, parallelised using python-csp, the Mandelbrot generator has been presented. Both a producer-consumer and worker-farmer implementation have been described, and the worker-farmer shows a linear performance relationship with the number of processes used (running on a dual-core computer). The correctness of channel synchronisation in python-csp has been demonstrated using a model checker. Future work will include a study of the correctness of non-deterministic
S. Mount et al. / CSP as a Domain-Specific Language Embedded in Python and Jython
309
selection in python-csp. Evaluation of the performance of python-csp shows that it performs slightly faster than equivalent implementations of PyCSP (significantly faster for the OS process version). The motivation for the constriction of python-csp was to provide a syntactically natural and semantically robust framework for the design and implementation of large scale, distributed, parallel systems, in particular wireless sensor networks. It is hoped that such systems will one day be grounded in a secure theory of communication and concurrency. python-csp has provided such a framework, but is so far limited to a shared memory, single machine implementation. The next stages in this work are to extend the synchronised communications to operate over inter-machine communications links. In some ways, CSP communications, being already modelled on a “channel”, are ideal for such a purpose. On the other hand, real communications channels, particularly wireless ones, have quite different characteristics from the instantaneous and reliable CSP channel. Finding efficient means for duplicating the semantics of the CSP channel using real communications remains a challenge for the authors. Acknowledgements The authors wish to acknowledge the Nuffield Foundation’s support for this research through an Undergraduate Research Bursary (URB/37018), which supported the work of the third author. References [1] Tim Peters. PEP 20: the Zen of Python. http://www.python.org/dev/peps/pep-0020/, August 2004. [2] Christian Tismer. Continuations and Stackless Python or ”how to change a paradigm of an existing program”. In Proceedings of the 8th International Python Conference, January 2000. [3] C.A.R. Hoare. Communicating Sequential Processes. Prentice-Hall, London, 1985. ISBN: 0-131-532715. [4] S.N.I. Mount, R.M. Newman, and E.I. Gaura. A simulation tool for system services in ad-hoc wireless sensor networks. In Proceedings of NSTI Nanotechnology Conference and Trade Show (Nanotech’05), volume 3, pages 423–426, Anaheim, California, USA, May 2005. [5] S. Mount, R.M. Newman, E. Gaura, and J. Kemp. Sensor: an algorithmic simulator for wireless sensor networks. In Proceedings of Eurosensors 20, volume II, pages 400–411, Gothenburg, Sweden, 2006. [6] John Markus Bjørndalen, Brian Vinter, and Otto J. Anshus. PyCSP - Communicating Sequential Processes for Python. In Alistair A. McEwan, Wilson Ifill, and Peter H. Welch, editors, Communicating Process Architectures 2007, pages 229–248, jul 2007. [7] Peter H. Welch. Process Oriented Design for Java: Concurrency for All. In H.R.Arabnia, editor, Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA’2000), volume 1, pages 51–57. CSREA, CSREA Press, June 2000. [8] Neil C.C. Brown and Peter H. Welch. An introduction to the Kent C++CSP library. In J.F. Broenink and G.H. Hilderink, editors, Communicating Process Architectures 2003, volume 61 of Concurrent Systems Engineering Series, pages 139–156, Amsterdam, The Netherlands, September 2003. IOS Press. [9] Bernhard H.C. Sputh and Alastair R. Allen. JCSP-Poison: Safe termination of CSP process networks. In Jan F. Broenink, Herman W. Roebbers, Johan P.E. Sunter, Peter H. Welch, and David C. Wood, editors, CPA, volume 63 of Concurrent Systems Engineering Series, pages 71–107. IOS Press, 2005. [10] Peter H. Welch. Graceful Termination – Graceful Resetting. In Applying Transputer-Based Parallel Machines, Proceedings of OUG 10, pages 310–317, Enschede, Netherlands, April 1989. Occam User Group, IOS Press, Netherlands. ISBN 90 5199 007 3. [11] Peter H. Welch and Fred R.M. Barnes. Communicating mobile processes: introducing occam-pi. In A.E. Abdallah, C.B. Jones, and J.W. Sanders, editors, 25 Years of CSP, volume 3525 of Lecture Notes in Computer Science, pages 175–210. Springer Verlag, April 2005. [12] Gerard J. Holzmann. The model checker SPIN. IEEE Transactions on Software Engineering, 23:279–295, 1997.
This page intentionally left blank
Communicating Process Architectures 2009 P.H. Welch et al. (Eds.) IOS Press, 2009 © 2009 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-065-0-311
311
Hydra: A Python Framework for Parallel Computing Waide B. TRISTRAM, Karen L. BRADSHAW Department of Computer Science, Rhodes University, Grahamstown, South Africa [email protected], [email protected] Abstract. This paper investigates the feasibility of developing a CSP to Python translator using a concurrent framework for Python. The objective of this translation framework, developed under the name of Hydra, is to produce a tool that helps programmers implement concurrent software easily using CSP algorithms. This objective was achieved using the ANTLR compiler generator tool, Python Remote Objects and PyCSP. The resulting Hydra prototype takes an algorithm defined in CSP, parses and converts it to Python and then executes the program using multiple instances of the Python interpreter. Testing has revealed that the Hydra prototype appears to function correctly, allowing simultaneous process execution. Therefore, it can be concluded that converting CSP to Python using a concurrent framework such as Hydra is both possible and adds flexibility to CSP with embedded Python statements. Keywords. concurrency, CSP, language translation, Python
Introduction Parallel architectures started making an appearance from as early as the mid-1960s and continue to be the primary design for high performance computing systems. This is particularly evident in modern supercomputers, such as IBM’s Roadrunner and Blue Gene/L, which make use of thousands of processors to achieve their astonishing computational power. However, these systems are only available to a select few scientists and researchers and it was not until recently that multi-processor computers started becoming readily available to consumers. The availability of dual and quad core CPUs targeted at the consumer, workstation and server markets has created a problem in the field of software development. Multi-core computers have the power and potential to greatly outperform their single-core counterparts, but this potential can only be realised if the software is able to make use of multiple processors [1]. Both consumers and researchers stand to gain from the performance increases afforded by multi-core CPUs and parallel software. Researchers are now able to construct small high performance computing systems for their data processing needs by combining a number of relatively cheap multi-core CPU systems. The Hydra project aims to provide a Python framework for parallel execution based on Communicating Sequential Processes (CSP). The focus of this paper is our investigation into the feasibility of translating CSP to Python code for the Hydra project. 1. Background and Related Work This paper investigates the feasibility of a concurrency framework for Python based around a CSP to Python translator. There is a significant amount of research in the related fields of language translation, concurrency and parallel computing, and Communicating Sequential Processes. However, a detailed review of the work is beyond the scope of this paper.
312
W.B. Tristram and K.L. Bradshaw / Hydra: A Python Framework for Parallel Computing
Therefore, only the relevant aspects of the related fields are briefly presented and discussed in this section. These discussions focus primarily around work that has facilitated the development of our concurrent framework for Python based on CSP. 1.1. Language Parsing and Translation The typical translation process involves a number of stages, ranging from identifying the syntactic constructs to constraint analysis and finally, producing the output code [2,3]. While constructing parsers can be done by hand, a tool known as a compiler generator is typically used to produce the translator. Compiler generators accept the target grammar and generate the various components of the compiler [3]. 1.2. Parallel Computing and CSP Communicating Sequential Processes was first introduced in 1978 by Hoare. A number of operations and constructs were identified as the primary methods for structuring computer programs [4]. Hoare identified input and output operations as being important but noted that these were not well understood. He also noted that the repetitive, alternative and sequential constructs were well understood, whereas there was less agreement on other constructs such as subroutines, monitors, procedures, processes and classes [4]. Processor development at the time was such that multiprocessor systems and increased parallelism were required to improve computation speed. However, Hoare noted that this parallelism was being hidden from the programmer as a deterministic, sequential machine. He saw that a more effective approach would be to introduce this parallelism at the programming level by defining communication and synchronization methods [4]. It is this approach that we are attempting to incorporate into Python using Hydra. 1.2.1. The CSP Programming Notation The programming language or notation specified by Hoare is based on a number of fundamental proposals. The first of these is the use of the alternative command in conjunction with guarded commands, and the related guards, as a sequential control structure and a means to control non-determinism. Associated with the guarded and alternative commands is the repetitive command, which loops until all its guards terminate. Secondly, the parallel command specifies a means to start parallel execution of a number of processes or commands by starting them simultaneously, and synchronizing on termination of each of the parallel processes. Parallel processes may not communicate directly, except through the use of message passing [4]. To support the message passing concept, input and output commands are specified. These commands enable communication between processes. Essentially, a channel is created and used for synchronous communication when a source process names a destination process for output and the destination process names the source process for input. This effectively introduces the rendezvous as the primary method of synchronization [4]. 1.2.2. The CSP Meta-Language Hoare continued to refine CSP and it evolved substantially compared to the notation described in his earlier paper. CSP had become a process algebra that allows for the formal description and verification of interactions in a concurrent system [4]. The new notation consists of two primitives, namely the process and the event, and a number of algebraic operators. Concurrent and sequential systems can then be defined through a combination of these operators and primitives. An important addition to CSP is the introduction of traces, which allow for the description of each possible behaviour in a system as a sequence of actions [4].
W.B. Tristram and K.L. Bradshaw / Hydra: A Python Framework for Parallel Computing
313
1.3. The Python Programming Language Python is a powerful, very high level programming language, supporting multiple programming paradigms [5,6]. Python has a strong, dynamic typing system and robust automatic memory management. It is very well suited for use both as a scripting language, much like Perl, and as a general purpose programming language. Python places a great deal of emphasis on programmer productivity and supports this via its expansive standard library and support for third-party extensions. 1.3.1. Features and Benefits There are numerous benefits and features that make Python a very attractive language both to beginner programmers and for advanced scientific programming [6,7]. It has high-level built-in data types, such as the dictionary, list and tuple. Python has strong introspection capabilities and provides easy to use object orientation features. There is also plenty of support and readily available documentation. Python supports full modularity and hierarchical packages for extending functionality and can also be embedded within applications as a scripting interface. This makes it very useful for linking together previously unrelated modules [7]. Python can therefore, be used for the rapid prototyping of algorithms, with any performance critical modules being rewritten in C and added as extensions. All of the above factors along with the availability of science orientated packages, such as SciPy and NumPy, have aided in the acceptance of Python in the computational science community. 1.3.2. Limitations As an interpreted language, Python’s performance is not as good as compiled languages such as C++, but the performance is sufficient for most applications. Python’s greatest limitation is its global interpreter lock. The Python VM makes use of a global interpreter lock (GIL) to ensure that only one thread runs in the VM at any time [8]. So, while Python supports multi-threading, these threads are time-sliced instead of executing in a truly parallel fashion. Attempts to remove the GIL, such as Greg Stein’s “free threading” patches, resulted in an overall drop in performance [8]. This performance decrease for non-threaded programs was unacceptable and the patches were abandoned and no further attempts to remove the GIL were made [8]. However, there are ways to circumvent the GIL limitation to achieve multiple processor usage. The first of the suggested methods is to make use of C extensions. The C extension can release the GIL and maintain the executing thread within the C code [8]. The second method is to divide the tasks between multiple Python interpreter processes, which must be spawned with appropriate communication and synchronization mechanisms [8]. There are frameworks and tools that provide functionality for communication between distinct Python processes, such as River [9], Trickle [10] and PYRO [11]. 2. Methodology As stated above, the aim of the Hydra project is to provide a Python framework for parallel execution based on Communicating Sequential Processes. However, the scope of this initial research was restricted to investigating the feasibility of translating CSP algorithms to concurrent Python code. The approach taken was that of research through design and development, requiring a working prototype for use in further research. However, this prototyping approach lead to a number of compromises in the development of the framework.
314
W.B. Tristram and K.L. Bradshaw / Hydra: A Python Framework for Parallel Computing
2.1. Approach
A number of issues needed to be addressed before being able to convert a CSP algorithm into a working, concurrent Python program. These issues are highlighted and briefly described in this section, while the decisions regarding these issues are fully discussed in Section 3. First, an appropriate grammar for CSP was defined. There are a few variations of the original CSP grammar introduced by Hoare and of the later CSP process algebra or CSP meta-language [4]. While the CSP meta-language allows one to verify a concurrent algorithm through the use of process algebra [4], the original CSP notation provides a more suitable syntax for programming. Therefore a grammar has been devised based on the original CSP notation. The decision to develop our own CSP syntax instead of using an existing dialect, such as that used by FDR, was influenced by the prototype nature of the project. In order to keep the language to be implemented small, it was decided to use a novel syntax so that much of the type support, expression and pattern matching elements of CSP could be ignored initially and we could concentrate on the concurrent aspects thereof. We also wanted to allow the programmer to embed Python expressions in their CSP, thus making the language much more powerful. At a later stage, a more acceptable dialect could be used with some modifications to the front-end of the parser. Suitable compiler generators were then identified. There are a number of requirements that the compiler generator needed to meet before being considered for use. Firstly, its parsing technique needs to be powerful enough to deal with the CSP notation. It must be able to cope with ambiguity, with backtracking and suitable lookahead sufficient for handling any ambiguous cases. Secondly, it must provide functionality for generating target code as opposed to merely returning the identified tokens. Thirdly, the compiler generator must provide a clear and easy to use method for defining the grammar. This is to ensure that the grammar is maintainable and extensible. Finally, it would be beneficial for the generated parser to be coded in Python for easier integration into the Hydra framework. Good error handling mechanisms would also be beneficial. The final choice of compiler generator was then made, based on its capabilities and shortcomings according to the above criteria. The availability of documentation and development activity was also investigated to ensure that issues or bugs are easily resolved. Once the compiler generator had been identified, the CSP parser was implemented using the selected tool. Due to the prototyping approach, the implementation of semantic checks and error reporting was kept rather basic and incomplete. The code generation aspect required the identification of suitable libraries and frameworks on which to build the parallel constructs. Once the libraries had been chosen, the appropriate code segments were then designed to represent the CSP constructs as closely as possible using the features of the selected libraries. This put down the groundwork for the actual code generation process. Once the underlying concurrent framework was complete, the actual code generator was developed. This step tied in closely with the parser generation and made use of the features provided by the compiler generator. The code generator was designed to take the abstract syntax tree (AST) returned by the parser and generate an equivalent Python program. Finally, sample programs were developed and converted using Hydra. The output code was analysed by hand to identify any glaring errors and the program was run and checked for correct execution. Success is indicated by the correct execution and functioning of the Hydra-based program and its communication channels, and whether or not multiple processors are used as shown by the CPU load and process metrics. However, more rigorous testing is required to fully validate the correctness of the conversion process and execution.
W.B. Tristram and K.L. Bradshaw / Hydra: A Python Framework for Parallel Computing
315
2.2. Translation to an Intermediate CSP Implementation While there are many projects that add CSP features to existing programming languages, there are very few attempts to convert directly from CSP to executable code [12]. JCSP and CTJ provide CSP features to Java [13,14]. CCSP and C++CSP provide similar CSP features for C and C++, respectively [15,16]. PyCSP introduces CSP features to Python and is discussed further in Section 3.2.2 [17]. From the list of modern language CSP implementations mentioned above, it would appear that no further work is required to expose CSP to programmers. However, these implementations require the programmer to convert their CSP code into the appropriate form for the implementation they desire to use. For small programs, this task is relatively easy. But once the programs start to get larger and more complex, the process becomes more difficult and is prone to error, particularly with regards to the correct naming and use of channels [12]. The time taken to develop and verify the CSP algorithm for a complex system can often be rivaled by the time taken to convert and debug the program written for one of the above mentioned CSP implementations [12]. Clearly this is not ideal and a means for translating the original CSP directly to executable code is more desirable. 3. Hydra Framework 3.1. Compiler Generators A number of parser generators and parsing frameworks for Python were investigated. The strengths and weaknesses of each compiler generator were assessed and ANTLR [18] was chosen as being most the suitable candidate for use in Hydra. 3.1.1. ANTLR ANTLR (ANother Tool for Language Recognition) is a parser generator that automates the construction of lexers and parsers [18]. ANTLR generates language recognisers that use a fast and powerful LL(*) parsing technique, which is an extension to LL(k) that uses arbitrary lookahead to make decisions. Ambiguity is handled by ANTLR’s backtracking functionality, which allows the parser to work out the correct course of action during runtime, and partial memoization of results means that this can be achieved with linear time complexity [18]. The code generation features of ANTLR are also quite advanced, with formal abstract syntax tree construction rules allowing for custom ASTs to be constructed. Additionally, ANTLR’s tight integration with StringTemplate enables the generation of structured text such as source code [18]. These features make the code generator easily retargetable with minimal changes to the front-end. ANTLR also has a grammar development IDE, named ANTLRWorks, which allows for the visualisation and debugging of parsers generated in any of ANTLR’s supported target languages, which include Python among others [18]. ANTLR is also actively supported with ongoing development, mailing lists, updated project website, and plenty of documentation and examples. 3.2. Concurrent Framework Modules A code generator was required to interpret the source language and produce an equivalent version in the target language [3]. The complexity of the code generator is often dependent on the complexity of the target language or architecture. One approach involves developing all the necessary constructs and underlying framework from scratch. A more practical approach is to find and use existing frameworks for the target architecture, and add custom code only for the functionality that is missing or incomplete [10]. Therefore, the back-end concurrent framework for Hydra is built on top of two existing Python frameworks, namely PYRO and PyCSP.
316
W.B. Tristram and K.L. Bradshaw / Hydra: A Python Framework for Parallel Computing
3.2.1. Python Remote Objects Python Remote Objects (PYRO) is a simple yet powerful framework for working with distributed objects written in Python. PYRO essentially handles all the network communication between objects, allowing remote objects to appear as local ones [11]. Additionally, PYRO provides remote method invocation functionality, which allows for methods from remote objects to be called locally. PYRO can be used over a network, allowing processes to be distributed between a number of separate computers, or it can be used purely on the local machine to provide a convenient inter-process communication mechanism [11]. PYRO consists of a special nameserver component that provides functionality for registering and retrieving remote objects. Client code is then able to register named objects with the PYRO nameserver and retrieve these objects using the specified name [11]. This remote object framework provides all the necessary functionality to implement CSP channels. Each communication channel between processes can be implemented as a remote Channel object with read and write methods. As such, PYRO plays a critical role in the implementation of the concurrent Hydra back-end. 3.2.2. PyCSP PyCSP is a Python module that provides a number of CSP constructs such as channels, channel poisoning, basic guards, skip guards, input guards, processes, and the alternative, parallel and sequential constructs [17]. The biggest drawback of PyCSP is that the current implementation (version 0.3.0 at the time this research was conducted) makes use of Python’s threading library, which is limited by the GIL [8,17]. One solution to this problem is to make use of network channels for communication between multiple local or remote operating system processes, which is the approach we have taken. The PyCSP Process construct is implemented by simply instantiating an object of the Process class, which extends from Python’s Thread class [17]. The instantiated Process object does not begin execution until it is used in either the Parallel or Sequential constructs [17]. Communication via Channels is handled by simply passing the read and write methods of a Channel object as arguments in a Process’s constructor. PyCSP Channels allow for any object to be sent across them, including Processes [17]. This is a useful feature that allows for easy distribution of work, as well as the relaxation of type limitations. PyCSP provides network channel functionality using PYRO [19]. With the appropriate custom framework code, this functionality can be leveraged to overcome PyCSP’s reliance on Python’s threading library. PyCSP has already implemented Python versions of most of the CSP constructs, such as the process, channel, guard and alternative commands, thus alleviating the need to develop these from scratch. 3.3. CSP Grammar While Hoare indicated that programs expressed in the original notation should be implementable, he also made it clear that the notation was not suitable for use as a programming language [4]. However, CSP provides a convenient notation for defining the architecture and communication channels of the processes used by the program. For this reason, a number of compromises and changes have been made to the grammar to allow for the integration of CSP into Python as a means of describing parallel communication. The grammar that was used in the Hydra prototype is presented below. A number of simplifications have been made to ease the construction of the prototype as mentioned in Section 2.1. The goal production or starting point of the grammar is the program production, which accepts zero or more Python import statements and a list of commands as represented further on in the grammar. There is only one method for defining parallel processes, but this is expanded upon in Section 3.4.2.
W.B. Tristram and K.L. Bradshaw / Hydra: A Python Framework for Parallel Computing program parallel process proc_label declaration int_const range type basictype command_list command simple_cmd struct_cmd assignment subscripts target_var struct_targ var_list simple_expr struct_expr expr_list expression input_cmd output_cmd repetitive alternative guarded guard guardlist guard_elem
= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
317
( PYIMPRT )* command_list ; ’ [[ ’ process ( ’ || ’ process )* ’ ]] ’; ( proc_label )? command_list ; ID ’ :: ’; ID ( ’ , ’ ID )* ’: ’ type ’; ’; simple_expr ; ( ID ’: ’ )? int_const ’ .. ’ int_const ; ( ’( ’ INT ’ .. ’ INT ’) ’ basictype ) | basictype ; ’ integer ’ | ’ boolean ’ | ’ char ’; declaration * command +; ( simple_cmd | struct_cmd ) ’; ’; assignment | input_cmd | output_cmd | ’ SKIP ’ | PYEXPR ; alternative | repetitive | parallel ; target_var ’ := ’ expression ; simple_expr ( ’ , ’ simple_expr )*; ID ( ’[ ’ int_const ’] ’ )? | struct_target ; ID ? ’( ’ var_list ? ’) ’; target_var ( ’ , ’ target_var )*; ID ( ’[ ’ int_const ’] ’ )? | INT | BOOL | CHR | PYEXPR ; ID ? ’( ’ expr_list ? ’) ’; expression ( ’ , ’ expression )*; simple_expr | struct_expr ; ID ’? ’ target_var ; ID ’! ’ expression ; ’* ’ alternative ; ’[ ’ guarded ( ’ [] ’ guarded )* ’] ’; ( ’( ’ range ’) ’ )? guard ’ -> ’ command_list ; guardlist | input_cmd | nullcmd ; guard_elem ( ’; ’ guard_elem )* ( ’; ’ input_cmd )?; simple_expr | declaration ;
A significant change that warrants discussion is the removal of expression operators such as the arithmetic and Boolean operators. In their place, the ability to use Python expressions has been added, allowing for much greater flexibility when it comes to expressions. The Python code is enclosed in braces and can be any valid Python expression. To support functionality from Python’s vast module collection, the ability to add Python import statements to the beginning of the program was added. These import statements are preceded by “ include” and are enclosed in braces. The rationale behind this rather significant change is that Python has the ability to evaluate expressions and implementing them in the grammar would just duplicate existing functionality. This removes the burden of parsing and evaluating expressions and essentially gets the Python interpreter to do this on behalf of the parser. This feature also allows for the use of all of Python’s data types, bypassing the limited data type support natively provided by the parser. Examples of the use of these constructs are given in Section 4. The Hydra CSP grammar supports single-line comments, starting with a double hyphen and ending in a newline. Since there is no symbol for the → and 2 symbols used by the guarded statement on common keyboards, “->” and “[]” were used in their place. The Hydra lexer supports four basic expression types, namely identifiers, characters, integers and booleans. Identifiers start with a lowercase letter of the alphabet, and can be followed by any combination of uppercase and lowercase letters, digits and the underscore character. Characters can be any valid ASCII character, denoted between single-quotes. Integers are simply defined as a series of digits. And finally, Boolean expressions are denoted by either “True” or “False” and are case-sensitive.
318
W.B. Tristram and K.L. Bradshaw / Hydra: A Python Framework for Parallel Computing
3.4. Hydra Framework Implementation 3.4.1. Using Hydra Before going into the implementation of the framework, it would be beneficial to describe the manner in which CSP is used within a Python program. The mechanism chosen is fairly simple, although somewhat less than ideal. The process is described below, along with a very simple example, which defines a process, declares an integer variable and assigns it a value. from Hydra . csp import cspexec code = """ [[ prod :: data : integer ; data := 4; ]]; """ cspexec ( code , progname = ’ simple ’)
Firstly, the Hydra csp module must be imported. The csp module provides the cspexec method, which takes the string containing the CSP algorithm as an argument and an optional program name argument. The cspexec method is responsible for converting the algorithm and managing the execution of the processes. Since the CSP algorithm is represented as a string, it is possible to specify the algorithm inline as a triple-quoted string or the algorithm could be specified in a separate file which could then be read in and supplied to the cspexec method. The program is then run by simply executing the Python program as usual. 3.4.2. Implementation Decisions The parallel production, although very simple in its appearance, is of paramount importance as it defines the concurrent architecture of the program. It takes a list of one or more processes to be executed in parallel. During execution, these processes are spawned asynchronously and may execute in parallel, thus achieving one of the project goals: execution of code over multiple processors. However, the prototype implementation exhibits two distinct behaviours. For the top-level parallel construct, it generates the appropriate code for executing over multiple Python interpreters, but for any nested parallel statements, PyCSP’s Parallel method is used. The rationale behind this is that current desktop computers have at most eight processor cores, therefore, implementing every process in a new Python interpreter instance is not likely to scale adequately. In the future, this distinction could be made explicitly controllable by the programmer. Another important set of CSP constructs is the input and output commands. These essentially define the channels of communication between processes and provide a synchronisation mechanism in the form of a rendezvous. Channels are named according to their source and destination processes and are carefully tracked and recorded so that they can be correctly registered by the PYRO nameserver before execution. The input and output commands generate simple read and write method calls on the appropriate Channel objects. While it is possible to communicate with external code through these channels by acquiring the appropriate Channel object via PYRO, such actions are not recommended as they can affect the correct operation of the Hydra program. The process construct is represented as a PyCSP Process. The necessary care is taken to retrieve the relevant Channel objects from the PYRO nameserver using the getNamedChannel method, based on the recorded channels. Since CSP allows for the definition of anonymous processes, a technique for handling and defining these methods was devised. The technique is fairly simple and involves giving each process an internal name. It
W.B. Tristram and K.L. Bradshaw / Hydra: A Python Framework for Parallel Computing
319
is worth noting that since anonymous processes have no user-defined name, it is not possible to use input and output commands within these processes. The repetitive, alternative and guarded statements are implemented using appropriately constructed Python while and if-else statements. To simplify code generation, all while blocks start with an if statement with the condition always false. This means that only elif statements need to be generated when translating the alternative construct. Input guards are implemented using PyCSP’s Alternative class and the priSelect method that uses order of appearance as an indicator of priority. Array declarations need to be treated specially by the declaration code as CSP permits the declaration of array bounds, which can lead to out-of-bounds errors if the programmer attempts to reference an uninitialised list variable. Therefore, arrays are declared as a Python list with the appropriate number of elements all set to None. Python has a number of keywords that cannot be used as method or variable names. Therefore, all user-defined identifiers are sanitised by simply prefixing an underscore to any identifiers that clash with known keywords. Python expressions and statements embedded in the CSP are handled by simply expressing them as normal Python code. This allows the programmer to use external modules and leverage the power of Python within their CSP code. 3.5. Process Distribution and Execution Once the programmer has defined their Hydra CSP-based program within a Python file, this file can be executed as a normal Python program (this runs in its own Python interpreter), which then calls the cspexec method of the Hydra framework. The cspexec method controls the execution process from receiving the CSP algorithm, having the ANTLR-based translator convert it, registering the appropriate channels, to finally starting the execution of the concurrent program. A relatively simple approach was taken to bootstrapping and executing the relevant processes once code generation was complete. One of the problems encountered with using PyCSP’s network channel functionality is that all channels need to be registered with the PYRO nameserver before the processes are able to retrieve the remote Channel objects. There is no easy way to add this registration process to the generated program without encountering situations where one process requests a channel that has not yet been registered. This problem was addressed by registering all the necessary channels beforehand in the cspexec method of the Hydra.csp module. The translation process returns a list of channels and processes that need to be configured and executed, which is then processed by a simple loop that registers the appropriate channel names with the PYRO nameserver. This can be seen in the Python code snippet below. # Iterate through required channels for i , chan in enumerate ( outpt . channels ): cn = One2OneChannel () # Create new channel chans . append ( cn ) # Keep track of the channel # Register the channel with the PYRO nameserver r e g i s t e r N a m e d C h a n n e l ( chans [ i ] , chan )
Since this happens before process execution, there is no chance of channels being unregistered or multiple registrations occurring for the same channel name, thus breaking interprocess communication. Once the channels have been registered, the processes are asynchronously executed by spawning a new Python interpreter using a loop and Python threads. The cspexec method then waits for the processes to finish executing and allows the user to view the results before ending the program. This process can be seen in the following Python code snippet.
320
W.B. Tristram and K.L. Bradshaw / Hydra: A Python Framework for Parallel Computing
class runproc ( Thread ): def __init__ ( self , procname , progname ): Thread . __init__ ( self ) self . procname = procname self . progname = progname def run ( self ): os . system ( ’ python ’ + self . progname + ’. py ’ + self . procname ) def cspexec ( cspcode , progname = ’ hydraexe ’ , debug = False ): # Translation and channel registration occurs here proclist = [] # Iterate through the list of defined processes for proc in outpt . procs : # Create a new Python interpreter for each top - level # process and start it in a new thread . newproc = runproc ( proc , progname ) proclist . append ( newproc ) newproc . start () # Wait for processes to finish before ter minating for proc in proclist : proc . join ()
An overview of the Hydra translation and execution process is shown in Figure 1.
Figure 1. Hydra translation and execution process.
Since all the processes are defined within the same Python file, it was necessary to provide some means for the new Python interpreters to execute the correct process method. During code generation, simple command-line argument handling is added to the output file that allows for the correct method to be executed based on the supplied argument. This is shown in the Python code snippet below.
W.B. Tristram and K.L. Bradshaw / Hydra: A Python Framework for Parallel Computing
321
# Snippet from generated Python code def __program ( _proc_ ): # PyCSP processes omitted # Process selection if _proc_ == " producer " : Sequence ( producer ()) elif _proc_ == " consumer " : Sequence ( consumer ()) else : print ’ Invalid process specified . ’ __program ( sys . argv [1])
In the above snippet, PyCSP’s Sequence construct is used to start the appropriate Process. This is because PyCSP Processes are only executed when they are used in a PyCSP Parallel or Sequence construct. 4. Analysis Two forms of testing were performed on the Hydra prototype. The system was tested with a number of sample programs; the resulting output code was then manually inspected to determine if it is an accurate representation of the CSP algorithm. Secondly, the code was executed and the operating system’s process and CPU load monitoring tools were used to determine whether or not the program was executing over multiple process cores. Since this implementation is merely a prototype for investigating the feasibility of using CSP within Python, the actual performance of the framework was not considered. The primary focus was on enabling multiprocessor execution in Python. Once that had been achieved, further work can focus on refining the framework and optimising for performance. All testing was performed on the system configuration specified in Table 1. Table 1. Testing platform configuration. Component CPU Motherboard Memory Hard Disk Network Operating System Python
Specification AMD Opteron 170 (2 cores @ 2.0GHz) ASUS A8R32-MVP Deluxe 2x1GB G.Skill DDR400 Seagate 320GB 16MB Cache Marvel Gigabit On-board Network Card Microsoft Windows 2003 Server SP2 Version 2.5.2
4.1. Generated Code Analysis One of the example Hydra CSP algorithms can be seen in the code listing below. This is a simple program with two processes (only the producer is shown). The producer process outputs the value of x to the consumer process 10000 times and the consumer process simply inputs the value received from producer and stores it in y. The value of x is updated with the current time expressed in seconds from a fixed starting time. The example also highlights the use of Python import statements and the use of Python statements within CSP.
322
W.B. Tristram and K.L. Bradshaw / Hydra: A Python Framework for Parallel Computing from Hydra . csp import cspexec prodcons = """ _include { from time import time } [[ -- producer process : sends the value of x to consumer producer :: x : integer ; x := 1; *[ { x <= 10000} -> { print " prod : x = " + str ( x )}; consumer ! x ; x := { time ()}; ]; || consumer :: -- code omitted ]]; """ cspexec ( prodcons , progname = ’ prodcons ’)
One of the resulting Python processes can be seen in the listing below. import sys from pycsp import * from pycsp . plugNplay import * from pycsp . net import * from time import time def __program ( _proc_ ): @process def producer (): __procname = ’ producer ’ __chan_consumer_out = getNamedChannel ( " producer - > consumer " ) x = None x = 1 __lctrl_1 = True while ( __lctrl_1 ): if False : pass elif x <= 10000: print " prod : " + str ( x ) _ _ c h a n _ c o n s u m e r _ o u t . write ( x ) x = time () else : __lctrl_1 = False @process def consumer (): # code omitted
Looking at the output code, it is clear that Hydra has generated both processes and defined them correctly, with correct channel initialisation and variable declarations. The repetitive command is present in the form of a while loop with the appropriate control variable and alternative code. The guarded commands can also be seen in the form of the elif statements, with expressions and statement blocks correctly represented. The output command can be seen with the write method call on the channel object. This example, while simple, is able to show many of the CSP constructs and their respective representations in Python using Hydra. The resulting Hydra program was then executed and the Windows Task Manager was used to monitor the python.exe interpreter processes and overall CPU usage. All unnecessary programs and services were closed to ensure the least possible interference with the CPU
W.B. Tristram and K.L. Bradshaw / Hydra: A Python Framework for Parallel Computing
323
Table 2. Average CPU loads. Core CPU 0 CPU 1 Combined
Average Load 83% 79% 81%
load measurements. To demonstrate the parallel execution effectively, the guard conditions for the producer and consumer processes were changed to True, thus creating infinite loops. Additionally, complex mathematical calculations were added to the producer and consumer processes to increase CPU load. This provided enough time to effectively demonstrate multicore usage. Table 2 shows the resulting average CPU loads, which clearly indicate multi-core execution. The number of Python processes was also verified. Four processes were found, which include two for the CSP processes, one for the Hydra framework and one for the PYRO nameserver. Therefore, the correct number of processes are being spawned. However, further testing is required on larger more complex programs to confirm their successful functioning and the correct functioning of the communication channels as well. 5. Conclusions The goal of the Hydra project is the creation of a concurrent framework for Python based on CSP. This framework is responsible for converting CSP code into concurrent Python code. The process involved the development of a parser for CSP using ANTLR and the creation of a code generator using ANTLR and StringTemplate, which takes the AST produced by the parser and generates the required Python code. Finally, basic testing was conducted to determine whether or not the Hydra framework was capable of meeting its objectives and it was confirmed that the Hydra prototype appears to execute the target program correctly, along with correct channel initialisation and communication. It also spawns the correct number of processes and executes over multiple processors. However, as stated previously, more rigorous testing and evaluation is required to validate the correctness of the Hydra program with certainty. The Hydra prototype has demonstrated that it is possible to take a CSP algorithm and convert it into concurrent Python code, using the method described in this paper, and have the concurrent program execute over multiple CPU cores. The objective of developing a flexible parser and translator was also achieved thanks to ANTLR’s powerful parsing and code generation functionality. A recent update to the PyCSP project (version 0.6.1) has added support for creating CSP processes as operating system processes instead of threads, similar to what is done in this prototype, although the underlying implementation differs. Acknowledgements The authors would like to acknowledge the financial support of Telkom, Comverse, Stortech, Tellabs, Amatole Telecom Services, Mars Technologies, Bright Ideas 39, and THRIP through the Telkom Centre of Excellence in the Department of Computer Science at Rhodes University.
324
W.B. Tristram and K.L. Bradshaw / Hydra: A Python Framework for Parallel Computing
References [1] Brian Hayes. Computing in a Parallel Universe. American Scientist, 95:476–480, 2007. [2] Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jeffrey D. Ullman. Compilers: Principles, Techniques, and Tools, 2/E. Addison-Wesley, 2nd edition, 2006. [3] Pat Terry. Compiling with C# and Java. Addison-Wesley, 2005. [4] C. A. R. Hoare. Communicating Sequential Processes. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1985. [5] Alex Martelli. Python in a Nutshell. O’Reilly & Associates, Inc., 2003. [6] Python.org. About Python. Online, May 2008. [7] D. Beazley and P. Lomdahl. Feeding a Large Scale Physics Application to Python. In Proceedings of the 6 th International Python Conference, San Jose, California, October 1997. [8] Python.org. Python Library and Extension FAQ. Online, January 2008. [9] Gregory Benson, Alexey Fedosov, Joe Gutierrez, Brian Hardie, Tony Ngo, Jennifer Reyes, and Yiting Wu. River - A Python-based Framework for Rapid Prototyping of Reliable Parallel Run-time Systems. Online, May 2008. [10] Gregory Benson and Alexey Fedosov. Python-based Distributed Programming with Trickle. In Hamid R. Arabnia, editor, PDPTA, pages 30–36, Las Vegas, Nevada, USA, June 25-28 2007. CSREA Press. [11] Irmen de Jong. PYRO - Python Remote Objects. Online, May 2008. [12] V. Raju, L. Rong, and G.S. Stiles. Automatic Conversion of CSP to CTJ, JCSP, and CCSP. IOS Press, 2003. [13] Abhijit Belapurkar. CSP for Java Programmers. Online, June 2005. [14] Gerald Hilderink, Jan Broenink, Wiek Vervoert, and Andre Bakkers. Communicating Java Threads. In Proceedings of the 20th World Occam and Transputer User Group Technical Meeting, pages 48–76, The Netherlands, 1997. IOS Press. [15] J. Moores. CCSP – a Portable CSP-based Run-time System Supporting C and occam. In B.M.Cook, editor, Architectures, Languages and Techniques for Concurrent Systems, volume 57 of Concurrent Systems Engineering series, pages 147–168, Amsterdam, the Netherlands, April 1999. WoTUG, IOS Press. [16] Neil C.C. Brown. C++CSP2: A Many-to-Many Threading Model for Multicore Architectures. In Alistair A. McEwan, Steve Schneider, Wilson Ifill, and Peter Welch, editors, Communicating Process Architectures 2007, pages 183–205. IOS Press, July 2007. [17] John Markus Bjørndalen, Brian Vinter, and Otto Anshus. PyCSP - Communicating Sequential Processes for Python. In Alistair A. McEwan, Steve Schneider, Wilson Ifill, and Peter Welch, editors, Communicating Process Architectures 2007, pages 229–248. IOS Press, July 2007. [18] Terence J. Parr. The Definitive ANTLR Reference: Building Domain-Specific Languages. The Pragmatic Programmers, Raleigh, North Carolina, 2007. [19] John Markus Bjørndalen. PyCSP. Online, May 2008.
Communicating Process Architectures 2009 P.H. Welch et al. (Eds.) IOS Press, 2009 © 2009 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-065-0-325
325
Extending CSP with Tests for Availability Gavin LOWE Oxford University Computing Laboratory, Wolfson Building, Parks Road, Oxford, OX1 3QD, UK [email protected] Abstract. We consider the language of CSP extended with a construct that allows processes to test whether a particular event is available (without actually performing the event). We present an operational semantics for this language, together with two congruent denotational semantic models. We also show how this extended language can be simulated using standard CSP, so as to be able to analyse systems using the model checker FDR. Keywords. CSP, tests for availability, semantic models
Introduction Many languages for message-passing concurrency allow programs to test whether a channel is ready for communication, without actually performing that communication. For example, in JCSP [1,2], the input and output ends of channels have a method pending(), to test whether there is data ready to be read, or whether there is a reader ready to receive data, respectively. Java InputStreams have a method available() that returns the number of bytes that are available to be read. Andrews [3] gives a number of examples using such a construct. In this paper, we study the effect of adding such tests to the process algebra CSP [4]. In particular, we add a single new construct to the language: the process if ready a then P else Q
(1)
tests whether the event a is ready for communication, and then acts like either P or Q, appropriately. More precisely, the process tests whether all other processes that have a in their alphabet (and so who must synchronise on a) are ready to perform a. We assume that within constructs of the form of (1), the test for the readiness of a is carried out only once: if the event becomes available or unavailable after the test is performed, that does not affect the branch that is selected. We allow processes to test for the readiness of events outside their own alphabet. In this paper we investigate the effect of adding the construct (1) to semantic models for CSP. In the next section, we give a brief overview of the syntax and standard semantics of CSP. In Section 2 we give some examples using the new construct, both to illustrate its potential usefulness, and to highlight some implications for the semantic models. In Section 3 we give an operational semantics to the language; then in Section 4 we give congruent denotational models, analogous to the traces and stable failures models of CSP [4]. In Section 5 we show how this extended language can be simulated using standard CSP, so as to be able to analyse systems using a model checker such as FDR [5,6]. We sum up in Section 6.
326
G. Lowe / Extending CSP with Tests for Availability
1. A Brief Overview of CSP In this section we give a brief overview of the syntax of CSP; for simplicity and brevity, we consider a fragment of the language in this paper. We also give a brief overview of the traces and stable failures models of CSP. For more details, see [7,4]. CSP is a process algebra for describing programs or processes that interact with their environment by communication. Processes communicate via atomic events, from some set Σ. Events are often passed on channels; for example, the event c.3 represents the value 3 being passed on channel c. The notation {| c |} represents the set of events over channel c. The simplest process is STOP, which represents a deadlocked process that cannot communicate with its environment. The process div represents a divergent process that can only perform internal events. The process a → P offers its environment the event a; if the event is performed, it then acts like P. The process c?x → P is initially willing to input a value x on channel c, i.e. it is willing to perform any event of the form c.x; it then acts like P (which may use x). A standard conditional is written as if b then P else Q, where b is a boolean condition on variables within the process (such as variables that hold values previously input)1 . The process b & P is equivalent to if b then P else STOP: P is enabled only if the boolean guard b is true. For convenience, we extend this notation to readiness tests and define: ready a & P notReady a & P
as shorthand for if ready a then P else STOP, as shorthand for if ready a then STOP else P.
The tests now act as guards upon P, so that P can be performed only if a is available or not available, respectively. The process P 2 Q can act like either P or Q, the choice being made by the environment: the environment is offered the choice between the initial events of P and Q. By contrast, P Q may act like either P or Q, with the choice being made internally, and not under the control of the environment. The process P Q represents a sliding choice or timeout: the process initially acts like P, but if no event is performed then it can internally change state to act like Q. The process P A B Q runs P and Q in parallel; P is restricted to performing events from A; Q is restricted to performing events from B; the two processes synchronise on events from A ∩ B. In this paper we will take the alphabets A and B to comprise just standard events, as opposed to actions corresponding to readiness tests. As noted above, we allow processes to test for the readiness of events outside their alphabets, e.g. (ready b & a → STOP) {a} {b} (b → STOP STOP). In examples, we will tend to omit the alphabets when they are clear from the context. The process P Q runs P and Q in parallel, synchronising on events from A. The process A
P ||| Q interleaves P and Q, i.e. runs them in parallel with no synchronisation. The process P \ A acts like P, except the events from A are hidden, i.e. turned into internal, invisible events, denoted τ , which do not need to synchronise with the environment. The process P[[R]] represents P where events are renamed according to the relation R, i.e., P[[R]] can perform an event b whenever P can perform an event a such that a R b. The relation R is often presented as a substitution; for example P[[b/a, c/a]] represents P, with the event a renamed to both b and c, and all other events unchanged. 1 We use the same syntax for both standard conditionals and readiness tests, but they are semantically different constructs.
G. Lowe / Extending CSP with Tests for Availability
327
Recursive processes may be defined equationally, or using the notation μ X • P, which represents a process that acts like P, where each occurrence of X represents a recursive instantiation of μ X • P. Prefixing (→) and guarding (&) bind tighter than each of the binary choice operators, which in turn bind tighter than the parallel operators. CSP can be given both an operational and denotational semantics. The denotational semantics can either be extracted from the operational semantics, or defined directly over the syntax of the language; see [4]. It is more common to use the denotational semantics when specifying or describing the behaviours of processes, although most tools act on the operational semantics. A trace of a process is a sequence of (visible) events that a process can perform. We write traces(P) for the traces of P. If tr is a trace, then tr |` A represents the restriction of tr to the events in A, whereas tr \ A represents tr with the events from A removed; concatenation is written “ ”; A∗ represents the set of traces with events from A. A stable failure of a process P is a pair (tr, X), which represents that P can perform the trace tr to reach a stable state (i.e. where no internal events are possible) where X can be refused, i.e., where none of the events of X is available. We write failures(P) for the stable failures of P. 2. Examples In this section we consider a few examples, firstly to illustrate the usefulness of the new construct, and then to highlight some aspects of the semantics. Being able to detect readiness on channels can be useful in a number of circumstances. For example, the construct: a→P 2 notReady a & b → Q gives priority to a over b: the event b can be performed only if the environment is not willing to perform a (at the point at which the test is made). Note, though, that if the environment withdraws its willingness to communicate a after the notReady a test is performed, then the above construct will be blocked, even if b is available: the construct makes the assumption about the environment that a is not withdrawn in this way. As a slightly larger example, consider the classic readers and writers problem [8]. Here collections of readers and writers share a database. In order to maintain consistency, readers may not use the database at the same time as writers, and at most one writer may use the database at a time. The following guard process supports this: readers (resp. writers) gain entry to the database by performing the event startRead (resp. startWrite) and perform endRead (resp. endWrite) when they are finished. The parameters r and w record the number of readers and writers currently using the database, and satisfy the invariant w ≤ 1 ∧ (r > 0 ⇒ w = 0). Guard(r, w) = w = 0 & startRead → Guard(r + 1, w) 2 endRead → Guard(r − 1, w) 2 r = 0 ∧ w = 0 & startWrite → Guard(r, w + 1) 2 endWrite → Guard(r, w − 1). The problem with the above design is that writers may be permanently locked out of the database if there is always at least one reader using the database (even if no individual reader uses the database indefinitely). The following version gives priority to writers, by not allowing a new reader to start using the database if there is a writer waiting:
328
G. Lowe / Extending CSP with Tests for Availability
Guard(r, w) = w = 0 & notReady startWrite & startRead → Guard(r + 1, w) 2 endRead → Guard(r − 1, w) 2 r = 0 ∧ w = 0 & startWrite → Guard(r, w + 1) 2 endWrite → Guard(r, w − 1). This idea can be extended further, to achieve fairness to both types of process; the parameter priRead records whether priority should be given to readers.2 Guard(r, w, priRead) = w = 0 ∧ priRead & startRead → Guard(r + 1, w, false) 2 w = 0 & notReady startWrite & startRead → Guard(r + 1, w, false) 2 endRead → Guard(r − 1, w, false) 2 r = 0 ∧ w = 0 ∧ ¬priRead & startWrite → Guard(r, w + 1, true) 2 r = 0 ∧ w = 0 & notReady startRead & startWrite → Guard(r, w + 1, true) 2 endWrite → Guard(r, w − 1, true). We now consider a few examples in order to better understand aspects of the semantics of processes with readiness tests: it turns out that some standard algebraic laws no longer hold. Throughout these examples, we omit alphabets from the parallel composition operator where they are obvious from the context. Example 1 Consider P Q where P = a → STOP and Q = if ready a then b → STOP else error → STOP. Clearly, it is possible for Q to detect that a is ready and so perform b. Could Q detect that a is not ready, and so perform error? If P makes a available immediately then clearly the answer is no. However, if it takes P some time to make a available, then Q could test for the availability of a before P has made it available. We believe that any implementation of prefixing will take some time to make a available: for example, in a multi-threaded implementation, scheduling decisions will influence when the a becomes available; further, the code for making a available will itself take some time to run. This is the intuition we follow in the rest of the paper. This decision has a considerable impact on the semantics: it will mean that all processes will take some time to make events available (essentially since all the CSP operators maintain this property). Returning to Example 1, in the combination P Q, Q can detect that a is not available initially and so perform error. Example 2 (if ready a then P else Q) \ {a} = P \ {a}: the hiding of a means that the ready a test succeeds, since there is nothing to prevent a from happening. Example 3 External choice is not idempotent. Consider P = a → STOP b → STOP and Q = ready a & ready b & error → STOP. Then P Q cannot perform error, but P 2 P Q can, if the two nondeterministic choices are resolved differently. We do not allow external choices to be resolved by ready or notReady tests: we consider these tests to be analogous to evaluation of standard boolean conditions in if statements, or boolean guards, which are evaluated internally. Example 4 The process R = ready a & P 2 notReady a & Q is not the same as if ready a then P else Q, essentially since the former checks for the readiness of a twice, but the latter checks 2
In principle one could merge the first two branches, by using a guard w = 0 ∧ (priRead ∨
notReady startWrite); however, allowing complex guards that mix booleans with readiness testing would com-
plicate the semantic definitions.
329
G. Lowe / Extending CSP with Tests for Availability
only once. When in the presence of the process a → STOP, R can evolve to the state P 2 Q (if the ready a test is made after the a becomes available, and the notReady a test is made before the a becomes available) or to STOP 2 STOP = STOP (if the ready a test is made before the a becomes available, and the notReady a test is made after the a becomes available); neither of these is, in general, a state of if ready a then P else Q. Example 5 ready a & ready b & P is not the same as ready b & ready a & P. Consider P = error → STOP and Q = a → STOP b → STOP. Then ready a & ready b & P Q can perform error, but ready b & ready a & P Q cannot. Similar results hold for notReady, and for a mix of ready and notReady guards. The above example shows why we do not allow more complex guards, such as ready a ∧ ready b & P: any natural implementation of this process would have to test for the availability of a and b in some order, but the order in which those are tested can make a difference. 3. Operational Semantics In this section we give operational semantics to the language of CSP extended with tests for the readiness or non-readiness of events. For simplicity, we omit interleaving, the form of A
parallel composition, and renaming from the language we consider. a As normal, we write P −→ P , for a ∈ Σ ∪ {τ } (where Σ is the set of visible events, and τ represents an internal event), to indicate that P performs the event a to become P . In addition, we include transitions to indicate successful readiness or non-readiness tests: ready a
• We write P −−→ P to indicate that P detects that the event a is ready, and evolves into P ; notReady a • We write P −−→ P to indicate that P detects that the event a is not ready, and evolves into P . Note the different fonts between ready and notReady, which are part of the syntax, and ready and notReady, which are part of the semantics. Define, for A ⊆ Σ: ready A = {ready a | a ∈ A}, †
A = A ∪ ready A ∪ notReady A,
notReady A = {notReady a | a ∈ A}, A†τ = A† ∪ {τ }. ready a
notReady a
Transitions, then, will be labelled by elements of Σ†τ . We think of the −−→ and −−→ transitions as being internal in the sense that they cannot be directly observed by any parallel peer. We refer to elements of Σ†τ as actions, and restrict the word events to elements of Στ . a a a Below we use standard conventions, writing, e.g., P −→ for ∃ P • P −→ P , and P −→ a for ¬(∃ P • P −→ P ). Recall our intuition that a process such as a → P may not make the a available immediately. We model this by a τ transition to a state where the a is indeed available. It turns out that this latter state is not expressible within the syntax of the language (this follows from Lemma 11, below). Within the operational semantic definitions, we will write this state as aˇ → P. We therefore define the semantics of prefixing by the following two rules. τ
a → P −→ aˇ → P,
a
aˇ → P −→ P.
We stress, though, that the aˇ → . . . notation is only for the purpose of defining the operational semantics, and is not part of the language.
330
G. Lowe / Extending CSP with Tests for Availability
The following rules for normal events are completely standard. For brevity, we omit the symmetrically equivalent rules for external choice (2) and parallel composition (A B ). The identifier a ranges over visible events. a
P −→ P τ P 2 Q −→ P 2 Q
a
P −→ P τ P Q −→ P Q
τ
P −→ P a P 2 Q −→ P
τ
P −→ P a P Q −→ P τ
P Q −→ Q
τ
P Q −→ P
a
α
P −→ P α ∈ A − B ∪ {τ } α PA B Q −→ P A B Q
P −→ P a Q −→ Q a∈A∩B a PA B Q −→ P A B Q a
α
P −→ P a∈A τ P \ A −→ P \ A
τ
μ X • P −→ P[μ X • P/X]
P −→ P α ∈ Σ − A ∪ {τ } α P \ A −→ P \ A div
τ
P Q −→ Q
τ
−→ div
The following rules show how the tests for readiness operate. ready a
if ready a then P else Q −−→ P,
if ready a then P else Q
notReady a
−−→ Q.
The remaining rules show how the readiness tests are promoted by various operators. We omit symmetrically equivalent rules for brevity. The rules for the choice operators are straightforward. ready a
notReady a
P −−→ P
P −−→ P
ready a
notReady a
P 2 Q −−→ P 2 Q
P 2 Q −−→ P 2 Q
ready a
notReady a
P −−→ P
P −−→ P
ready a
notReady a
P Q −−→ P Q
P Q −−→ P Q
The rules for parallel composition are a little more involved. A ready a action can occur only if all processes with a in their alphabet are able to perform a. ready b
P −−→ P b Q −→ ready b
P A B Q −−→ P A B Q
ready a
P −−→ P b∈B
ready a
P A B Q −−→ P A B Q
a∈ /B
A notReady a action requires at least one parallel peer with a in its alphabet to be unable to perform a. In this case, the action is converted into a τ . notReady b
P −−→ P b Q −→ b∈B τ P A B Q −→ P A B Q
notReady b
P −−→ P b Q −→ notReady b
P A B Q −−→ P A B Q
b∈B
G. Lowe / Extending CSP with Tests for Availability
331
notReady a
P −−→ P notReady a
P A B Q −−→ P A B Q
a∈ /B
Note that in the second rule, the notReady b may yet be blocked by some other parallel peer. If a ready a action can be performed in a context where a is then hidden, then all relevant parallel peers are able to perform a; hence the transition can occur; the action is converted into a τ . ready a
P −−→ P a∈A τ P \ A −→ P \ A α
P −→ P α ∈ ready(Σ − A) ∪ notReady(Σ − A) α P \ A −→ P \ A Note that there is no corresponding rule for notReady a: in the context P \ A, if P can perform notReady a (with a ∈ A) then all parallel peers with a in their alphabet are able to perform a, and so the a is available; hence the notReady a action is blocked for P \ A. The following two lemmas can be proved using straightforward structural inductions. First, ready a and notReady a actions are available as alternatives to one another. Lemma 6 For every process P: ready a
notReady a
(∃ Q • P −−→ Q) ⇔ (∃ Q • P −−→ Q ). Informally, the two transitions correspond to taking the two branches of a construct of the form if ready a then R else R . The if construct may be only part of the process P above, and so R and R may be only part of Q and Q above. Initially, each process can perform no standard events. This is a consequence of our assumption that a process of the form a → P cannot perform the a from its initial state. Lemma 7 For every process P expressible using the syntax of the language (so excluding a the aˇ → . . . construct), and for every standard event a ∈ Σ, P −→. Of course, P might have a τ transition to a state where visible events are available. 4. Denotational Semantics We now consider how to build a compositional denotational semantic model for our language. We want the model to record at least the traces of visible events performed by processes: any coarser model is likely to be trivial. In order to consider what other information is needed in the model, it is useful to consider (informally) a form of testing: we will say that test T distinguishes processes P and Q if P T and Q T have different traces of visible events. In this case, the denotational model should also distinguish P and Q. We want to record within traces the ready and notReady actions that are performed. For example, the processes b → STOP and ready a & b → STOP are distinguished by the test STOP (with alphabet {a}); we will distinguish them denotationally by including the ready a action in the latter’s trace. Further, we want to record the events that were available as alternatives to those events that were actually performed. For example, the processes a → STOP 2 b → STOP and
332
G. Lowe / Extending CSP with Tests for Availability
a → STOP b → STOP can be distinguished by the test ready a & b → STOP; we will distinguish them denotationally by recording that the former offers a as an alternative to b. We therefore add actions offer a and notOffer a to represent that a process is offering or not offering a, respectively. These actions will synchronise with ready a and notReady a actions. We write notOffer A = {notOffer a | a ∈ A},
offer A = {offer a | a ∈ A}, ‡
†
A = A ∪ offer A ∪ notOffer A,
A‡τ = A‡ ∪ {τ }.
A trace of a process will, then, be a sequence of actions from Σ‡ . We can calculate the traces of a process in two ways: by extracting then from the operational semantics, and by giving compositional rules. We begin with the former. We augment the operational semantics with extra transitions as follows: a
• We add offer a loops on every state P such that P −→; a • We add notOffer a loops on every state P such that P −→. Formally, we define a new transition relation −− by: α
α
offer a
a
notOffer a
a
for α ∈ Σ†τ ,
P −− Q ⇔ P −→ Q, P −−− P ⇔ P −→ , . P −−− P ⇔ P −→
Appendix A gives rules for the −− that can be derived from the rules for the −→ relation and the above definition. We can then extract the traces (of Σ‡ actions) from the operational semantics (following [4, Chapter 7]): tr
Definition 8 We write P −→ Q, for tr = α1 , . . . , αn ∈ (Σ‡τ )∗ , if there exist P0 = P, αi+1 tr P1 , . . . , Pn = Q such that Pi −− Pi+1 for i = 0, . . . , n−1. We write P =⇒ Q, for tr ∈ (Σ‡ )∗ , tr
if there is some tr such that P −→ Q and tr = tr \ τ . tr
The traces of process P can then be defined to be the set of all tr such that P =⇒. The following lemma states some healthiness conditions concerning this set. Lemma 9 For all processes P expressible using the syntax of the language (so excluding the tr aˇ → . . . construct), the set T = {tr | P =⇒} satisfies the following conditions: 1. T is non-empty and prefix-closed. 2. T includes (notOffer Σ)∗ , i.e., the process starts in a state where no standard events are available. 3. offer and notOffer actions can always be remove from or duplicated within a trace: tr αtr ∈ T ⇒ tr α, αtr ∈ T ∧ tr tr ∈ T, for α ∈ offer Σ ∪ notOffer Σ. 4. ready a and notReady a actions are available as alternatives to one another: tr ready a ∈ T ⇔ tr notReady a ∈ T. 5. Either an offer a or notOffer a action is always available. tr tr ∈ T ⇒ tr offer atr ∈ T ∨ tr notOffer atr ∈ T.
G. Lowe / Extending CSP with Tests for Availability
333
Proof: (Sketch) 1. This follows directly from the definition of =⇒. 2. The follows from Lemma 7 and the definition of −− . 3. This follows directly from the definition of −− : offer and notOffer transitions always form self-loops. 4. This follows directly from Lemma 6. 5. This follows directly from the definition of −− : each state has either an offer a or a notOffer a loop. 2 4.1. Compositional Traces Semantics We now give compositional rules for the traces of a process. The semantics for each process will be an element of the following model. Definition 10 The Readiness-Testing Traces Model contains those sets T ⊆ (Σ‡ )∗ that satisfy conditions 2–5 of Lemma 9. We write tracesR [[P]] for the traces of P3 . Below we will show that these are congruent to the operational definition above. STOP and div are equivalent in this model: they can perform no standard events; they can only signal that they are not offering events. tracesR [[STOP]] = tracesR [[div]] = (notOffer Σ)∗ . The process a → P can initially signal that it is not offering events; it can then signal that it is offering a but not offering other events; it can then perform a, and then continue like P. tracesR [[a → P]] = Init ∪ {tr atr | tr ∈ Init ∧ tr ∈ tracesR [[P]]} where Init = {tr tr | tr ∈ (notOffer Σ)∗ ∧ tr ∈ ({offer a} ∪ notOffer(Σ − {a}))∗ }. The process if ready a then P else Q can initially signal that it is not offering events; it can then either detect that a is ready and continue as P, or detect that a is not ready and continue like Q. tracesR [[ if ready a then P else Q]] = (notOffer Σ)∗ ∪ {tr ready atr | tr ∈ (notOffer Σ)∗ ∧ tr ∈ tracesR [[P]]} ∪ {tr notReady atr | tr ∈ (notOffer Σ)∗ ∧ tr ∈ tracesR [[Q]]}. The process P Q can either perform a trace of P, or can perform a trace of P with no standard events, and then (after the timeout) perform a trace of Q. The process P Q can perform traces of either of its components. tracesR [[P Q]] = tracesR [[P]] ∪ {trP trQ | trP ∈ tracesR [[P]] ∧ trP |` Σ = ∧ trQ ∈ tracesR [[Q]]}, tracesR [[P Q]] = tracesR [[P]] ∪ tracesR [[Q]]. Before the first visible event, the process P 2 Q can perform an offer a action if either P or Q can do so; it can perform a notOffer a action if both P and Q can do so. Therefore, P 3 We include the subscript “R” in tracesR [[P]] to distinguish this semantics from the standard traces semantics, traces[[P]].
334
G. Lowe / Extending CSP with Tests for Availability
and Q must synchronise on all notOffer actions before the first visible event. Let tr
notOffer Σ
tr
be the set of ways of interleaving tr and tr , synchronising on all notOffer actions (this operator is a specialisation of the operator defined in [4, page 70]). The three sets in the definiX
tion below correspond to the cases where (a) neither process performs any visible events (so the two processes synchronise on notOffer actions throughout the execution), (b) P performs at least one visible event (after which, Q is turned off), and (c) the symmetric case where Q performs at least one visible event. tracesR [[P 2 Q]] = {tr | ∃ trP ∈ tracesR [[P]], trQ ∈ tracesR [[Q]] • trP |` Σ = trQ |` Σ = ∧ tr ∈ trP trQ } ∪ notOffer Σ
{tr atrP | ∃ trP atrP ∈ tracesR [[P]], trQ ∈ tracesR [[Q]] • trP |` Σ = trQ |` Σ = ∧ a ∈ Σ ∧ tr ∈ trP trQ } ∪ notOffer Σ
{tr atrQ | ∃ trP ∈ tracesR [[P]], trQ atrQ ∈ tracesR [[Q]] • trP |` Σ = trQ |` Σ = ∧ a ∈ Σ ∧ tr ∈ trP trQ }. notOffer Σ
In order to give a semantic equation for parallel composition, we define a relation to
A B capture how traces of parallel components are combined4 . We write (trP , trQ ) − → tr if the traces trP of P and trQ of Q can lead to the trace tr of PA B Q. Let privateA = (A − B) ∪ offer(A − B) ∪ notOffer A ∪ ready (Σ − B) ∪ notReady(Σ − B); these are the actions that the process with alphabet A can perform without any cooperation from the other process. Let syncA,B = (A ∩ B) ∪ offer(A ∩ B); these are the actions that the two processes synchronise
A B upon. The relation − → is defined by:
A B → , (, ) −
A B → tr and b ∈ B, then if (trP , trQ ) −
A B → αtr, (αtrP , trQ ) − A B
for α ∈ privateA ,
(αtrP , αtrQ ) −→ αtr,
A B
for α ∈ syncA,B ,
(ready btrP , offer btrQ ) −→ ready btr,
A B → tr, (notReady btrP , notOffer btrQ ) −
A B (notReady btrP , offer btrQ ) − → notReady btr,
The symmetric equivalents of the above cases. In the second clause: the first case corresponds to P performing a private action; the second case corresponds to P and Q synchronising on a shared action; the third case corresponds to a readiness test of P detecting that Q is offering b; the fourth case corresponds to a nonreadiness test of P detecting that Q is not offering b; the fifth case corresponds to a nonreadiness test of P detecting that Q is offering b. The reader might like to compare this definition with the corresponding operational semantics rules for parallel composition. The semantics of parallel composition is then as follows; note that each component is restricted to its own alphabet, and that the composition can perform arbitrary notOffer (Σ − A − B) actions: 4 One normally defines a set-valued function to do this, but in our case it is more convenient to define a relation, since this leads to a much shorter definition.
G. Lowe / Extending CSP with Tests for Availability
335
tracesR [[PA B Q]] = {tr | ∃ trP ∈ tracesR [[P]], trQ ∈ tracesR [[Q]] • trP |` (Σ − A) ∪ offer(Σ − A) ∪ notOffer(Σ − A) = ∧ trQ |` (Σ − B) ∪ offer(Σ − B) ∪ notOffer(Σ − B) = ∧
A B → tr \ notOffer(Σ − A − B)}. (trP , trQ ) −
The semantic equation for hiding of A captures that notReady A and offer A actions are blocked, A and ready A actions are internalised, and arbitrary notOffer A actions can occur. tracesR [[P \ A]] = {tr | ∃ trP ∈ tracesR [[P]] • trP |` (notReady A ∪ offer A) = ∧ trP \ (A ∪ ready A) = tr \ notOffer A}. We now consider the semantics of recursion. Our approach follows the standard method using complete partial orders; see, for example, [4, Appendix A.1]. Lemma 11 The Readiness-Testing Traces Model forms a complete partial order under the subset ordering ⊆, with tracesR [[div]] as the bottom element. Proof: That tracesR [[div]] is the bottom element follows from item 2 of Lemma 9. It is straightforward to see that the model is closed under arbitrary unions, and hence is a complete partial order. 2 The following lemma can be proved using precisely the same techniques as for the standard traces model; see [4, Section 8.2]. Lemma 12 Each of the operators is continuous with respect to the ⊆ ordering. Hence from Tarski’s Theorem, each mapping F definable using the operators of the language has a least fixed point given by n≥0 F n (div). This justifies the following definition. tracesR [[ μ X • F(X)]] = the ⊆-least fixed point of the semantic mapping corresponding to F. The following theorem shows that the two ways of capturing the traces are congruent. Theorem 13 For all traces tr ∈ (Σ‡ )∗ : tr
tr ∈ tracesR [[P]] iff P =⇒ . Proof: (Sketch) By structural induction over the syntax of the language. We give a couple of cases in Appendix B. 2 Theorem 14 For all processes, tracesR [[P]] is a member of the Readiness-Testing Traces Model (i.e., it satisfies conditions 2–5 of Lemma 9). Proof: This follows directly from Lemma 9 and Theorem 13. 2 We can relate the semantics of a process in this model to the standard traces semantics. Let φ be the function that replaces readiness tests by nondeterministic choices, i.e., φ(if ready a then P else Q) = φ(P) φ(Q) and φ distributes over all other operators (e.g. φ(PA B Q) = φ(P)A B φ(Q)). The standard traces of φ(P) are just the projection onto standard events of the readiness-testing traces of P. Theorem 15 traces[[φ(P)]] = {tr |` Σ | tr ∈ tracesR [[P]]}.
336
G. Lowe / Extending CSP with Tests for Availability
4.2. Failures We now consider how to refine the semantic model, to make it analogous to the stable failures model [4], i.e. to record information about which events can by stably refused. The refusal of events seems, at first sight, to be very similar to those events not being offered, as recorded by notOffer actions. The difference is that refusals are recorded only in stable states, i.e. where no internal events are available: this means that if an event is stably refused, it will continue to be refused (until a visible event is performed); on the other hand, notOffer actions can occur in any states, and may subsequently become unavailable. So, for example: • a → STOP STOP is equivalent to a → STOP in the Readiness-Testing Traces model, since the traces of STOP are included in the initial traces of a → STOP; but a → STOP STOP can stably refuse a initially, whereas a → STOP cannot. • a → STOP STOP a → STOP has the trace offer a, notOffer a, offer a (where the notOffer a action is from the intermediate STOP state) whereas a → STOP does not; but neither process can stably refuse a before a is performed. Recall that in the standard model, stable failures are of the form (tr, X), where tr is a trace and X is a set of events that are stably refused. For the language in this paper, should refusal sets contain actions other than standard events? Firstly, we should not consider states with ready or notReady transitions to be stable: recall that we consider these actions to be similar to τ events, in that they are not externally visible. We define: α
stable P ⇔ ∀ α ∈ ready Σ ∪ notReady Σ ∪ {τ } • ¬P −→ . Therefore such actions are necessarily unavailable in stable states, so there is no need to record them in refusal sets. There is also no need to record the refusal of an offer a action, since this will happen precisely when the event a is refused. It turns out that including notOffer actions within refusal sets can add to the discriminating power of the model. Consider P = a → STOP STOP, Q = (a → STOP STOP) a → STOP. Then P and Q have the same traces, and have the same stable refusals of standard events. However, Q can, after the empty trace, stably refuse {b, notOffer a} (i.e., stably offer a and stably refuse b), whereas P cannot. We therefore have a choice as to whether or not we include notOffer actions within refusal sets. We choose not to, because the distinctions one can make by including them do not seem useful, and excluding them leads to a simpler model: in particular, the refusal of notOffer actions do not contribute to the performance or refusal of any standard events. I suspect that including notOffer actions within refusal sets would lead to a model similar in style to the stable ready sets model [9,10]. Hence, we define, for X ⊆ Σ: x
P ref X ⇔ stable P ∧ ∀ x ∈ X • ¬P −→ . We then define the stable failures of a process in the normal way: tr
(tr, X) ∈ failuresR [[P]] ⇔ ∃ Q • P =⇒ Q ∧ Q ref X.
(2)
Definition 16 The Readiness-Testing Stable Failures Model contains those pairs (T, F) where T ⊆ (Σ‡ )∗ , F ⊆ (Σ‡ )∗ × P Σ, T satisfies conditions 2–5 of the Readiness-Testing Traces Model, and also
G. Lowe / Extending CSP with Tests for Availability
337
6. If (tr, X) ∈ F then tr ∈ T. Below, we give compositional rules for the stable failures of a process. Since the notion of refusal is identical to as in the standard stable failures model, the refusal components are calculated precisely as in that model, and so the equations are straight-forward adaptations of the rules for traces. The only point worth noting is that in the construct if ready a then P else Q, no failures are recorded before the if is resolved. failuresR [[div]] = {}, failuresR [[STOP]] = {(tr, X) | tr ∈ (notOffer Σ)∗ ∧ X ⊆ Σ}, failuresR [[a → P]] = {(tr, X) | tr ∈ Init ∧ a ∈ / X} ∪ {(tr atr , X) | tr ∈ Init ∧ (tr , X) ∈ failuresR [[P]]} where Init = {tr tr | tr ∈ (notOffer Σ)∗ ∧ tr ∈ ({offer a} ∪ notOffer(Σ − {a}))∗ }, failuresR [[ if ready a then P else Q]] = {(tr ready atr , X) | tr ∈ (notOffer Σ)∗ ∧ (tr , X) ∈ failuresR [[P]]} ∪ {(tr notReady atr , X) | tr ∈ (notOffer Σ)∗ ∧ (tr , X) ∈ failuresR [[Q]]}, failuresR [[P Q]] = {(tr, X) | (tr, X) ∈ failuresR [[P]] ∧ tr |` Σ = } ∪ {(trP trQ , X) | trP ∈ traces[[P]] ∧ trP |` Σ = ∧ (trQ , X) ∈ failuresR [[Q]]}, failuresR [[P Q]] = failuresR [[P]] ∪ failuresR [[Q]], failuresR [[P 2 Q]] = {(tr, X) | ∃(trP , X) ∈ failuresR [[P]], (trQ , X) ∈ failuresR [[Q]] • trP |` Σ = trQ |` Σ = ∧ tr ∈ trP trQ } ∪ notOffer Σ
{(tr atrP , X) | ∃(trP atrP , X) ∈ failuresR [[P]], trQ ∈ traces[[Q]] • trP |` Σ = trQ |` Σ = ∧ a ∈ Σ ∧ tr ∈ trP {(tr atrQ , X)
notOffer Σ
| ∃ trP ∈ traces[[P]], (trQ atrQ , X) ∈ failuresR [[Q]] • trP |` Σ = trQ |` Σ = ∧ a ∈ Σ ∧ tr ∈ trP
notOffer Σ
trQ } ∪
trQ },
failuresR [[PA B Q]] = {(tr, Z) | ∃(trP , X) ∈ failuresR [[P]], (trQ , Y) ∈ failuresR [[Q]] • trP |` (Σ − A) ∪ offer(Σ − A) ∪ notOffer(Σ − A) = ∧ trQ |` (Σ − B) ∪ offer(Σ − B) ∪ notOffer(Σ − B) = ∧
A B → tr \ notOffer(Σ − A − B) ∧ Z − A − B = X ∩ A ∪ Y ∩ B}, (trP , trQ ) −
failuresR [[P \ A]] = {(tr, X) | ∃(trP , X ∪ A) ∈ failuresR [[P]] • trP |` (notReady A ∪ offer A) = ∧ trP \ (A ∪ ready A) = tr \ notOffer A}, failuresR [[ μ X • F(X)]] = the ⊆-least fixed point of the semantic mapping corresponding to F. The fixed-point definition for recursion can be justified in a similar way to as for traces. The congruence of the above rules to the operational definition of stable failures —i.e., equation (2)— can be proved in a similar way to Theorem 13. Conditions 2–5 of the ReadinessTesting Stable Failures Model are satisfied, because of the corresponding result for traces
338
G. Lowe / Extending CSP with Tests for Availability
(Theorem 14). Condition 6 follows directly from the definition of a stable failure, and the congruence of the operational and denotational semantics. The following theorem relates the semantics of a process in this model to the standard stable failures semantics. Theorem 17 failures[[φ(P)]] = {(tr |` Σ, X) | (tr, X) ∈ failuresR [[P]]}.
5. Model Checking In this section we illustrate how one can use a standard CSP model checker, such as FDR [5, 6], to analyse processes in the extended language of this paper. We just give an example here, in order to give the flavour of the translation; we discuss prospects for generalising the approach in the concluding section of the paper. We consider the following solution to the readers and writers problem. Guard(r, w) = w = 0 ∧ r < N & notReady startWrite & startRead → Guard(r + 1, w) 2 r > 0 & endRead → Guard(r − 1, w) 2 r = 0 ∧ w = 0 & startWrite → Guard(r, w + 1) 2 w > 0 & endWrite → Guard(r, w − 1). This is the solution from Section 2 that gives priority to writers, except we impose a bound of N upon the number of readers, and add guards to the second and fourth branches, in order to keep the state space finite. We will show that this solution is starvation free as far as the writers is concerned: i.e. if a writer is trying to gain access then one such writer eventually succeeds. We will simulate the above guard process using standard CSP, in particular simulating the ready, notReady, offer and notOffer actions by fresh CSP events on channels ready, notReady, offer and notOffer. Each process is translated into a form that uses these channels, following the semantics presented earlier; the simulation will have transitions that correspond to the −− transitions of the original, except it will have a few additional τ transitions that do not affect the failures-divergences semantics. More precisely, let α ˆ be the event used to simulate the action α; for example, if α = ready e then α ˆ = ready.e. Then each process P α α ˆ τ is simulated by a translation trans(P), where if P −− Q then trans(P) −→(−→)∗ trans(Q), and vice versa. In particular, for each standard event e, we must add an offer.e or notOffer.e loop to each state. For convenience, and to distinguish between the source and target languages, we present the simulation using prettified machine-readable CSP.5 The standard events and the channels to simulate the non-standard actions are declared as follows: channel startWrite, endWrite, startRead, endRead E = {startWrite, endWrite, startRead, endRead} channel ready, notReady, offer, notOffer : E
We start by defining some helper processes. The following process is the translation of STOP: it can only signal that it is not offering standard events. STOPT = notOffer?e → STOPT 5
The CSP text below is produced (almost) directly from the machine-readable CSP using LATEX macros.
G. Lowe / Extending CSP with Tests for Availability
339
The following process is the translation of e → P: initially it can signal that it is not offering standard events; it can then timeout into a state where e is available, after which it acts like P; in this latter state it can also signal that it is offering e but no other standard events.6 Prefix(e,P) = notOffer?e1 → Prefix(e,P) Prefix1(e,P) Prefix1(e,P) = e → P 2 offer.e → Prefix1(e,P) 2 notOffer?e1:diff(E,{e}) → Prefix1(e,P)
The reader might like to compare these with the −− semantics in Appendix A. In order to simulate the Guard process, we simulate each branch as a separate parallel process: the branches of Guard synchronise on notOffer actions before the choice is resolved, so the processes simulating these branches will synchronise on appropriate notOffer events. The first branch is simulated as below: Branch(1,r,w) = if w==0 and r
We explain the Restart process below. Note how the notReady startWrite test is simulated by the notReady.startWrite and ready.startWrite events. Note also how the process signals which standard events are and are not available in the different states. The other branches are slightly simpler, as they do not include readiness tests. Branch(2,r,w) if r>0 then Branch(3,r,w) if r==0 and Branch(4,r,w) if w>0 then
= Prefix(endRead, Restart(2,r−1,w)) else STOPT = w==0 then Prefix(startWrite, Restart(3,r,w+1)) else STOPT = Prefix(endWrite, Restart(4,r,w−1)) else STOPT
When one branch executes and reaches a point corresponding to a recursion within the Guard process, all the other branch processes need to be restarted, with new values for r or w. We implement this by the executing branch signalling on the channel restart. R = {0..N} −− possible values of r W = {0..1} −− possible values of w BRANCH = {1..4} −− branch identifiers channel restart : BRANCH.R.W Restart(i,r,w) = restart!i.r.w → Branch(i,r,w)
Each branch can receive such a signal from another branch, as an interrupt, and restart with the new values for r and w.7 Branch’(i,r,w) = Branch(i,r,w) ' restart?j:diff(BRANCH,{i})?r’.w’ → Branch’(i,r’,w’) 6 7
The operator diff represents set difference. The ' is an interrupt operator; the left hand side is interrupted when the right hand side performs an event.
340
G. Lowe / Extending CSP with Tests for Availability
Below we will combine these Branch’ processes in parallel so as to simulate Guard. We will need to be able to identify which branch performs certain events. For events e other than notOffer events, we rename e performed by branch i to c.i.e. We rename each notOffer.e event performed by branch i to both itself and notOffer1.i.e: the former will be used before the choice is resolved (synchronised between all branch processes), and the latter will be used after the choice is resolved (privately to branch i):8 EE = union(E, {|ready, notReady, offer|}) −− events other than notOffer channel c : BRANCH.EE −− c.i.e represents event e done by Branch(i, , ) channel notOffer1 : BRANCH.E Branch’’(i,r,w) = Branch’(i,r,w) [[ e \ c.i.e | e ← EE ]] [[ notOffer.e \ notOffer.e, notOffer.e \ notOffer1.i.e | e ← E ]] alpha(i) = {|c.i,restart,notOffer,notOffer1.i|} −− alphabet of branch i
Below we will combine the branch processes in parallel, together with a regulator process Reg that, once a branch has done a standard event to resolve the choice, blocks all events of the other branches until a restart occurs; further, it forces processes to synchronise on notOffer events before the choice is resolved, and subsequently allows the unsynchronised notOffer1 events.9 Reg = c?i?e → (if member(e,E) then Reg’(i) else Reg) 2 notOffer? → Reg Reg’(i) = c.i? → Reg’(i) 2 restart.i? ? → Reg 2 notOffer1.i?
→ Reg’(i)
We build the guard process by combining the branches and regulator in parallel, hiding the restart events, and reversing the above renaming.10 Guard0(r,w) = ( i : BRANCH • [ alpha(i) ] Branch’’(i,r,w)) [| {|c,restart,notOffer,notOffer1|} |] Reg Guard(r,w) = (Guard0(r,w) \ {| restart |}) [[ c.i.e \ e | e ← EE, i ← BRANCH ]] [[ notOffer1.i.e \ notOffer.e | e ← E, i ← BRANCH ]]
We can check the simple safety property that the guard allows at most one active writer at a time, and never allows both readers and writers to be active. Spec(r,w) = w==0 and r0 & endRead → Spec(r−1,w) 2 r==0 and w==0 & startWrite → Spec(r,w+1) 2 w>0 & endWrite → Spec(r,w−1) internals = {|ready, notReady, offer, notOffer, writerTrying|} assert Spec(0,0) T Guard(0,0) \ internals 8
union represents the union operation. member tests for membership of a set. 10 The is an indexed parallel composition, indexed by i; here the ith component has alphabet alpha(i). The notation [| A |] is the machine-readable CSP version of . 9
A
G. Lowe / Extending CSP with Tests for Availability
341
This test succeeds, at least for small values of N. In order to verify a liveness property, we need to model the readers and writers themselves. Each reader alternates between performing startRead and endRead, or may decide to stop (when not reading). Each writer is similar, but, for later convenience, we add an event writerTrying to indicate that it is trying to perform a write. It is important that the startWrite event becomes available immediately after the writerTrying event, in order for the liveness property below to be satisfied, hence we have the following form. channel writerTrying Reader = Prefix(startRead, Prefix(endRead, Reader)) STOPT Writer = (writerTrying → Prefix1(startWrite , Prefix(endWrite, Writer)) STOPT) 2 notOffer.startWrite → Writer
Following the semantic definitions, we need to synchronise the notOffer.startWrite events of the individual writers, and we need to synchronise the offer.startWrite and notOffer.startWrite events of the writers with the ready.startWrite and notReady.startWrite events of the guard, respectively. For convenience, we block the remaining offer and notOffer events, since we make no use of them, and processes do not change state when they perform such an event. Readers = ( ||| r:{1..N} • Reader) Writers = ([|{notOffer.startWrite}|] w:{1..N} • Writer) [[ offer.startWrite \ ready.startWrite, notOffer.startWrite \ notReady.startWrite ]] ReadersWriters = Readers ||| Writers System = let SyncSet = union(E,{ready.startWrite,notReady.startWrite}) within (Guard(0,0) [| SyncSet |] ReadersWriters) [| {|offer,notOffer|} |] STOP
We now consider the liveness property that the guard is fair to the writers, in the sense that if at least one writer is trying to gain access, then one of them eventually succeeds. Testing for this property is not easy: the only way to test that a startWrite event eventually becomes available is to hide the readers’ events, and to check that startWrite becomes available without a divergence (so after only finitely many readers’ events); however, hiding all the readers’ events will lead to a divergence when no writer is trying to gain access (at which point refinement tests do not act in the way we would like). What we therefore do is use a construction that has the effect of hiding the readers’ events when a writer is trying to gain access, but leaving the startRead events visible when no writer is trying to gain access. We then test against the following specification (where all other irrelevant events are hidden). WLSpec(n) = n0 & startWrite → WLSpec(n−1) 2 n==0 & (startRead → WLSpec(n) STOP)
The parameter n records how many writers are currently trying; when n>0 (i.e. at least one writer is trying), this process insists that a writer can start after a finite amount of (hidden) activity by the readers; when n==0 (i.e. no writer is trying), the process allows arbitrary startRead events.
342
G. Lowe / Extending CSP with Tests for Availability
The way to implement the state-dependent hiding described above is to rename events both to themselves and a new event startRead’, put this in parallel with a regulator that allows startRead events when no writer is trying and startRead’ events when at least one writer is trying, and then hide startRead’ and other irrelevant events. startRead
channel startRead’ System’ = (System [[ startRead \ startRead, startRead \ startRead’ ]] [| {writerTrying,startWrite,startRead,startRead’} |] Reg1(0)) \ {|endRead,endWrite,ready,notReady,startRead’|} Reg1(n) = n0 & startWrite 2 n==0 & startRead 2 n>0 & startRead’
→ → → →
Reg1(n+1) Reg1(n−1) Reg1(n) Reg1(n)
We can then use FDR to verify assert WLSpec(0) FD System’
In particular, this means that the right hand side is divergence-free, so when n>0 the startWrite events will become available after a finite number of the hidden events. 6. Discussion In this paper we have considered an extension of CSP that allows processes to test whether an event is available. We have formalised this construct by giving an operational semantics and congruent denotational semantic models. We have illustrated how we can use a standard model checker to analyse systems in this extended language. In this final section we discuss some related work and possible extensions to this work. 6.1. Comparison with Standard Models There have been several different denotational semantic models for CSP. Most of these are based on the standard syntax of CSP, with the standard operational semantics. It is not possible to compare these models directly with the models in this paper, since we have used a more expressive language, with the addition of readiness tests; however, we can compare them with the sub-language excluding this construct. For processes that do not use readiness tests, the two models of this paper are more distinguishing than the standard traces and stable failures models, respectively. Theorems 15 and 17 show that our models make at least as many distinctions as the standard models. They distinguish processes that the standard models identify, such as a → STOP (a → STOP b → STOP)
and
a → STOP b → STOP :
the former, but not the latter, has the trace offer a, b. In [10], Roscoe gives a survey of the denotational models based on the standard syntax. The most distinguishing of those models based on finite observations (and so not modelling divergences) is the finite linear model, FL. This model uses observations of the form A0 , a0 , A1 , a1 , . . . , an−1 , An , where each ai is an event that is performed, and each Ai is either (a) a set of events, representing that those events are offered in a stable state (and so ai ∈ Ai ), or (b) the special value • representing no information about what events are stably offered (perhaps because the process did not stabilise).
G. Lowe / Extending CSP with Tests for Availability
343
For processes with no readiness tests, FL is incomparable with the models in this paper. Our models distinguish processes that FL identifies, essentially because the latter records the availability of events only in stable states, whereas our models record this information also in unstable states. For example, the processes (b → STOP a → STOP) b → STOP
and
a → STOP b → STOP
are distinguished in our models since just the former has the trace offer b, a; however they are identified in FL (and hence all the other finite observation models from [10]) because this b is not stably available. Conversely, FL distinguishes processes that our models identify, such as a → b → STOP (a → STOP STOP)
and
a → (b → STOP STOP) STOP,
since the former has the observation {a}, a, •, b, •, but the latter does not since its a is performed from an unstable state; however, they are identified by our failures model (and hence our traces model) since this records stability information only at the end of a trace. I believe it would be straightforward to extend our models to record stability information throughout the trace in the way FL does. Roscoe also shows that each of the standard finite observation models can be extended to model divergences in three essentially different ways. I believe it would be straightforward to extend the model in this paper to include divergences following any of these techniques. 6.2. Comparison with Other Prioritised Models As we described in the introduction, the readiness tests can be used to implement a form of priority. There have been a number of previous attempts to add priority to CSP. Lawrence [11] models priorities by representing a process as a set of triples of the form (tr, X, Y), meaning that after performing trace tr, if a process is offered the set of events X, then it will be willing to perform any of the events from Y. For example, a process that initially gives priority to a over b would include the triple , {a, b}, {a}). Fidge [12] models priorities using a set of “preferences” relations over events. For example, a process that gives priority to a over b would have the preferences relation {a → a, b → b, a → b}. In [13,14], I modelled priorities within timed CSP using an order over the sets (actually, multi-sets) of events that the process could do at the same time. For example, a process that gives priority to a over b (and would rather do either than nothing) at time 0 would have the ordering (0, {a}) (0, {b}) (0, {}). All the above models are rather complex: I would claim that the model in this paper is somewhat simpler, and has the advantage of allowing a translation into standard CSP, in order to use the FDR model checker. One issue that sometimes arises when considering priority is what happens when two processes with opposing priorities are composed in parallel. For example, consider P Q where P and Q give priority to a and b respectively: P = a → P1 2 notReady a & b → P2 , Q = b → Q1 2 notReady b & a → Q2 . There are essentially three ways in which this parallel composition can behave (in a context that doesn’t block a or b): • If P performs its notReady test before Q, the test succeeds; if, further, P makes the b available before Q performs its notReady test, then Q’s notReady test will fail, and so the parallel composition will perform b;
344
G. Lowe / Extending CSP with Tests for Availability
• The symmetric opposite of the previous case, where Q performs its notReady test and makes a available before P performs its notReady test, and so the parallel composition performs a; • If both processes perform their notReady tests before the other process makes the corresponding event available, then both tests will succeed, and so both events will be possible. I consider this to be an appropriate solution. By contrast, the model in [11] leads to this system deadlocking; in [13,14] I used a prioritised parallel composition operator to give priority to the preferences of one components, thereby avoiding this problem, but at the cost of considerable complexity. 6.3. Readiness Testing for Channels Most CSP-like programming languages make use of channels that pass data. In such languages, one can often test for the availability of channels, rather than of individual events. Note, though, that there is an asymmetry between the reader and writer of the channel: the reader will normally want to test whether there is any event on the channel that is ready for communication (i.e. the writer is ready to communicate), whereas the writer will normally want to test if all events on the channel are ready for communication (i.e. the reader is ready to communicate). It would be interesting to extend the language of this paper with constructs if ready all A then P else Q
and
if ready any A then P else Q,
where A is a set of events (e.g. all events on a channel), to capture this idea. 6.4. Model Checking In Section 5, we gave an indication as to how to simulate the language of this paper using standard CSP, so as to use a model checker such as FDR. This technique works in general. This can be shown directly, by exhibiting the translation. Alternatively, we can make use of a general result from [15], where Roscoe shows that any operator with an operational semantics that is “CSP-like” —essentially that the operator can turn arguments on, interact with arguments via visible events, promote τ events of arguments, and maybe turn arguments off— can be simulated using standard CSP operators. The −− semantics of this paper is “CSP-like” in this sense, so we can use those techniques to simulate this semantics. We intend to automate the translation. One difficulty is that the translation from [15] can produce processes that are infinite state because they branch off extra parallel processes at each recursion of the simulated process. In Section 5 we avoided this problem by restarting the Branch processes that constituted the guard at each recursion (this uses an idea also due to Roscoe), whereas Roscoe’s approach would branch off new processes at each recursion. I believe the technique in Section 5 can be generalised and probably automated. 6.5. Full Abstraction The denotational semantic models we have presented turn out not to be fully abstract with respect to may-testing [16]. Consider the process if ready a then P else P. This is denotationally distinct from P, since its traces have ready a or notReady a events added to the traces of P. Yet there seems no way to distinguish the two processes by testing: i.e., there is no good reason to consider those processes as distinct. We believe that one could form a fully abstract semantics as follows. Consider the relation ∼ over sets of traces defined by
345
G. Lowe / Extending CSP with Tests for Availability
(S ∪ {tr tr }) ∼ (S ∪ {tr ready atr , tr notReady atr }) for all S ∈ P(Σ∗ ), tr, tr ∈ Σ∗ , a ∈ Σ. In other words, two sets are related if one is formed from the other by adding ready a and notReady a actions in the same place. Let ≈ be the transitive reflexive closure of ∼. This relation essentially abstracts away irrelevant readiness tests. We conjecture that P and Q are testing equivalent iff traces[[P]] ≈ traces[[Q]], and that it might be possible to produce a compositional semantics corresponding to this equivalence. It is not clear that the benefits of full abstraction are worth this extra complexity, though. Acknowledgements I would like to thank Bill Roscoe and Bernard Sufrin for interesting discussions on this work. I would also like to thank the anonymous referees for a number of very useful suggestions. References [1] Peter Welch, Neil Brown, James Morres, Kevin Chalmers, and Bernhard Sputh. Integrating and extending JCSP. In Communicating Process Architectures, pages 48–76, 2007. [2] Peter Welch and Neil Brown. Communicating sequential processes for Java (JCSP). http://www.cs. kent.ac.uk/projects/ofa/jcsp/, 2009. [3] Gregory R. Andrews. Foundations of Multithreaded, Parallel, and Distributed Programming. AddisonWesley, 2000. [4] A. W. Roscoe. The Theory and Practice of Concurrency. Prentice Hall, 1997. [5] A. W. Roscoe. Model-checking CSP. In A Classical Mind, Essays in Honour of C. A. R. Hoare. PrenticeHall, 1994. [6] Formal Systems (Europe) Ltd. Failures-Divergence Refinement—FDR 2 User Manual, 1997. [7] C. A. R. Hoare. Communicating Sequential Processes. Prentice Hall, 1985. [8] P. J. Courtois, F. Heymans, and D. L. Parnas. Concurrent control with “readers” and “writers”. Communications of the ACM, 14(10):667–668, 1971. [9] E. R. Olderog and C. A. R. Hoare. Specification-oriented semantics for communicating processes. Acta Informatica, 23(1):9–66, 1986. [10] A. W. Roscoe. Revivals, stuckness and the hierarchy of CSP models. Journal of Logic and Algebraic Programming, 78(3):163–190, 2009. [11] A. E. Lawrence. Triples. In Proceedings of Communicating Process Architectures, pages 157–184, 2004. [12] C. J. Fidge. A formal definition of priority in CSP. ACM Transactions on Programming Languages and Systems, 15(4):681–705, 1993. [13] Gavin Lowe. Probabilities and Priorities in Timed CSP. DPhil thesis, Oxford, 1993. [14] Gavin Lowe. Probabilistic and prioritized models of Timed CSP. Theoretical Computer Science, 138:315– 352, 1995. [15] A.W. Roscoe. On the expressiveness of CSP. Available via http://web.comlab.ox.ac.uk//files/ 1383/complete(3).pdf, 2009. [16] R. de Nicola and M. C. B. Hennessy. Testing equivalences for processes. Theoretical Computer Science, 34:83–133, 1984.
A. Derived Operational Semantics The definition of the −− relation, and the operational semantic rules for the −→ relation can be translated into the following defining rules for −− . notOffer a
STOP −−− STOP notOffer b
a → P −−− a → P offer a
aˇ → P −−− a → P
τ
for a ∈ Σ
a → P −− aˇ → P
for b ∈ Σ
aˇ → P −− P
a
notOffer b
aˇ → P −−− a → P
for b = a
346
G. Lowe / Extending CSP with Tests for Availability
if ready a then P else Q if ready a then P else Q if ready a then P else Q
ready a
−−− P notReady a
−−− Q
notOffer b
−−− if ready a then P else Q
a
for b ∈ Σ
τ
P −− P
P −− P
a
τ
P 2 Q −− P 2 Q
P 2 Q −− P ready a
notReady a
P −−− P
P −−− P
ready a
notReady a
P 2 Q −−− P 2 Q
P 2 Q −−− P 2 Q notOffer a
P −−− P
offer a
P −−− P offer a
notOffer a
Q −−− Q
P 2 Q −−− P 2 Q
notOffer a
P 2 Q −−− P 2 Q
a
τ
P −− P
P −− P
a
τ
P Q −− P Q
P Q −− P ready a
notReady a
P −−− P
P −−− P
ready a
notReady a
P Q −−− P Q
P Q −−− P Q
offer a
notOffer a
P −−− P
P −−− P
offer a
notOffer a
P Q −−− P Q α
P −− P
P Q −−− P Q
α ∈ (Σ − A)) ∪ {τ } ∪ ready(Σ − A) ∪ notReady(Σ − A) ∪ offer(Σ − A) ∪ notOffer(Σ − A)
α
P \ A −− P \ A α
P −− P
α ∈ A ∪ ready A
τ
P \ A −− P \ A
notOffer a
P \ A −−− P \ A,
for a ∈ A
α
α
P −− P α
PA B Q −− P A B Q
α ∈ privateA ∪ {τ }
α
PA B Q −− P A B Q
ready b
α ∈ syncA,B
notReady b
P −−− P
P −−− P
offer b
notOffer b
Q −−− Q
b∈B
ready b
PA B Q −−− P A B Q notReady b
P −−− P
Q −−− Q τ
PA B Q −− P A B Q
b∈B
notOffer d
PA B Q −−− PA B Q,
offer b
Q −−− Q notReady b
P −− P α Q −− Q
PA B Q −−− P A B Q
b∈B
for d ∈ Σ − A − B.
347
G. Lowe / Extending CSP with Tests for Availability
B. Congruence of the Operational Semantics In this appendix, we prove some of the cases in the proof of Theorem 13: tr
tr ∈ tracesR [[P]] iff P =⇒ . Hiding We prove the case of hiding in Theorem 13 using the derived rules in Appendix A. (⇒) Suppose tr ∈ tracesR [[P \ A]]. Then there exists some trP ∈ tracesR [[P]] such that trP |` (notReady A ∪ offer A) = and trP \ (A ∪ ready A) = tr \ notOffer A. By the inductive α1 αn trP i.e., P −− . . . −− for some α1 , . . . , αn such that trP = α1 , . . . , αn \ hypothesis, P =⇒, {τ }. From the derived operational semantics rules, P\A has the same transitions but with each αi ∈ A∪ready A replaced by a τ , i.e., transitions corresponding to the trace trP \(A∪ready A). Further, using the third derived rule for hiding, arbitrary notOffer A self-loops can be added tr to the transitions, giving transitions corresponding to trace tr. Hence P \ A =⇒. tr
(⇐) Suppose P \ A =⇒. Consider the transitions of P that lead to this trace. By consideratrP tion of the derived rules, we see that P =⇒ for some trace trP such that trP |` (notReady A ∪ offer A) = and trP \ (A ∪ ready A) = tr \ notOffer A. By the inductive hypothesis, trP ∈ tracesR [[P]]. Hence, tr ∈ tracesR [[P \ A]]. Parallel Composition We prove the case of parallel composition in Theorem 13 using the derived rules from Appendix A. (⇒) Suppose tr ∈ tracesR [[PA B Q]]. Then there exist trP ∈ tracesR [[P]] and trq ∈ tracesR [[Q]] such that trP |` (Σ − A) ∪ offer(Σ − A) ∪ notOffer(Σ − A) = , trQ |` (Σ − B) ∪ offer(Σ − B) ∪
A B notOffer(Σ−B) = and (trP , trQ ) − → tr\notOffer(Σ−A−B). By the inductive hypothesis,
tr
trQ
α1
β1
αn
βm
P P =⇒ and Q =⇒. So P −− . . . −− and Q −− . . . −− for some α1 , . . . , αn , β1 , . . . , βm such that trP = α1 , . . . , αn \ {τ } and trq = β1 , . . . , βm \ {τ }. We then have that
tr\notOffer(Σ−A−B)
A B PA B Q ==⇒ , since each event implied by (trP , trQ ) − → tr \ notOffer(Σ − A − B) has a corresponding transition implied by the operational semantics rules (formally, this is a
A B →, combined with a straightforward induction on m + n). case analysis over the clauses of − Further, using the final derived rule for parallel composition, arbitrary notOffer(Σ − A − B) self-loops can be added to the transitions, giving transitions corresponding to trace tr. Hence tr PA B Q =⇒.
tr\notOffer(Σ−A−B)
tr
==⇒ . Con(⇐) Suppose PA B Q =⇒. Then by item 3 of Lemma 9, PA B Q sider the transitions of P and Q that lead to this trace according to the operational semanα1
β1
αn
βm
tics rules, say P −− . . . −− and Q −− . . . −− . Let trP = α1 , . . . , αn \ {τ } and tr
trQ
P and Q =⇒. By the inductive hypothesis, trP ∈ tracesR [[P]] trq = β1 , . . . , βm \ {τ }; so P =⇒ and trQ ∈ tracesR [[Q]]. Also, by consideration of the operational semantics rules, trP |` (Σ − A)∪offer(Σ−A)∪notOffer(Σ−A) = and trQ |` (Σ−B)∪offer(Σ−B)∪notOffer(Σ−B) = .
A B Further, (trP , trQ ) − → tr \ notOffer(Σ − A − B), since each transition implied by the opera-
A B tional semantics rules has a corresponding event implied by the definition of − → (formally, this is a case analysis over the operational semantics rules, combined with a straightforward induction on m + n). Hence tr ∈ tracesR [[PA B Q]].
This page intentionally left blank
Communicating Process Architectures 2009 P.H. Welch et al. (Eds.) IOS Press, 2009 © 2009 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-065-0-349
349
Design Patterns for Communicating Systems with Deadline Propagation a
Martin KORSGAARD a,1 and Sverre HENDSETH a Department of Engineering Cybernetics, Norwegian University of Science and Technology Abstract. Toc is an experimental programming language based on occam that combines CSP-based concurrency with integrated specification of timing requirements. In contrast to occam with strict round-robin scheduling, the Toc scheduler is lazy and does not run a process unless there is a deadline associated with its execution. Channels propagate deadlines to dependent tasks. These differences from occam necessitate a different approach to programming, where a new concern is to avoid dependencies and conflicts between timing requirements. This paper introduces client-server design patterns for Toc that allow the programmer precise control of timing. It is shown that if these patterns are used, the deadline propagation graph can be used to provide sufficient conditions for schedulability. An alternative definition of deadlock in deadlinedriven systems is given, and it is demonstrated how the use of the suggested design patterns allow the absence of deadlock to be proven in Toc programs. The introduction of extended rendezvous into Toc is shown to be essential to these patterns. Keywords. real-time programming, Toc programming language, design patterns, deadlock analysis, scheduling analysis
Introduction The success of a real-time system depends not only on its computational results, but also on the time at which those results are produced. Traditionally, real-time programs have been written in C, Java or Ada, where parallelism and timeliness relies on the concepts of tasks and priorities. Each timing requirement from the functional specification of the system is translated into a periodic task, and the priority of each task is set based on a combination of the tasks period and importance. Tasks are implemented as separate threads running infinite loops, with a suitable delay statement at the end to enforce the task period. Shared-memory based communication is the most common choice for communication between threads. With this approach, the temporal behaviour of the program is controlled indirectly, through the choice of priorities and synchronization mechanisms. The occam programming language [1] has a concurrency model that only supports synchronous communication, and where parallelism is achieved using structured parallels instead of threads. Control of timeliness is only possible with a prioritized parallel construct, which is insufficient for many real-time applications [2]. The occam-π language [3] improves on this by allowing control of absolute priorities. The occam process model is closely related to the CSP process algebra [4, 5], and CSP is well suited for formal analysis of occam programs. Toc [6, 7] is an experimental programming language based on occam that allows the specification of timing requirements directly in the source code. The goal of Toc is to create a 1 Corresponding Author: Martin Korsgaard, Department of Engineering Cybernetics, O.S Bragstads plass 2D, 7491 Trondheim, Norway. Tel.: +47 73 59 43 76; Fax: +47 73 59 45 99; E-mail: [email protected].
350
M. Korsgaard and S. Hendseth / Communicating Systems with Deadline Propagation
language where the temporal response is specified explicitly rather than indirectly, and where the programmer is forced to consider timing as a primary concern, rather than something to be solved when the computational part of the program is complete. A prototype compiler and run-time system has been developed. In Toc, the timing requirements from the specification of the system can be expressed directly in source code. This is achieved with a dedicated deadline construct and synchronous channels that also propagate deadlines to dependent tasks. A consequence of deadline propagation is an implicit deadline inheritance scheme. The Toc scheduler is lazy in that it does not execute statements without an associated deadline. Server processes or other shared resources do not execute on their own, but must be driven by deadlines implicitly given by their access channels. The laziness means that temporal requirements cannot be ignored when programming, and tasks that traditionally would be implemented as deadline-less “background tasks” must now be given explicit deadlines. occam programs use a strict, round-robin scheduler that closely matches CSP’s principle of maximum progress. This is one reason why CSP is well suited to model occam programs. However, laziness is not closely matched by a principle of maximum progress, and that makes CSP-based analysis of Toc programs harder. For all parallel systems that use synchronization there is a risk of deadlock. In occam, the client-server design paradigm can be used to avoid deadlocks by removing the possibility of a circular wait [8, 9]. Here, each process acts as either a client or a server in each communication, and each communication is initiated by a request and terminated by a reply. If the following three criteria are met, then the system is deadlock-free: 1. Between a request to a server and the corresponding reply, a client may not communicate with any other process. 2. Between accepting a request from a client and the corresponding reply, a server may not accept requests from other clients, but may act as a client to other servers. 3. The client-server relation graph must be acyclic. An alternative deadlock-free design pattern described in [8] is I/O-PAR, guaranteed when all processes proceed in a sequential cycle of computation followed by all I/O in parallel. The whole system must progress in this fashion, similar to bulk synchronous parallelism (BSP) [10]. It is also possible to use a model checker such as SPIN [11] or FDR2 [12] to prove the absence of deadlocks under more general conditions. For some real-time systems, proving the absence of deadlocks is not sufficient, and a formal verification of schedulability is also required. Toc is scheduled using earliest deadline first (EDF), and for EDF systems, a set of independent, periodic tasks with deadline equal to period is schedulable if and only if:
i
Ci ≤1 Ti
(1)
where Ci is the worst-case computation time (WCET) of task i, and Ti is the period of that task [13]. Equation 1 ignores overheads associated with scheduling. Scheduling analysis is more complicated for task sets that are not independent. One effect that complicates analysis of dependent tasks is priority inversion [14], where a low priority task holding a resource will block a high priority task while holding that resource. Priority inversions are unavoidable in systems where tasks share resources. In contrast, an unbounded priority inversion is a serious error, and occurs when the low priority task holding the resource required by the high priority task is further preempted by intermediate priority tasks, in which case the high priority task may be delayed indefinitely. Solutions to this problem include the priority inheritance and the priority ceiling algorithms, both defined in [15]. These algorithms are restricted to priority-based systems. Other systems can be scheduled
M. Korsgaard and S. Hendseth / Communicating Systems with Deadline Propagation
351
and analyzed by using the Stack Resource Policy [16], for example. SRP supports a wide range of communication and synchronization primitives, but is otherwise similar to the priority ceiling algorithm. Unfortunately, schedulability analysis of task sets that communicate synchronously using rendezvouses are not well developed [17]. If deadlines may be shorter than periods, necessary and sufficient schedulability conditions can be found using the Processor Demand Criterion [18], but the computation is only practical if the relative start-times of all the tasks are known (a “complete task set”). If the relative start-times are not known (an “incomplete task set”) then the problem of determining schedulability is NP-hard. To use these analysis methods one needs to know the WCET C of each task. This is far from trivial, and may be impossible if the code contains variable-length loops or recursions. Fortunately, programmers of real-time systems usually write code that can easily be found to halt, reducing the problem to finding out when. Execution times vary with input data and the state of the system, meaning that measurements of the execution time are not sufficient. The use of modern CPU features such as superscalar pipelines and cache further increases this variation. Despite of these difficulties, bounds on execution times can still be found in specific cases using computerized tools such as aiT [19]. The process is computationally expensive [20] and subject to certain extra requirements such as manual annotations of the maximum iterations in a loop [21]. AiT is also limited to fixed task sets of fixed priority tasks. This paper discusses how analysis of schedulability and deadlocks can be enabled in Toc by using a small set of design patterns. The paper begins with a description of the Toc timing mechanisms in Section 1 and the consequences of these mechanisms. Section 2 describes deadlocks in deadline propagating systems, and how it differs from deadlocks in systems based on the principle of maximum progress. Section 3 describes the suggested design patterns and how they are used to program a system in an intuitive manner that helps avoid conflicts between timing requirements. Section 4 explains how schedulability analysis can be done on systems that are designed using these design patterns. The paper ends with a discussion and some concluding remarks. 1. Temporal Semantics of Toc The basic timing operator in Toc is TIME, which sets a deadline for a given block of code, with the additional property that the TIME construct is not allowed to terminate before the deadline. The deadline is a scheduling guideline, and there is nothing in the language to stop deadlines from being missed in an overloaded system. In contrast, the minimum execution time constraint is absolute and is strictly enforced. The definition of a task in Toc is the process corresponding to a TIME construct. A periodic task with deadline equal to period is simply a TIME construct wrapped in a loop, and a periodic task with a shorter relative deadline than period is written by using two nested TIME constructors: the outer restricting the period while the inner sets the deadline. Toc handles the timing of consecutive task instances in a way that removes drift [6, 7]. Sporadic tasks can be created as a task triggered by a channel by putting the TIME construct in sequence after the channel communication. A TIME construct will drive the execution of all statements in its block with the given deadline whenever this deadline is the earliest in the system. If channel communications are part of that block, then the task will also drive the execution of the other end of those channels up to the point where the communications can proceed. A task has stalled if it cannot drive a channel because the target process is waiting for a minimum execution time to elapse, but execution of the task will continue when the target becomes ready.
352
M. Korsgaard and S. Hendseth / Communicating Systems with Deadline Propagation
Figure 1. Order of Execution when task communicates with passive process.
Figure 2. Use of Extended Rendezvous. (a) x will not be passed on due to laziness. (b) x will be forwarded with the deadline inherited from input.
The definition of laziness in Toc is that primitive processes (assignment, input/output, SKIP and STOP) not driven by a deadline will not be executed, even if the system is otherwise idle. However, constructed processes will be evaluated even without a deadline in cases where a TIME construct can be reached without first executing primitive processes. Data from the compiler’s parallel usage rules checker is used to determine which process to execute to make the other end of a given channel ready. An example of the order of execution is given in Figure 1. Here, the process P.lazy() will not be executed because it is never given a deadline. 1.1. Timed Alternation In Toc, alternations are not resolved arbitrarily, but the choice of alternative is resolved by analyzing the timing requirements of the system. Choice resolution is straight forward when there is no deadline driving the ALT itself, so that the only reason that the process with the ALT is executing at all is that it is driven by one of its guard channels, which by the definition of EDF must have the earliest deadline of all ready tasks in the system at that time. By always selecting the driving alternative it is ensured that the choice resolution always benefits the earliest deadline task in the system. This usage corresponds to the ALT being a passive server with no timing requirements on its own, that handles requests from clients using the deadlines of the most urgent client. Other cases cannot be as easily resolved: If the ALT itself has a deadline, if the ALT is driven by a channel not part of the ALT itself or if the driving channel is blocked by a Boolean guard, then the optimal choice resolution with respect to timing cannot easily be found. If the design patterns presented in this paper are used, then none of these situations will arise, and therefore these issues are not discussed here. 1.2. Use of Extended Rendezvous A consequence of laziness is that driving a single channel communication with another process cannot force execution of the other process beyond what is required to complete the communication. One consequence of this is that it is difficult to write multiplexers or simple data forwarding processes. This is illustrated in Figure 2a: if something arrives on input then it will not immediately be passed on to output because there no deadline driving the execution of the second communication. The output will only happen when driven by the deadline of the next input.
M. Korsgaard and S. Hendseth / Communicating Systems with Deadline Propagation
353
A workaround is to add a deadline to the second communication, but this is awkward, and the choice of deadline will either be arbitrary or a repeated specification of an existing deadline, neither of which is desirable. Another solution, though ineffective, is to avoid forwarding data altogether, but this would put severe restrictions on program organization. Extended inputs [3] fix the problem. An extended input is a language extension from occam-π that allows the programmer to force the execution of a process “in the middle” of a channel communication, after the processes have rendezvoused but before they are released. Syntactically, this during process is indented below the input, and a double question-mark is used as the input operator. In Toc, the consequence is that the deadline used to drive the communication will also drive the during process. A correct way to forward a signal or data is shown in Figure 2b. 2. Deadlock in Timed Systems A deadlock in the classical sense is a state where a system cannot have any progress because all the processes in the system are waiting for one of the other processes. For a system to deadlock in this way there has to be at least one chain of circular waits. In Toc, when the earliest-deadline task requires communication over a channel it will pass its deadline over to the process that will communicate next on the other side of that channel. If this process in turn requires communication with another process, the deadline will propagate to this other process as well. Thus a task that requires communication over a channel is never blocked; it simply defers its execution to the task that it is waiting for. This can only fail if a process — through others — passes a deadline onto itself. In that case the process cannot communicate on one channel before it has communicated on another, resulting in a deadlock. This leads to the following alternative definition of deadlock in Toc: Definition 1 (Deadlock) A system has deadlocked if and only if there exists a process that has propagated its deadline onto itself. One advantage of this definition is that it takes laziness into account. Deadlock avoidance methods that ignore laziness are theoretically more conservative, because it is possible to construct programs that would deadlock but does not because the unsafe code is lazy and does not execute. In occam one can avoid deadlocks by the client-server paradigm if clients and servers follow certain rules and the client-server relation graph is acyclic. This ensures that there are no circular waits. In Toc, one can use the same paradigm to avoid deadlocks by ensuring that there is no circular propagation of deadlines. In Toc there can also be another type of deadlock-like state, the slumber, where a system or subsystem has no progress because it has no deadlines. Slumber occurs e.g. in systems that are based on interacting sporadic tasks, where sporadic tasks are triggered by other tasks. If a trigger is not sent, perhaps because of an incorrectly written conditional, the system may halt as all processes wait for some other process to trigger them. Even though the result is similar to a deadlock — a system doing nothing — the cause of a slumber is laziness and incorrect propagation of deadlines rather than a circular wait. 3. Design Patterns By essentially restricting the way a language is used, design patterns can be used to make a program more readable and to avoid common errors [22]. The goals of the design patterns for Toc presented in this paper are:
354
M. Korsgaard and S. Hendseth / Communicating Systems with Deadline Propagation
Figure 3. Diagram symbols for design patterns. An arrow points in the direction of deadline propagation, not data flow.
Figure 4. A task driving a task distorts timing.
1. To be able to reason on the absence of deadlocks in a system. 2. To allow for schedulability analysis. 3. To provide a flexible way of implementing the desired real-time behaviour of a system. The design patterns should be designed in such a way that knowledge of a program at the pattern level should be sufficient to do both deadlock and schedulability analysis, so that analysis does not need to take further details of the source code into account. The design patterns will be compatible with the client-server paradigm, so that this can be used to avoid deadlocks. Schedulability analysis is done by analyzing the timing response of each pattern separately, and then using the deadline propagation graph of the system at the pattern level to combine these timing responses into sufficient conditions for schedulability. 3.1. The Task A task is any process with a TIME construct, and is depicted with a double circle as shown in Figure 3. All functionality of the system must be within the responsibility of a task. Tasks are not allowed to communicate directly with one another, because this can lead to distorted timing. This is illustrated by the two tasks in Figure 4: T1 has a deadline and period of 10 ms and T2 of 20 ms, which means that T1 wants to communicate over the channel once in every 10 ms period, while T2 can only communicate half as often. A task’s minimum period is absolute, so T1, which was intended to be a 10 ms period task, will often have a period of 20 ms, missing its deadline while stalled waiting for T2. Moreover, T2 will often execute because it is driven by T1 and not because of its own deadline, in which case it is given a higher scheduling priority than the specification indicates. The result of direct communication in this case is that two simple, independently specified timing requirements become both dependent and distorted. 3.2. The Passive Server A process is passive if it has no deadline of its own. Thus a passive server is driven exclusively by deadlines inherited from clients through its access channels. A passive server is depicted as a single circle, and communications between a task and a server is shown as an arrow
M. Korsgaard and S. Hendseth / Communicating Systems with Deadline Propagation
355
from the driving client to the server. This is usually, but not necessarily, the same as the data direction. The protocol for an access to a server may involve one or more communications. The first communication is called the request, and any later communications are called replies. To follow the client-server paradigm, if the protocol involves more than one communication, the client cannot communicate between the request and the last reply, and the server may not accept requests from other clients in this period. Again, the server is allowed to perform communications with other servers as a client, propagating the deadline further on. The possibility of preemption means that the temporal behaviour of tasks using the same server are not entirely decoupled: If a task that uses a server preempts another task holding the same server, then the server may not be in a state where it can immediately handle the new request. In this case, the deadline of the new request must also drive the handling of the old request, adding to the execution requirements of the earlier deadline task. This is equivalent to priority inversion in preemptive, priority-based systems. The preempted task executes its critical section with the earlier task’s deadline, resolving the situation in a manner analogous to priority inheritance. 3.3. Sporadic Tasks and Events A passive server is useful for sharing data between tasks, but because a server executes with the client’s deadline, a different approach is needed to start a periodic tasks that is to execute with its own deadline. A communication to start a new task will be called a trigger; the process transmitting the trigger is called the source and the sporadic task being triggered is called the target. A simple way to program a sporadic task would be to have a TIME construct follow in sequence after a channel communication. When another task drives communication on the channel, the TIME construct will be discovered and the sporadic task will be started. However, this approach is problematic, for reasons similar to why tasks should not communicate directly. A target sporadic task that has been started will not accept another trigger before its deadline has elapsed; i.e. if the target task has a relative deadline D then the source must wait at least D between each trigger to avoid being stalled. If a new trigger arrives too early, the source will drive the target and eventually stall. This illustrates the need for events that can be used to start a sporadic task, without the side-effects of stalling the source or driving the target. Events must appear to be synchronous when the target is ready and asynchronous when the target is not: If the target is waiting for a trigger it must be triggered synchronously by an incoming event, but if it the target is busy then incoming events must not require synchronization. The event pattern is a way of implementing this special form of communication in Toc. An event pattern consists of an event process between the source and the target, where the communication between the event process and the target must follow certain rules. An event process is depicted in diagrams as an arrow through a square box. An event process forwards a trigger from the source to the target when the target is ready, and stops it from being forwarded when the target is busy. The event process must therefore know the state of the target. After a trigger is sent to the target then the target can be assumed to be busy, but the target must explicitly communicate to the event process when it becomes ready. The event and target processes must be together follow two rules: 1. After the target notifies the event process that is ready then the target cannot communicate with other processes before triggered by a new event. 2. A trigger from a source should never drive the target to become ready. This implies that sending a trigger will never stall the source.
356
M. Korsgaard and S. Hendseth / Communicating Systems with Deadline Propagation
Because of these rules, a deadline from the source will never propagate beyond the target, so that the event process is never involved in a circular propagation of a deadline. This in turn means that an event can never be the source of a deadlock in the super-system, even if communications with the event process makes the super-system cyclic. It is therefore safe to disregard events when using the client-server paradigm to avoid deadlocks, or equivalently, to model an event as a passive server to both the source and the target. This sets an event apart from other processes, which is emphasized by the use of a square diagram symbol instead of a circle. There are many possible designs for an event process: it may discard the input when the target is not ready or it may buffer the input. It may have multiple inputs or outputs, and it may also hold an internal state to determine which events to output. 4. Schedulability Analysis Traditionally, threads execute concurrently and lock resources when required, so that other threads are blocked when trying to access those resources. On the other hand, in Toc, the earliest deadline task is always driving execution, but the code being executed may belong to other tasks. A task is never blocked. This means that the schedulability analysis can assume that tasks are independent, but that the WCET for each task will then have to include the execution time of all the parts of dependent processes it may also have execute. All sporadic tasks will here be assumed to execute continuously with their minimum periods. This is a conservative assumption that enables schedulability analysis on a pure pattern level, without having to analyze actual source code to find out which sporadic tasks that may be active at the same time. To find the worst-case execution time, each task will also be assumed to be preempted by all other tasks as many times as is possible. The analysis will also be simplified by assuming that D = T for all tasks, which in code terms means that there are no nested TIME constructs. Because no assumption can be made as to the relative start-time of tasks in Toc, this restriction is necessary to make the schedulability analysis problem solvable in polynomial time. In this setting, the schedulability criterion is simply the standard EDF criterion given in Equation 1. The relative deadlines Di = Ti are read directly from code, so the main problem of the schedulability analysis is to find Ci : the worst-case execution time for each task including execution of other tasks due to deadline propagation. This variable is made up of three components: 1. The base WCET if not preempting and not being preempted. 2. The WCET penalty of being preempted by others. 3. The WCET penalty of preempting other processes. Dependencies between processes will be analyzed using the deadline-propagation graph of the system. An example graph in given in Figure 5. 4.1. WCET of a Server or Event The execution time of a server is complex due to preemption. If a server protocol involves at least one reply, then a client waits for the server in one of two phases: Waiting for the server to accept its request, and waiting for the server up to the last reply. In the first phase the server is ready to accept any client, and may switch clients if preempted, but in the second phase it will complete the communication with the active client even if preempted (executing with the deadline of the preempting task). If the server handles the request by an extended rendezvous then this is equivalent to a second phase even if the server has no reply, as the server cannot switch clients during an extended rendezvous.
M. Korsgaard and S. Hendseth / Communicating Systems with Deadline Propagation
357
Figure 5. Example deadline propagation graph with tasks and servers.
Figure 6. Timing of passive server
The WCETs for the two phases of a protocol with one reply are illustrated in Figure 6. For a server s, the WCET of the server in the first phase is denoted Cs,request and consists of lazy post-processing of the previous access or initialization of the server. The WCET of the server in the second phase is denoted Cs,reply , which is zero if there is no second phase. s,reply , but Cs,reply consists in part of the WCET of code local to the process of the server, C also of execution used by other servers accessed by s. If acc s is the set of servers accessed directly by s, then s,reply + Cs,reply = C
(Cs ,request + Cs ,reply )
(2)
s ∈acc s
This is a recursive formula over the set of servers. As long as the deadline-propagation network is acyclic, then the recursion will always terminate as there will always be a server at the bottom where acc s = Ø. It is assumed that each server accesses another server directly only once for each of its own requests, but the calculations can easily be adjusted to accommodate multiple accesses by simply adding the WCET of those servers multiple times. For a server s, the maximum WCET of any clients in the client process between the request and the last reply is important when calculating the effect of preemptions, and is denoted Cs,client . This variable is also zero if there are no replies.
358
M. Korsgaard and S. Hendseth / Communicating Systems with Deadline Propagation
From a temporal point of view an event process behaves like a server to both the source and the target. Like with a server, a source that sends a trigger to an event may up to a certain point be preempted by another source of that event. The same applies to a target when it notifies the event that it is ready. When doing the schedulability analysis one can therefore treat an event as a server. 4.2. Schedulability Analysis of the System A client can be preempted by multiple other clients and by each of these clients multiple times. As seen in the source code in Figure 6, a server cannot change clients while in phase two. Therefore, if the client is preempted in the second phase its work will be driven to complete with the deadline of the preempting task. In contrast, if a client is preempted in the first phase its work is in effect given away to the preempting task, and it will have to execute Crequest over again to make the server ready after the preempting process has finished. The worst case is that the client has executed almost all of Crequest every time it is preempted, in which case each preemption leads to another Crequest of work for the preempted client. From the preempting client’s point of view, preempting a task in its first phase in a server may save up to Cs,request of execution, because of the work already done by the task being preempted. On the other hand, if preempting in the second phase, it will have to drive the current client to finish before the server becomes ready. This may also include execution of code in the process of the preempted client. The worst-case is that the current client has just begun the second phase, in which case preempting it leads to Cs,client + Cs,reply of extra work for the preempting client. For any two tasks A and B, where DA < DB , the deadline propagation graph can be used to identify the servers in the network where A will have to drive B if A preempts B. Such a server is called a critical server for the pair of tasks (A, B). Definition 2 (Critical Server) A critical server for the pair of tasks (A, B) is a server that can be accessed directly or indirectly by both A and B, but not only through the same server. If a server s2 is accessible by two tasks only through server s1 , then preemption must necessarily happen in s1 , because both tasks cannot at the same time hold s1 , which is necessary to access s2 . In Figure 5, servers 1, 2 and 3 are critical servers for (A, B), while server 3 is the only critical server for both (A, C) and (B, C). The set of critical servers for any two tasks A and B is denoted crit (A, B). If task A is to preempt task B, then A must start after B, and have an earlier absolute deadline. For tasks A and B, where DA < DB , this means that A can preempt B at most !
DB DA
" (3)
times for each instance of B under EDF. In other words, Equation 3 is the number of consecutive instances of A that fits completely within a single instance of B. Note that a single instance of A can preempt B at most once. X is the execution time required by code local to X; the max function returns the C largest numerical value of a set. T is the set of tasks. With these definitions, CA can be written as:
M. Korsgaard and S. Hendseth / Communicating Systems with Deadline Propagation
A + CA = C
359
(Cs,request + Cs,reply ) s∈acc A
+ X∈T,DX
!
DA DX
"
max
+ X∈T,DX >DA
max
s∈crit (A,X)
s∈crit (A,X)
Cs,request
(4)
(Cs,client + Cs,reply )
The first line in the equation is the amount of execution required by the task process itself, plus the execution required by mandatory accesses to other servers. The second line is sum of the worst-case penalties for being preempted by a given task, multiplied by Equation 3, which is the maximum number of preemptions by that task. The third line is the sum of the worst-case costs of preempting each other task that may be preempted. Equation 4 can be extended to include the scheduling overhead associated with preemptions, as the maximum number of preemptions can be found using Expression 3. When the Cs of all the tasks have been found, then the system can be tested for schedulability with Equation 1.
5. Discussion Schedulability analysis in systems of communicating processes is not well developed. This paper presents an algorithm for finding sufficient schedulability conditions for Toc programs. The analysis presented here is conservative, so dimensioning the processing power of a system based on this analysis may lead to low utilization. The analysis has not been tested on an actual system, and it remains to see just how conservative it is in practice. Nevertheless, this paper shows that, subject to certain restrictions, schedulability analysis is possible in principle for systems of deadline-driven communicating processes. Also, in some real-time applications, proof of schedulability is more important than a high utilization. Sporadic tasks are common when programming in Toc, but rare in traditional real-time programming, where they are mostly used to model error-handling. The schedulability analysis in this paper is developed on top of the traditional model, where sporadic tasks are assumed to execute continuously with their relative deadlines as periods. This is a conservative assumption that is particularly unfortunate for systems largely based on sporadic tasks. Many legal constructs in Toc programs lead to conflicts between timing requirements. This is for example the case when using an ALT within a TIME, using a single channel to communicate between tasks, or triggering sporadic tasks without an event-like formalism. It can be argued the language has far more expressive power than what can be safely used. Some of this is due to the language being syntactically based on occam, while being semantically very different. Indeed, the deadline-driven semantics of Toc and the strict round-robin semantics of occam may be too different to be supported by the same basic statements. However, the deadline-driven semantics of Toc is based on occam because occam is the existing language that best supports the concept of parallelism and synchronous communication needed by Toc. Extended rendezvous is another occam concept that has proven to be immensely useful in Toc; the use of extended rendezvous to manage deadlines cannot easily accomplished by other means. The clear separation of primitive and constructed processes that exist in occam is also the basis for Toc’s definition of laziness, and it less clear how laziness can be defined in a C-like language, for example.
360
M. Korsgaard and S. Hendseth / Communicating Systems with Deadline Propagation
6. Conclusions and Future Work This paper has introduced design patterns for communicating between tasks and triggering sporadic tasks. The patterns allow dependencies and conflicts between timing requirements to be avoided. However, whether they provide enough expressive power to design general systems or whether they scale well to larger systems is uncertain, as Toc is still an experimental language in an early stage. Composition of patterns with respect to timing requirements has also not been studied, and is a potentially interesting subject for future work. When programming communicating systems with deadline propagation and laziness, all functionality of the system must be implemented with an explicit timing requirement. While the principle of timing requirements on all calculations is intuitively sound, it leads to the discovery of requirements that traditionally would have been unpronounced. Some of these requirements occur in the form of new sporadic tasks. Sporadic tasks therefore occur more frequently when programming in Toc than in traditional real-time programming, but their use leads to conservative schedulability analysis. Developing a framework for better specification and handling of sporadic tasks in schedulability analysis represents a potential for significant improvement. The schedulability analysis presented here is based on the traditional approach, where the analysis is NP-hard unless D = T for tasks where the relative start-times are not known. Because of the ability to trigger tasks in Toc, the relative start-times of tasks are generally not known. Being able to concisely specify tasks with shorter deadlines than period is a useful property of Toc, and the scheduling analysis approach would be better if it also supported this. Consistent use of the design patterns to achieve decoupling of timing requirements limits the ways the language can be used. Toc has inherited occam’s process statements almost unaltered, while having temporal semantics that are entirely different. A re-evaluation of Toc’s statement model to make these design patterns embodied in the language may reduce the need for overly restrictive design rules. The lack of a formal framework for analyzing Toc programs is also unfortunate. CSP can be used directly to analyze occam code, because CSP’s principle of maximum progress maps to occam’s strict round-robin scheduler. Although Toc is based on occam, the fact that scheduling is lazy rather than strict means that there is no direct mapping between Toc code and CSP. Developing a way to enable formal analysis of deadline-driven systems is an important objective for future work. References R 2.1 Reference Manual, 1995. [1] SGS-THOMPSON Microelectronics Limited. occam( [2] Da Qing Zhang, Carlo Cecati, and Enzo Chiricozzi. Some practical issues of the transputer based realtime systems. In Proceedings of the 1992 International Conference on Industrial Electronics, Control, Instrumentation and Automation, pages 1403–1407. IEEE CNF, 1992. [3] Peter H. Welch and Frederick R. M. Barnes. Communicating mobile processes: introducing occam-pi. In A. E. Abdallah, C. B. Jones, and J. W. Sanders, editors, 25 Years of CSP, volume 3525 of Lecture Notes in Computer Science, pages 175–210. Springer-Verlag, apr 2005. [4] C. A. R. Hoare. Communicating sequential processes. Communications of the ACM, 21:666–677, 1978. [5] A. William Roscoe. The Theory and Practice of Concurrency. Prentice Hall Europe, Hertfordshire, England, 1998. [6] Martin Korsgaard and Sverre Hendseth. Combining EDF scheduling with occam using the Toc programming language. In Alistair A. McEwan, Wilson Ifill, and Peter H. Welch, editors, Communicating Process Architectures 2008, pages 339–348. IOS Press, sept 2008. [7] Martin Korsgaard, Sverre Hendseth, and Amund Skavhaug. Improving real-time software quality by direct specification of timing requirements. In Proceedings of the 35th Euromicro SEAA Conference. IEEE Computer Society, 2009.
M. Korsgaard and S. Hendseth / Communicating Systems with Deadline Propagation
361
[8] Peter H. Welch, George Justo, and Colin Willcock. High-level paradigms for deadlock-free highperformance systems. In Transputer applications and systems ’93: Proceedings of the 1993 World Transputer Congress 20-22 September 1993, Achen, Germany, pages 981–1004. IOS Press, 1993. [9] Jeremy M. R. Martin and Peter H. Welch. A design strategy for deadlock-free concurrent systems. Transputer Communications, 3(4), June 1997. [10] D. B. Skillicorn, Jonathan M. D. Hill, and W. F. Mccoll. Questions and answers about BSP, November 1996. Oxford University Computing Laboratory. [11] Gerard J. Holzmann. The SPIN Model Checker: Primer and Reference Manual. Addison-Wesley Professional, 2003. [12] Formal Systems (Europe) Ltd. Failures-Divergence Refinement User Manual, June 2005. Web page: www.fsel.com. [13] C. L. Liu and James W. Layland. Scheduling algorithms for multiprogramming in a hard-real-time environment. Journal of the ACM, 20(1):46–61, 1973. [14] Butler W. Lampson and David L. Redell. Experience with processes and monitors in Mesa. Communications of the ACM, 23(2):105–117, feb 1980. [15] Liu Sha, Ragunathan Rajkumar, and John P. Lehoczky. Priority inheritance protocols: an approach to real-time synchronization. IEEE Transactions on Computers, 39(9):1175–1185, Sep 1990. [16] Theodore P. Baker. A stack-based resource allocation policy for realtime processes. Proceedings of the 11th Real-Time Systems Symposium, 1990., pages 191–200, Dec 1990. [17] Alan Burns and Andy J. Wellings. Synchronous sessions and fixed priority scheduling. Journal of Systems Architecture, 44:107–118, 1996. [18] Sanjoy K. Baruah, Rodney R. Howell, and Louis E. Rosier. Algorithms and complexity concerning the preemptive scheduling of periodic, real-time tasks on one processor. Real-Time Systems, 2(4):301–324, 1990. [19] Reinhold Heckmann and Christian Ferdinand. Worst case execution time prediction by static program analysis. In Proceedings of the 18th International Parallel and Distributed Processing Symposium, 2004., page 125. IEEE CNF, Apr 2004. [20] Jean Souyris, Erwan Le Pavec, Guillaume Himbert, Victor J´egu, Guillaume Borios, and Reinhold Heckmann. Computing the worst-case execution time of an avionics program by abstract interpretation. In Proceedings of the 5th Intl Workshop on Worst-Case Execution Time (WCET) analysis, pages 21–24. IEEE Computer Society, 2005. [21] Reinhard Wilhelm et al. The worst-case execution-time problem—overview of methods and survey of tools. Transactions on Embedded Computing Sys., 7(3):1–53, 2008. [22] Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley, 1995.
This page intentionally left blank
Communicating Process Architectures 2009 P.H. Welch et al. (Eds.) IOS Press, 2009 © 2009 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-065-0-363
363
JCSP Agents-Based Service Discovery for Pervasive Computing a
Anna KOSEK a, Jon KERRIDGE a, Aly SYED b and Alistair ARMITAGE a School of Computing, Edinburgh Napier University, Edinburgh, EH10 5DT, UK b NXP Semiconductors Research, Eindhoven, The Netherlands Abstract. Device and service discovery is a very important topic when considering pervasive environments. The discovery mechanism is required to work in networks with dynamic topology and on limited software, and be able to accept different device descriptions. This paper presents a service and device discovery mechanism using JCSP agents and the JCSP network package jcsp.net2. Keywords. device discovery, service discovery, CSP, JCSP, agents, jcsp.net2.
Introduction Ubiquitous computing, also known as pervasive computing, was first described by Weiser in 1991. In Weiser’s vision the physically available devices will soon become invisible to the user [1]. The term ubiquitous implies that devices in the environment will finally become so pervasive, that they are hardly noticed [2]. Devices from pervasive systems come in various sizes, have different functions and capabilities offering different services to the environment. A smart space is a pervasive computing environment in which available computing devices collaborate with each other to assist humans. A smart space will have to provide functions to discover all the devices, learn about different capabilities and services provided by them, couple appropriate devices to perform some tasks and enable communication between them. As in a pervasive system functionalities are distributed in the environment it is desirable to have the device and service discovery also distributed so as not to have a central device that does service discovery and is a single point of failure. Achieving distributed device and service discovery is a challenging task. When considering device and service discovery, all the devices in the smart space are treated as equal, so all of them will have to have mechanisms to: • • •
discover other devices, gather information about available capabilities and services, connect to appropriate devices and perform tasks.
The primary goal of a distributed pervasive system is to perform tasks assigned by its user by exploiting resources and services available in the environment [3]. In many cases, performing a task requires more than one device or involves choosing the service or device to use according to some criteria. Service discovery is a very important topic in the pervasive computing domain. Service discovery should be automatic and flexible to users' needs, to make the system as pervasive as possible. Service discovery solutions like Jini [4, 5] and Salutation [4] operate on centralized or semi-centralized architectures. Jini, Java-based service discovery, uses a Central Jini Lookup Service, where all available services are registered [5]. The Salutation approach to
364
A. Kosek et al. / JCSP Agents-Based Service Discovery for Pervasive Computing
service discovery includes service brokers, called Salutation Managers (SLMs) [4]. All available services are registered in SLMs and clients query SLMs when they require a service. These approaches are suitable for fixed or partially fixed networks, where the existence of at least one central device, to keep a repository, is guaranteed. In the case of failure of the central device, service discovery is impossible. These approaches assume that the network is stable and communication is reliable [3]. In pervasive systems, the network is dynamic; devices are mobile and can appear or disappear from the space at any time. Therefore it is undesirable that devices rely on the existence of other devices to register and advertise services. The service discovery in pervasive computing requires a distributed approach [3]. One option is Konark, a delivery and service discovery protocol developed by the University of Florida [6]. Service discovery is distributed, all devices run a small version of a Web Service using SOAP messages to request service and exchange information [6]. This approach is more suitable for pervasive systems, but this approach is undesirable for resource constrained devices, because all devices are required to run a Web Service and process XML-based SOAP messages. To get information from a Web Service, the structure of data stored in its database has to be known. Therefore before getting information from a device in the Konark system, the exact format of the available data has to be known. Fixed format of service description presented by a Web Service can be a disadvantage when considering a pervasive system with extendible services. Universal Plug and Play (UPnP) is another well known standard used for service discovery [4]. UPnP uses Simple Service Discovery Protocol (SSDP) that is based on multicast for service discovery. An UPnP network consists of Control Points and Devices. Devices can be controlled by one or many Control Points. Control Points are responsible for discovering new devices on the network and sending a multicast message (SSDP) requesting a service. This approach is similar to broadcast-based and it generates overhead on the network, as many devices are involved in the processing of the search request. The approach for service discovery presented is this paper is not based on broadcast or multicast and, unlike UPnP model, allows every device to initiate the service discovery mechanism. In pervasive environments devices should act autonomously when discovering services. Moreover the discovery mechanism should be adaptive to meet the requirements of highly dynamic environments [3]. Scalability is a very important requirement for a pervasive infrastructure [7]. This paper presents two discovery mechanisms: device discovery and service discovery. The device discovery is very simple and based on broadcasting (explained in details in Section 3). The main focus of this paper is an agentbased approach to discover services in highly dynamic networks of devices. Described service discovery is distributed and scalable. Presented mechanism is based on passing a search message between devices, so unlike a broadcast- based approach, it works without utilizing significant network bandwidth. The usability of JCSP [12, 15] for pervasive environments was described in [8]. The experiment described in that paper focuses on a dynamic connection capability between devices in a smart space. The system described was equipped with very simple service and device discovery. We build on this earlier work, using the same device discovery mechanism, while improving service discovery. The service discovery presented in [8] operated only on a fixed number of device types and was broadcast-based. The mechanism presented in this paper is not broadcast-based, is more flexible, allows customized searches and can accept and store any type of service description offered by a device. The key idea of the presented service discovery technique is to send a “messenger” to gather all information and return it to the requesting device. The system uses the same mechanism to gather information about devices and propagate service descriptions to other devices.
A. Kosek et al. / JCSP Agents-Based Service Discovery for Pervasive Computing
365
This paper presents a CSP based system for device and service discovery in pervasive adaptive systems. Section 1 presents the jcsp.net2 package developed by Chalmers [9] and its usability for device discovery, Section 2 of this paper describes a software structure used for service discovery; and Section 3 shows the proposed architecture and data structures used in the system. Section 4 provides summary of this work. 1. Device Discovery Using jcsp.net2 Devices described in this paper communicate using TCP/IP protocols and are provided with unique IP addresses. Every device is also called a node on a network, because it is represented by a single IP address. To discover a device and connect to it, the application needs to find its IP address. In this system device discovery is then narrowed to IP address discovery. When the IP address is determined, a JCSP network connection can be established by use of a protocol from jcsp.net2 [9]. To make a JCSP network connection between two nodes only the IP address and the port number are needed. As the port number can be fixed, the system needs only IP address to connect to a device, and this information can be provided by proposed device discovery. A description of the device discovery mechanism for pervasive environment as in [8] is provided here for completeness. Discovery Server
Discovery Client
Main Process
DEVICE
Figure 1. The device architecture.
The device discovery mechanism on every device is connected to the Main Process (Figure 1). The Main Process represents a set of processes running on a device that are responsible for device functionality (for example a messaging system, as presented in Section 4). The device discovery mechanism consists of two parts: Client and Server. The DiscoveryServer provides a list of available devices. Every available device has a process DiscoveryClient that sends an UDP (User Datagram Protocol) packet to other devices in the network. The DiscoveryServer receives these packets, this way DiscoveryServer is able to determine which devices are running. Because the net2 package provides mechanisms to establish network connections just using IP address and port number, only a simple discovery mechanism is needed when constructing a pervasive system. The connection established between devices will be used for a service discovery and performing tasks that needs devices collaboration. 2. JCSP Mobile Processes and Agents The CSP [10] model is based on processes communicating over channels. The idea of mobility of processes and channels in JCSP was adopted from the π-calculus, further description can be found in [11, 12]. A mobile process is a process that is created on one node and sent to another node. When a mobile process arrives on a new node it can be connected to it and run in parallel with other processes on this node. Once sent, a JCSP mobile process resides on the destination node and it is now part of the network of processes running on that node. The mechanism of creating, delegating and running a mobile process is illustrated on Figure 2.
366
A. Kosek et al. / JCSP Agents-Based Service Discovery for Pervasive Computing
Stage 1
Node 1
Node 2 Mobile process
Process 1
Process 2
Stage 2
Node 1
Mobile process
Node 2
Process 1
Process 2
Stage 3
Node 2
Node 1 Process 1
Mobile process
Process 2
Figure 2. JCSP mobile process.
As shown in Figure 2, Process 1 creates the Mobile process on Node 1 (Stage 1). The Mobile process is sent by a channel to Node 2 and received by Process 2. Process 2 activates the Mobile process and runs it in parallel with all the processes running on the node. Channels between Mobile process and Process 2 are established to enable communication. A JCSP mobile agent is a mobile process that can be made to visit several processing nodes and undertake some operations [13], e.g. gathering data. After arriving at a node an agent will be connected to the existing network of processes to enable access to host resources. When all operations are completed the agent is disconnected from the node and moved to the next one, and so on. The mobile agent’s path can be defined within the agent or it can be decided by a node. An agent can remember a traveled path and gather data from visited nodes. A mobile agent exists in one of two states: active or passive (Figure 3). When an agent is created it is in passive state. In the active state an agent can carry out instructions, write and read from the node it is connected to. In the passive state an agent can be moved, activated or terminated. After suspension a mobile agent saves its status and when it is reactivated it starts again with the same status. When a mobile agent is terminated it cannot be reactivated. When a mobile agent is activated by another process it starts working in parallel with it, channels between those two processes can be created for communication. Suspension of the mobile agent doesn’t have to be requested by the environment, it can be a decision made by mobile agent itself and the reasons for suspension can be complex. A mobile agent can visit several nodes, connect to them and exchange information. This way a mobile agent can perform a smart search on nodes that it is visiting and gather particular information. This capability was used for a smart service discovery and is presented in Section 3.
A. Kosek et al. / JCSP Agents-Based Service Discovery for Pervasive Computing
367
Figure 3. Agent's states.
3. Proposed Architecture Consider a smart space with devices available in it (Figure 4). In a pervasive system functionalities are distributed, so to perform some task, devices have to collaborate. The network is dynamic and devices can appear and disappear. In this system it is essential to discover new device, notice when a device leaves the space and discover new services to enable device cooperation.
Figure 4. Smart space with devices available in it.
Devices can be distinguished only by their IP address and service information. All devices are able to recognize other participants of the smart space, discover their capabilities and send messages. The JCSP software running on a device consists of two parts, the Main Process and the device discovery mechanism (Figure 5). The Main Process is responsible for device functionality: messaging and service discovery. The device discovery mechanism uses DiscoveryServer and DiscoveryClient to detect existence of other devices.
368
A. Kosek et al. / JCSP Agents-Based Service Discovery for Pervasive Computing
C2
Discovery Server
Discovery Client
C4
C3 Main Process
C1
C5
C6
Figure 5. Device architecture.
There are six static channels used in the architecture (Figure 5): C1 – for receiving messages in the system, C2 – used to receive broadcast signals from other devices for device discovery, C3 – for receiving updates about devices available in the smart space, C4 – used to send broadcast signals to other devices for device discovery, C5 – for sending messages and agents in the system, C6 – to receive agents responsible for smart search for service discovery. Channels C2 and C4 are Java socket based channels, all the remainder are JCSP channels. Channels C2 and C4 are sending simple UDP packets as there is no need to use jcsp.net2 protocol on top of TCP/IP. When the agent arrives at a node it is connected using two dynamic channels (Figure 6). Those channels enable communication between an Agent and a Main Process to perform smart search for service discovery. When a device joins the network the DiscoveryClient, part of the device discovery mechanism, starts broadcasting UDP packets to other devices to inform them about its presence. The DiscoveryServer, also part of the device discovery mechanism, starts receiving signals from other devices. When a device appears in or disappears from the space, a local list of available devices held by the DiscoveryServer is changed. The list is then sent to the MainProcess where it is interpreted. The device compares the new list with its own local list, updating or deleting records as necessary.
Discovery Server
Discovery Client
Main Process
Agent
Figure 6. Device architecture with the Agent.
A. Kosek et al. / JCSP Agents-Based Service Discovery for Pervasive Computing
369
When a device is new in a network, it must inform others about its services and understand what services other devices in the network can offer. To do this the new device creates a list of devices to explore, called UpdateList. A new device sends a JCSP agent for a trip using the devices list (UpdateList) to inform other devices about its services and to gather information about the other devices in the network. The agent is created in the Main Process, and instantiated with the UpdateList, so it knows the route for trip it is going on. Next the agent is sent to the first device on the UpdateList, where it is connected to the Main Process and runs in parallel with other processes on the device. When connected, the agent exchanges data and informs the device of its next destination. The final destination of the agent is the device that created it. Every time the agent visits a node and obtains information about a device, it removes this destination from its UpdateList. When the list is empty it returns to the device from which its journey started. When the agent returns it has all the information to update its primary sender and has also informed all the devices from its UpdateList about the new device. 4. Example Scenario Consider a scenario with three mobile devices in a space called a Message Room (Figure 7). Devices are personal computers offering a messaging application. The application discovers all devices that offer messaging and enables communication between people that have the same interests. The application can help to find people of interest to the user and suggest topics that they can talk about. The conversation using messenger can initiate further real-life discussions. Every device holds information about its user. Device 1 belongs to Alice, who prefers to talk about sport. Device 2 is used by Paul, who would like to talk about art. A new device appears in the smart space. Device 3 provides a messaging service for a user called Adam, who would like to talk about art or music. Device 3 is new in the space and has recognised that there are two other devices available. The device discovery mechanism in Device 3 sends a list of IP addresses to main process in Device 3: UpdateList = {"134.27.221.10" , "134.27.221.11"} Smart Space
Device
2
Data: IP: 134.27.221.11 Name: Paul Preferences: Art
Data: IP: 134.27.221.10 Name: Alice Preferences: Sport Device
1
Device
Data: IP: 134.27.221.12 Name: Adam Preferences: Art, Music
3
Figure 7. Message Room example.
The order of IP addresses in UpdateList is the same as the order of devices discovered by Device 3 in the space. It is possible to add an algorithm to modify the order of devices to reduce routing distance and overall time of service discovery.
370
A. Kosek et al. / JCSP Agents-Based Service Discovery for Pervasive Computing
Device 3 creates an agent to find information about other devices. The agent is initialized with UpdateList and sent to the first destination on the list, the device with IP address 134.27.221.10, Alice’s device (Figure 8, Step 1). The agent is connected to Main Process from Device 1. The Agent then removes IP 134.27.221.10 from its UpdateList and requests data from Device 1 (Figure 8, Step 2). Next the Agent decides where it should be sent and consults Main Process if the next destination from the UpdateList still exists. The Device 1 uses its resources and capabilities to send the Agent to the Device 2. In Step 3 the Agent is sent to Device 2 and connects to it to gather information about its user and introduces itself to the device (Figure 8, Step 4). At this point the Agent’s UpdateList is empty, so the Agent requests that Device 2 send it to its home node (Step 5). When reaching the final destination the Agent attaches to the Main Process of the device, namely Device 3 and transfers all gathered information. Smart Space (Step 4) Agent
(Step 3) Device
Agent
Data: IP: 134.27.221.10 Name: Alice Preferences: Sport
Data: IP: 134.27.221.11 Name: Paul Preferences: Art
2
(Step 5) Agent
(Step 6) (Step 2)
Device
Agent
3
Device
1
Agent
(Step 1)
Agent
Data: IP: 134.27.221.12 Name: Adam Preferences: Art, Music
Figure 8. Agent's trip.
To simplify the implementation the service discovery data gathered by the system has a fixed format as follows: Data={Service, IP, Name, Preference1, Preference2, Preference3} For example, the data received by the Agent from Device 2 is: Data={“Messaging”, “134.27.221.11”, “Paul”, “Art”, “”, “”} In this example scenario the format of the data gathered by an agent is fixed, but an agent can gather data of different types, depending on what is revealed by the device that it visits. After the Agent returns to its home node, all devices that appeared on UpdateList are now informed about user Adam and that he prefers to talk about art or music. Adam is informed about available users, their names and preferred topics. Now depending on Adams’s choice, a connection between him and particular users can be established to start
A. Kosek et al. / JCSP Agents-Based Service Discovery for Pervasive Computing
371
conversation. Every time a new Agent arrives at a node, the device’s User Interface (UI) is updated with the new user’s name and conversation preferences. All discovery agents are built using the same agent class. There can be many agents in the system, because many devices can send an agent to discover others. In this implementation the only difference between agents is data that they hold; this data and network state determines their behavior. This way all agents can adapt to network topology and offers flexible service discovery. Initial experiments were undertaken using the Java programming language and JCSP libraries [15]. Devices are composed in a network and run Java code, communicating over a TCP/IP network provided by a wireless router. Every device is a PDA (Personal Digital Assistant) Dell Axim X5 with Microsoft® Pocket PC operating system, an Intel® PXA255/400MHz processor, RAM capacity of 64MB and IBM J9 Java Virtual Machine. Table 1. Size of the system Application classes: Additional JCSP libraries: Total:
23.22 KB 628.00 KB 651.22 KB
The choice of PDAs for the experiment was made based on equipment availability. A PDA may seem a big device, but the system was kept to a small size (Table 1). The system requires a JVM to run Java code. In the JVM a process is represented by a thread. Limitations of the equipment can influence the number of threads that can be run on one JVM, and this number can be further decreased by size of a thread. A JCSP system controlling a LEGO NXT robot described in [14], due to VM and device memory limitations, could only run 90 simple threads. Experiments for the project described in this paper were first simulated on PC and then performed on PDAs, so the minimal requirements for the equipment that can run the system were not tested. The initial experiment used the jcsp.net package [15] for network connections. The device discovery was performed with the use of central repository, as a capability provided by a new protocol for jcsp.net2 was not available. The second version of the implementation uses distributed device discovery presented in [8]. The service discovery mechanism using JCSP agents remained the same for both implementations. In the device discovery mechanism the frequency of sending packets was determined by experiments. The aim was to minimize the reaction time for device failure. In the actual implementation a UDP is sent over a Java socket and has a size of 9-bytes. The frequency of sending UDP packets is every three seconds. A DiscoveryServer holds a constantly updated list of available devices. To discover a new device the DiscoveryServer needs just one packet with IP address that does not exist in its local list. Therefore it takes about three seconds plus a network latency to discover a new device. To detect disappearing devices DiscoveryServer has to wait for longer time, to allow all active devices to report. In existing implementation the time to wait for devices to report depends on number of devices that already have been discovered. A margin of few seconds is added to this calculation, because reversing an incorrect decision takes considerable time. Once DiscoveryServer determines that the device left the space, the Main Process is informed and all information about leaving device and connections are discarded. If the DiscoveryServer was wrong, all connections will have to be reestablished and service discovery undertaken again.
372
A. Kosek et al. / JCSP Agents-Based Service Discovery for Pervasive Computing
5. Issues and Further Work The system was tested on up to ten PDA devices and deals with devices entering and leaving the space. The service discovery can adapt to changing topology, using mechanisms implemented in an agent. In this implementation an agent is sent only once to discover devices capabilities in the network. The agent can be dropped if the home destination does not exist anymore or if the device fails while dealing with the agent. If the agent is lost or dropped before visiting all nodes that it was sent to, some of the nodes may not have information about the new device. This problem can be solved by setting a timer for the agent to return to its home node. If this timer expires, the device can send another agent for service discovery. This agent will have to visit all nodes again and if it finds the same problem as the predecessor it will be dropped, which can lead to infinite loop of sending and dropping agents. If the agent returns with partial information about the network, a new agent can be sent directly to unexplored region to discover available services. The service discovery is designed to send agents only from new devices to avoid repetition of gathered data and flooding network with agents that perform the same task. The agent sent from a new device not only gathers data but informs visited device about services offered by the sender. If an agent is lost or dropped existing devices may not receive information about the new device since service discovery deployed in those devices has no mechanisms for finding this information. A solution for this problem can be exploring the data that is held in the arriving agent, this way the device can get information about nodes that the agent has visited. However, if the agent is implemented so it gathers only specific information, the information extracted from the agent might be useless for this particular device. It is possible that devices may loose or gain new services. In this case, the device acts like a new device in the network and sends an agent offering new services. Services are revoked or expanded by overwriting their descriptions and informing all devices in the network. The service available in the presented implementation is to provide information about a person that uses a particular device. In the presented implementation the information that the agent is obtaining from every node has a fixed description. To extend the solution, the information about the service can be described using XML format or very popular in Web Service Discovery domain OWL-S (Ontology Web Language, Semantic Markup for Web Services) description [16]. The description provided in OWL-S can be quite flexible, so different types of services can be expressed, even complicated services like electronic payment. Despite all the problems associated with loosing and dropping agents or getting only partial information, the use of agents provides a flexible solution for service discovery in networks with dynamic topology. The agent is initialized with list of devices and decides where to be sent; therefore it will not be sent to a device that appears to be switched off. The journey plans stored in the agent can be changed and easily fitted to the dynamic environment. The agent can also perform a smart search; it might only gather specific types of information, so the issuing device will receive only the requested information. The agent can reveal different information about its home device depending on the type of the device visited or a trust level. For example, an agent might carry detailed information about the issuing device, but reveal some parts of it visited devices, depending on rules that it has adopted.
A. Kosek et al. / JCSP Agents-Based Service Discovery for Pervasive Computing
373
6. Summary This paper shows a new approach for distributed service discovery. The usability of JCSP as a software architecture for pervasive computing was described in [8]. The device mobility and service distribution is desirable in pervasive system. We show that service discovery can be achieved with JCSP mobile agents. JCSP agents, which are an extension of mobile processes, can move around the system, connect to nodes and exchange data. In this work these agents were used for to implement a smart search capability for service discovery. The JCSP agent, as a software structure, can be used to perform different functions, e.g. to represent person or object physically moving in space or to model mobile social networks [13]. Implementation of the system presented in this paper with use of JCSP libraries on PDA devices has been done as proof of concept. References [1] Mark Weiser, The Computer for the 21st Century. Scientific American, 1991: pp. 66-75 [2] George Coulouris, Jean Dollimore and Tim Kindberg, Distributed Systems: Concepts and Design. 2005, Pearson Education. [3] Dipanjan Chakraborty and Tim Finin, Toward Distributed Service Discovery in Pervasive Computing Environments. IEEE Transactions on Mobile Computing, 2006. [4] Christian Bettstetter and Christoph Renner, A Comparison Of Service Discovery Protocols And Implementation Of The Service Location Protocol. 6th EUNICE Open European Summer School: Innovative Internet Applications. 2000. [5] Rahul Gupta, Sumeet Talwar and Drahma P. Agrawal, Jini Home Networking: A Step toward Pervasive Computing. IEEE Computer Society, 2002. pp. 34-40. [6] Sumi Helal, Nitin Desai and Varun Verma, Konark - A Service Discovery and Delivery Protocol for AdHoc Networks. Wireless Communications and Networking, 2003. [7] Karen Henricksen, Jadwiga Indulska and Andry Rakotonirainy. Infrastructure for Pervasive Computing: Challenges. Workshop on Pervasive Computing INFORMATIK 01. 2001. Vienna. [8] Anna Kosek, Jon Kerridge, Aly Syed and Alistair Armitage , A Dynamic Connection Capability for Pervasive Adaptive Environments Using JCSP. Artificial Intelligence and Simulation of Behaviour (AISB) 2009. [9] Kevin Chalmers, Investigating Communicating Sequential Processes for Java to Support Ubiquitous Computing, PhD thesis in School of Computing. 2008, Edinburgh Napier University. [10] C.A.R. Hoare, Communicating Sequential Processes. 1985: Prentice Hall International Series in Computer Science. [11] Kevin Chalmers and Jon Kerridge. jcsp.mobile: A Package Enabling Mobile Processes and Channels. Communicating Process Architectures, 2005. [12] Kevin Chalmers, Jon Kerridge and Imed Romdhani, Mobility in JCSP: New Mobile Channel and Mobile Process Models. Communicating Process Architectures, 2005. [13] Jon Kerridge, Jens-Oliver Haschke and Kevin Chalmers, Mobile Agents and Processes using Communicating Process Architectures. Communicating Process Architectures, 2008. [14] Jon Kerridge, Alex Panayotopoulos and Patrick Lismore. JCSPre: the Robot Edition to Control LEGO NXT Robots. Communicating Processes Architectures. 2008. York, UK. [15] Peter Welch and Paul Austin, The JCSP Home Page. http://www.cs.kent.ac.uk/projects/ofa/jcsp/. [16] David Martin, et al. OWL-S: Semantic Markup for Web Services. 2004; Available from: http://www.w3.org/Submission/2004/SUBM-OWL-S-20041122/.
This page intentionally left blank
Communicating Process Architectures 2009 P.H. Welch et al. (Eds.) IOS Press, 2009 © 2009 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-065-0-375
375
Toward Process Architectures for Behavioural Robotics Jonathan SIMPSON and Carl G. RITSON School of Computing, University of Kent, Canterbury, Kent, CT2 7NZ, England {J.Simpson , C.G.Ritson} @kent.ac.uk Abstract. Building robot control programs which function as intended is a challenging task. Roboticists have developed architectures to provide principles, constraints and primitives which simplify the building of these correct, well structured systems. A number of established and prevalent behavioural architectures for robot control make use of explicit parallelism with message passing. Expressing these architectures in terms of a process-oriented programming language, such as occam-π, allows us to distil design rules, structures and primitives for use in the development of process architectures for robot control. Keywords. concurrency, robotics, behavioural control
Introduction Robotics is inherently a concurrent problem: watching many sensors, and driving many effectors to sense and interact with the world. The ability to write control programs using parallelism is desirable, as it allows us to maintain the concurrent aspects of robot behaviour, without having to serialise the tasks required. It is natural to think about control tasks in the real world in parallel, and in unwinding these thoughts to a sequential control loop, mental overhead is added. Control tasks within robot systems fall into one of two categories: reactive and deliberative. Reactive control involves the tight coupling of sensing and action to provide fast responses where action is critical and is somewhat analogous to the reflexes present in humans. Deliberative control involves tasks which require more extensive computation or sensing before action can take place, generally involving plans and state e.g. generating and maintaining a path to a given way-point, or feature identification of the world using camera data. Most robot control programs are ‘hybrid’ systems, with reactive and deliberative behaviour layers along with a third layer performing arbitration and co-ordination [1]. This combination of reactive and deliberative components allows for both high-level planning and fast reaction to important stimuli from the environment. For example, a robot might be equipped with a deliberative path planner, which uses a pre-supplied map of its environment to calculate a safe path to the destination. This planner would work alongside a reactive component which works to avoid obstacles encountered trying to follow way points generated by the path planner. The third layer would act to ensure that both the path planner and the obstacle avoidance behaviour can influence the motion of the robot. Hybrid robot control systems employ parallel computation to allow both the deliberative and reactive components to be active at the same time. In addition, the reactive component may be internally parallel, as the objective of reactive systems is keeping sensing and action close together, a model that relies on critical information in the world not being missed.
376
J. Simpson and C.G. Ritson / Toward Process Architectures for Behavioural Robotics
Behavioural robotic control applies behaviour-based AI to robot control; using modular decomposition of the system’s intelligence to structure the behaviours which compose its controls. Behaviour-based AI is heavily inspired by the agent-based decomposition of human intelligence presented in Minsky’s Society of Mind [2]. Architectures for behavioural control act to support the development of the third, interfacing layer of systems, often providing support for parallelism and communication between components. When writing a control program using a language with explicit parallelism and message passing, these evolved and widely used behavioural architectures provide a natural source of ideas for process architectures. We are specifically interested in the development of scalable process architectures for robot control using the occam-π programming language [3]. Using occam-π allows us to harness the Transterpreter, a portable virtual-machine developed expressly for running occam-π programs on small embedded platforms [4]. We have used occam-π for robot control successfully on a variety of mobile robot platforms in the past, including the LEGO MindStorms RCX [5], Surveyor SRV-1 [6] and Pioneer 3-DX [7], with additional support now added for the LynxMotion AH3-R ‘Hexapod’ walking robot [8]. In light of this growing availability of process-orientation on robotics platforms, we consider it valuable to explore traditional behavioural architectures in this context. The techniques developed for use in occam-π may also be applicable to robotic control in other process-oriented languages or language extensions such as JCSP [9], PyCSP [10] and gCSP [11]. In this paper we will elucidate, implement primitives for and compare a number of seminal behavioural control architectures in the context of their application in process architectures for robot control. Section 2 examines Brooks’ Subsumption Architecture, with following sections 3, 4 and 5 dealing with the Connell’s Colony Architecture, Maes’ ActionSelection and Arkin’s Motor Schemas respectively. Finally, in section 6 we will examine Rosenblatt’s Distributed Architecture for Mobile Control (DAMN). After looking at each architecture in turn we conclude in section 7 and identify future extensions to this work in section 8. 1. Platforms The implementation examples in this paper make reference to two robot platforms, the Pioneer 3-DX from Mobile Robots, Inc [12] and the AH-3R ‘Hexapod’ walking robot from LynxMotion [8]. The Pioneer 3-DX is a two wheeled robot mobile platform fitted with a ring of 16 sonar sensors covering its circumference. A SICK laser range-finder is also fitted, covering a forward 180 degree arc with a scan distance of eight meters. On-board computation is provided by a 700MHz PC104 board running Debian GNU/Linux. The LynxMotion AH3-R ‘Hexapod’ is, as its name would suggest, a six legged walking robot. Each leg is roughly 60 degrees apart and driven by three servos, giving it three degrees of freedom: swinging forward or backward, raising or lowering, and extending or contracting. The Hexapod is fitted with ultrasound range finders on a rotating turret covering 360 degrees around the robot, controlled by an Arduino based micro-controller. There are also two tilt sensors, mounted at 90 degrees on the robot’s body and pressure sensors on each of the robot’s feet. The robot is controlled via a serial link to a host PC. 2. Subsumption Architecture Brooks’ Subsumption Architecture [13] was one of the first behavioural control systems, allowing robot programs to be expressed as a hierarchy of levels of competence which interact
J. Simpson and C.G. Ritson / Toward Process Architectures for Behavioural Robotics
377
with each other to control the robot. Subsumption uses a network of finite state machines known as “behaviours”, along with asynchronous message passing over ‘wires’ between input and output “ports”. Behaviours output values to a port and the most recent value output to that port is available for input to the receiver constantly, essentially providing a memory cell located in front of the port into which values are written. Behaviours are grouped into increasing levels of competence, with each level able to tap into wires in lower levels and suppress inputs and inhibit outputs as required. Inhibition Line
Inputs
S 10
behavior
I 3
Outputs
Suppression Line
Figure 1. A behaviour module for a Subsumption Architecture, with a suppressor on an input line and an inhibitor on an output line.
Two key primitives are used in Subsumption to allow other behaviours making up different levels of competence to interact: suppression of inputs and inhibition of outputs. Higher levels of competence replace the input to lower level modules or inhibit their output to produce the desired result. If a suppressor is placed at the input to a module, values from a secondary “suppression” input may replace other input for a pre-defined period. If an inhibitor is placed at the output of a module, a secondary “inhibitory” input can send a signal to block all output from the module for a period of time. The diagrammatic representation of these two Subsumption primitives is shown in Figure 1, with an I or S indicating inhibition or suppression at the top of the circle and the activity period indicated by the number at the bottom. A more comprehensive explanation of these primitives, including code listings is available in [7]. 2.1. Implementation In Brooks’ original implementation, these communicating state machines are compiled along with a scheduler which supports behaviours running concurrently on a single processor and in parallel across multiple processors. The ability to mix both simulated and real parallelism via a software scheduler was exploited in Brooks’ six-legged walking robot “Genghis,” which had a control system consisting of 57 finite state machines running on just four processors [14]. Process-oriented programming provides an advantage when implementing control programs using the Subsumption Architecture, as it gives us independent, parallel processes with message passing from which to build the Subsumption primitives. The authors have previously implemented these primitives in occam-π and used them successfully to construct a basic program to ‘bump and wander’; navigating safely within an enclosed space [7]. The process network for a simple robot control program written using the Subsumption Architecture in occam-π is shown in Figure 2. This program is written to run on the the Pioneer 3-DX robot platform, as previously described in section 1. Our example program has three levels of competence, the first of which is to move into space and execute an emergency stop if it gets too close to any object, providing a degree of safety to the robot’s motion. The second level of competence detects objects in the centre of the laser scanners’ field
378
J. Simpson and C.G. Ritson / Toward Process Architectures for Behavioural Robotics
Move into space, stop if a collision will occur. laser.scanner
min.distance
S 10
prevent.collision
motor.control
Detect objects, turn away from them. object.detect
pivot
I 1
Move forward if there is an obstruction behind. sonar.ring
space.behind
Figure 2. A Subsumption Architecture-based bump and wander program for a robot with three levels of competence.
(i.e. directly in front of it) and backs the robot up while turning, to avoid them. This level upgrades the control system from simply driving into an area until it stops due to proximity to actually being able to navigate around the space. However, as the robot’s emergency stop behaviour uses the laser range-finder at the front, the robot might back into walls while trying to avoid obstacles in tight spaces. Our third level uses the sonar sensors at the back of the robot to determine if there is space for the robot to back up. If there is no room, it inhibits the outputs from the module attempting to back the robot up and allows the lower-level forward driving behaviour to take over. This three competence network allows the robot to perform multi-point turns simply by engaging its “back up and turn” and lower level “go forward” behaviours alternately, with no explicit statement of the composite action. Scaling this early model of Subsumption to larger systems poses difficulties. Increasingly complex interdependencies form as higher levels of competence intercept values being passed between individual modules in lower levels, replacing their input or inhibiting output. In his specification of the Colony Architecture (see Section 3), Connell identifies that Subsumption requires a holistic view and correct behavioural decomposition from the very beginning of design so as to offer the correct inputs and outputs for higher layers to interact with. A later revision to the Subsumption Architecture incorporates a number of changes which attempt to resolve these problems, changing the functioning of inhibition and suppression to require frequent communication and short delay periods [14]. These changes have been implemented in our versions of the Subsumptive primitives and further enhance the suitability of these primitives when using a paradigm that supports message passing.
3. Colony Architecture Connell’s Colony Architecture is a refinement on the early Subsumption Architecture which removes explicit inhibition and replaces it with predicates inside the behaviour modules, allowing only suppression of outputs in lower level behaviours [15]. The Colony Architecture uses a “soup” of modules which are related to one another by the actuator they control, rather than a layered ordering across the system. Unlike Subsumption, additional layers do not necessarily increase in competence; we might introduce modules in a higher layer which provide more general solutions to lower-level control problems for cases where more specific lower-level modules cannot establish the correct action to take. The Colony Architecture removes the Subsumption’s ability to “spy” on inputs and replace outputs of internal modules
J. Simpson and C.G. Ritson / Toward Process Architectures for Behavioural Robotics
379
in lower layers, enhancing the system’s modularity by only allowing the input and output of entire layers of behaviour to interact. The Colony Architecture was used on a multi-processor robot which could also run programs written for the Subsumption Architecture. This multiprocessor robot used 24 looselycoupled processors to run a system with over 41 behaviours, with software scheduling like that of Brooks’ to run multiple behaviours per processor. Both Brooks and Connell identified that the explicit parallelism of behaviours allowed the control system to scale through the addition of more processors, maintaining reactivity and performance even with the introduction of more behaviours. 3.1. Implementation The Colony Architecture uses very similar primitives to the Subsumption Architecture by virtue of being based on it. A retriggerable monostable primitive is added, which is used when an event (defined as a single point of activity which does not persist in the environment) should trigger a behaviour. The detection of these events is performed using initiation and satisfaction predicates, which set the monostable true or false depending on the condition of the event. The monostable itself maintains a true value for a period of time, but eventually resets itself if not reset by the satisfaction predicate, acting much like a piece of memory with a watchdog timer. The monostable is used to persist the point state from the environment, allowing it to have time to influence the system even if the state ceases to exist in the environment. Modifications to the Subsumption primitives involve both inhibition and suppression requiring continued sending of messages over the control channels. Early Subsumption tended to use a single message and long delays, whereas the Colony Architecture and Brooks’ later Subsumption both use short delays with regular messaging along the control channel. 4. Action-Selection Maes’ Action-Selection architecture relies on a network of independent competence modules and the use of activation levels to control which modules are executed [16,17]. Activation levels in the network are propagated such that an executable module primes modules which can run after it in a task, while a non-executable module will prime those which run before it, causing activation to pool in the first behaviour in a task which is suitable for execution. Inhibition, or “conflictor” lines are connected between modules that oppose each other’s behaviour, when a module with such a connection becomes active it inhibits the activation of the other modules which would impede completion of its task. Is there space in front?
activation
Turn left
Drive forward activation
Goal: Navigate space
Figure 3. A set of Action-Selection competence modules to move within a space. Activation spread is accomplished via bi-directional connections between modules, as shown.
380
J. Simpson and C.G. Ritson / Toward Process Architectures for Behavioural Robotics
A simple example of Action-Selection is shown in Figure 3 on the previous page, a robot control program which has a goal to navigate into space. The competence modules for this program are “drive forward” and “turn left”, the “drive forward” module is preconditioned on there being space in front of the robot. The goal “navigate space” raises the activation of the “drive forward” module and it will become active, moving the robot forward. Once the robot runs out of space in front of it, the “drive forward” module will become inactive, as the “has space in front” precondition will become false. The “drive forward” behaviour will then pass its activation on to the “turn left” module, causing a turn until the precondition for “drive forward” becomes true once more (i.e. there is free space in front of the robot). 4.1. Implementation Use of the Action-Selection mechanism in occam-π is most easily achieved through the creation of a second decision-making network consisting of a number of ‘cells’ which propagate the activation levels for each behaviour to calculate activation values for each cell. Cells which have an activation level over their threshold can be activated if they are ‘executable’. To be ‘executable’ all of a cell’s preconditions must be met, and the decision-making network will therefore capture all of the preconditions to allow it to arbitrate between behaviours and activate those that should be activated. These activated cells in the decision layer trigger the execution of behaviours in the robot control system itself, and free those control behaviours from having to incorporate the propagation of activation or management of preconditions. Goals are connected into the decision layer after the last module in a behaviour that completes them, and feed activation back through the modules responsible for completing the task until it pools in the first task, raising it above the activation threshold. Partly due to its inspiration from neural networks, the process architectures that result from the implementation of Action-Selection are complex and require a second decision network to make their implementation relatively neutral to behaviours. 5. Motor Schema Arkin’s Motor Schema approach to control uses multiple concurrent schemas (behaviours) active during the completion of a high level task [18]. Two types of schema are employed in the architecture: motor and perceptual. Motor schemas are behaviours that control the motion or activity of the robot, such as ‘stay on path’ or ‘avoid obstacles’. Perceptual schemas identify features and conditions in the environment that provide data necessary for a given motor schema to function, for example “find terrain” might supply a clear path vector to “stay on path”. In Arkin’s system, multiple schemas may effect action at the same time, and these actions are merged through vector addition of potential fields. Groupings of perceptual and motor schemas which achieve a given task are known as assemblages. Some assemblages may be present throughout the entire runtime of the control program, such as those that provide emergency stop or hazard avoidance facilities. Additionally, a planner module may load and unload different assemblages based on the input from the perceptual schemas connected to it. This mechanism provides effective re-use of components, as we can re-use parameterised perceptual and motor schemas across multiple assemblages. To illustrate the Motor Schema approach, an example program is presented in Figure 4, written for the LynxMotion AH3-R Hexapod robot [8] as described in section 1. The sample program shown in Figure 4 has two assemblages which are loaded constantly: one to detect body tilt (detect.tilt) and keep the robot level and another to detect obstacles in the path of the robot (detect.obstacles) and generate a vector away from them. The planner is able to load additional assemblages based on perceptual schemas which are connected to it. In this
J. Simpson and C.G. Ritson / Toward Process Architectures for Behavioural Robotics
detect.tilt
balance
detect.obstacles
avoid.obstacle
381
∑ motor.sum
motion.track
move.toward
Active planner task: Move toward moving object on lateral motion.
motion.track
back.off
Inactive planner task: Back away from something approaching
detect.lateral.motion
Key perpetual schema
Planner detect.approach
motor schema
Planner State
active schema
Figure 4. A motor-schema based control program to navigate a robot to investigate motion and run away if approached.
appr
oach
roa app
Start
=0
d che
Back off
Wait for motion late
ral m
otio
al later
n
Move toward
n=0
motio
Figure 5. State machine of the planner for an example control program using Motor Schemas which investigates motion and runs away if approached.
case we have two perceptual schemas connected to the planner: one to detect lateral motion (detect.lateral.motion) which loads an assemblage to move towards (investigate) the source of the motion, and another which loads an assemblage which backs away from the source of the approach (detect.approach). The state machine for the planner is shown in Figure 5. Motor Schemas provide the reactive component of Arkin’s Autonomous Robot Architecture (AuRA), which combines the aforementioned planner with a spatial reasoner and other
382
J. Simpson and C.G. Ritson / Toward Process Architectures for Behavioural Robotics
deliberative levels of function [19]. 5.1. Implementation The key primitive for the implementation of Motor Schemas is the vector sum, an implementation of which is shown in Listing 1. The example provided is suitable for controlling motion in a 2D plane, it is straightforward to add more components to allow control in a 3D space (x,y,z), allowing the system’s behaviours to influence height or tilt. The planner in a Motor Schema based system is a custom-built state machine which uses perceptual schemas to determine when to change between states. PROTOCOL VECTOR IS REAL32; REAL32: PROC motors (VAL []REAL32 gain, CHAN []VECTOR in?) MOBILE []REAL32 x, y: SEQ x := MOBILE [SIZE in]REAL32 y := MOBILE [SIZE in]REAL32 SEQ i = 0 FOR SIZE in x[i], y[i] := 0.0, 0.0 WHILE TRUE INITIAL REAL32 x.v IS 0.0: INITIAL REAL32 y.v IS 0.0: SEQ ALT i = 0 FOR SIZE in in[i] ? x[i]; y[i] SEQ x[i] := x[i] ∗ gain[i] y[i] := y[i] ∗ gain[i] SEQ i = 0 FOR SIZE in SEQ x.v := x.v + x[i] y.v := y.v + y[i] :
−− drive motors ...
Listing 1. An occam-π implementation of the vector sum primitive which allows for the control of motion in a 2D plane.
6. Distributed Architecture for Mobile Navigation (DAMN) Rosenblatt’s Distributed Architecture for Mobile Navigation (DAMN) combines independent, asynchronous modules with arbiters performing command fusion via a voting mechanism [20]. The overall goals of the system are prioritised via the weighting of votes placed by each module. Arbiters make a decision on the set of votes which have been received within their time step. This provides asynchronous operation of system components and allows the behaviours to be a mix of deliberative and reactive modules, emitting decisions at their natural rates. Arbiters in DAMN offer a set of commands to behaviours; a steering arbiter might offer varying degrees of turn and the behaviour modules would then be able to vote on each possibility. Votes made by behaviours are normalised, and the choice which has the highest vote amount is chosen to occur.
J. Simpson and C.G. Ritson / Toward Process Architectures for Behavioural Robotics
DAMN Arbiter commands votes
383
Robot Controller
votes
Seek lateral motion
Back off when approached
Move toward
Move away
Detect lateral motion
Detect approach
Balance
Avoid obstacles
Modules
Figure 6. A DAMN based control program to navigate a robot to investigate motion and run away if approached. The arbiter sends commands to the robot itself based on the votes made by behaviours.
Arbiters may be connected to an adaptive mode manager, allowing the weighting of different behaviours to be changed while the system is running. A mode manager such as SAUSAGES altering the vote weightings allows for sequential action [21]. For example, a robot might have a primary stage of operation where it locates all target objects (soda cans, red balls etc.) and a secondary stage where it retrieves all of those target objects. A mode manager would first weight highly all of those behaviours responsible for the target finding abilities, then reduce those weights and increase those to the behaviours responsible for retrieving the objects. A sample program implemented using DAMN is shown in Figure 6. It performs the same task as the earlier example implemented using a Motor Schema approach, using the six-legged walker to approach moving objects and back away from objects that approach it. 6.1. Implementation The development of DAMN-based robot control systems in occam-π requires the implementation of ‘arbiter’ processes for each actuator to be controlled by the system. Each of these arbiters will offer appropriate choices for actions to take with the actuator that they control to the behaviour modules of the system. 7. Conclusions Whilst examining a number of behavioural architectures in the light of process-orientation, we have found a number of design strategies and primitives which “fit” well with our programming model. Brooks’ Subsumption Architecture has previously shown promise when used for robotics in occam-π [7], and the adaptations to rules introduced by Brooks’ later work involving the use of communication patterns to control inhibition and suppression further enhance its suitability. A lack of modularity, identified by both the ourselves and authors of other related control systems mean that systems structured using the Subsumption architecture tend to run into scaling problems. This is due to the tight bindings established between behaviours inside the layers themselves. Connell’s Colony Architecture deals with some of the scalability issues from Subsumption, structuring behaviours around the effectors they control instead of into tight layers. This
384
J. Simpson and C.G. Ritson / Toward Process Architectures for Behavioural Robotics
structuring means that the suppressor becomes the main primitive, with priority hierarchies of behaviour built up around the effectors. A major plus of the Colony Architecture is that behaviours do not end up tightly bound to each other, allowing it to offer better scalability than Subsumption. Maes’ Action-Selection, being a neural network influenced approach is significantly different from the Subsumption and Colony architectures because it does not rely on message passing or finite state machines in its definition. This architecture takes no advantage of the communicating process model in the actual robot control code, which is not ideal for meshing with other architectural components and process-oriented hardware interfaces. Additionally the need to implement a separate neural network based decision layer from the actual robot control processes and the highly connected nature of the decision layer makes implementation complex. In support of this view, Arkin notes in [22] that “no evidence exists of how easily the current competence module formats would perform in real-world robotic tasks”, due to lack of implementation on actual robot platforms. Arkin’s Motor Schemas offer a method of command fusion which is simple and effective, matching with our typical use of process-orientation in robotic control and having flexibility for the kinds of motion possible with different platforms. In the context of the wider architecture AuRA, the finite-state machine based planner and re-use of perceptual schemas make this approach to control modular and flexible, allowing both deliberative and reactive behaviours to be expressed in the same way. Where Motor Schemas offer a blending approach to resolving the action of many behaviours into one coherent choice, Rosenblatt’s DAMN allows the use of an arbitration-based approach. Behaviours vote on potential actions and decide on the correct action to take via a centralised arbiter, customised for the potential actions that can be taken for each effector. DAMN provides a framework which can exploit message passing for decision making without the implementation overhead and connection complexity of a neural network, whilst allowing complete freedom to the internal structure of the voting behaviours. The asynchronous nature of DAMN also allows a seamless combination of deliberative and reactive components each working at their own frequency, with the relative proportion of their inputs able to be counteracted through weighting. An AuRA-like planner or other mode manager could also easily be used to provide sequential or adaptive behaviour, providing a coherent framework. 8. Future Work This paper has reviewed and introduced a number of behavioural robotics architectures in the context of process-oriented programming, specifically using occam-π. It would be highly beneficial to implement a single control program using each architecture in turn, to provide metrics for detailed comparison and evaluation of their process-oriented implementations. The creation of a library and documentation providing primitives and development methodologies for Subsumption, Colony, Motor Schema and DAMN architectures would provide useful assistance in writing robotics programs using occam-π. Further exploration of hybrid approaches, using multiple architectures in the homogenous process-oriented environment may yield useful combinations for achieving particular goals. The authors envisage that a unified library of components for building programs using different architectures would be greatly beneficial in the context of a visual robotics programming environment, such as that described in [23]. Architectures could be considered in the design of hardware interfaces to platforms for use in process-oriented robotics. For example, effective use of DAMN could be achieved via the provision of arbiters inside the hardware interface itself, building scalability and the design paradigm in from the hardware up.
J. Simpson and C.G. Ritson / Toward Process Architectures for Behavioural Robotics
385
Having examined existing architectures for their suitability to process-oriented control, it would be interesting to explore the development of systems that fully take advantage of the language features present in occam-π such as barriers and dynamic network creation. Additionally, the development of architectures and design patterns which lend themselves to modelling in CSP [24] and formal verification via tools such as Formal Methods’ FDR [25] may allow the creation of provably reliable control applications. Acknowledgements Jon Simpson is funded by the Computing Lab at the University of Kent and the EPSRC DTA. Carl Ritson is funded by EPSRC grant EP/D061822/1. We thank all of those who continue to work on the Transterpreter project [26], especially Matt Jadud and Christian Jacobsen whose ideas on robot architectures have been formative to this work. We also thank the anonymous reviewers for their comments, which helped us improve and shape this paper. References [1] Erann Gat. On Three-Layer Architectures. In David Kortenkamp, R. Peter Bonnasso, and Robin Murphy, editors, Artificial Intelligence and Mobile Robots, pages 195–210. AAAI Press, 1997. [2] Marvin Minsky. The Society of Mind. Simon & Schuster, Inc., New York, NY, USA, 1986. [3] P.H. Welch and F.R.M. Barnes. Communicating Mobile Processes: Introducing occam-π. In A.E. Abdallah, C.B. Jones, and J.W. Sanders, editors, 25 Years of CSP, volume 3525 of Lecture Notes in Computer Science, pages 175–210. Springer Verlag, April 2005. [4] Christian L. Jacobsen and Matthew C. Jadud. The Transterpreter: A Transputer Interpreter. In Communicating Process Architectures 2004, pages 99–107, 2004. [5] Jonathan Simpson, Christian L. Jacobsen, and Matthew C. Jadud. A Native Transterpreter for the LEGO Mindstorms RCX. In Alistair A. McEwan, Steve Schneider, Wilson Ifill, and Peter H. Welch, editors, Communicating Process Architectures 2007, volume 65 of Concurrent Systems Engineering, Amsterdam, The Netherlands, July 2007. IOS Press. [6] Matthew C. Jadud, Christian L. Jacobsen, Carl G. Ritson, and Jonathan Simpson. Safe parallelism for behavioral control. In IEEE International Conference on Technologies for Practical Robot Applications (TePRA), November 2008. [7] Jonathan Simpson, Christian L. Jacobsen, and Matthew C. Jadud. Mobile Robot Control: The Subsumption Architecture and occam-π. In Frederick R. M. Barnes, Jon M. Kerridge, and Peter H. Welch, editors, Communicating Process Architectures 2006, pages 225–236, Amsterdam, The Netherlands, September 2006. IOS Press. [8] Lynxmotion, Inc. AH3-R Walking Robot. http://www.lynxmotion.com/Category.aspx?CategoryID=92. [9] Peter H. Welch and Neil Brown. The JCSP Home Page: Communicating Sequential Processes for Java. http://www.cs.kent.ac.uk/projects/ofa/jcsp/, March 2008. [10] Otto J. Anshus, John Markus Bjørndalen, and Brian Vinter. PyCSP - Communicating Sequential Processes for Python. In Alistair A. McEwan, Wilson Ifill, and Peter H. Welch, editors, Communicating Process Architectures 2007, volume 65 of Concurrent Systems Engineering, pages 229–248, Amsterdam, The Netherlands, jul 2007. IOS Press. [11] Jan F. Broenink and Dusko S. Jovanovic. Graphical Tool for Designing CSP Systems. In Ian R. East, David Duce, Mark Green, Jeremy M. R. Martin, and Peter H. Welch, editors, Communicating Process Architectures 2004, pages 233–252, September 2004. [12] Mobile Robots, Inc. Pioneer 3-DX Mobile Robot. http://www.activrobots.com/ROBOTS/p2dx.html. [13] Rodney A. Brooks. A Robust Layered Control System for a Mobile Robot. IEEE Journal of Robotics and Automation, 2(1):14–23, March 1986. [14] Rodney A. Brooks. A robot that walks; emergent behaviors from a carefully evolved network. Technical report, MIT, Cambridge, MA, USA, 1989. [15] Jonathan H. Connell. A colony architecture for an artificial creature. Technical report, Cambridge, MA, USA, 1989. [16] Pattie Maes. The dynamics of action selection. In IJCAI, pages 991–997, 1989. [17] Pattie Maes. How to do the right thing. Technical report, Massachusetts Institute of Technology, Cambridge, MA, USA, 1989.
386
J. Simpson and C.G. Ritson / Toward Process Architectures for Behavioural Robotics
[18] Ronald C. Arkin. Motor schema based navigation for a mobile robot: An approach to programming by behavior. In IEEE International Conference on Robotics and Automation, volume 4, pages 264–271, Mar 1987. [19] Ronald C. Arkin and Tucker Balch. AuRA: Principles and Practice in Review. Journal of Experimental and Theoretical Artificial Intelligence, 9:175–189, 1997. [20] Julio K. Rosenblatt. Damn: A distributed architecture for mobile navigation. In H. Hexmoor and D. Kortenkamp, editors, proceedings of the 1995 AAAI Spring Symposium on Lessons Learned from Implemented Software Architectures for Physical Agents, Menlo Park, CA, March 1995. AAAI Press. [21] Jay Gowdy. Sausages: Between planning and action. Technical Report CMU-RI-TR-94-32, Robotics Institute, Pittsburgh, PA, September 1994. [22] Ronald C. Arkin. Behavior-based Robotics. MIT Press, Cambridge, MA, USA, 1998. [23] Jonathan Simpson and Christian L. Jacobsen. Visual Process-oriented Programming for Robotics. In Peter H. Welch, Susan Stepney, Fiona A.C. Polack, Frederick R. M. Barnes, Alistair A. McEwan, Gardiner S. Stiles, Jan F. Broenink, and Adam T. Sampson, editors, Communicating Process Architectures 2008, volume 66 of Concurrent Systems Engineering, pages 365–380, Amsterdam, The Netherlands, September 2008. IOS Press. [24] C. A. R. Hoare. Communicating sequential processes. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1985. [25] Formal Systems (Europe) Ltd., 3, Alfred Street, Oxford. OX1 4EH, UK. FDR2 User Manual, May 2000. [26] The Transterpreter Project. The Transterpreter Project - Concurrency, Everywhere. http://www.transterpreter.org/, July 2009.
Communicating Process Architectures 2009 P.H. Welch et al. (Eds.) IOS Press, 2009 © 2009 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-065-0-387
387
HW/SW Design Space Exploration on the Production Cell Setup Marcel A. GROOTHUIS and Jan F. BROENINK Control Engineering, Faculty EE-M-CS, University of Twente, P.O. Box 217 7500 AE Enschede, The Netherlands {M.A.Groothuis , J.F.Broenink} @utwente.nl Abstract. This paper describes and compares five CSP based and two CSP related process-oriented motion control system implementations that are made for our Production Cell demonstration setup. Five implementations are software-based and two are FPGA hardware-based. All implementations were originally made with different purposes and investigating different areas of the design space for embedded control software resulting in an interesting comparison between approaches, tools and software and hardware implementations. Common for all implementations is the usage of a model-driven design method, a communicating process structure, the combination of discrete event and continuous time and that real-time behaviour is essential. This paper shows that many small decisions made during the design of all these embedded control software implementations influence our route through the design space for the same setup, resulting in seven different solutions with different key properties. None of the implementations is perfect, but they give us valuable information for future improvements of our design methods and tools. Keywords. CSP, embedded systems, Mechatronics, motion control, real-time FPGA, Handel-C, gCSP, 20-sim, POOSL, Ptolemy II, QNX
Introduction A typical mechatronic system design consists of a combination of a mechanical system, AD/DA, mixed-signal and power electronics, and an embedded control system (examples: cars, printers, robots, airplanes). The goal of the embedded control system is to control the behaviour of the mechanic system in a predefined way. The design of the embedded control system software for these mechatronic systems has a large design space with many multidisciplinary factors that influence the route from idea to a working realization and the required amount of time to design these systems. To find an optimal design for the embedded control software, we need to investigate (or explore) the various possible solutions in our design space. This paper describes our Design Space Exploration work on the Production Cell setup in our laboratory. This setup consists of a mechanical system with multiple axes that operate in parallel (see section 1.1 for more information). In the past few years we have made seven different designs for the (embedded control) software for this setup, exploring various possible solutions for implementing the software. Common for all these implementations, besides the setup, is the usage of a modeldriven design flow and a process-oriented framework for the software that consists of a combination of an event-driven software part and a time-triggered software part. Five different CSP based implementations of the software were made and two other process-oriented, but non-CSP based, implementations were made shown in Table 1. All versions focus on different choices in the design space. This paper compares and evaluates all implementations from
388 M.A. Groothuis and J.F. Broenink / HW/SW Design Space Exploration on the Production Cell Setup
Table 1 by looking at various aspects like computational resource usage, accuracy, amount of design time, the trade-off between CPU-based or an FPGA-based hardware implementation and the choice of real-time operating system. Table 1. Production Cell Embedded Control System Software implementations. Nr. A B C D E F G
Name gCSP RTAI POOSL Ptolemy II gCSP QNX gCSP Handel-C int gCSP Handel-C float SystemCSP
Data type floating point floating point floating point floating point integer floating point -
Target CPU CPU CPU CPU FPGA FPGA -
Explanation [1] [2] [3] [4] [5, 6] [7] [8]
Realization yes yes yes partial yes yes no
Section 1 contains background information on the Production Cell setup, embedded control system software, our design method and the languages and tooling used. Section 2 describes the important aspects for an embedded control system software design and how to design choices influence our route through the design space. Sections 3 and 4 describe the CPU and FPGA designs for the production cell setup followed by an evaluation and discussion in section 5 and conclusions and ongoing work in section 6. 1. Background This section contains a brief description on the Production Cell demonstration setup, followed by a description of the structure of embedded control system software and the design method we use to design the software. The last subsection introduces the languages and tools used for the implementations from Table 1. 1.1. Production Cell Setup The setup that is used for all process-oriented controller implementations described in this paper is a mock-up of an industrial production line system (in our case a plastics moulding machine). The production cell setup [5, 9] is a circular system that consists of 6 production cells that operate simultaneously and semi-independently. Each of these cells, called Production Cell Units (PCUs), executes a single action in the production process. Figure 1 shows an overview of the setup used. Its main goal is to pass along metal blocks and to execute (pseudo)actions on these block like moulding, extraction from the machine, transportation (belts) and storage. The storage part is simulated by a rotation unit that picks up a block at the end of the production process and transfers it again to the beginning of the setup, resulting in a loop. Sensors (located before and after the PCUs) detect the blocks and generate external events for the PCUs, so the PCUs can perform their jobs. The loop in combination with the sensor-event-triggered PCUs can result in a deadlock when the setup contains 8 or more blocks. The setup needs at least one free buffer position (located next to the sensors) in order to be able to move blocks to the next PCU. When all sensor guarded buffer positions are occupied, the system cannot move anymore resulting in a deadlock. The mechanical setup is connected via power and interface electronics to an embedded PC with an FPGA I/O card, that runs the embedded control software. 1.2. Embedded Control System Software The combination of a mechanical setup and embedded (control) software for motion control systems and robotics requires a multi-disciplinary and synergistic approach for its design. The
M.A. Groothuis and J.F. Broenink / HW/SW Design Space Exploration on the Production Cell Setup 389
Figure 1. The Production Cell setup.
dynamic behaviour of the mechanics influences the behaviour of the software and vice-versa. Therefore they should be designed together to find an optimal and dependable realization for the entire mechatronic setup. The purpose of an embedded control system is to control physical processes (like mechatronic setups). For this paper, the purpose is to control and co-ordinate mechanical movements (like position, velocity and acceleration) to get a smooth and precise movement. A typical embedded control system software design contains often a layered structure [10] with layers for: user-interfacing, supervisory control, sequence control (order of actions), loop control (control law and motion profiles), safety purposes and measurement and actuation. The embedded control system software is a combination of an event-driven part and a time-triggered part with different and often challenging (real-)time requirements for the different layers. Hard real-time behaviour is for example required for the last two layers. The control laws for the loop control layer require a periodic time schedule with hard deadlines in which jitter and latency are undesirable. For the Production Cell setup, the sequence control, loop control and safety layers are essential to the implementation. 1.3. Design Method The design method used for designing the embedded control system (ECS) software for mechatronic systems is based on model-driven design with a close cooperation between the involved disciplines. Concurrent design techniques are used to shorten the total time from idea to realization. The loop controllers are, for example, designed concurrently to the other ECS software layers [11]. In the design flow for designing the loop controllers, the starting point is a physical system model (a model of the mechanical setup). From this model, the control engineer derives the required control algorithm, based on the assumption of continuous time and floating point calculations and verified by simulation in, for example 20-sim [12]. The next step is to incorporate target behaviour (discrete time, AD/DA effects, signal delays and scaling) via stepwise refinement into the design before the loop controllers can be integrated in the ECS software design. Concurrently, the other ECS software layers are designed, starting from an abstract toplevel model that is extended via stepwise refinement into a complete ECS software design. In order to prevent integration problems of the software, the control laws and the setup, cosimulation tests between the software and the physical system model can be performed as early integration tests [11].
390 M.A. Groothuis and J.F. Broenink / HW/SW Design Space Exploration on the Production Cell Setup
1.4. Used Tools and Languages The production cell implementations from Table 1 were made using various modelling and implementation languages and tools. This section introduces them briefly. The 20-sim modelling and simulation tool (commercial) is used for implementations A, B, D and F to model the dynamic system (the mechanical part) and to design the control laws and motion profiles (the trajectory to follow) for the axes movements. The embedded control system design models for implementations A, D, E and F were made using our graphical CSP tool (academic), gCSP [13], based on the GML language (graphical notation for CSP) [14]. gCSP diagrams contain information about compositional relationships (SEQ, PAR, PRI-PAR, ALT and PRI-ALT) and communication relationships (rendezvous channels). The tool supports animation/simulation [15] of these diagrams and code generation of CSPm code (for deadlock and livelock checking with FDR2 or ProBE), Occam code, C++ code (using the CTC++ library (implementation A) [16]) and Handel-C code. Implementations E, F use the Handel-C [17] (commercial) hardware description language to implement the CSP based embedded control system in an FPGA. The ECS software for implementation B is made using the Parallel Object Oriented Specification Language (POOSL) [18] in combination with the Shesim simulator and the Rotalumis execution engine for POOSL (academic). The POOSL language has a formal background based on timed CCS [19]. Implementation C is entirely modelled in Ptolemy II [20]. Ptolemy II is a heterogeneous modelling and simulation tool (academic) that allows for creating multi-domain models using different Models of Computation (MoC) in a hierarchical model structure, consisting of actors (comparable to submodels or processes) and directors. The director determines the domain and the model of computation that is used by the simulator for executing an actor. Communication between actors takes place via channels connected to ports. Each port uses a receiver that determines the exact behaviour (FIFO, mailbox or CSP rendezvous) of a channel, according to its domain. Examples of Ptolemy II domains that can be used to model the Production Cell and its ECS software layers are the continuous-time (CT), discrete-event (DE), synchronous dataflow (SDF), rendezvous/CSP and finite state machines (FSM) domains. Implementation G is modelled using the SystemCSP language. SystemCSP is based on the principles of both component-based design and CSP process algebra, offering a more structured approach and more expressiveness than the occam-like GML approach used by gCSP [8]. 2. Design Space Exploration The optimal design of embedded control software that is flexible, dependable, cost-efficient and takes into account all kind of functional and non-functional constraints, like real-time requirements and time-to-market, is complex and has a large design space. The design pyramid in Figure 2 shows that an idea can be realized in many ways. During the route from idea to final realization, many design decisions need to be made which all have their own influence on the final result. With realization we mean the software implementation that runs on the setup, so the total system including the embedded control system software. Every decision restricts the design space and starts a new smaller design pyramid. The reachable solutions (feasible design space), whether optimal or not, depend on all these decisions. For example, the architecture choice between a CPU (a) or an FPGA (b) results in different sub-design space with different feasible solutions. Typical decisions for the Production Cell setup that influence the final realization are:
M.A. Groothuis and J.F. Broenink / HW/SW Design Space Exploration on the Production Cell Setup 391
Exploration of alternatives
Abstraction level
Different design choice, different solution
Requirements Specification
(a)
(b)
Level of detail
Project phase Idea
Architecture Implementation
Final realization
Solutions
Feasible design space
Realization
Figure 2. Design Pyramid with different abstraction levels (adapted from [21]).
• Choice of the modelling formalisms and languages; • Operating system choice: general purpose or dedicated (real-time) operating system; • Architecture trade-offs: CPU ⇔ FPGA, distributed ⇔ centralized design, parallel ⇔ sequential design; • FPGA solution: use natural parallelism (high resource usage) ⇔ sequential solution (lower resource usage) and resource usage ⇔ design time. The next two sections describe the CPU and FPGA based embedded control system implementations followed by an evaluation of the design choices and their effect on the realization. 3. CPU Implementations This section describes all embedded control system software implementations from Table 1 that run on a embedded PC/104 platform with 600 MHz X86 CPU equipped with an FPGA based digital I/O board that connects to the Production Cell setup. 3.1. gCSP RTAI (Implementation A) The gCSP RTAI implementation is the first completely working embedded control system (ECS) software implementation for the Production Cell setup. The ECS software structure is modelled in and generated from gCSP and manually combined with the loop controllers and motion profiles that are modelled in and generated from 20-sim. The compiled code runs under RTAI real-time Linux [22]. The focus of this implementation was a proof-of-concept for gCSP in combination with its CTC++ library in an environment that requires real-time guarantees. LinkDrivers [13] are created and used to provide channel communication with the hardware. In order to provide the required periodic timing behaviour (for the loop controllers) to the (untimed) CSP program, TimerChannels are used to synchronize the controller processes with the OS timer (an environmental process that provides periodic tocks [23]). The layered software structure (section 1.2) is implemented using prioritized PAR constructs to be able to prioritize the loop controller above the other layers. The gCSP RTAI version is based on a bottom-up design approach starting with the hardware drivers and loop controllers and extended via a single PCU implementation towards a complete CSP based embedded control software implementation. Figure 3 shows an example of the top-level gCSP model. This work proved that gCSP and CSP processes and channels are usable and suitable to create ECS software that is formally verified. Integration of external 20-sim code was straightforward due to the usage of a special “20-sim-to-gCSP-process” code generation template. The final realization worked reasonably well, but showed serious timing problems (missed deadline) when many (>15) blocks in the system. The generated code in combina-
392 M.A. Groothuis and J.F. Broenink / HW/SW Design Space Exploration on the Production Cell Setup
Figure 3. gCSP Production Cell software top-level model.
tion with the CTC++ library shows quite some process switching overhead (many small processes), resulting in a high CPU load. This is now partially solved via optimizations in the CTC++ library. This implementation revealed some serious shortcomings in the gCSP graphical modelling (GML) language and the gCSP tool namely the lack of support to draw state machine constructions for implementing a sequence controller and a (currently) incomplete CSPm code generator. As a consequence, the formal verification is limited to the drawn process network. Contents of code blocks (non graphical processes) cannot be checked directly without writing a corresponding CSPm implementation by hand. Another missing feature was the ability to simulate (animate or graphically debug) the created gCSP process network to see its (time-dependent) scheduling behaviour. This was solved last year with the implementation of a gCSP animation facility [15], which allows for visual debugging of the process network using colours for the channels and processes (as depicted in Figure 3) to indicate the status of all processes and channels (e.g. for processes: blocked, running, finished and for channels: free, reader waiting, writer waiting, rendezvous). Furthermore, the animation framework allows for setting breakpoints on processes and it shows the contents of the CSP scheduler queues. 3.2. POOSL (Implementation B) The non-deterministic timing behaviour of the system under load for the gCSP-RTAI implementation was the starting point of the POOSL implementation [2]. One of the strong features of the POOSL language is its predictable timing behaviour (with formal background, see [24]). The POOSL version is made using a top-down design approach starting with a top-level discrete-event concurrency model for the process interactions in the system. This model is extended via stepwise and local (within process) refinement towards a multi-modelof-computation model (discrete-event (DE) and continuous-time (CT) equations) that is still untimed. The last refinement is the addition of (real-)time information to the discrete event part of the model and to integrate the continuous time parts. The latter runs in parallel with the event-driven part and generated from 20-sim using a POOSL code generation template. This results in a real-time model. The channel interaction of the PCUs is explicitly specified by using two-way, three-way and four-way handshaking pattern for the PCU synchronization.
M.A. Groothuis and J.F. Broenink / HW/SW Design Space Exploration on the Production Cell Setup 393
Figure 4. Handshake synchronization in POOSL (left) and schematic (right).
See Figure 4 for a schematic and POOSL code example for the Rotation unit synchronization (Input()() and Output()() side run in parallel). The behaviour of the (untimed) discrete event ECS part was completely verified via simulation in Shesim. This revealed also the possible > 8-blocks deadlock in the Production Cell system. Although formal verification for POOSL models is possible, using for example UPPAAL [25], it is not used here, because the translation is not yet automated. Compared to the gCSP implementation in section 3.1, the Shesim simulator allowed us to better predict and verify the behaviour of the software (discrete event + continuous time part). The top-down refinement approach allowed us to design the discrete event (DE) and the continuous time (CT) part separately and to integrate them later, by only specifying the required DE-CT interactions and the DE-CT channel interfaces on beforehand. The Rotalumis POOSL execution engine, used for the final implementation on the setup, did not allow us to run it in a real-time environment (RTAI real-time Linux in our case). This is a major shortcoming for this implementation. The highest priority Linux process was the best we could achieve, so we could not give any real-time guarantees. The timing behaviour was however surprisingly stable and predictable (under the given non-real-time conditions), which allowed us to run our loop controllers without too much trouble (no stability problems), even under heavy load (about 15 blocks in the system). A minor issue with the POOSL language is that it did not allow us to specify process priorities. In case of a high system load, the loop controllers are, for example, more important than the graphical user interface. 3.3. Ptolemy II (Implementation C) The focus of the Ptolemy II implementation was on the feasibility of a single tool solution for modelling the dynamic system behaviour, the motion profiles and control laws and the embedded software in one single tool and model. Essential for our model-driven design flow is the requirement for code generation, preferably without manual adaptations in order to run it on the target. As all other implementations require multiple tools and models and often manual integration of generated code from these models, the Ptolemy approach can significantly reduce the integration effort and the required design time. The Ptolemy II implementation is based on a top-down approach design approach, similar to the POOSL approach, but now with the entire mechatronic system as top-level and the embedded control system as one of its components. The relevant behaviour of the setup (mechanics, electronics, embedded control system software and even a 3D model with the kinematic behaviour) are modelled via local stepwise refinement in one single Ptolemy II
394 M.A. Groothuis and J.F. Broenink / HW/SW Design Space Exploration on the Production Cell Setup
Figure 5. ECS top-level implementation in Ptolemy II and a state machine example.
Figure 6. ECS hierarchy and used models of computation.
model as multiple (composite) actors (processes). These actors are implemented using multiple different domains (and the corresponding models of computation). Figure 5 shows the structure of the embedded control system actor and an example of the Rotation unit state machine, which we could not implement graphically in gCSP yet. Figure 6 shows the different models of computation that are used for the embedded control system software layers, zoomed in on the Rotation unit. Formal verification via automated exports to model checkers for the various Ptolemy II domains is not yet possible. The entire setup, including the ECS behaviour, was verified by simulations in Ptolemy II. The embedded control system software implementation was generated from Ptolemy as ANSI-C code. Not all available domains in Ptolemy II are mature enough for practical use, as the tool is still under development. The Ptolemy II CSP/Rendezvous domain is, compared to gCSP, of limited use. It has no support for SEQ, no priority support and no code generation support yet. In order to model the dynamic behaviour of the setup in the continuous time (CT) domain, we had to use transfer function equations, instead of graphical diagrams like “ideal physical models” or bond-graphs in 20-sim. We have extended the Ptolemy II library with self developed Ptolemy building blocks in order to implement motion profiles and PID controllers. For the final implementation, an extension to the ANSI-C code generation facility was necessary in order to get the required real-time behaviour under a real-time operating system (in our case RTAI Linux). On the graphical modelling level, the channel communication through ports between different models of computation was not straightforward. The ports only transfer data, but they do not specify sampling behaviour at the boundary of continuous time and discrete time, which can result in unexpected behaviour unless the modeller
M.A. Groothuis and J.F. Broenink / HW/SW Design Space Exploration on the Production Cell Setup 395
adds these required “conversion” actors to the model by hand. The all-in-one model approach proved to be time-saving and easy for early integration testing, however Ptolemy II needs quite some extensions to be able to use it for a real setup like the Production Cell. 3.4. gCSP QNX (Implementation D) The QNX real-time operating system [26] is a POSIX compliant microkernel based operating system with advanced scheduling capabilities and extensive run-time tracing and profiling capabilities dedicated for the development of deterministic systems with hard real-time demands. Because we had some serious timing problems with the existing non-pre-emptive CTC++ library (see also section 3.1), the focus of this ECS design was mainly on the creation of a QNX version of our CTC++ library that is API compatible with our existing Windows/(RTAI)Linux/DOS CTC++ library, but now with a good timing foundation. The mapping of CSP constructs, processes and channels onto QNX proved to be straightforward. The PAR and PRIPAR constructs are, for example, implemented using QNX POSIX threads in combination with prioritized pre-emptive scheduling and the QNX channels (message passing) are used to provide CSP channels. To allow transparent distributed processing, QNX provides its own distributed networking protocol QNET. The QNX CTC++ library uses QNET to provide network channels. A TimerChannel and time-out guard are implemented to provide timing (periodic and time-outs) to CTC++ processes by letting them synchronize with a (discrete) tock event from the environment (the operating system timer). The QNX adaptive partition scheduler is used to guarantee a set of processes (partition) an amount (e.g. 80%) of CPU cycles. For more information about this library, see [4]. The current gCSP QNX ECS design is not complete. It only contains the Rotation unit and its interactions (by coincidence the part shown in Figure 4). This partial design was generated from gCSP to test the backward compatibility of the new library. Oscilloscope measurements and instrumented QNX kernel traces provided exact scheduling and timing information on our embedded control system software design. The required timing was reached with almost no jitter (2 μs for a 1 ms period). The extensive traces revealed also that the operating system overhead for channel communication and thread switching is relatively high (140 μs) compared to our process calculation times (70 μs). So, although the timing is reliable, the mapping of (small) processes onto operating system threads needs some optimizations to be useful for the entire setup. 3.5. SystemCSP (Implementation G) The SystemCSP design of the ECS software for the Production Cell setup can be found in chapter 6 of [8]. This design is made for demonstration purposes of the SystemCSP language. This design is not implemented on the Production Cell setup, because a tool and an execution engine for the SystemCSP language do not yet exist. However, the SystemCSP design in [8] does show how the ECS software can be modelled using the SystemCSP notation. The GML language has no support for drawing state machines (related to CSP primitives), which is possible in SystemCSP. An example of a graphical CSP based state machine construction with interactions (CSP events) in SystemCSP is shown in Figure 7. 4. FPGA Implementations Our Production Cell setup consists of several Production Cell units running in parallel and because deterministic real-time timing behaviour is important for an embedded control sys-
396 M.A. Groothuis and J.F. Broenink / HW/SW Design Space Exploration on the Production Cell Setup
Figure 7. SystemCSP interaction contract between feederbelt, feeder and moulder (adapted from [8]).
tem, this requires a careful choice of the scheduling algorithm used to emulate the parallelism in software on a sequential processor, while keeping the observable timing behaviour deterministic. Because the Production Cell setup is also equipped with an FPGA based digital I/O board and FPGAs can provide deterministic timing and native parallelism, we made also two completely FPGA-based embedded control system implementations: E and F from Table 1. The purpose of these implementations was to investigate the feasibility of an FPGA-based solution and to look at the trade-off between a CPU-based and an FPGA-based solution. This section describes the embedded control system hardware implementations. 4.1. gCSP Handel-C integer (Implementation E) The gCSP Handel-C integer version of the Production Cell motion control software was the result of a feasibility study on FPGA based motion control (see for more information our CPA 2008 paper [5]). The main characteristics of this implementation are: • FPGA choice: exploit inherent parallelism and accurate timing; no usage of a soft core CPU; • Usage of Handel-C as hardware description language; • Design of a decentralized process-oriented layered structure and communication framework for motion control (see Figure 9, Figure 8 and [5]). FDR2 was used to check that the framework is free of deadlocks; • ECS framework designed in gCSP (see Figure 8). Implementation generated using Handel-C code generation; • Control laws and motion profiles designed in 20-sim (floating point). Integer used as native data type for motion profile and loop controller implementation; • Implementation on a low cost Xilinx Spartan III 3s1500 FPGA; • Integer PID loop controllers run at 1 ms with idle time of 99.95%; • Combination of top-down design (ECS framework) and bottom-up design (PID loop controllers).
M.A. Groothuis and J.F. Broenink / HW/SW Design Space Exploration on the Production Cell Setup 397
ExtractionBelt
Init
Extractor
1 Rotation
Molder
FeederBelt
Feeder
1
Terminate
Controller handshake channel
Error channel
Figure 8. gCSP Handel-C top level (from [5]). To previous PCU
From Previous PCU
From previous PCU
Production Cell Unit (PCU) State Out State In
Controller
Command
PCI bus
User Interface
Controller override
Host PC
Sensors Actuators
Safety
State In State Out Sensors Actuators
Low-level Hardware FPGA
To next PCU
From Next PCU
To Next PCU
Controller handshake channel User interface channel
FPGA To Production Cell From Production Cell
State channel Hardware interface channel
Error channel
Figure 9. Production Cell unit structure (from [5]).
The result is a completely working and successful FPGA based embedded control system realization for the Production Cell with a much better performance with respect to the timing accuracy and the system load than all CPU based realizations. This implementation fits also in a relatively small Spartan III FPGA. The FPGA resource usage is measured in the amount of internal logic blocks (lookup tables (LUTs, flip-flops (FF), memory (MEM) and arithmetic logic units (ALUs)) that are needed for the implementation of the design. Table 2 shows the resource usage on a Xilinx Spartan III 3s1500 FPGA for this design. The designed software framework structure turned out to be useful for embedded control system software for this class of mechatronic systems. The same structure is also used for the Ptolemy II and the gCSP QNX version (sections 3.3 and 3.4). The main disadvantage of the integer implementation was the large design time required for manual translation of the motion profiles and loop controllers from a floating point implementation (20-sim) towards an integer implementation. This route does not directly fit into our existing (floating point based) model-driven design flow. 4.2. gCSP Handel-C floating point (Implementation F) The design gap between a floating point loop controller design and an integer based loop controller implementation in the previous implementation (section 4.1) is rather large and
398 M.A. Groothuis and J.F. Broenink / HW/SW Design Space Exploration on the Production Cell Setup Table 2. Estimated FPGA resource usage for the integer version (adapted from [5]). Element PID controllers Motion profiles I/O + PCI S&C Framework Free
LUTs 13.5% 0.9% 3.6% 10.3% 71.7%
(amount) (4038) (278) (1090) (3089) (21457)
Flipflops 0.4% 0.2% 1.6% 8.7% 89.1%
(amount) (126) (72) (471) (2616) (26667)
Method Accuracy
32 bit
16 bit / 32-bit*
Handel-C
Support Library / IP-core
Handel-C floating-point library
Seq
Seq (1)
Implementation platform
Used ALUs 0 0 0 0 32
Floating-point
Language
PCU execution order
Memory 0.0% 0.0% 2.3% 0.6% 97.1%
Par (2)
Pipelined
Seq (3)
Par (4)
32-bit*
Handel-C
ANSI-C
Coregen + Handel-C wrapper
Softcore or hardcore CPU with FPU*
Seq (5)
Par (6)
Seq (7)
FPGA * not yet implemented
Figure 10. Routes for floating point on an FPGA (adapted from [7]).
requires more design iterations to ensure correct behaviour and a stable loop controller. The choice for the integer datatype for FPGAs is a logical choice because it is the native datatype. However, from a model-driven design flow point of view (including the usage of code generation) a floating point FPGA implementation for the loop controllers is preferable. A quick and naive attempt to use floating point, in combination with the Handel-C floating point library, during the design of the integer version resulted in a FPGA resource usage explosion and an implementation that did not fit in the used Spartan III FPGA. This was mainly due to our choice to fully exploit the FPGA’s parallelism. At the same time, there was plenty of idle time left in each loop controller calculation loop (idle time 99.95%, period 1 ms) to do other things. Handing in some of the parallelism, frees FPGA resources that can be used for a floating point implementation. There are trade-offs between FPGA resource usage, amount of parallelism, calculation time, design time and loop controller calculation accuracy. The focus of this Production Cell design was to investigate these trade-offs to see if we could fit a floating point implementation in this FPGA. Figure 10 shows the possible implementation routes for a floating point FPGA implementation of the Production Cell motion profiles and loop controllers: 1. 2. 3. 4. 5. 6.
Sequential Handel-C floating point library with sequential PCU execution; Sequential Handel-C floating point library with parallel PCU execution; Pipelined Handel-C floating point library with sequential PCU execution; Pipelined Handel-C floating point library with parallel PCU execution; Using a 16-bit floating point core from Xilinx Coregen with parallel PCU execution; Using a 16-bit floating point core from Xilinx Coregen with sequential PCU execution; 7. Soft-core or hard-core CPU with floating point unit. Route 1, 3, 6 and 7 use sequential PCU execution which means that the controllers for each PCU run after another and share the same PID loop controller and motion profile generator FPGA processes, only using different parameters. The necessary motion profiles
M.A. Groothuis and J.F. Broenink / HW/SW Design Space Exploration on the Production Cell Setup 399 Table 3. Test results for floating-point loop controller implementation routes 1-4 (MP = motion profile). Route
LUT
FF
Mem
ALUs
LUT %
(1) MP in blockram 5669 (1) MP during runtime 6385 (2) MP in blockram 32530 (2) MP during runtime 33967 (3) MP in blockram 6508 (3) MP during runtime 6816 (4) MP in blockram 31181 (4) MP during runtime 32407 Maximum 26624
3825 4531 21492 22648 4127 4818 21937 23792 26624
671744 0 868352 0 671744 0 868352 0 589824
4 4 24 24 4 4 24 24 32
21.29 23.98 122.18 127.58 24.44 25.60 117.12 121.72
Mem % Fits 113.89 0.00 147.22 0.00 113.89 0.00 147.22 0.00
• • -
tcalc tloop % 41.54 μs 6.92 μs 38.41 μs 6.4 μs < 1000μs
4.15 0.69 3.84 0.64
Table 4. Estimated FPGA usage for the floating point version. Element Floating point library + wrapper PID controllers Motion profiles I/O + PCI S&C Framework Free
LUTs 27.4% 4.2% 1.1% 4.1% 5.6% 57.6%
(amount) (8191) (1251) (314) (1250) (1666) (17280)
Flipflops 19.7% 0.3% 0.5% 1.8% 4.2% 73.5%
(amount) (5909) (91) (163) (534) (1250) (22005)
Memory 0.0% 0.0% 0.0% 2.3% 0.3% 97.4%
Used ALUs 4 0 0 0 0 28
provide the loop controller with a predefined trajectory (position, speed and acceleration setpoints) that the PCU axes should follow. They can be calculated at runtime or stored (hardcoded) in FPGA blockram, resulting in a trade-off between FPGA resources (ram or LUTs). The Handel-C floating point library supports pipelined calculation and sequential calculation which results in another resource usage optimization possibility. Another route (5,6) for optimization is to lower the floating point precision from 32 bit to 16 bit at the cost of calculation accuracy. This is not possible with the Handel-C floating point library, but it is possible to generate a similar core with the Xilinx Coregen tool. The resource usage results of the different floating point possibilities are given in Table 3, which is based on the result from [7]). This table does not show routes 5 and 6 because it turned out that the generated 16-bit Coregen floating-point core was already bigger than the 32-bit Handel-C library. These routes were not further investigated. Route 7 is also not shown because it is still future work. The usage of a soft-core requires external RAM, which is not available on our Production Cell FPGA board. Table 3 shows only two feasible routes (italic) for the given FPGA: 1 and 3 with motion profile calculation during runtime. The result from Table 3 contain only 6 controllers and 6 motion profiles and not a complete implementation. The complete FPGA embedded control system with floating point controllers is implemented based on route 1. The resource usage results of this version are given in Table 4. This version is around 50% larger than the integer version (implementation E), but it has a better calculation accuracy, resulting in a slightly smoother movement. A disadvantage of sharing the PID loop controller and motion profiles with all PCUs is that it requires more effort to use it together with the PCU structure from Figure 9. A minor disadvantage of the Handel-C floating point library in combination with C code generation is that it is not ANSI-C compatible, because the Handel-C floating point library uses functions instead of standard operators and it uses a floating-point structure, containing the integer representation of the sign, exponent and mantissa separated by commas, instead of a float datatype, also see Table 5. These differences require manual changes in the generated code from 20-sim or changes in the 20-sim code generator.
400 M.A. Groothuis and J.F. Broenink / HW/SW Design Space Exploration on the Production Cell Setup Table 5. ANSI-C float versus Handel-C float declaration. ANSI-C float a = 0.007; float b = -0.31;
Handel-C Float a = {0, 119, 6643777}; //sign, exponent, mantissa Float b = {1, 125, 2013265};
5. Evaluation All presented Production Cell embedded control system (ECS) software implementations were made after each other by different people with a different amount of experience. Some results from one version were used for the others (e.g. the 20-sim controllers). Furthermore, the used tools (mainly academic) have a different maturity for our purpose (ECS software). This makes it difficult to give a precise and fair comparison of all these approaches when looking at the required design time, which operating system to choose and which tool or method is the best. However it is possible to give global observations and guidelines for future embedded control system implementations based on the presented methods, tools and the setup used. Common for all CPU and FPGA implementations is a hierarchical process-oriented implementation. The (CSP) process abstraction together with a layered structure and standardized building blocks (like Figure 9) is perfectly suitable for (ECS designs with a combination of discrete event and continuous/discrete time parts. Accurate timing is essential for real-time ECS software, however the combination of (untimed) CSP and timing poses some questions about how to implement this in practice (see also [8]) to get an efficient and deterministic timing realization. The POOSL language (implementation B) can provide accurate timing, but without real-time guarantees, the implementation is of little value at the moment. Implementation A uses our (existing) CTC++ library without pre-emption support and using user-level threading. The timing accuracy here is limited to the channel communication frequency (scheduling is only possible on channel communication). The new QNX CTC++ library, made for implementation D, does use preemptive scheduling and provides more accurate timing, but on the other hand, the usage of operating system threads results in a much larger context switch overhead, especially with many channel communications. When we compare all implementations, we see that they all contain many small processes with multiple channels to the same neighbour to communicate small amounts of data (simple variables). To become more resource efficient it is necessary to turn these channel communications into bus transfers or to send multiple variables with a single write action. Furthermore, the small processes should be combined into larger ones with the same behaviour while translating the model into an implementation. The usage of formal checking of the created (graphical) models in combination with automated model-to-formal-language translation reveals that none of the used approaches can currently provide an intuitive and user-friendly way of using formal methods to ensure the correctness of the designed ECS structure. It is possible for implementations A, B, D, E, F, but not yet without manual translation or adaptation/extension of (gCSP) generated formal descriptions (CSPm). The Ptolemy II all-in-one tool approach is promising with respect to shortening the design time and doing early integration, but this academic tool is not yet mature enough for daily usage in the mechatronics field. The FPGA implementations provide an interesting alternative for the commonly used CPU-based embedded control system implementations in industry, especially when accurate timing and more parallelism than CPU-based solutions is required. The FPGA implementation allow for a single chip solution, containing both the ECS “software” and the required (digital) I/O hardware for actuation and sensing. It also allows us to reach faster reaction times than possible on X86 PCs. The main disadvantage of an FPGA based ECS implement-
M.A. Groothuis and J.F. Broenink / HW/SW Design Space Exploration on the Production Cell Setup 401
ation is the required design effort. The design gap between model-driven ECS design and an FPGA implementation is rather large with the current tooling, especially for implementation E (integer), where our floating point based controllers needed to be translated into an integer implementation. Implementation F (floating point) makes this design gap smaller, at the cost of additional FPGA resources which may require again a sequential implementation, a larger FPGA or an FPGA with DSP blocks (e.g. the Xilinx Virtex series) that can be used for floating point calculations. 6. Conclusions and Future Work The comparison of different design methods and tools for embedded control system software (ECS) for the same setup gives us growing insights in the maturity of the used design tools, that have mainly an academic background, for ECS software design and realization. The different ways of designing the process-oriented ECS software lead to a standardized layered structure which we can add a building blocks into a (g)CSP library. Having both software and hardware realizations of the ECS “software” for the same setup provides us with useful information about the design trade-off between a CPU-based and an FPGA-based solution. The FPGA solution requires more design time but it can provide accurate timing without the usage of a real-time operating system. The comparison of all ECS realizations shows that many small decisions made during the design of all these realizations influence our route through the design space, resulting in seven different solutions with different key properties. None of the realizations is perfect, but they give us valuable information for future improvements of our design methods and tools. We are currently working on version 2 of gCSP with suggested improvements like state machines and language elements from SystemCSP and with a better CSPm translation. We are working also on an extended version of the ECS software framework from Figure 9 to incorporate also vision processing and other Human-Machine-Interface (HMI) features for the usage in our Humanoid soccer robot and our robotic head setup. Acknowledgements We would like to thank our former MSc students Bert van den Berg, Pieter Maljaars, Kees Verhaar, Jasper van Zuijlen, Bart Veldhuijzen and Thijs Sassen for their final MSc project contributions on the Production Cell setup, its software and hardware motion control implementations. Furthermore, we would like to thank our ViewCorrect project partner and colleague Jinfeng Huang from Eindhoven University for the joint work on the POOSL implementation of the Production Cell control software. References [1] P. Maljaars. Controllers for the production cell set up. MSc thesis 039CE2006, Control Engineering, University of Twente, The Netherlands, December 2006. URL http://www.ce.utwente.nl/rtweb/ publications/MSc2006/pdf-files/039CE2006_Maljaars.pdf. [2] Jinfeng Huang, Jeroen P. M. Voeten, M.A. Groothuis, J.F. Broenink, and Henk Corporaal. A model-driven approach for mechatronic systems. In Seventh International Conference on Application of Concurrency to System Design, 2007, Bratislava, Slovakia, pages 127–136, Los Alamitos, July 2007. IEEE Computer Society Press. ISBN 978-0-7695-2902-8. doi: 10.1109/acsd.2007.40. [3] C. A. Verhaar. An integrated embedded control software design case study using Ptolemy II. MSc thesis 011CE2008, Control Engineering, University of Twente, The Netherlands, May 2008. URL http:// purl.org/utwente/e58154.
402 M.A. Groothuis and J.F. Broenink / HW/SW Design Space Exploration on the Production Cell Setup [4] Bart Veldhuijzen. Redesign of the CSP execution engine. MSc thesis 036CE2008, Control Engineering, University of Twente, February 2009. URL http://purl.org/utwente/e58514. [5] M.A. Groothuis, J. J. P. van Zuijlen, and J.F. Broenink. FPGA based control of a production cell system. In Communication Process Architectures 2008, York, United Kingdom, volume 66 of Concurrent Systems Engineering Series, pages 135–148, Amsterdam, September 2008. IOS Press. ISBN 978-1-58603-907-3. doi: 10.3233/978-1-58603-907-3-135. [6] J. J. P. van Zuijlen. FPGA-based control of the production cell using Handel-C. MSc thesis 008CE2008, Control Engineering, University of Twente, April 2008. URL http://purl.org/utwente/e58152. [7] Thijs Sassen. Floating-point based control of the Production Cell using an FPGA with Handel-C. MSc thesis 009CE2009, Control Engineering, University of Twente, June 2009. URL http://www.ce. utwente.nl/rtweb/publications/MSc2009/pdf-files/009CE2009_Sassen.pdf. [8] Bojan Orlic. SystemCSP, A graphical language for designing concurrent component-based embedded control systems. PhD Thesis, Control Engineering, University of Twente, The Netherlands, September 2007. [9] Bert van den Berg. Design of a production cell setup. MSc thesis 016CE2006, University of Twente, Control Engineering, 2006. URL http://www.ce.utwente.nl/rtweb/publications/MSc2006/ pdf-files/016CE2006_vdBerg.pdf. [10] S Bennet. Real-Time computer control: An introduction. Prentice-Hall, New York, NY, 1988. ISBN 0137641761. [11] M.A. Groothuis, A.S. Damstra, and J.F. Broenink. Virtual prototyping through co-simulation of a cartesian plotter. In IEEE International Conference on Emerging Technologies and Factory Automation (ETFA), 2008., number 08HT8968C in ETFA, pages 697–700. IEEE Industrial Electronics Society, September 2008. ISBN 978-1-4244-1505-2. doi: 10.1109/etfa.2008.4638472. [12] Controllab Products. 20-sim website, 2009. URL http://www.20sim.com. [13] D.S. Jovanovic. Designing dependable process-oriented software, a CSP approach. PhD thesis, University of Twente, Enschede, The Netherlands, 2006. [14] G.H. Hilderink. Managing complexity of control software through concurrency. PhD thesis, University of Twente, Enschede, The Netherlands, May 2005. URL http://doc.utwente.nl/50746/1/thesis_ Hilderink.pdf. [15] T.T.J. van der Steen. Design of animation and debug facilities for gCSP. MSc thesis, Control Engineering, University of Twente, June 2008. URL http://purl.org/utwente/e58120. [16] Bojan Orlic and Jan F. Broenink. Redesign of the C++ Communicating Threads library for embedded control systems. In Frank Karelse, editor, 5th PROGRESS Symposium on Embedded Systems, pages 141– 156. STW, Nieuwegein, NL, 2004. [17] Agility Design Systems. Handel-C, 2008. URL http://www.agilityds.com. [18] B. D. Theelen, O. Florescu, M. C. W. Geilen, J. Huang, P. H. A. van der Putten, and J. P. M. Voeten. Software/hardware engineering with the parallel object-oriented specification language. In 5th IEEE/ACM International Conference on Formal Methods and Models for Codesign, pages 139 – 148. IEEE, 2007. [19] Robin Milner. Communication and Concurrency. Prentice-Hall, Englewood Cliffs, 1989. ISBN 9780131149847. [20] Ptolemy. Ptolemy II website, 2009. URL http://ptolemy.berkeley.edu/ptolemeyII. [21] Henk Corporaal. Embedded system design. In Frank Karelse, editor, Progress White Papers 2006, pages 7–27. STW, Utrecht, 2006. [22] RTAI. RTAI website, 2009. URL http://www.rtai.org. [23] G.H. Hilderink and J.F. Broenink. Sampling and timing a task for the environmental process. In Communicating Process Architectures 2003, pages 111–124. IOS press, 2003. ISBN 1 58603 3816. [24] Jinfeng Huang. Predictability in Real-Time System Design. PhD thesis, Technische Universiteit Eindhoven, The Netherlands, September 2005. [25] Uppaal. UPPAAL model checker. Website, July 2009. URL http://www.uppaal.com/. [26] QNX Software Systems. QNX real-time operating system (RTOS) software. Website, June 2009. URL http://www.qnx.com/.
Communicating Process Architectures 2009 P.H. Welch et al. (Eds.) IOS Press, 2009 © 2009 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-065-0-403
Engineering Emergence: an occam-π π Adventure Peter H. WELCH a, Kurt WALLNAU b and Mark KLEIN b a
b
School of Computing, University of Kent, Canterbury, UK Software Engineering Institute, Carnegie Mellon University, Pittsburgh, PA., USA [email protected], [email protected], [email protected] Abstract. Future systems will be too complex to design and implement explicitly. Instead, we will have to learn to engineer complex behaviours indirectly: through the discovery and application of local rules of behaviour, applied to simple process components, from which desired behaviours predictably emerge through dynamic interactions between massive numbers of instances. This talk considers such indirect engineering of emergence using a processoriented architecture. Different varieties of behaviour may emerge within a single application, with interactions between them provoking ever-richer patterns – almost social systems. We will illustrate with a study based on Reynolds' boids: emergent behaviours include flocking (of course), directional migration (with waves), fear and panic (of hawks), orbiting (points of interest), feeding frenzy (when in a large enough flock), turbulent flow and maze solving. With this kind of engineering, a new problem shows up: the suppression of the emergence of undesired behaviours. The panic reaction within a flock to the sudden appearance of a hawk is a case in point. With our present rules, the flock loses cohesion and scatters too quickly, making individuals more vulnerable. What are the rules that will make the flock turn almost-as-one and maintain most of its cohesion? There are only the boids to which these rules may apply (there being, of course, no design or programming entity corresponding to a flock). More importantly, how do we set about finding such rules in the first place? Our architecture and models are written in occam-π, whose processes are sufficiently lightweight to enable a sufficiently large mass to run and be interacted with for real-time experiments on emergent behaviour. This work is in collaboration with the Software Engineering Institute (at CMU) and is part of the CoSMoS project (at the Universities of Kent and York in the UK). Keywords. complex systems, emergent behaviour, process orientation, mobile processes, occam-π, boids, flocking, migration, maze solving
403
This page intentionally left blank
Communicating Process Architectures 2009 P.H. Welch et al. (Eds.) IOS Press, 2009 © 2009 The authors and IOS Press. All rights reserved.
405
Subject Index 20-sim 387 agents 363 alternation 263 alternative 135 application timers 135 ASD 105 behavioural control 375 bigraphs 1 bisimilarity 49 boids 403 cloud computing 185 complex systems 403 concurrency 7, 117, 145, 263, 277, 311, 375 co-routines 277 CoSMoS 197 CSP 67, 89, 105, 117, 263, 277, 293, 311, 325, 363, 387 deadlock analysis 349 denotational semantics 239 design patterns 349 device discovery 363 device driver 105 distributed computing 205 domain-specific languages 293 dynamic languages 293 embedded systems 67, 173, 387 emergent behaviour 403 escape analysis 117 ETC 145 flocking 403 formal methods 105 formal modelling 1 gCSP 67, 387 graph rewriting 49 grid computing 185 Handel-C 387 higher-order communication 49 105 I2C Java 7 JCSP 197, 363 jcsp.net2 363
language translation 311 Linux kernel 105 LLVM 145 maze solving 403 mechatronics 387 message-passing 225 migration 403 mobile channels 205 mobile processes 403 mobility 225, 239 motion control 387 multi-agent systems 197 object oriented 7 occ21 145 occam-π 117, 145, 403 OpenComRTOS 173 optimisation 225 pedestrian simulation 197 π-calculus 49 POOSL 387 process orientation 403 process scheduling 67 process transformation 67 processes 277 protocol support 205 Ptolemy II 387 PyCSP 263, 277 Python 263, 277, 293, 311 QNX 387 R 185 real-time FPGA 387 real-time programming 349 refinement 1, 239 robotics 375 RTOS 173 scheduling analysis 349 SCOOP 7 semantic models 325 service discovery 363 statistical genetics 185 structural traces 89 synch-point scheduling 135
406
system engineering tests for availability theory of concurrency threads Toc programming language
173 325 49 277 349
traces tranx86 VCR verification visualisation
67, 89 145 89 1 89
Communicating Process Architectures 2009 P.H. Welch et al. (Eds.) IOS Press, 2009 © 2009 The authors and IOS Press. All rights reserved.
407
Author Index Armitage, A. Bacanu, S.-A. Barnes, F.R.M. Barrett, S.J. Bezemer, M.M. Bialkiewicz, J.-A. Bjørndalen, J.M. Bouwmeester, L. Bradshaw, K.L. Broenink, J.F. Brown, N.C.C. Chalmers, K. Chechik, M. Clayton, S. Derwig, R. Dunlap, D. Faust, O. Friborg, R.M. Goldsmith, M. Groothuis, M.A. Hammoudeh, M. Hendseth, S. Kauke, B. Kerridge, J. Klein, M. Klomp, A. Korsgaard, M. Kosek, A. Lowe, G. Martin, J.M.R.
363 185 v, 117 185 67 239 263, 277 105 311 v, 67, 387 89, 225 205 7 197 105 185 173 263, 277 1 67, 387 293 349 159 197, 205, 363 403 105 349 363 325 185
Mezhuyev, V. Mount, S. Murakami, M. Newman, R. Ostroff, J.S. Paige, R.F. Pecheur, C. Pedersen, J.B. Peschanski, F. Ritson, C.G. Roebbers, H. Sampson, A.T. Simpson, J. Smith, M.L. Sputh, B.H.C. Stiles, G. (Dyke) Syed, A. Teig, Ø. Thornber, S.J. Torshizi, F. Tristram, W.B. Urquhart, N. Vander Meulen, J. Vannebo, P.J. Verhulst, E. Vinter, B. Wallnau, K. Welch, P.H. Weston, S. Wilson, S.
173 293 49 293 7 7 29 159 239 v, 145, 375 v, 105 v 375 89 173 v 363 135 185 7 311 197 29 135 173 v, 263, 277 403 v, 403 185 293
This page intentionally left blank
This page intentionally left blank
This page intentionally left blank