Diss. ETH No. 14215
Modular Language Specification and Composition
A dissertation submitted to the SWI SS FE D ER AL I...
18 downloads
1053 Views
1MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Diss. ETH No. 14215
Modular Language Specification and Composition
A dissertation submitted to the SWI SS FE D ER AL I N STI TUTE OF T E CH NO LOG Y ( E TH) ZÜ RIC H
for the degree of Doctor of Technical Sciences
presented by C HRI STO PH D E NZ L E R Dipl. Informatik-Ing. ETH born July 20, 1968 citizen of Muttenz, BL and Schwerzenbach, ZH
accepted on the recommendation of Prof. Dr. Albert Kündig, examiner Dr. Matthias Anlauff, co-examiner
2001
TIK-Schriftenreihe Nr. 43 Dissertation ETH No. 14215 Examination date: June 1, 2001
Für meine Mutter, im Andenken an meinen Grossvater
Acknowledgements
I would like to thank my adviser, Prof. Dr. Albert Kündig, for giving me the possibility to develop my own ideas and for his confidence in my work. He introduced me in the field of embedded systems from where I learned many lessons on efficiency, simplicity and reusability. I am grateful to Dr. Matthias Anlauff for agreeing to be co-examiner of this thesis. MCS modelled on his Gem/Mex system. Especially his advice to represent undef as null saved me many redesigns. I want to thank my colleagues from TIK, Daniel Schweizer, Philipp Kutter, Samarjit Chakraborty, and Jörn Janneck for insightful discussions. Daniel introduced me into the field of language and program specification. So I was well prepared to join Philipp’s Montage approach. Discussing Montages with him always left me with a head full of new ideas. Jörn and in particular Samarjit then helped me to get some structure into these creative boosts. A special thank goes to Hugo Fierz whose CIP system inspired my own implementation. Discussing modelling techniques with him gave me valuable insights into good design practice. I owe my thanks also to Hans Otto Trutmann for his FrameMaker templates and his support on this word processor. Discussions with Stephan Missura and Niklaus Mannhart could be really mind challenging as their clear train of thought forced my arguments to be equally precise. Many times, a glance at Stephan’s thesis gave me the needed inspiration to continue with mine. Having lunch with Niklaus Mannhart was always a welcome interruption to my work. He also deserves my thanks for proof-reading this thesis. Last but not least, I thank Regula Hunziker, my fiancée, for her love and her support.
i
Abstract
This dissertation deals with the modularisation of specifications of programming languages. A specification will not be partitioned into compiler phases as usual but into modules – called Montages – which describe one language construct completely. This partitioning allows to plug specifications of language constructs into a specification for a language and – as Montages contain executable code – thus building an interpreter for the specified language. The problems that follow from this are discussed on different levels of abstraction. The different character of language specifications on a construct by construct basis also demands for a different concept of the whole system. Knowledge about processes, such as parsing, has to be distributed to many Montages. But this is made up by the increased flexibility of Montage deployment. A language construct that once has been successfully implemented for a language can be reused with only minor adaptations in many different languages. Welldefined access via interfaces separate Montages clearly, such that changes in one construct cannot have unintentional side-effects on other constructs. This dissertation describes the concept and implementation of a system based on Java as a specification language. Reuse of specifications is not restricted to reuse of source code but it is also possible to reuse precompiled components. This enables to distribute and sell language specifications without giving away valuable know-how on its internals. Some approaches towards development and distribution of as well as support on language components will be discussed. A detailed description of the Montage Component System will go into the particulars of decentralised parsing (each Montage can parse only its specified construct), explain how static semantics can be processed efficiently and how a program can be executed, i.e. how its dynamic semantics is processed.
iii
Zusammenfassung
Die vorliegende Arbeit beschäftigt sich mit der Modularisierung von Spezifikationen für Programmiersprachen. Eine Spezifikation wird dabei nicht wie üblich in Compilationsphasen unterteilt, sondern in Moduln – sogenannte Montages – die jeweils ein Sprachkonstrukt komplett beschreiben. Diese Unterteilung erlaubt es Spezifikationen einzelner Sprachkonstrukte zu einer Spezifikation einer ganzen Sprache zusammenzustecken. Da die Montages ausführbaren Code enthalten, lässt sich auf diese Weise ein Interpreter für die spezifizierte Sprache zusammensetzen. Es ergeben sich dabei Probleme auf mehreren Abstraktionsebenen deren Lösungen in dieser Arbeit diskutiert werden. Die spezielle Art der Spezifikation einer Sprache (Konstrukt für Konstrukt) verlangt auch nach einer andersartigen Konzeption des ganzen Systems. Prozesswissen, das in einem herkömmlichen Modell in einer Phase vorhanden ist (z.B. über das Parsen), muss auf viele einzelne Montages verteilt werden. Der Gewinn ist aber eine enorme Flexibilität beim Einsatz von Montages. Ein Sprachkonstrukt welches in einer Sprache erfolgreich zum Einsatz kam, kann mit nur minimalen Anpassungen in einer neuen Sprachspezifikation eingesetzt werden. Die Montage-Schnittstellen grenzen die einzelnen Teilspezifikationen sauber voneinander ab, so dass durch eine Aenderung in einem Konstrukt keine unbeabsichtigten Nebeneffekte in anderen Konstrukten entstehen können. Diese Arbeit beschreibt die Konzeption und Implementation eines Systems basierend auf Java als Spezifikationssprache. Teilspezifikationen können nicht nur als Quellcode wiederverwendet werden, sondern auch als kompilierte Komponenten. Dies eröffnet auch die Möglichkeit, Sprachspezifikationen zu vermarkten ohne dabei wertvolles Know-How preiszugeben. Es werden deshalb einige Ansätze zur Entwicklung, Vertrieb und Unterhalt von Sprachkomponenten beschrieben und diskutiert. v
vi
Zusammenfassung
Die Beschreibung des Montage Component Systems geht auf die Probleme des dezentralen Parsens ein (jede Montage kann nur das von ihr beschriebene Konstrukt parsen), erklärt wie die statische Semantik effizient ausgeführt werden kann und legt dar wie ein Program zur Ausführung gelangt.
Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .iii Zusammenfassung . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Motivation and Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2 Electronic Commerce with Software Components . . . . . . . . . . . . . . . . 7 2.1 E-Commerce for Software Components . . . . . . . . . . . . . . . . . . . . . . 8 2.1.1 What is a Software Component? . . . . . . . . . . . . . . . . . . . . . . . 8 2.1.2 End-User Composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.3 What Market will Language Components have? . . . . . . . . . . 11 2.2 Electronic Marketing, Sale and Support . . . . . . . . . . . . . . . . . . . . . 2.2.1 Virtual Software House (VSH) . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 On-line Consulting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.3 Application Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11 12 13 16
2.3 Formal Methods and Electronic Commerce . . . . . . . . . . . . . . . . . . 18 2.3.1 Properties of Formal Methods . . . . . . . . . . . . . . . . . . . . . . . . 18 3 Composing Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.1 Partitioning of Language Specifications. . . . . . . . . . . . . . . . . . . . . . 3.1.1 Horizontal Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Vertical Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.3 Static and Dynamic Semantics of Specifications . . . . . . . . . .
24 24 26 28
3.2 Language Composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 vii
viii
Contents 3.2.1 The Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2.2 On Benefits and Costs of Language Composition . . . . . . . . . 30 3.3 The Montages Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3.1 What is a Montage? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.3.2 Composition of Montages . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4 From Composition to Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.1 What is a Montage in MCS? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.1.1 Language and Tokens. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.1.2 Montages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.2 Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.3 Registration / Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.4 Integration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.4.1 Parser Generation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.4.2 Scanner Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.4.3 Internal Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.4.4 External consistency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.5 Parsing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.5.1 Predefined Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.5.2 Bottom-Up Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.5.3 Top-Down Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.5.4 Parsing in MCS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.6 Static Semantics Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.6.1 Topological Sort of Property Dependencies . . . . . . . . . . . . . . 60 4.6.2 Predefined Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.6.3 Symbol Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.7 Control Flow Composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.7.1 Connecting Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.7.2 Execution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5 Implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.1 Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.2 Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.2.1 Token Manager and Scanner . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.2.2 Tokens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.2.3 Modular Parsing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.3 Data Structures for Dynamic Semantics of Specification . . . . . . . . . 80 5.3.1 Main Class Hierarchy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.3.2 Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Contents
ix
5.3.3 I and T Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.3.4 Terminal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.3.5 Repetition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.3.6 Nonterminal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.3.7 Synonym . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.3.8 Montage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.3.9 Properties and their Initialisation . . . . . . . . . . . . . . . . . . . . . . 90 5.3.10 Symbol Table Implementation . . . . . . . . . . . . . . . . . . . . . . . 94 6 Related Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.1 Gem-Mex and XASM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.2 Vanilla . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.3 Intentional Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.4 Compiler-Construction Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.4.1 Lex & Yacc. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.4.2 Java CC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 6.4.3 Cocktail . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 6.4.4 Eli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 6.4.5 Depot 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 6.4.6 Sprint. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 6.5 Component Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6.5.1 CORBA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6.5.2 COM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 6.5.3 JavaBeans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.6 On the history of the Montage Component System . . . . . . . . . . . 112 7 Concluding Remarks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 7.1 What was achieved . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 7.2 Rough Edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 7.2.1 Neglected Parsing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 7.2.2 Correspondence between Concrete and Abstract Syntax . . . 117 7.2.3 BNF or EBNF? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 7.3 Conclusions and Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 7.3.1 Separation of Concrete and Abstract Syntax. . . . . . . . . . . . . 122 7.3.2 Optimization and Monitoring . . . . . . . . . . . . . . . . . . . . . . . 123 7.3.3 A Possible Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 Bibliography. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 Curriculum Vitae . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
Chapter 1
Introduction
Introduction This thesis aims to bring specifications of programming languages closer to their implementations. Understanding and mastering the semantics of languages will be important to a growing number of programmers and – to a certain extent – also to software users. A typical compiler processes the source program in distinct phases which normally run in sequential order (Fig. 1). In compiler construction suites such as GCC, Eli or Cocktail [GCC, GHL+92, GE90], each of these phases corresponds to one module/tool. This architecture supports highly optimized compilers and is well suited for complex general-purpose languages. Its limitations become apparent when reuse and extensibility of existing implementations/specifications is of interest. Consider extending an existing programming language with a new construct — a not yet available switch statement in a subset of C for instance. In a traditional compiler architecture, such an extension would imply changes in all modules. The scanner must recognize the new token (switch), the parser has to be able to parse the new statement correctly, and obviously semantic analysis and code generation have to be adapted.
1.1 Motivation and Goals The specification and implementation of programming languages are often seen as entirely different disciplines. This lamentable circumstance leads to a separation of programming languages into two groups: 1. Languages designed by theorists. Most of them have very precisely formulated semantics, many where first specified before an implementation was available. And often they are based on a sound concept or model of computation.
1
2
1 Introduction source code lexical analysis syntax analysis semantic analysis intermediate code generation optimization code generation execution
Figure 1: Module structure (phases) of a traditional compiler system
Examples are: ML [HMT90], Lisp [Ste90], Haskell [Tho99] (functional programming), Prolog [DEC96] (logic programming), ASM [Gur94, Gur97] (state machines). Although their semantics is often given in unambiguous mathematical notations, they lack a large programmer community — either because the mathematical background needed to understand the specification is considerable, or because their usually high level of abstraction hinders an efficient implementation. 2. Languages designed by programmers. Their development was driven by operating systems and hardware architecture needs, by marketing considerations, by the competition with other languages, by practical problems — for many such languages, several of these reasons apply. Examples are: Fortran, C, C++, Basic, Java. Many of these languages feature surprisingly poor semantics descriptions. Normally, their specification is given in plain English, leaving room for many ambiguities. The most precise specification in these cases is the source code of the compiler, if available, and even this source code has to be adapted to the hosting platforms specific needs. For this reasons, formal specifications are hard to provide. Checking a program against a given specification is a very tedious and (in general) intractable task [SD97].
1.1 Motivation and Goals
3
This thesis will focus on specifications of imperative programming languages, which will be found mainly in the second group. Ideally, language specifications should be easy to understand and easy to use. Denotational descriptions assume a thorough mathematical background of the reader. They allow to formulate many problems in a very elegant fashion (e.g. an order of elements), but they lack to give hints for an efficient implementation (e.g. a quick-sort algorithm). Specifications should be understandable by a large community of different readers: language designers, compiler implementors, and programmers using the language to implement software solutions. Denotational specifications will be laid aside by the compiler constructor as soon as efficiency is in demand and they are not suited to stir the interest of the average C programmer. Whether this is an indication for the insufficient education of programmers or for the unnecessary complexity of mathematical notions will not be subject of this thesis. Both arguments find their supporters and in both there is a certain truth and a certain ignorance towards the other. Programming languages become more and more widespread in many disciplines apart from computer science. The more powerful applications get, the more sophisticated their scripting languages become. It is e.g. a common task to enter formulas in spread sheet cells. Some of these formulas should only be evaluated under certain circumstances, and some might have to be evaluated repeatedly. Another example for the increasing importance of basic programming skills is the ability to enter queries in search engines on the Internet. Such a programming language designed for a specific purpose is called a domain specific language (DSL) [CM98] or a “little language” [Sal98]. Many of those languages are used by only a few programmers, some may be even designed for one single task in a crash project (e.g. a data migration project). Although most DSLs will never have the attention of thousands of programmers, they still should feature a correct implementation and fulfil their purpose reliably. Many DSLs will be implemented by programmers who are inexperienced in language design and compiler construction. A tool for the construction of DSLs should support these programmers. This basically means that it should be possible to encapsulate the experience of professional language designers as well as their reliable and well understood implementations in «language libraries». This enables a DSL designer to merely compose his new language instead of programming it from scratch. The language specification suite we describe in this thesis was designed to allow for such a systematic engineering approach. Languages are decomposed into their basic constructs, e.g. while and for loops, if-then-else branches, expressions and so on. Each such construct is represented by a software component. Such a component provides the modularity and
4
1 Introduction
composability needed for our programming language construction kit. It enables the designer of a new DSL to simply compose a new language by picking the desired constructs from a library and plugging them together. Existing components may be copied and adapted or new constructs be created from scratch and added to the library
1.2 Contributions We describe a system based on the Montages approach [KP97b] which we call Montage Component System (MCS). The basic idea is to compose a programming language on a construct by construct basis. A possible module system for an IF statement will look like Fig. 2. Extending a language basically consists of adding new modules (Montages) to a language specification. Each such module is implemented as a software component, which simplifies the composition of languages. The originating Montage approach has been extended to support: • Abstract syntax • Mapping of concrete to abstract syntax • Coexistence of precompiled and newly specified components. • Configurability (to some extent) of precompiled components. • 4 configurable phases: parsing, static semantics, generation of an execution network and execution We provide a survey of compiler construction technology and an in-depth discussion of how language specifications can be composed to executable interpreters. A MCS specification decomposes language constructs into four different levels, which roughly correspond to the phases of traditional approaches: parsing, static semantics, code generation and execution (dynamic semantics). In contrast to conventional compiler architectures, these specifications are not given per level for the whole language, but only for a specific language construct. Combining Montage specifications on each of these four levels is the main topic of this thesis. The systems architecture supports multilanguage and multi-paradigm specifications. The Montage Component System originated in the context of a long standing tradition of research in software development techniques at our laboratory [FMN93, Mar94, Schw97, Mur97, Sche98]. Although technically oriented, our work always considered economical aspects as well. This thesis shall continue this tradition and starts with some reflections on marketing, sale and support of language components. It is important to emphasise that these considerations greatly influenced design decisions and requirements MCS had to fulfil. The latter were:
1.2 Contributions
Execution (dashed)
5
Identifier
ConditionalOp
Identifier
a
<>
b
Static analysis: simultaneous firing (dotted)
ConditionalExpr
Parse (solid)
IfStatement
Assignment
A Montage Component Web for: IF a<>b THEN a := c END Identifier
Identifier
a
c
Figure 2: A MCS Web for an IF statement and its various control flows.
• Specifications should be easy to understand and to use. • Specifications should be reusable not only in source form but also in compiled form (component reuse). • Specifications should be formal • Modularity/composability of specifications from a programmers point of view. Programmers think in entities of language constructs and not of compiler phases. • Employment of a standard component system and a standard programming language for specification.
6
1 Introduction
1.3 Overview This thesis is organized as follows. Commercialization of software components imposes some prerequisites on the architecture of MCS. These prerequisites will be introduced and explained in Chapter 2. Chapter 3 then gives an overview of how programming languages can be composed. Design and concepts of MCS are explained in detail in Chapter 4, which is the core of this thesis. Chapter 5 discusses some interesting implementation details A survey of related work will be given in Chapter 6. References to this chapter can be found throughout this thesis, in order to put a particular problem into a wider context. Finally, Chapter 7 concludes this dissertation and gives some prospects for the future.
Chapter 2
Electronic Commerce with Software Components
Electronic Commerce with Software Components Electronic commerce has emerged in the late 90s and has grown to a multi- billion dollar business. E-Commerce businesses can be characterised by two dimensions: the degree of virtuality of the traded goods and the degree of virtuality of the document flow. Fig. 3 shows these two dimensions and illustrates their meaning by giving some examples. Whether ‘real’ goods have to be moved depends on the business sector involved. On the document flow axis however, E-Commerce tries to reduce the amount of physically moved documents. An important distinction has to be made between business-to-business (B2B) commerce and business-to-consumer (B2C) commerce. Note that a company may well distinguish between logistic (B2B like) and customer (B2C like) commerce – Wal-Mart supermarkets e.g. B2C E-Commerce will emerge much slower, as consumers cannot be expected to be on-line 24 hours a day. Security considerations second this argument (a signature on a piece of paper is much easier to understand than to trust in abstract cryptographic systems). Electronic document flow
B2B B2C banking systems
'virtual' goods
travel agency
automobile industry 'real' goods
supermarket
Paper-based document flow
Figure 3: Characterisation of E-Commerce businesses. The horizontal axis represents the ‘virtuality’ of the traded goods. The vertical axis indicates how ‘paperless’ the office works.
7
8
2 Electronic Commerce with Software Components
E-Commerce as we refer to in this chapter is concerned with the left part of the diagram in Fig. 3. Components are virtual goods which might be sold to end-users or deployed by other business organisations. Section 2.2 will give some examples of both B2B and B2C scenarios with language components involved. Szyperski explains that there has to be a market for component technology in order to keep the technology from vanishing [Szy97]. We further believe, that E-Commerce will play a major role in this market, as we will elaborate in section 2.2.This chapter will focus on the relation of software components and their marketing. The Montage Components System presented later (chapter 4, p. 39) in this thesis highly relies on the success of component technology. Although basic idea of composing language specifications can be applied to a single user environment, it makes limited sense only. Its full potential is revealed only if language components can be distributed over the Internet [Sche98]. In the following we will present our vision of a (language) component market. After defining the term ‘Software Component’, we point out some premises E-Commerce imposes on software components. We then explain our vision of electronic marketing, consulting and support. These ideas where developed and implemented under the umbrella of the Swiss National Science Foundations project “Virtual Software House”1. The chapter closes with some considerations on the acceptance of formal methods in software markets.
2.1 E-Commerce for Software Components 2.1.1 What is a Software Component? The term ‘Software Component’ is used in many different ways. For marketing and sales persons it is simply a ‘software box’. Programmers have different ideas of components, too. To the C++ programmer, it might be a dynamic link library (DLL); a Java programmer refers to a JavaBean and a CORBA specialist has in mind any program offering its services through an ORB. Often the terms ‘Object’ and ‘Component’ are not separated clearly enough. Throughout this monograph we keep to the following definition, taken from [SP97]:
1
project no. 5003-52210
2.1 E-Commerce for Software Components
9
“A component is a unit of composition with contractually specified interfaces and explicit context dependencies only. Components can be deployed independently and are subject to composition by third parties” This definition is product-independent and does not only cover technical aspects. It covers five important aspects of a component: 1. Extent: ‘unit of composition’ means a piece of software that is sufficiently self-contained to be composed by third parties (people who do not have complete insight into the components software). 2. Appearance: ‘contractually specified interfaces’ implies that the interface of a component adheres to a standard of interfaces that is also followed by other components. For example, JavaBeans employs a special naming scheme which allows others to set and get attribute values, to fire events and to gain insight into the components structure. 3. Requirements: ‘explicit context dependencies’ specifies that the component does not only reveal its interfaces to its clients (as classes and modules would); it furthermore tells what the deployed environment has to provide in order to be operative. 4. Occurrence: ‘Components can be deployed independently’, i.e. it is well separated from its environment and from other components. No implicit knowledge of the underlying operating system, hardware or other software (components) may be used during compile-time and at run-time. 5. Usage: Components ‘are subject to composition by third parties’ and thus will be deployed in systems unknown to the programmer of the component. This aspect justifies the four former items of the definition. A component should encapsulate its implementation and interact with its environment only through well defined interfaces. A detailed discussion on the above definition along with the difference between ‘Objects’ and ‘Components’ is given in Szyperski’s book on component software [Szy97].
2.1.2 End-User Composition Besides the pure definition of what a software component is, there is also a list of requirements that a component has to fulfil to be a ‘useful’ component. Components should provide solutions to a certain problem on a high level of abstraction. As we have seen above, a component is subject to third party composition and will eventually be employed in combination with other components of other component worlds (JavaBeans interoperating with ActiveX components for example). This implies that its interfaces and its interfacing
10
2 Electronic Commerce with Software Components
process have to be kept as simple and as abstract as possible. There is no room for proprietary solutions and ‘hacks’ — the adherence to standards is vital for the success of a component. Of course, there is a price to be paid for seamless interoperability: performance. In many environments based on modern highperformance hardware, however, one is readily willing to pay this price for the flexibility gained. As we will point out in chapter 4, there are many reasons to pay this price in the case of language composition as well. Nevertheless, we must stress the fact that (hand-written) compilers, as a rule, significantly outperform MCS generated systems. Aspect 4 in our component definition forbids to make any assumption on the environment of a component. Aspect 5 implies another unknown: the capabilities and skills of third parties. The success of a component is tightly coupled with its ease of use and the time it takes to understand its functionality. Especially if components are subject to end-user composition, this argument becomes important. The gain in flexibility of deployment (down to the enduser) outweighs in many cases the above mentioned drawback of slower execution We defined the extent of a component as the ‘unit of composition’ and considering the entire definition, we can deduce that a component is also a unit of deployment. This technical view does not necessarily apply to a components business aspect. Although a component may be useful for deployment, it may not be a unit of sale. For example, there is no argument against a component that specifies only the multiplication construct of a language, but in sales it will normally be bundled with (at least) the division construct. MCS allows to bundle complete sublanguages and use them as a component of its own. Such a cluster of useful components can unfold its full potential only, if it is possible to still (re-)configure its constituents. This is important, because as components grow (components may be composed of other clustered components), the resulting component will become monolithic and inert. On the other hand, bundling components for sale can be accomplished according to any criteria, thus decoupling technical issues from marketing aspects. For example, technically seen, the term and factor language constructs (= components in MCS) only make sense if there are also numerical constructs such as addition, subtraction, multiplication and division. From a sales point of view, it may be perfectly right to sell term and factor without its constituents, because the buyer wants to specify these on his own.
2.2 Electronic Marketing, Sale and Support
11
2.1.3 What Market will Language Components have? When speaking about end-users, we do mean ‘programming literates’. That is, people skilled in using any imperative programming language. This is regarded as a prerequisite to be able to specify new languages. Thus the answer to this sections question is: the hard middle. According to Jeff Bezos, founder and CEO of Amazon.com, this is [Eco97]: “In today’s world, if you want to reach 12 people, that’s easy: you use the phone. If you want to reach 12 million people, it’s easy: you take out an ad during the Superbowl. But if you want to pitch something to 10,000 people – the hard middle – that’s really hard.” The next section addresses exactly this problem, as how to do marketing, sale and support for the hard middle.
2.2 Electronic Marketing, Sale and Support The most important advantage of component technology is – as its name suggests – its composition concept. Components will not make sense at all if they to not lend themselves easily to composition. Composable software will create more flexible and complex applications than conventionally developed software. This is not because a single software house would not be able to write complex software, but it is because different styles of software development, different cultures of approaching problems will also be put together when composing software. This adds to its complexity, but may also lead to more efficient, more effective and more elegant solutions. Component software will not be successful if there is no market for software components. Such markets, however, have to be established first. Conventional software distribution schemes will not match the requirements of the component industry. They deal with stand-alone applications, which can be sold in shrink-wrapped boxes, containing media on which the software is delivered (DVD, CD, diskettes, tapes) and several kilograms of manuals as hard copies. Obviously this is not applicable to software components. Depending on their functionality, components may be of only a few kilobytes in size and extremely simple to use. No one would like to go to a shopping mall to acquire such a piece of software, not to speak of the overhead of packaging, storage and sales room space. The typical marketplace for components is the Internet. Advanced techniques for distribution, support and invoicing have to be applied to protect
12
2 Electronic Commerce with Software Components
both the customers and the vendors interests. In the following subsections, we describe how such a market place on the Internet ideally looks like.
2.2.1 Virtual Software House (VSH) In the Virtual Software House project2 [SS99] a virtual marketplace for software of any kind was studied and prototype solutions have been developed. The VSH is a virtual analogy to a real shopping mall. An organisation running a VSH is responsible for the maintenance of the virtual mall. It will offer its services to suppliers (shop owners in the mall) that want to sell their products and services over the Internet. It runs the server with the VSH software, it may provide disk space and network bandwidth to its contractors and it will give them certain guarantees about quality of service (e.g. availability, security, bandwidth). Invoicing and electronic contracting could also be part of its services. The quality of the mediating services of a VSH will be important for all participants: 1. Operators of a VSH can establish their name as a brand and thus attract new customers as well as new contractors. 2. Suppliers can rely on the provided services and do not need to care about establishing themselves the infrastructure for logistics, invoicing and marketing. A VSH is an out-sourcing centre for a supplier. Especially small companies will e.g. profit from a professional web appearance and facilitated sales procedures. They can concentrate on their main business without neglecting the sales aspect. 3. For the customer, an easy-to-use and easy-to-understand web interface is very important. This includes simple and yet powerful search and query facilities. Clearly formulated and understandable payment conditions and strict security guarantees will help to win customers’ confidence. Of central importance is the product catalogue which can be queried by customers. Although several different companies sell their products under the umbrella of a VSH, customers usually prefer to have one single central search engine. In contrast to shopping in real shopping malls, internet customers usually do not want to spend an afternoon surfing web pages just to find a product. They will expect search and query facilities that are more elaborated than just plain text searching (as is done in search engines like AltaVista or Google). Instead, Mediating Electronic Product Catalogues (MEPC) [HSSS97] will be employed. These catalogues summarize the (possibly proprietary) product cata2
funded by Swiss National Science Foundation, project no. 5003-52210.
2.2 Electronic Marketing, Sale and Support
13
logues of the different companies participating in the VSH. Mediating means that they offer a standardized view of the available products. MEPCs enable a customer to query for specific products and/or combination of products. Optionally he can do this in his own language, using his own units of measurement and currencies. It is the mediating aspect of such EPCs to convert and translate figures and languages. A VSH has a very flexible, federated organisation. It will not only allow its contractors to sell software but it will also support them in offering other (virtual) services such as consulting (see section 2.2.2), software installation and maintenance or federated software developing (see section 2.2.3). A VSH is not only designed for business to consumer (B2C) commerce, but it can also serve as a platform for business to business (B2B) commerce. One example would be federated software development, where two or more contractors use the services of a VSH for communication, as a clearinghouse during the development stage and as a merchandising platform afterwards. Another example for B2B commerce would be a financial institute joining the VSH to offer secure payment services (by electronic money transfers or credit card) to the other contractors. Such Virtual Software Houses are already available on the web. A spin-off of the VSH project has started its business in autumn 1999 and can be reached at www.informationobjects.com. Another similar example (although not featuring such sophisticated product catalogues) is www.palmgear.com. The latter have specialised in software for palmtop or hand-held computers. Usually such programs are only a few kilobytes in size and often they are distributed as shareware by some ambitious student or hobby programmer (which not necessarily reduces their quality). Obviously such programmers do not have enough time and money to professionally advertise and sell their software. As some of their programs are just simple add-ons to the standard operating system, this market already comes close to the component market proposed above.
2.2.2 On-line Consulting Consulting on evaluation, buying, installation and use of software plays a prominent role in today’s software business. There are many attempts to replace human consultants by sophisticated help-utilities, electronic wizards and elaborated web-sites. But these all have two major drawbacks: First, they cannot (yet) answer specific questions. They can only offer a database of knowledge, which has to be queried by the user. Second, they are not human. For support, many users prefer a human counterpart to a machine. But human manpower is
14
2 Electronic Commerce with Software Components
expensive, individual (on-site) consulting even more so as time consuming travelling adds to the costs. The aim of the Network and On-line Consulting (NOLC) project3 was to reduce costs in consulting. This aim was achieved by providing a platform that supports on-line-connections between one (or several) client(s) and one (or several) consultant(s). Clients and consultants may communicate via different services like chat, audio, video, file transfer, whiteboards and application sharing (Fig. 4) It is important, that these services are available at very low cost. This specially means that there should not be any costly installations necessary. Fortunately, for the platform used in the NOLC project (Wintel), there is a free software called NetMeeting [NM] which meets this requirement: • Availability for free, as it is shipped as an optional package of Windows. Today’s hardware suffices to run NetMeeting. The only additional cost may arise from the purchase of an optional camera. • Supports audio, video and communication via standard drivers; so any sound card and video camera can be used even over modem lines. • A decentralised organisation of the communication channels. The last item was very important: The separation of the control over the communication from the communication itself. It allows to control communication channels between customers and consultants from a server without having to deal with the data-stream they produce. I.e. establishment and conclusion of a connection (and the quality of service parameters) can be controlled by a server running the NOLC-system. The actual data-stream of the consulting session (video, audio, etc.), however, will be sent peer-to-peer, and thus does not influence the throughput of the NOLC-server. With consulting, three parties are involved: 1. A client who has a problem about e.g. a piece of software. 2. A consultant who offers help in the field of the client’s problem. 3. An intermediary providing a platform that brings the above two parties together. The NOLC-project investigated the characteristics of the third party, and a prototype of such an intermediate platform was implemented. It consists of a server that provides the connecting services between the first two parties. The server controls the establishment and conclusion of a consulting session. To fulfil this task, it has access to a database that stores information about consultants and clients (for an unrestricted consulting session, both parties need to be 3
funded by Swiss National Science Foundation, project no. 5003-045329, a sub-project of VSH.
2.2 Electronic Marketing, Sale and Support
Conferencing Tool (NetMeeting)
consulting data chat, audio & video channels, ...
15
Conferencing Tool (NetMeeting)
applet
applet
browser
browser
Customer
control data
Consultant
Java server applet applet
database
Server
Figure 4: NOLC architecture and participants in a consulting session
registered). This data consists of the communication infrastructure available at the parties computers (audio, video), but also the skills of the consultants, their availability and the fees. When a potential client looks for consulting services, he will eventually visit the web site of a consulting provider. On this web site, he may register himself and apply for a consulting session. Before such a session may start, he is presented a few choices and questions about the requested consulting session: What kind of media shall be used (video, whiteboard, application sharing etc.) and about which topic the session will be. It is possible to let the system search for an available consultant, or the client may himself enter the name of a consultant (if known). Just before a new session is started, the client gets an overview of the costs that this session is going to generate. Usually, cost depends on the communication infrastructure, the chosen consultant and the duration of the session. Once the client agrees, the chosen consultant (which was previously marked as available) receives a message, indicating that a client wants to enter a new session. During a session both parties have the possibility to suspend the session and to resume it later on. This feature is necessary e.g. if there are questions which
16
2 Electronic Commerce with Software Components
cannot be answered immediately; after resuming, the same configuration as before suspension is re-established. Of course it is also possible to renegotiate this configuration at any point in a session. After completing the session, the client will be presented the total cost and he will be asked to fill in an evaluation form which serves as a feedback to the consultant. The intermediary service controls re-negotiation of the communication configuration, suspension and resumption of sessions and, finally, the calculation of the cost. Once a session terminates, the feedback report is stored for future evaluation and the generated costs are automatically billed to the client. Recently, some security features were added. The customer will have to digitally sign each session request and the server will store these signed requests. Thus, it will be possible to prove that a customer has requested and accepted to pay for a session. Billing and money transfer is not part of the NOLC-platform, but is delegated to a Virtual Software House [SS99]. So NOLC is a business service provided either by the VSH itself or by an additional party offering their services through the VSH. In addition, n to m group communication and load-balancing in a fault-tolerant environment were also investigated [Fel98, FGS98]. The Object Group Service (OGS) employed is a CORBA service which can be used to replicate a server in order to make it fault-tolerant.
2.2.3 Application Web The last aspect of E-Commerce that has been investigated in the context of the VSH project was the maintenance of component systems. The central question was: How to remotely maintain and control a component system, given that the system is heterogeneously composed. I.e. components from different software developers with arbitrary complex versioning should be manageable [MV99,VM99]. Information management tools and techniques do not scale well in the face of great organisational complexity. An informal approach to information sharing, based largely on manual copying of information, cannot meet the demands of the task as size and complexity increase. Formal approaches to sharing information are based on groupware tools, but cooperating organisations do not always enjoy the trust or availability of sophisticated infrastructure, methods, and skills that this approach requires. Bridging the gap requires a simple, loosely coupled, highly flexible strategy for information sharing. Extensive information relevant to different parts of the software life cycle
2.2 Electronic Marketing, Sale and Support
17
should be interconnected in a simple, easily described way; such connections should permit selective information sharing by a variety of tools and in a variety of collaboration modes that vary in the amount of organisational coupling they require. During the development of a component, the programmers have a lot of information about the software, e.g. knowledge about versioning, compatibility with other components, operating systems and hardware, known bugs, omitted and planned features, unofficial (undocumented) features etc. All this information is lost when the software is released in a conventional manner. The customer of such a component may only rely on official documentation. The core idea of the application web is to maintain links back to the developers data. So it would be possible at any time in the life cycle of the software to track a problem back to its roots [Mur97, Sche98]. Of course, these links will in general not be accessible to anybody. As an illustration, some scenarios will now be described: Remote on-line consulting. A software developer out-sources the support and maintenance services to a intermediary (a consulting company). Such a consultant would have to acquire a license to provide support and the right to access the developers internal database. In turn, customers facing problems using a certain product will automatically be rerouted to the consultant, when following the help link available in their software. Customer specific versions. A customer unhappy with a certain version of a component may follow the maintenance link back to the developers site, where he may request additional functions. A developer receives all the important data about a component, such as version number and configuration. He may then build a new variant of the component according to the clients needs. On completion, the client will be notified and may download the new version from the web immediately. Federated software development. Several software companies developing a component system may use the services of the application web to provide data for their partners. As these partners may be competitors in other fields, only relevant links will be accessible to them. The application web allows a fine grained access control, supporting several service levels. A prototype of the application web was developed using Java technology. What are the services of the application web? • Naming and versioning: It is important to maintain a simple, globally scalable naming scheme for versioned configurations distributed across organisations [VM99]. The naming scheme employed is based on the reversed internet addresses of the accessible components (similar to the naming scheme of Java’s classes, e.g. org.omg.CORBA.portable).
18
2 Electronic Commerce with Software Components
• Persistence, immutability and caching: It is important that links will not be moved or deleted. Participating organisations have to ensure the stability and accessibility of the linked information. Repositories (another B2B service in a VSH) could provide reliable, persistent bindings between versioned package names and their contents. For example they may support WWW browsing and provide querying facilities. • Reliable building: Versioned components contain links to other components and packages (binaries) they import during the building process. As these links will be available through the application webs services, building (even with older versions of libraries) will always be possible. • Application inspection: Java’s introspection techniques allow to (remotely) gain insight into the components configuration at run-time. This feature is very important in the consulting and special version scenarios above. They allow e.g. a consultant to collect information about the actual components environment. This information may be used to query the knowledge base for possible known bugs or incompatibilities.
2.3 Formal Methods and Electronic Commerce Consulting will become an emerging market in the advent of electronic commerce, the main reason being the decoupling of sales and consulting. As there will be less or no personal contact between customer and vendor in e-commerce, customers will not be willing to pay a price for a product so far justified by the personal advice of the sales representative in conventional business. The job of a sales representative will shift to a consultant’s job. Consulting and sale will be two different profit centres. It should be noted that this is to the advantage of the customer, too. He will get fair and transparent prices for the product as well as for additional advisory services. The consultant will be more neutral, as he does not have to sell the most profitable product to earn his salary, but the best (from the client’s point of view) if he wants to keep his clients. Could rationalisation also cut consultative jobs? Not on a short and middle term. There are several reasons for this answer which will be discussed in the following sections. All these reasons are related to the (limited) applicability of formal methods in computer science.
2.3.1 Properties of Formal Methods Formal notations are applied in specifications of software. They allow – on the basis of a sound mathematical model – to describe precisely the semantics of a
2.3 Formal Methods and Electronic Commerce
19
piece of software, e.g. the observable behaviour of a component. Many formal notations have a rather simple semantics, thereby lending themselves to mathematical reasoning and automated proof techniques. But capturing the semantics of a problem has remained a hard task, although there has been extensive research on this topic during the last decades. The following discussion will focus on some major aspects of formal methods and their effect on commercially distributed software components. Scalability. Unfortunately formal methods do not scale, i.e. they cannot keep up with growing4 system. E.g. proving that an implementation matches its specification is intractable for programs longer than a few hundred lines of code [SD97]. The main reason is the simple semantics formal notations feature. Typically they are lacking type systems, namespacing, information hiding and modularity. Introducing such concepts complicates these formalisms in such a way that the complexity of their semantics catches up with those of conventional programming languages. On the other hand, the programming language designers learned many lessons from the formal semantics community. This lead to simpler programming languages with clearer semantics. Examples are Standard-ML [Sml97] (functional programming with one of the most advanced type and module systems), Oberon [Wir88] and Java [GJS96] (imperative programming, simplicity by omitting unnecessary language features). These languages have proven their scalability in many projects. Comprehensibility. To successfully employ formal methods, a sound mathematical knowledge is presumed. This can be a major hindrance to the introduction of formal methods, as many programmers in the IT community do not have a university degree or a similar mathematical background. Of course, improving education for programmers is important. But this does not completely solve the problem. For describing syntax, the Backus-Naur-Form (BNF) [Nau60] or the Extended Backus-Naur-Form (EBNF) [Wir77b] in combination with regular expressions [Les75] (micro syntax) has become a standard. For the specification of semantics there is no such dominant formalism available. Semantics has many facets, which will be addressed by different specification formalisms. In widespread use and easy to understand are semi-formal specification languages. The most prominent among them is the Unified Modeling Language (UML). Notabene, UML is a good example for one single formalism that does not suffice to capture all facets of semantics: Unified does not denote one single formalism, but rather expresses the fact that the four most well-known architects of modeling languages agreed on a common basis. In fact, UML features over half a dozen different diagram classes. UML is semi4
growing in terms of complexity, often even in terms of lines of code.
20
2 Electronic Commerce with Software Components
formal, because the different models represented by the different diagrams have no formal correspondence. This correspondence is either given by name equivalence (in simple cases) or by textual statements in English (complex cases). Specifications can be separated into two classes: declarative and operational semantics. Declarative semantics describes what has to be done, but not how. Operational semantics on the other hand describes in more detail how something is done. To many programmers, the latter are easier to understand, as they are closer to programming languages than declarative descriptions. Efficiency. Why distinguish between a specification and an implementation? If there is a specification available, why going through the error prone process of implementing it? Basically, a specification and its corresponding implementation can be viewed as two different – but equivalent – descriptions of a problem and its solution. However, in practice, it is very hard to run formal specifications efficiently (compared to C / C++ code) on a computer. • In declarative specifications, the ‘how’ is missing. This means, that a code generator would have to find out on its own which algorithm should be applied for a certain problem. In general, declarative languages were not designed for execution, but to reason on them. It is for example possible to decide on the equivalence of two different declarative specifications. In the PESCA project [Schw97, SD97] this property was used to show the equivalence of an implementation with respect to its formal specification. Using algebraic specifications, the proof of equivalence was performed employing semi-automated theorem proofing tools. In order to compare the implementation with its specification, the implementation had to be transformed into algebraic form as well. This could be done automatically in O(l) where l is the length of the implementation. Because of its operational origin, this transformed specification had to be executed symbolically in order to be compared. During this execution, terms tend to grow quickly and therefore term rewriting gets slow and memory consuming. Apart from very simple examples, comparing specifications does not (yet) seem to be a tractable task. This is very unfortunate, as in the context of a VSH, a specification search facility would be an interesting feature: given a specification of a certain problem, is there a set of components solving it? • Operational semantics does not do much better. Although the semantics is already given in terms of algorithms, most operational specification languages feature existential and universal quantifiers. In general, these quantifiers cannot be implemented efficiently (linear search over the underlying universe, the size of which may be unknown). Completeness. Bridging the gap between specification and implementation is one of the trickiest parts when employing formal methods. It is important to
2.3 Formal Methods and Electronic Commerce
21
guarantee that the implementation obeys its specifications. As discussed above, this is only possible with a considerable overhead of time and man-power (the problem has to be specified, implemented and the correspondence has to be proven). But it does not suffice to do this for new components only, it also has to be done for the compiler, the operating system and the underlying machine! On all these levels of specification one will face the above mentioned problems. Why can there be no gaps tolerated? Specifying (existing) libraries of components is an expensive task, as it binds considerable resources of well educated (and paid) specialists and not to be underestimated computing power. This would only pay off, if consulting on components could be eliminated and replaced by e.g. specification matching. Bridging the gaps between specifications and implementations completely is necessary in order to avoid this automated process to interrupt and ask for user assistance. In their work, Moorman-Zaremski and Wing investigated signature matching [MZW95a] and specification matching [MZW95b]. Matching only signatures of functions featured surprising results. Queries normally returned only a few results which had to be examined by hand. An experienced user, on the other hand, could decide with very little effort on which function to use. Full specification matching cannot be guaranteed to be accomplished without user interaction (the underlying theorem prover might ask for directions or proof strategies). Considering that these user interactions are at least as complex as the decision between a handful of pre-selected functions, the question arises whether it makes sense to use formal specifications at all in this scenario. Openness. Should a component reveal its specification/implementation at all? In many business scenarios, giving away a formal specification or the source code of a component is not a topic as the company’s know-how is a primary asset. Publishing this know-how would have considerable consequences on the business model. The global market for operating systems may serve as an example: The free Linux versus the black box Windows: Free software products cannot be sold whereas black box software cannot be trusted. Of course, there are many different shades of gray: from “Free Software5”, “Open Source6”, “Freeware”, “Public Domain”, “Shareware”, “Licensed Software” up to buying all rights on a specific piece of software. Both sides, software developer and user have to decide on a specific distribution model. Conclusions and implementation decisions. The original Montages [see section 3.3] used ASMs as specification formalism. The intended closeness to 5
As propagated by Richard Stallmann, founder of the GNU project (see www.gnu.org/philosophy).
6
As propagated by Eric S. Raymond and Bruce Perens, founders of the Open Source Initiative (see www.opensource.org).
22
2 Electronic Commerce with Software Components
Turing Machines resulted in the ASMs being simple and easy to understand. However, there are some drawbacks. ASMs lack modularity and have no type system; on the other hand, they have a semantics of parallel execution with fixed-point termination. All these features distinguish them enough from conventional programming languages so as to scare off C++ or Java programmers. As a typical representative of an operational specification language, ASMs focus on algorithmic specifications and support restricted reasoning only. When deciding on an implementation platform for our Montage Component System, ease of use, clarity of specifications and compatibility were major criteria. The Montage model itself proved to be very useful for language specification; therefore, its core model was chosen as a base for our implementation. However, ASMs were replaced by Java to reflect the considerations in this section. It also allows many programmers to understand the MCS without prior learning of a new formalism and thus seeing MCS as a tool rather than an abstract formal specification mechanism. The designers of Java learned many lessons from the past, avoiding pitfalls of C/C++ but still attracting thousands of programmers. Other considerable advantages of Java over ASMs are the availability of (standard) libraries and an advanced component model (Java Beans).
Chapter 3
Composing Languages
Composing Languages Programming languages are not fixed sets of rules and instructions, they evolve over the years (or decades), as they are adopted to changing needs and markets. This evolution might lead to a new similar language (a new branch in the language family) or just to a new version of the same language. Examples are the Pascal language family, which evolved over the past three decades to many new languages. Among them: Pascal [JW74], Modula [Wir77a], Modula-2 [Wir82], Modula-3 [Har92], Oberon [Wir88], Oberon-2 [MW91], ComponentPascal [OMS97], Delphi [Lis00]. Another important dynasty of programming languages are the C-like languages: Starting with C [KR88](K&R-C (Kernighan and Ritchie) and ANSI-C are distinguished) and evolving to Objective-C [LW93] and C++ [Str97]. The latter underwent several enhancements during its two decades of existence, e.g. introduction of templates. Java [GJS96] as a relatively new, but nevertheless successful language has already undergone an interesting evolution: It started as a Pascal like language (then called Oak) and got its C++ like face in order to make it popular. Compared to C++ or Delphi, it lost many features as for example its hybrid character (all methods have to be class-bound), address pointers, operator overloading, multiple inheritance, and templates. Java’s success soon led to new enhancements of the language (inner classes) and to new dialects (Java 1.0 [GJS96], Java 1.1 [GJSB00], JavaCard [Che00]). Some basic concepts are the same in all the above mentioned programming languages, e.g. while loops, if-then-else statements or variable declarations. These constructs may have different syntax but their semantics remains the same. All imperative programming languages basically differ in the number of language constructs they offer to the programmer. For example, C++ can be described as C plus additional language constructs. In general, programming languages are simply composed of different language constructs. 23
24
3 Composing Languages
3.1 Partitioning of Language Specifications A programming language specification is usually given in form of text and consists of one or several files containing specifications given in a specialised notation or language1. Well known examples are (E)BNF for syntax specification [Wir77b] and regular expressions as they are used in Lex [Les75]; even the make utility [SM00] controlling compilation (of the compiler or interpreter) can be mentioned here. Usually a language specification is structured into different parts, each corresponding to one particular part of the compiler and often each of these parts is given in a separate notation/language. From a software engineering point of view this makes sense, as such partitionings of specifications allow for separate and parallel development – and for reuse – of different parts of the compiler. In principle, a language specification can be partitioned/modularised in two different ways: 1. It is split into transformation phases like scanning, parsing, code generation etc. This corresponds to the well known compiler architecture; we will call it the horizontal partitioning scheme. 2. It is described language construct by language construct. Each of these construct descriptions, we call them Montages, contains a complete specification from syntax to static and dynamic semantics. We call this approach the vertical partitioning scheme. We will consider the pros and cons of both approaches in the following sections.
3.1.1 Horizontal Partitioning The conventional approach of partitioning a compiler into its compilation phases is very successful. It has been established in the 1960’s along with the development of regular parsers. The idea was to split the compilation process into two independent parts: the front-end and the back-end (Fig. 5). The front-end is concerned with scanning, parsing and type checking while the back-end is responsible for code generation. Advantages: This partitioning is well suited for large languages with highly optimized compilers, and it meets exactly the needs of compiler constructors like Borland and Microsoft, having numerous languages in offer. The separation of front- and back-end allows them to build several front-ends for different languages, all producing the same intermediate representation. From this 1
In fact, such specification languages are good examples for “little languages” or DSLs.
3.1 Partitioning of Language Specifications Front-End
25
source program lexical analyzer syntax analyzer
symbol-table manager
semantic analyzer
error handler
intermediate code generator code optimizer code generator
Back-End
target program
Figure 5: The typical phases of a compiler partition its design horizontally
intermediate representation, different back-ends can generate code for different operating systems and microprocessors. The intermediate representation serves as a pivot in this design and reduces complexity2 from O(l·t) to O(l+t) where l is the number of languages and t the number of target platforms. This kind of modularity allows for fast development of compilers of new languages or of existing languages on new platforms. Only specific parts of the compiler have to be implemented anew, the intermediate representation is the pivot in this design. Even within the front- and back-end phases, modules might be exchanged. An example is the code optimizer that could be replaced by an improved version without altering the rest of the compiler. The availability of tools supporting the horizontal partitioning approach simplifies the task of building a compiler (see Related Work in section 6.4 for an overview). Most of these tools support only front-end construction, but some also offer support on code generation. Especially code optimization is hard to automize due to the variety of hardware architectures. The horizontal partitioning allows to perform optimization on the entire program. Many optimization techniques – like register colouring, peep-hole optimization or loop unrolling – cannot be applied locally or per construct as their mechanisms are 2
Here, the complexity of managing all the languages and target architectures is meant.
26
3 Composing Languages
based on a more global scale. For example register colouring will examine a block or a whole procedure at once. As everything can be done within the same phase, no accessibility problems (due to encapsulation and information hiding) have to be solved. Disadvantages: Each module of a traditional compiler contains specifications for all language constructs corresponding to its phase. I.e. a module specifies only a single aspect of a construct (e.g. static semantics) but it does this for all constructs of the language. The complete specification of a single language construct is thus spread over all phases. Therefore, the horizontal partitioning is not well suited to experiment with a language, i.e. to generate various versions to gain experience with its look and feel. In general, applying only minor changes to a language can be very costly, especially if the changes affect all phases of the compiler. This is usually true if new language features and/or new language constructs are added. There are roughly three different levels of impact a change can have on a horizontally partitioned compiler: 1. Only one single phase is affected: Examples would be minor changes in the syntax, like replacing lower case keywords with upper case keywords, or an improved code optimizer. 2. Some, but not all phases are affected: This is the case if language constructs are introduced that can be mapped to the existing intermediate format. As an example, consider the introduction of a REPEAT UNTIL loop into a language that knows already a WHILE-loop. In this case, the change will have an impact on the front-end phases, while the back-end remains unchanged. 3. All phases have to be changed: This is the case if the changes in the language constructs cannot be mapped to the existing intermediate representation. An example is the introduction of virtual functions in C in order to enhance the language with dynamic binding features. The code generator now has to emit code that determines the function entry point at run-time instead of statically generate code for procedure calls. All these changes have in common that they potentially have an impact on the whole language. Even if only a single phase is affected, the change has to be carefully tested on undesired side-effects, which cannot be precluded due to the lack of encapsulation of language construct specifications within a phase.
3.1.2 Vertical Partitioning Instead of modularizing a compiler along compilation phases, each language construct is considered a module (Fig. 6). Such a module – we call it a Mon-
Return-Statement
...
27
Procedure Call
Assignment
While-Statement
If-Statement
3.1 Partitioning of Language Specifications
Montage Component System
Figure 6: Vertical partitioning along language constructs plugged into the Montage Component System
tage – contains a complete specification of this construct, including syntax as well as static and dynamic semantics. Vertical partitioning of a compiler is very similar to the way beginners do first translations3. They try to identify the main phrases of a sentence. Then each phrase is parsed into more elementary ones until, finally, they end with single words. Now each word can be translated and the process is reversed, combining single words to phrases and sentences. Our approach supports this idea in a similar way. A program is subdivided into a set of language constructs, each of them specified in a single module. Once the program is broken up into these units, translation or execution is simple, as only a small part of the whole language has to be considered at once. Then these modules are re-combined using predefined interfaces. Section 3.3 will elaborate in more detail how this is done. Advantages: A vertical partitioning scheme is very flexible with respect to changes in a language. Modifications will be local, as they usually affect only a single module. As each module contains specifications for all aspects of the compilation process for a certain construct, unintended side-effects with other modules are very unlikely. This is in contrast to the horizontal partitioning scheme, where side-effects within a phase potentially affect the whole language. As an example, we mentioned already the introduction of virtual functions in
3
This applies both to the translation of natural languages and to the act of understanding a program text when reading.
28
3 Composing Languages
C, affecting all phases of the compiler. As the phases do not feature any form of further modularization, unintended side-effects can easily be introduced. Component frameworks like Java Beans or COM allow to build new systems by composing pre-compiled components. The vertical partitioning supports this approach, as Montages are compiled components which can be deployed in binary form. This in turn opens many possibilities in marketing of language components. As they do not have to be available in source form – which is the case with conventional horizontally partitioned compiler specifications – the developer does not have to give away any know-how about the internals of the language components. Not only the developer profits from pre-compiled language components, the user does so too. Combining them and testing the newly designed languages is much less complex than testing all phases of a conventional compiler. New groups of users have now access to language design, as the learning curve is flattened considerably. In the best case, a new language can be composed completely from existing language components. Normally, however, language construction is a combination of reusing components and implementing new ones. In this case, the amount of effort is still reduced, as with the availability of a language component market, the need for implementing new components decreases over the time. Disadvantages: Construction of an efficient and highly optimizing compiler will be very difficult with our approach. Optimization relies on the capability to overview a certain amount of code, which is exactly what vertical partitioning is not about. Ease of use and flexibility of deployment are of primary interest in our system. For main stream programming languages that have a huge community, efficiency can be achieved by using conventional approaches, preferably in combination with our system: Language composition supports fast prototyping and is used to evaluate different dialects of a new language. When this process converges to a stable version, then an optimizing compiler can be implemented using the phase model.
3.1.3 Static and Dynamic Semantics of Specifications A language specification contains static and dynamic semantics in a similar manner as a program does (in fact, language specifications often are programs: either compilers or interpreters). The partitioning schemes as described above are part of the static semantics of language specifications. They define how specifications can be composed, not how the specified components interact in order to process a program.
3.2 Language Composition
29
It is important to distinguish between the structure of the specification and the structure of the transformation process that turns a program text into an executable piece of code. In a horizontal partitioning scheme, modularization is done along the transformation phases, i.e. each phase completes before the next can start. In other words, partitioning scheme and control flow during compilation are the same. In a vertical partitioning scheme, a program text is transformed in a similar manner as in conventional compilers: phase after phase. It is simply a causal necessity to parse a program text before static semantics can be checked which in turn has to be done before code generation. But control flow switches between the specification modules during all phases of compilation.
3.2 Language Composition A note to those familiar with compiler-compilers: Usually these tools rest on a specification of the translations. A Montage specifies also the behaviour of a language construct, but in a purely operational manner. Therefore, translations are programmed rather than specified.
3.2.1 The Basic Idea Modularity available in the specification of a programming language will be destroyed by most compiler construction due to their horizontal partitioning scheme (for a detailed discussion see section 3.1.1). Modularity and compositionality of a programming language is therefore only available to compiler implementors, but not to the users of a compiler, i.e. to the programmers. Of course, the programmer is free to use only a subset of a given language, but he will never be able to extend the language by his own constructs or to recombine different languages to form a more powerful and problem specific language. The Montage Component System will provide these features to the programmer. It allows to separately specify language constructs. From such a specification, a language component can be generated that is pluggable to other language components, and therefore suited to build new languages on a plug and play basis. In contrast to existing compiler compilers, reuse of specifications can be applied on a binary level rather than on a source text level. MCS generates compiled software components from the specifications. These can be distributed independently from any language. Each component is provided with a set of requirements and services that can be queried and matched to other compo-
30
3 Composing Languages
nents. Nevertheless pre-compiled language components need a certain flexibility, e.g. the syntax has to be adaptable or identifiers need to be renamed in order to avoid naming collisions. Creation of a new programming language is a tool-assisted process. It is possible to build a language from any combination of newly specified and existing components. The latter could e.g. be downloaded from various sources on the Internet. The system will check that all requirements of the components are met and prompt the user for missing adaptations. If there are no more reported errors and warnings, the new language is ready to use, i.e. an interpreter is available. How does such a system work in detail? An in-depth description of the Montage Component System is given in chapter 4 and details about its design and implementation in chapter 5. To understand our approach in principle, section 3.3 below should be read first, as it provides an introduction to the Montage approach.
3.2.2 On Benefits and Costs of Language Composition The established methods have successfully been deployed for over two decades. They produce stable, efficient and well understood compilers and interpreters. Unfortunately they are rather rigid when considering changes. Especially during the design phase of a new language, a greater flexibility is desirable. For example, it should be possible to produce and test dozens of variants of a new language before it is released. The most important aspect of language composition is that pre-compiled components can be reused. The advantages can be summarised as follows: 1. Language composition is now accessible to non-experts in the field of programming language construction. Pre-compiled components will be designed, developed and distributed by experts. Languages composed of such components profit from the built-in expert knowledge. 2. The development cycle for a new language can be drastically reduced. On the one hand because the pre-compiled components need no further testing and on the other hand due to the abbreviated specification-generation-compilation cycle: the pre-compiled components need no further compilation. 3. Reuse is done on a binary level, reducing the possibility of text copying errors. In combination with the limited impact that a component has on the whole language, this again results in a more reliable and flexible language design method. This list may give rise to some questions and objections, which need to be discussed:
3.2 Language Composition
31
Who composes languages? Programming language design and implementation will be simplified, such that it will be applicable for non-experts in the field of programming language construction. Is this desirable? Should this domain not be reserved to experts? Similar debates were held about operator overloading (e.g. in C++). Should the programmer be given the opportunity to alter the meaning of parts of the programming language? Although language construction is much more powerful than operator overloading (which does not add any fundamental power to the language – it is just a notational convenience), the basic question remains the same: Should the programmer have the same rights and power on the language than the language implementor? The effort of creating a new language is still great enough, to think about it thoroughly, before one indulges in language composition. Of course there will always be designs of lower quality. But these will eventually vanish, as they will not be convincing enough to be reused in further languages. Only the best designs will survive on a long term, because their components will be reused in many different languages. Simonyi compares this process with the survival of the fittest in biological evolution [Sim96]. We foresee two main areas where vertically partitioned systems – like MCS – are particularly suitable: education on the one hand, and design and implementation of domain-specific languages (DSLs) on the other hand: 1. In education, an introduction to programming can be taught using a very simple language in the beginning. and then refine and extend it stepwise. This solves a typical dilemma in programming courses: Teaching either a main stream programming language from the beginning or starting with a didactically more suited language. The first approach faces the problem that one has to cope with the whole complexity of e.g. C++ from the very beginning. The second approach wastes a lot of time introducing a nice language which will not be used later on. A further effort then has to follow to teach the subtleties of a main stream language. Using MCS, a teacher can start teaching C++ with a subset of the language that is simple and safe. Then step by step he can refine and extend the language, using new Montages or refined versions of existing ones. During the whole course, basically the same language can been taught; the transitions from one complexity level to the next are smooth. The system can be used to its full flexibility: For introductory courses, ready made Montages are available, they only need to be plugged together. A teacher does not need to have any knowledge of compiler construction, the students will use only the end-product, namely the compiler or interpreter, but they do not need to understand our system. In more advanced courses,
32
3 Composing Languages
teachers may explain details of a language construct by showing the corresponding Montage. And in compiler construction courses, the system can be used by the students to build new languages themselves. 2. Domain specific languages are typically small languages that have a limited application area and a small community of programmers. Well known examples are Unix tools as sed, awk, make, etc. [Sal98].In some cases, they are only employed one single time (e.g. to control data migration from a legacy system to a newer one). In such cases it is not worth constructing optimizing compilers or inventing a highly sophisticated syntax. Often, such languages have to be implemented within very tight time bounds. In all these scenarios, language composition offers an interesting solution to the problems. Creating a new language might be done within a few hours. Reuse of existing language components simplifies development and debugging and reduces the fault rate in this process [Kut01]. The flexibility of phase model approaches is sufficient for language development. It is possible to produce dozens of compilers with lex and yacc too. Of course, it is possible to generate compilers for numerous variations of a language. But normally this is a fairly sequential and time consuming process because careful testing has to guarantee no unwanted side-effects. Especially in education and DSL design the flexibility available in traditional phase model approaches might not be sufficient. Student exercises in language design or compiler construction are a good example to elaborate on this statement. During such a course, students normally have to implement a running compiler for a simple language in the exercises. The lack of sufficient modularization within a compiler phase, forces the student to an “all or nothing” approach. He has to specify and generate the phase in its full complexity (i.e. all language constructs at once) in order to have a running system that can be tested. There are a myriad of possibilities for making mistakes. This is not only discouraging for students but also for their tutors which have to assist them. Montages could be used in such courses to improve modularity in the student projects. Once a Montage has been compiled successfully, it can be reused without alteration in future stages of the student compiler. Changes in one Montage do have limited effects on others and thus debugging will become easier for both, student and tutor. Success in learning is one of the most motivating factors in education [F+95]. The time gained can be used to deepen the students insight into the subject, to broaden his knowledge by being able to cover more subjects or simply to reduce stress in education. Experimenting with a language will normally improve its design and its expressiveness but experimenting takes time. The reduced development cycle
3.3 The Montages Approach
33
time is another reason to prefer vertically partitioned systems over phase model approaches. As the time it takes to generate a single version of a language is reduced, it is possible to either develop languages faster or to generate more variants of a language before deciding on one. Faster development is interesting to industrial DSL designers, whereas experimentation is of advantage to both, students and professionals. Students will get a better understanding if they are able to easily alter (parts of) a language and professionals will profit from the experience they gained during experimentation. How about pre-compiled compiler phases? Wouldn’t this improve the performance of established approaches? Pre-compiled compiler phases would only make sense in a few areas and in general they would even complicate language specifications as is illustrated with the following examples. Changing the syntax of a while statement from lower to upper case keywords would be easy, as only the scanner phase is involved. But suppose a simple goto language that shall be extended with procedures. Changes in the scanner and parser phase are obvious, but also the code generator needs to be redesigned, as the simple goto semantics probably would not suffice to model parameter passing and subroutines efficiently. Pre-compiled compiler phases would be a hindrance in this case. The language designer would need to have access to the sources of the pre-compiled parts, in order to re-generate them. With MCS, the same problem would be solved by introducing some Montages that specify the syntax and behaviour of procedures. Changes to existing Montages can be necessary as well, but they will be implemented elegantly by type-extending existing Montages. No testing of pre-compiled components? Of course, interaction of components in a newly composed language has to be tested. But these tests happen on a higher level of abstraction, closer to the language design problem. Testing of component internals does not have to be considered any more. Components interact only through well defined interfaces which restrict the occurrence of errors, simplify debugging, and accelerate testing in general.
3.3 The Montages Approach Our work is based on Montages [KP97b, AKP97], an approach that combines graphical and textual specification elements. Its underlying model are abstract state machines [Gur94, Gur97]. The following overview introduces the basic concepts and ideas of Montages in order to provide a better understanding of the following chapters. Readers familiar with Montages may skip this section
34
3 Composing Languages
and continue with chapter 4. Detailed information on Montages can be found in [KP97b, AKP97] and in Kutter’s thesis [Kut01].
3.3.1 What is a Montage? A complete language specification is structured in specification modules, called Montages. Each Montage describes a construct of a programming language by «extending the grammar to semantics». A Montage consists of up to five components partitioned in four parts (Fig. 7 shows Java's conditional operator ?:). 1. Syntax: Extended Backus-Naur Form (EBNF) is used to provide a contextfree grammar of the specified language L. A parser for L can be generated from the set of EBNF rules of all Montages. Furthermore, the rules define in a canonical way the signature of abstract syntax trees (ASTs) and how parsed programs are mapped onto an AST. The syntax component is mandatory, the following components are all optional. 2. Control Flow and Data Flow Graph: The Montage Visual Language (MVL) representation has been explicitly devised to extend EBNF rules to finite state machines (FSM). Such a graph associated to an EBNF rule defines basically a local finite state machine. Each node of the AST is decorated with a copy of the FSM fragment given by its Montage. The reference to descendents in the AST defines an inductive construction of a globally structured FSM. Control flow is represented by dashed arrows. Data may be stored in attributes of Montage instances (in our example, the attributes staticType and value are defined for every Montage). Control flow always enters a Montage at the initial edge (I) and exits at the terminal edge (T). Control flows may be attributed with predicates. E.g. one control flow leaving from the branching node in Fig. 7 shows the predicate cond.value = true. Branching of control flow may only occur in terminal nodes. This is due to the condition that there is only one control flow leaving each Montage (T). The default control flow is indicated by absence of predicates. 3. Static semantics: A transition rule that does static analysis can be provided. Such rules may fire after successful construction of an AST for a given program. In Fig. 7, the static type of the conditional operator is determined during static analysis, which is — in this case — not a trivial task. To enhance readability, macros can be used (e.g. CondExprType).
3.3 The Montages Approach ConditionalExpression ::=
35 ConditionalOrOption "?" Expression ":" ConditionalOption
S-ConditionalOrOption.value
I
S-Expression result
S-ConditionalOrOption
T
S-ConditionalOption staticType := CondExprType(S-Expression, S-ConditionalOption)
condition
S-ConditionalOrOption.staticType = "boolean"
@result: if (S-ConditionalOrOption.value) then value := S-Expression.value else value := S-ConditionalOption.value endif Figure 7: Montage for the Java conditional expression
4. Conditions: The third part contains post conditions that must be established after the execution of static analysis. In our example, static type checking occurs. 5. Dynamic Semantics: Any node in the FSM may be associated with an Abstract State Machine (ASM) [Gur94, Gur97] rule. This rule is fired when the node becomes the current state of the FSM. ASM rules define the dynamic semantics of the programming language. In the fourth section of Fig. 7, the ASM rule specifies what happens at runtime when a conditional operator is encountered. Rules in this section are always bound to a certain node in the FSM. The header of each rule (here: @result) defines this association. Note that there may be also predicates defined in thegraphical section which are evaluated at runtime.
3.3.2 Composition of Montages The syntax of a specified language is given by the collection of all EBNF rules. Without loss of generality, we assume that the rules are given in one of the two following forms: A ::= B C D E=F|G|H
(1) (2)
36
3 Composing Languages
The first form defines that A has the components B, C and D whereas the second form defines that E is one of the alternatives F, G or H. Rules of the first form are called characteristic productions and rules of the second form are called synonym productions. Analogously, non-terminals appearing on the left-handside of characteristic rules are called characteristic symbols and those appearing in synonym rules are called synonym symbols. One characteristic symbol is marked as the start symbol. It must be guaranteed (by tool support) that each non-terminal symbol appears in exactly one rule as the left-hand-side. The two forms of EBNF rules also determine how a language specification can be constructed from a set of Montages by putting them together. 1. A Montage is considered to be a class4 whose instances are nodes in the abstract syntax tree. Terminal symbols in the right-hand-side of the EBNF, e.g. identifiers or numbers, are leave nodes of the AST (represented by ovals in MVL), they do not correspond to Montages. Non-terminals, on the other side, are (references to) instances of other Montage classes. Such attributes are called selectors and are represented by a rectangle and a prefixing «S-». Each non-terminal in a Montage may have at most one in-going and one out-coming control flow arrow. This rule allows in a simple way to compose Montages: The referenced Montage's I and T arrows are connected with the in-going and out-coming control flow arrow respectively. 2. When using sub-typing and inheritance, synonym symbols can be considered as abstract classes. They cannot be instantiated but provide a common base for their right-hand-side alternatives. After an AST has been built upon a program input, static semantic rules may be fired. The idea is to «charge» all rules simultaneously and trigger their firing by the availability of all occurring attributes (i.e. undef). In our example the static semantics rule can only be fired when all referenced attributes become available, i.e. attributes5 staticType, isConst and value of S–Expression and S–ConditionalOption are defined. As soon as all attributes for some Montage become available, the firing begins. In this process further attributes may be computed and so the execution order is determined automatically and corresponds to a topological ordering of rules according to their causal relations. This approach was adopted from [Hed99]. Eventually all static semantics rules are fired. If not, an error occurs, either because execution of some rules was faulty, or because one or more attributes never got defined during the firing process. Usually the latter indicates design flaws in the language specification. 4
in the sense of e.g. a Java class
5
used in the macro CondExprType that is not shown here.
3.3 The Montages Approach
37
Another approach would be to predetermine the execution order, e.g. a preorder traversal of the AST. Experience has shown that one predetermined traversal often does not suffice. This problem can be solved, but it leads to clumsy static semantics rules because they have to keep track of «passes». Once static analysis has successfully terminated, the program is ready for execution. Control flow begins with the start symbol's I arrow. When encountering a state, its dynamic semantics rule is executed. Control is passed to the next state along the control flow arrow whose predicate evaluates to true. Such predicates are evaluated after executing the rule associated to the source node. The absence of a predicate means either true (only one control flow arrow) or the negated conjunction of all other predicates leaving the same source node. This scheme of local control flow has its limits, e.g. when describing (virtual) method calls (Fig. 8). Neither is the call target local to the calling Montage (with respect to the parse tree), nor can it be determined statically. In such cases, non-local jumps can be used. They are distinguished from normal control flows by the presence of a term which is evaluated at runtime to compute the jump target. Moreover, the box representing the jump target is no selector and is therefore not marked as such. In Fig. 8 the MethodDeclaration box represents the class MethodDeclaration and the term Dispatch computes the appropriate instance at run-time.
Figure 8: Montage for method invocations (screen shot of Gem/Mex tool)
38
3 Composing Languages
The closest related work to Montages is the MAX system [PH97a]. As Montages, MAX builds upon the ASM case studies for dynamic semantics of imperative programming languages. In order to describe static aspects of a language, the MAX system uses occurrence algebras, a functional system closely related to ROAG [Hed99]. A very elegant specification of Java using ASMs can be found in [BS98]. This specification abstracts from syntax and static semantics but focuses on dynamic semantics. Its rules are presented in less than ten pages.
Chapter 4
From Composition to Interpretation
From Composition to Interpretation This chapter describes the concepts behind the Montage Component System. Algorithms and data structures are discussed in an abstract form which is neutral with respect to a concrete implementation in any specific language or component framework. Implementation details are discussed in the next chapter. Those readers not familiar with the Montage approach should read the preceding section 3.3 first in order to get an introduction. More in-depth information can be found in [AKP97, Kut01]. These publications give a well-founded description of Montages. Its mathematical background is based on abstract state machines (ASM). For reasons discussed in section 2.3 we focus on an implementation using a main stream programming language. Simplicity, composability and ease of use are our main goals, and, in combination with our different formalism (Java instead of ASMs), this explains that our notion of a Montage differs in some details from the original definition. Therefore, we first present some definitions that render a notion of a Montage in MCS. These definitions are implementation independent, although they are given with an object-oriented implementation and a component framework (such as they are discussed in section 6.5) in mind. After an overview over the process of transforming Montage specifications into an interpreter, its single phases are described in detail throughout the rest of this chapter. Deviations of our approach from the original Montage approach are indicated at the appropriate places.
4.1 What is a Montage in MCS? The following definitions are provided with regard to an implementation and reflect the necessary data structures that are used to implement the system. We 39
40
4 From Composition to Interpretation
will refer to these definitions and give corresponding class declarations in Java when discussing the implementation in the next chapter. Montages – although entities of composition – never can be executed on their own. Only when being member of a language, they can be deployed conveniently. Therefore we start by defining our notion of a language.
4.1.1 Language and Tokens A language L = (M, T) consist of a set of Montages M and a set of tokens T. A token tok = (regexp, type) is defined as a pair of a regular expression regexp defining the micro syntax and a type indicating into which type the scanned string is to be converted. Tokens are either relevant t Trlv, i.e. they will be passed to the parser or they are skipped1: t Tskip; Trlv and Tskip are disjoint sets: T rlv T skip = . T denotes the set of all tokens of a language: T = T rlv T skip .
4.1.2 Montages A Montage m = (sr, Pm, cfg) can be defined as a triple consisting of a syntax rule sr, a set of properties Pm and a control flow graph cfg. Fig. 9 shows a graphical representation of a Montage. A syntax rule sr = (name, ebnf) consists of a name (the left hand side or the production target) and a production rule ebnf (the right hand side). The elements of a production rule, which are of interest in our context, are terminal symbols, nonterminal symbols, repetition delimiters (braces, brackets and parenthesis) and synonym separators (vertical lines). The complete definition of the Extended Backus-Naur Form can be found in [Wir77b]. A property p = (name, value, Ref) is basically a named variable containing some value. Associated with each property is a rule, specifying how its initial value can be computed. In MCS, Java block statements (see [GJS96] for a definition) are used to express such rules. They may contain several (or no) references r Ref to other properties, possibly to properties of other Montages. A reference represents a read access, writing to a property within an initialisation rule is prohibited. (section 4.6 describes the use of properties in detail). A control flow graph is a united data structure as it contains both an abstract syntax tree fragment and a control flow graph. Thus it can be described as a triple cfg = (N, East, Ecf) containing a set of nodes N, a set of tree edges East and a set of control flow edges Ecf. A node can either be a nonterminal, a repetition or 1
e.g. whitespaces and comments will not be of interest to the parser and are skipped.
4.1 What is a Montage in MCS?
41
Syntax Rule Example ::= A {B} "text". LIST
Control Flow Graph I I
A
n
B C
T T
Properties name X Y
type rule int OtherName.X + DifferentName.Z boolean true
Actions @n: { int i = 0; // local variable decl X = i; // access property // additioal Java statements } Figure 9: Schematic representation of a Montage Syntax Rule: an EBNF production Control Flow Graph: united representation of an AST fragment and control flow information Properties: variables initialized during static semantics evaluation Action: dynamic semantics
an action. In the tree structure, repetitions may only occur as inner nodes, actions only as leave nodes, nonterminals may be both. If a nonterminal node is an inner node, then all its subtrees are part of the Montage that the nonterminal represents. The graphical representation of control flow graphs uses nested boxes to display the tree structure. This allows to layout the control flow dependencies as a plain graph. Action nodes in the control flow graph are equivalent to properties. While properties and their associated initialisation rules define static semantics, the rules associated with action nodes define the dynamic semantics of a language. An action thus is defined similar to a property: act = (an, Ref). It is associated to an action node an and it may also contain a block of Java statements. The same
42
4 From Composition to Interpretation
rules as for initialisation rules apply here, i.e. from within this block, access to properties (read and write) is possible.
4.2 Overview Montages have to be aware of each other in order to communicate and interact during interpreter generation and program text processing. Five transformation phases are necessary in the process from language specifications to an interpreted program (Fig. 10). 1. In a first step, the Registration, all Montages and tokens that are part of the new language specification have to be introduced. 2. During the Integration phase, a parser is generated that is capable of reading programs of the specified language. Simultaneously, consistency checks are applied to the Montages, i.e. completeness of the language specification and the accessibility of all involved subcomponents is asserted. 3. The parser is then used to read a program (Parsing) and to transform it into an abstract syntax tree (AST). 4. In the next stage (Static semantics), dependencies between the nodes in this AST are resolved by assigning initial values to all properties of all Montages. 5. Finally, the control flow graphs are connected to each other (Control Flow Composition) and thus building a network of nodes that can be executed. The first two phases specify the static semantics of the language specification. This means that all necessary preparations that can be done statically are completed after integration. Further processing is done by executing the specification, namely the phases three to five – the dynamic semantics of the language specification. The five steps of the transformation process also imply a shift in focus from Montages towards their subcomponents. This is also reflected in Fig. 10 by the three major (intermediate) data structures that are generated during specification transformation (displayed in ellipses). As focus shifts from Montages to their subcomponents (properties or control flow graphs), interaction between the components gets more and more fine grained, the data structures thus are increasingly complex. The following sections will provide a detailed description of the five transformation phases and the resulting data structures.
4.2 Overview
43
Whl ::= "d" xyz
1. Registration/Adaptation
Whl ::= "d" xyz
Props: ... Rules: ...
Whl ::= "d" xyz
Whl ::= "d" xyz
Props: ... Rules: ...
Props: ... Rules: ...
Montages & tokens
2. Integration
Whl ::= "d" xyz
Whl ::= "d" xyz
Props: ... Rules: ...
Whl ::= "d" xyz
Static Semantics of Specification
Props: ... Rules: ...
Props: ... Rules: ...
L
Props: ... Rules: ...
Whl ::= "d" xyz
Props: ... Rules: ...
ProgL
3. Parsing
Source code of program to execute
Whl ::= "d" xyz
Whl ::= "d" xyz
Whl ::= "d" xyz
Props: ... Rules: ...
Props: ... Rules: ...
Whl ::= "d" xyz
Whl ::= "d" xyz
Props: ... Rules: ...
Props: ... Rules: ...
AST
4. Static Semantics
5. Control Flow Composition
Dynamic Semantics of Specification
Props: ... Rules: ...
Command Prompt C:\>
Control Flow Network Program interpretation
Figure 10: The transformation process from specifications to an executable interpreter.
44
4 From Composition to Interpretation
4.3 Registration / Adaptation Registering simply marks a Montage (or a token) as being part of a language (Fig. 11). The most of the work performed in this phase is done manually by the user. He has to adapt imported Montages to the new environment, i.e. to the new language. This covers renaming of nonterminals, properties and actions where necessary. Token definitions have to be given in this phase as well for all variable tokens, e.g. identifiers, numbers, strings, whitespaces, etc. Tokens for keywords can be generated by the system automatically (see integration phase below).
M5
M4
T1 T3 T2 added by user generated during integration
M1
T
M3 M6 M2
M L
Figure 11: Language L consisting of a set of Montages M and a set of tokens T.
If a language shall be composed of existing Montages, then in almost every case, minor adaptations have to be performed, e.g. adjusting the syntax rule to the general guidelines (as capitalized keywords for instance). Too stringent consistency checking of Montages in this phase would hinder a flexible Montage composition as only compatible Montages would be allowed to join the language. We consider editing a Montage in a language context (rather than out of context) as less error prone and thus more productive. Apart from enforcing set semantics (i.e. no duplicate Montages or tokens in a language) there are no consistency checks necessary. This loose grouping allows for comfortable editing of Montages. A language manages a set of Montages and tokens. It returns, upon request, references to Montages or tokens being member of the language. It plays a central role in the integration phase as it is the only place in a language specification where all member Montages and tokens are known. One of the Montages of a language has to be designated to be the starting Montage. It is equivalent to the start symbol (a designated nonterminal symbol) in a set of EBNF rules specifying a language. The starting Montage will
4.4 Integration
45
begin the parsing process in phase 3 (Fig. 10). Registration has to ensure that exactly one starting Montage is selected before transformation progresses to the integration phase.
4.4 Integration During the integration phase, tokens and Montages are integrated into a language specification. This requires parser and scanner generation as well as internal and external consistency checks.
4.4.1 Parser Generation For each Montage m M, a concrete syntax tree cst is generated by parsing its syntax rule sr (Fig. 13 shows an example). A syntax tree reflects the structure of the EBNF rule. Repetitions are represented by inner nodes, nonterminal and terminal symbols by leave nodes. Note that the original syntax rule can always be reconstructed from cst by performing an inorder traversal2. The syntax trees of all Montages can be merged by replacing the nonterminals with references to the root of the syntax trees of their designated Montages. This will result in a parse graph as parse trees may refer to each other mutually (Fig. 12). The parser is now ready for use (see next section on parsing phase).
4.4.2 Scanner Generation Syntax rule parsing also generates tokens for terminal symbols. Such terminal symbols, or keywords, are easily detectable as strings enclosed in quotation marks. Each keyword encountered is added to the token set of the language. Keywords are fixed tokens, i.e. they have to appear in the program text exactly as they are given within the quotation marks in the EBNF rule. In contrast, the syntax of identifiers or numbers varies and can only be specified by a rule (a regular expression) but not with a fixed string. After all parse trees have been generated, the complete set of tokens is known. It is now possible to generate a scanner that is capable of reading some input stream and returning tokens to a parser. We use scanner generation algorithms as they are used in the Lex [Les75] or JLex [Ber97] tools.
2
In the tree representation we use in our figures this corresponds to a traversal from top to bottom.
46
4 From Composition to Interpretation
st: Condition
T "(" ")"
Odd
st: While
";"
"WHILE"
"DO"
Condition "DO"
"END" "IF"
Relation
st: Statement
Statement
Block
"END"
If While
"WHILE"
Repeat ...
While ::= "WHILE" Condition "DO" Statement "END". Condition = Odd | Relation Statement = Block | If | While | Repeat | ... Figure 12: Merging of syntax trees by resolving references from nonterminal symbols to Montages and from terminal symbols to the token table.
4.4.3 Internal Consistency Internal consistency is concerned with the equivalence between the concrete and the abstract syntax tree of a Montage. The syntax tree generated from the EBNF production reflects the concrete syntax cst whereas the tree structure of the control flow graph defines the abstract syntax ast for the same language component (Fig. 13). If the structure of cst is not equivalent to the structure of ast then the parser will not be able to map the parsed tokens onto the given ast unambiguously and thus will stop the transformation process. Every nonterminal symbol and repetition in cst must have an equivalent node in ast. This equivalence can either be defined manually or semi automatically. It is not possible to identify equivalent nodes in both trees completely automatically, Fig. 14 shows why: nonterminals symbol may have the same name. Therefore it is impossible to automatically find equivalent nodes in the control flow graph. E.g. in Fig. 14: is the first occurrence of “Term” in the EBNF rule equivalent to the left or to the right “Term” node in the control flow graph? This example may seem obvious, but we will show that the answer to this question is part of the language specification itself and cannot be generated automatically.
4.4 Integration
47
Case ::= { "CASE" Expression "DO" [ StmtBlock ] } [ "DEFAULT" StmtBlock ] "ESAC". LIST~1
Expression
I
T OPT~3
true OPT~2
I
StmtBlock~2
StmtBlock~1
o
T
ast
cst {}
LIST~1 "CASE"
Expression
Expression
OPT~2
"DO" []
StmtBlock~1 OPT~3
StmtBlock []
StmtBlock~2 o
"DEFAULT" StmtBlock "ESAC"
equivalent nodes
Figure 13: EBNF production and control flow graph with their respective tree representations shown below. Structure of these trees and position of nonterminals have to match.
Manual definition of equivalent nodes (e.g. by selecting both nodes and marking them as equivalent) is the most flexible solution to the problem of multiple occurrences of the same name. It allows to define arbitrary nodes as equivalent. Although it would not be wise to assign e.g. an EBNF nonterminal symbol “Term” to a control flow node “Factor”, manual assignment would not prevent it. In addition to the production rule and the control flow graph, a table showing the relation between the two trees would be necessary.
48
4 From Composition to Interpretation Add ::= Term AddOp Term. AddOp = "+" | "-".
I
Term
Term
add
T
Figure 14: Multiple occurrence of the same name for a nonterminal symbol.
As users normally will identify equivalent nodes by name, it is natural to define equivalence as equality of names. This equivalence could be found automatically, but as we indicated above, this is not possible for multiple occurrences of the same name. The nonterminal symbols “Term” in the syntax rule can be distinguished unambiguously by their occurrence (first and second appearance in text) because an EBNF rule is given in a sequential manner. The same does not apply to a control flow graph, although one could argue that the given control flow would sequentialise the nodes. Although this is true, such a definition may be too stringent. For our example in Fig. 14 this means that the evaluation order of the two terms is restricted to left-to-right. A right-to-left evaluation could not be specified! In some cases, control flow graphs represent a partial order and thus, no unambiguous order of nonterminal nodes can be given. Fig. 15 shows such a case. Implying from the annotation of the left edge that the left “Statement” corresponds to the THEN-clause is dangerous, as it presumes some knowledge about the dynamic semantics that is not available in the syntax rule. We propose a semi automated approach to solve the problem of unambiguously identifying equivalent nodes of the concrete and abstract syntax trees. As If ::= "IF" Expression "THEN" Statement "ELSE" Statement "END".
I
Expression
Expression.result
Statement
Statement
T Figure 15: Unspecified evaluation order
4.4 Integration
49
mentioned above, the occurrences of nonterminals in the EBNF rule are sequentialised by their appearance in the rule. For each nonterminal node in the control flow graph we need to provide a number that indicates the appearance in the syntax rule. This number is 1 by default which simplifies the obvious cases, as e.g. in Fig. 13. The first and only appearance of “Condition” in the syntax rule is equivalent to the only “Condition” node in the control flow graph. If there is more than one nonterminal node in cft with the same name, then these nodes have to be enumerated in an unambiguous way, e.g. by appending ~i where i indicates the ith appearance of this nonterminal in the syntax rule. Fig. 16 illustrates an enumeration of the “Term” nonterminal nodes, such that the resulting evaluation order is right-to-left. Add ::= Term AddOp Term. AddOp = "+" | "-".
I
Term~2
Term~1
add
T
Figure 16: Specification of a right-to-left evaluation order using node enumeration
Repetitions are enumerated regardless of their kind (option, list or group). In the EBNF rule, only opening brackets are counted in the order of their occurrence. Fig. 13 provides an overview over all these naming conventions. We are now ready to specify what internal consistency is: a notion for equivalence between concrete and abstract syntax trees which can be summarized as follows: Be Sc = (c1, c2, ..., cn) a sequence of nodes of cst and Sa = (a1, a2, ... am) a sequence of nodes of ast with m n 0 , i.e. the sequences are not empty. Sc was generated by an inorder traversal of cst, where all terminal symbols were ignored (i.e. skipped). Similarly, Sa was generated by an inorder traversal of ast where all action nodes were ignored3. Furthermore we have a function eqv: Sc Sa that returns the equivalent control flow node for a given syntax tree node. Thus, a concrete syntax tree cst is equivalent to an abstract syntax tree ast if:
o
1. | Sc | = | Sa |, the number of nodes produced by the traversals are the same. 2. i,j: i, j > 0 eqv(ci) = aj i = j, equivalent nodes appear in the same order in both sequences.
3
Additionally, all subtrees of nonterminal nodes where skipped as well. Such subtrees reflect the tree structure of the Montage designated by the nonterminal node and thus are of external nature.
50
4 From Composition to Interpretation
4.4.4 External consistency External consistency is concerned with the accessibility to (parts of) other Montages. We have seen in Fig. 12 that Montages will be connected to each other when building parse graphs. Furthermore, properties of Montages may contain references to properties (of possibly other Montages) and the same reference may also occur in the rules associated with action nodes. In order to function properly, access to all referenced Montages or parts of it (e.g. root of parse tree, properties) has to be guaranteed. In other words, the external consistency check has to assert that all referenced entities are available, i.e. accessible for read or write (or both) operations. There exist two different kinds of references to external entities in a Montage: A.Textual references: If a nonterminal symbol is parsed, then its name has to designate a Montage registered with L. If no such Montage can be found the specification is not complete and thus it will be impossible to continue the transformation process towards an interpreter for L. Similar rules apply for Montage properties. References to other properties may appear in their initialisation rules. A dependency relation exists between Montages M1 and M2 if an initialization rule in M1 contains a reference to a property of M2. Montages and their dependency relations span a graph that is illustrated in Fig. 20a on p. 61. It is also possible to check whether the referred properties are within the scope of the referring initialisation rule. The scope of an initialisation rule is the Montage it is declared in, and all Montages that are accessible via nonterminals from there. Let us illustrate this with a simple grammar for an assignment (each line corresponds to a Montage): Asg ::= Expr ::= Term ::= Factor =
Ident "=" Expr. Term { AddOp Term }. Factor { MultOp Factor }. Ident | Number | "(" Expr ")".
Properties shall be defined as shown in Fig. 17. The initialisation rules are implementing type checking. An error will be issued on checking the third property, Expr.Error, as it tries to access the property Asg.TypeOK, which is out of scope. There is no nonterminal Asg in the Montage Expr nor is it possible to construct a path from Expr to Asg by transitively access nonterminals, e.g. the following would be legal: Expr.Error := return Term~1.Factor~1.Type;
Note that Factor is a synonym rule and therefore does not have any properties. Factor~1.Type actually accesses the Type property of the underlying Montage (after the parse tree was built, see next section). Hence, a further test
4.4 Integration
51
Asg.TypeOK := return Expr.Type == Ident.Type Expr.Type
:= if (exists(Term~2)) { if (Term~1.Type == Term~2.Type) { return Term~1.Type; } else { return undef; } } else { return Term~1.Type; } Expr.Error := return Asg.TypeOK; Term.Type
:= if (exists(Factor~2)) { if (Factor~1.Type == Factor~2.Type) { return Factor~1.Type; } else { return undef; } } else { return Factor~1.Type; } Figure 17: Property declaration with initialisation rules
should check, whether a property accessed in a synonym Montage is available in all alternatives of the production. Further processing of properties has to be done during static semantics analysis and is described in section 4.6. B.Graphical references: Nonterminal nodes may contain further (nested) repetitions and nonterminals. These refer to repetitions and nonterminals in the Montage designated by topmost nonterminal. These nested nodes serve only as source and target nodes for control flow edges. It is not allowed to add actions to (nested) nonterminal nodes. Each Montage encapsulates its internal structures such as properties and the control flow graph. Access is granted via well defined interfaces. If actions could be added from outside it would violate encapsulation and destroy modularity between Montages. The external consistency check is completed successfully if the nesting structure of the subtree of a nonterminal node matches the designated Montage’s ast. Equivalence between nested nodes and the ast of the designated Montage is defined analogously to the internal consistency described above.
52
4 From Composition to Interpretation
4.5 Parsing We are now ready for the dynamic semantics part of the language specification, i.e. to execute the specification in order to read and interpret a program. The next step in the transformation process is parsing (step 3 in Fig. 10) The parsing phase is responsible for reading a program and converting it to a parse tree according to the given syntax rules from the Montages. Fig. 18 illustrates this process with an example of a simple language. Before going into details about the conversion of a program into a parse tree, we have to select a suitable parsing strategy.
Grammar of L Asg ::= Ident "=" Expr. Expr ::= Term { AddOp Term }. Term ::= Factor { MultOp Factor }. Factor = Ident | Number | "(" Expr ")".
Parse tree of P Asg
Ident: d
Expr
Term
Parser
Term
Program P in L d = (a + b) * c
Term
*
Factor
Factor
Expr
Ident: c
+
Term
Factor
Factor
Ident: a
Ident: b
Figure 18: Parsing transforms a program P of a language L into a parse tree
4.5.1 Predefined Parser Parsing is a well understood process and is easy to automate. This might be an explanation why the Montage approach lacks the possibility to specify parser actions or to get control during parsing in general. In the publications defining the Montage approach [e.g. AKP97, KP97b, Kut01] parse tree building is explained only as a mapping of concrete syntax (the program P) onto a parse tree. No concrete definitions about the parsing method can be found. Furthermore, no mechanism for intervention during the parsing is foreseen in these
4.5 Parsing
53
publications. From a users point of view this omission can be seen as both a flaw in or a quality of Montages. On the one hand, the experienced language designer will miss of course the tricks and techniques that allowed him to specify “irregular” language constructs elegantly and compactly. Normally, these are context sensitive parts of a grammar where additional context information (such as type information) is necessary to parse them unambiguously. As there is no way to specify in Montages how to resolve ambiguities, the language designer is forced to either rewrite the syntax rules or to rely on standard resolving algorithms offered by the underlying parser (if they are known at all). We will give some examples below. On the other hand, the lack of being able to specify irregularities can be seen as a construction aid for the language designer. E.g. Appel advises that conflicts “should not be resolved by fiddling with the parser” [App97]. The occurrence of an ambiguity in a grammar is normally a symptom of an ill-specified grammar. Having to resolve it by rewriting the grammar rules is definitely an advantage to the inexperienced language designer, as it forces him to stick to a properly defined context-free grammar. The question is: should there be a possibility to control the parser in Montages? We decided against it for two reasons: 1. MCS shall stay with the original Montages as close as possible. Even without sophisticated parsing techniques full-fledged languages such as Oberon [KP97a] or Java [Wal98] could be specified using Montages. 2. With regard to modularity and reuse of specifications, the Montage approach is in a dilemma: both, the possibility to specify parse actions as well as the rewriting of syntax rules has its disadvantages. If parse actions were allowed, they would only apply to a specific parser model (see below). One would have to stick to a certain parser (e.g. a LALR parser) to enable reuse. In particular this would be the case if the parser would be specified completely by the Montages (as it is done with the static and dynamic semantics). If there are no parse actions allowed, the language designer is forced to rewrite the syntax rules in order to express them in a context-free manner. In extreme cases this might result in not being able to reuse a Montage as it is, because it leads to an ambiguity in the grammar. In general, we think that the advantages of a predefined parser will outweigh the complexity one would have to deal with if self defined parse actions were allowed.
54
4 From Composition to Interpretation
The following discussion will analyse the two basic parsing strategies: bottom-up or shift-reduce parsing, such as LALR parsers, and top-down or predictive parsing, such as recursive descent parsers, with regard to Montages. For indepth introductions into these parsing techniques, we refer to [ASU86, App97]. Both parsing approaches are applicable to Montages as shown by Anlauff’s GEM/MEX (LALR parsing using yacc [Joh75] as a parser generator) and our MCS (predictive parsing). The choice of the parsing technique decides on what classes of grammars can be processed. Both parsing techniques have their pros and cons with regard to ease of use, efficiency and parser generation.
4.5.2 Bottom-Up Parsing The bottom-up approach reads tokens from the scanner (so called shift-operation) until it finds a production whose right-hand-side (rhs) matches the tokens read. Then these tokens will be replaced by the left-hand-side (lhs) of the production (which is called a reduce-operation). To be precise, the matching tokens get a common parent node in the parse tree. The tree therefore grows from its leaves towards its root, which corresponds to a growth from bottom to top when considering the usual layout of trees in computer science (root at top). During the construction of a parse tree, two conflicts may occur: Reduce-reduce conflict: The parser cannot decide which production to choose in a reduce operation. This will be the case if several Montages have the same rhs in their syntax rule. One reason for this is that during registration the equality of the two rhs was not noticed, a common mistake if complete sublanguages are registered. An example for such a sublanguage was shown in Fig. 18. If Asg is imported as a self-contained sublanguage, then the Montages Expr, Term and Factor will be imported as well. If there is already a Montage Expression registered that contains the same rhs as Expr, there will be reduce-reduce conflicts during parsing. In this case, we are grateful for this conflict, as the related warning will draw our attention on this overspecification of the language. Reduce-reduce conflicts do not only indicate overspecifications, but also pinpoint context sensitive parts of the grammar. A typical example offers the following portion of a FORTRAN like grammar. Note that each line corresponds to a Montage:
4.5 Parsing
55
Stmt ProcCall Expr ParamList ExprList
= ::= ::= ::= ::=
ProcCall | Asgn Ident "(" ParamList ")". Ident [ "(" ExprList ")" ]. Ident {"," Ident}. Ident {"," Ident}.
Unfortunately, the grammar is ambiguous as the following line A(I, J)
can be interpreted as a call to A with the parameters I an J or as an access to array A at location (I, J). This grammar is not context free of course, i.e. only by regarding at the type declaration of A it can be decided which production to apply. In this case, the reduce-reduce conflict indicates a clumsy language design. The deployment of a standard parser generator as e.g. yacc [Joh75] or CUP [Hud96] might be dangerous as they implement a (too) simple resolving strategy for reduce-reduce conflicts: the first rule in the syntax specification is chosen. Montages cannot be enumerated and thus no order on input can be guaranteed that will be obeyed during parser generation. Furthermore, the second rule (Montage) will fall into oblivion as it will never be chosen. This is an unsolved problem in GEM/MEX which delegates parsing to a yacc-generated parser. Shift-reduce conflict: The second kind of conflict in shift-reduce parsers occurs when it is undecidable whether to perform a shift-operation (read more tokens from the scanner) or a reduce-operation (build a new node in the parse tree). The well known dangling-else such as in programming language Pascal or C, is a good example to demonstrate a shift-reduce conflict: If ::= "if" Expression "then" Stmt [ "else" Stmt ].
The following program fragment is ambiguous: if a then if b then s1 else s2
It can be interpreted in two different ways: (1) (2)
if a then {if b then s1 else s2} if a then {if b then s1} else s2
Shift-reduce parsers will detect the conflict. Suppose the program is read up to s1. Now, without further information, the parser cannot decide whether to reduce (interpretation 2) or to continue reading until s2 (interpretation 1). In Pascal and C, an else has to match the most recent possible then, so interpretation (1) is correct. By default, yacc or CUP resolve shift-reduce conflicts by shifting, which produces the desired result in the dangling-else problem of C or Pascal.
56
4 From Composition to Interpretation
4.5.3 Top-Down Parsing The second method to parse a program text and to build a parse tree has its pros and cons with respect to Montages too. Top-down parsers try to build the parse tree from the root towards the leaves. The parser is structured into several procedures, each of which is capable to recognize exactly one production rule. Each of these procedures reads tokens from the scanner and decides upon their type how to continue parsing. A terminal symbol is simply compared to the expected input, lists and options will be recognized in the body of while loops or conditional statements. But the most interesting is the recognition of nonterminal symbols: it will be delegated, calling its corresponding procedure in the parser. As the recognizing procedures can be called recursively (compare with the parse graph constructed in the integration phase, Fig. 12, p. 46) and because the syntax rules will be called from top to bottom4, such a parser is called recursive-descent parser. As with the bottom-up parsers, we have to mention two problems that top-down parsers impose on the Montage approach: Left-Recursiveness: A grammar which is to be recognized by a top-down parser must not be left-recursive. We will illustrate this by the following grammar: ProcCall ::= Ident "(" ParamList ")". ParamList ::= { Ident "," } Ident.
If the parser encounters a procedure call such as p(i) or r(i, j, k) then it will not be able to recognize its parameter list. The parser calls a recognizing procedure ParamList that will try to read all Idents and the succeeding “,” within a while loop. The problem is that the parser cannot predict whether it should enter this loop at all, and if, when it has to exit the loop because the first token in the repetition is the same as the one following it. Lists and options have to be used carefully if they occur at the beginning of a production rule. Fortunately, every left-recursive grammar can be rewritten to be right-recursive [ASU86]. For our above example this would look like: ProcCall ::= Ident "(" ParamList ")". ParamList ::= Ident { "," Ident }.
As demonstrated here, rewriting (or left factoring as this method is called) can often be done within the rule itself – no Montages other than ParamList is affected. The ban of left-recursive productions can be a nuisance if Montages
4
The starting production is considered to be the topmost production. Then all nonterminals appearing within this production are listed with their respective syntax rules below, and so on. Hereby an order is generated that sorts productions from the most general one (starting production) down to the most specialised ones (tokens).
4.5 Parsing
57
are imported that where developed in a system with a bottom-up parser where this restriction does not apply. Lookahead: A second problem with top-down parsers is, that they cannot always decide which production to choose next in order to parse the input. The following fragment from the Modula-2 syntax [Wir82] shall serve as an example: statement ProcCall assignment ActualParams
::= ::= ::= ::=
}
[ assignment | ProcCall | ]. designator [ ActualParams ]. designator ":=" expression. "(" [ExpList] ")".
Consider this program fragment as input: a := a + 1
When the parser starts reading this line, it is expecting a statement. The next token is a designator a which could be the beginning of the productions ProcCall and assignment. Which production should the parser choose now? There are two ways to answer this question: either it tries to call all possible productions in turn5 or it pre-reads the following token and gets “:=” which allows to identify assignment as the next production. A parser that tries all possibilities is called a backtracking parser; pre-reading tokens is called lookahead and it avoids time-consuming backtracking.
4.5.4 Parsing in MCS Considering our self imposed preconditions – like comprehensibility of the system and its processes, composability and compactness of a language – and the open specification of Montages with regard to parsing, a top-down parser seems more suitable than bottom-up parsing. Top-Down Parsing: The algorithms for top-down parsing are easier to understand than those for bottom-up parsing. Shift-reduce parsers are monolithic finite state machines., usually implemented with a big parse table that is steering the recognition of token patterns. As the construction of such parse tables is too much work to do by hand, the user has to rely on algorithms that are difficult to comprehend. Error detection and error recovery are more complex to implement in bottom-up parsers. Top-down parsers, however, are subdivided into procedures each of which can recognize exactly one syntax rule. Note that these procedures form a verti5
e.g. first production ProcCall: the next token must be an opening parenthesis “(” which would fit the ActualParams production. As there is no “(” the parser has to step back and try the next candidate production, assignment, where it is successful.
58
4 From Composition to Interpretation
cal partitioning of the parser. Hence, the structure of top-down parsers is very similar to MCS. Each Montage can implement a service that is able to exactly recognize its own syntax rule. If efficiency is important, then lookaheads have to be determined. This can be done automatically by analysing so called FIRST sets [ASU86, Wir86]. From the point of view of a MCS user, a top down parser has its pros and cons, as indicated in the discussion above. The most important rule – no leftrecursive grammars – is not as limiting as it may seem at first glance. Each left recursive grammar can be transformed into a right-recursive one and in many cases this is possible by just rewriting a single syntax rule. The parse algorithm is simple and corresponds the way a human will read a grammar. If efficiency of parsing is not a major goal, then a backtracking parser even allows for parsing of ambiguous grammars. The parser could be implemented to ask for user assistance in the case of several legal interpretations of the input program. User assisted parsing could be very useful in education, e.g. to demonstrate ambiguous grammars and their consequences for parsers and programmers (all variants of different parse trees can be tested). Parse Graph: In order to parse a program, MCS uses the parse graph constructed in the integration phase (see Fig. 12). Each node (read Montage) in this graph has a method that can be called to parse its own syntax rule. These methods either return a parse tree (the subtree corresponding to the parsed input) or an error (in the case of a syntax error in the program). The scanner provides a token stream from which Montages can read tokens one by one. Parsing begins at the parse method of the starting Montage. Note that in the parse tree, each Montage occurs exactly once. As in every recursive descent parser, control is transferred to the next parse method as soon as a nonterminal token was read. Then the Montage corresponding to this nonterminal will take over. When the construct was parsed successfully, the recognized subtree is returned to the caller. Parse graph and external consistency guarantee that all necessary Montages will be found during parsing. The parse tree returned to the caller is basically an unrolled copy of (parts of) the parse graph. Its nodes are instances of Montages that represent their textual counterparts in the tree. We refer to these Montages as Instance Montages. Instance Montages: An Instance Montage (IM) is a representative of a language construct in a program. Template Montages (Montages as we described them until now) serve as templates for IMs. They define the attributes of an instance at runtime (i.e. the dynamic semantics of a language specification), and can be implemented in two ways:
4.5 Parsing
59 Grammar of L
Program P in L
Asg ::= Ident "=" Expr. Expr ::= Term { AddOp Term }. Term ::= Factor { MultOp Factor }. Factor = Ident | Number | "(" Expr ")".
Parse Graph for L
c = a + b
Parse Tree for P Asg
Asg
Exp
Term Ident
Ident
Fact
Exp
Term
Term
Fact
Fact
Ident
Ident
Num
Template Montage Parsing Static Semantics Dynamic Semantics
Instance Montage Static Semantics Dynamic Semantics
Figure 19: Parse graph to control parsing and resulting parse tree.
1. As copies of the template Montages. They will be created by cloning the template. In this case, they feature all characteristics of the template Montages, only that some of them will never be used, e.g. generating a parser, checking internal and external consistency or the ability to parse a program. 2. As instances of new classes. The characteristics of such new classes are defined by the template Montages. They have the advantage that only the dynamic semantics of the specifications has to be present. Fig. 19 illustrates the relations between Template Montages and Instance Montages. Static semantics and dynamic semantics will be processed on IMs only. Additional characteristics of IMs concerning their implementation and deployment are explained in section 5.3. and section 5.2.3 provides a more detailed insight in the implementation of the parser in MCS.
60
4 From Composition to Interpretation
4.6 Static Semantics Analysis 4.6.1 Topological Sort of Property Dependencies In order to initialize all properties we simply could fire all their initialisation rules simultaneously6. This will result in some rules being blocked until the properties they depend on become available. Other properties can be computed immediately. Fig. 20 illustrates initialization by means of three Montages and their properties. Some rules depend on the results of others (e.g. M1.A) whereas some rules can fire immediately (in our case M2.C). Before initialization starts, all properties are undefined, marked by the distinct value undef. Static semantics analysis is completed when all properties are defined, i.e. pP : p undef. A simultaneous firing of rules could end in a deadlock situation if initialisation rules mutually refer to each other. To avoid this situation, i.e. a system looping infinitely, it is advisable to check for circular dependencies before executing initialisation rules. This can be done by interpreting the properties and their references as a directed graph (digraph) G = (P, R) that is defined by a set of vertices P (all properties of all Montages of a language L) and a set of directed edges R (all references contained in these properties). Let P be the a set of all properties of a language L and let R be the set of all references between the properties of P. We define a reference r = (s,t) as an ordered pair of properties s P source and t P target with Psource and Ptarget being the set of reference sources and targets respectively. We have to assert that G is a directed acyclic graph (dag)7. Fig. 20b shows such a graph, where we inverted the direction of all references in order to get a data flow graph. In our example, M2.C is the only rule that can fire initially. Its result triggers the computation of M1.A and M3.A etc. Fortunately, there exists an algorithm that very well suits our needs, i.e. topological sorting:
6 7
In fact, we are describing our system based on a sequential execution model, “firing all rules in a random order” would be more precise here. Formally: Let path(a, b) be a sequence of properties p1, p2, ... , pn, such that (p1, p2), (p2, p3), ... (pn-1, pn)R. The length of a path is the number of references on the path. A path is simple if all vertices on the path, except possibly the first and last, are distinct. A simple cycle is a simple path of length at least one that begins and ends at the same vertex. In a directed acyclic graph, the following holds true: 1. r: r r R 2. r ,s ,t: (r,s) R (s,t) R (r,t) R
4.6 Static Semantics Analysis
a.
61
M1
M2
A = B + 2*M2.C B = M3.A
C = 42 D = A < M1.A
M3 A = M2.C + 3 E = M3.D ? 2 : undef
b.
M1.A
M2.D
M3.A
M1.B
refers to
M3.E
M2.C data flow
Figure 20: Relations between Montages and Properties a. dependencies between Montages imposed by initialisation rules b. data flow during initialisation of Properties
1. It checks for cycles in a graph and 2. if no cycles are detected, then it returns an order in which initialisation rules can be fired without a single rule being blocked because of missing results If cycles can be found then static semantics cannot be executed. The initialisation of the properties participating in the cycle have to be rewritten, therefore it would be helpful, if a failed topological sort would return the offending reference. Successful execution of all initialisation rules does not imply successful completion of static semantics analysis: The initialisation rules may explicitly set a property to undef (see Fig. 17 and Fig. 21). The original Montage definition features a condition part (see example in Fig. 7, p. 35) which contains a boolean expression that has to evaluate to true. If this condition cannot be established then program transformation is stopped. MCS does not contain such a condition part because the same result can be obtained with a property. The condition shown in Fig. 7 can be expressed with an initialisation rule in MCS as given in Fig. 21. It is possible to assign to a property the distinct value of undef. According to our definition for completion of static semantics, pP : p undef, one single property being undefined will suffice to stop the transformation process. Hence, after the topological sorting and execution of all initialisation rules, it is
62
4 From Composition to Interpretation if (ConditionalOrOption.staticType instanceof java.lang.Boolean) { return new Boolean(true); } else { return undef; } Figure 21: Initialising a property to undef
important to test whether all properties were set. In section 5.3.9 we will present an algorithm that can perform static semantics analysis in O(|P| + |R|) where again, |P| denotes the number of all properties of all Montages and |R| the number of all references between them. In other words, if cleverly programmed, static semantics analysis can be embedded in a topological sort. Related Work. Topological sorting of attribute grammar actions has been described by Marti and Hedin.[Mar94, Hed99] In the GIPSY project presented by Marti a DSL, GIPSY/L, is used to describe relations between different documents, processes, and resources in software development processes. GIPSY/L can be extended by users in order to adapt the system to expanding needs. An extensible attribute grammar [MM92] allows to specify actions that control these processes. The order in which such actions are executed is determined by a topological sort along their dependencies. Hedin describes reference attribute grammars that do not have to transport information along the abstract syntax tree, but use references between the attributes in order to access remote data more efficiently. A topological sort has the same function as in our system: checking for cycles and determining an order for execution.
4.6.2 Predefined Properties Some properties are predefined, i.e. they are available in every Montage. Their number is small in order to keep the system simple. Conceptually seen, predefined properties are initialized in the parser phase already. Terminal Synonym Properties: Terminal synonym productions as e.g. AddOp in Fig. 14, p. 48 generate a property of the same name (AddOp in this case). This property is of type String and contains the actual string that was recognized by the parser during the parsing phase. In the example of the Add Montage in Fig. 14, this would be either “+” or “-”. Terminal synonym properties are readonly properties initialised by the parser during AST building.
4.6 Static Semantics Analysis
63
Parent Property: Each Montage implicitly contains a property parent which is a reference to the parent Montage in the AST. This reference will also be set by the parser during AST building and is read-only too. The parent property allows navigation towards the root of the AST, whereas nonterminals allow to navigate towards its leaves. Symbol Table Property: The last of the predefined properties is a reference to a symbol table, SymTab (see below). Again, this property is read-only but its initialisation can be user defined. There is a default behaviour which copies the reference to the symbol table from the parent Montage in the AST. Nevertheless, for specific cases (e.g. when a new variable is defined) declaration can be cached in the symbol table in the initialisation rule of the property.
4.6.3 Symbol Table The symbol table plays an important role during the static semantics phase. Basically it is a cache memory for retrieving declarations. Although it would be possible to use property initialisation to remember declarations in a subtree, it would be a tremendous overhead (and an error prone approach) to hand these references up and down the tree during static semantics evaluation [Hed99]. The advantages and the use of symbol tables are best explained with an example: Variable Declaration and Use: Let us have a closer look at variable declarations and variable access in a program. We will give a (partial) specification of a simple language that allows to declare variables in nested scopes. To simplify the example, variables have an implicit type (Integer) and there is only one statement that allows to print the contents of a variable. Given the following specifications: Prog Block Decl Stmt Print Var
::= ::= ::= = ::= ::=
Block. "{" {Decl} {Stmt} "}". Ident ["=" Expr]. Print | Block. "print" Var. Ident.
The Montages which are of interest here, Block, Decl, Print and Var are given in Figures 23 through 26 respectively. Consider the following program: { int i = 2; { int i = 5; print i; } }
64
4 From Composition to Interpretation
a)
1 Symbol Table value key
Prog SymTab
2
Block SymTab
b)
unique
Symbol Table key value i
3
4
Decl
Block
SymTab
SymTab
name
unique
value
2
c)
5
Decl
6
SymTab Symbol Table key value
name
i
value
Print SymTab
7
Var
i SymTab isDeclared
5
value
Figure 22: Symbol table and abstract syntax tree
In this example we have two variable declarations which occur in nested scopes. Both variables have the same name, i, but they have different values. When the print statement prints the value of i to the console, it will only see the inner variable declaration, as scoping rules shadow the outer declaration. Thus, the output of this program will be: 5 Fig. 22 shows the AST of the program above. First we want to focus on node 7, a use of variable i. In order to provide access to the memory where the value of i is stored, the Montage Var has to get the reference from the declaration (node 5). This non-local dependency between node 7 and node 5 can conveniently be bridged by an entry in the symbol table. Whenever a variable is declared, it is added to the symbol table with its name as the key for retrieval.
4.6 Static Semantics Analysis
65
Later in static semantics processing, this variable will be used and its declaration (containing the reference to its place in memory) can be retrieved by querying the symbol table. The symbol table is a component that exists independently of Montages and the AST. Its life-cycle is restricted to the static semantics analysis as it will not be used any more after all references are resolved. As mentioned above, every Montage has a predefined property SymTab that refers to the symbol table. But initialisation of this property cannot be done statically by the parser (as e.g. for the parent property). The reason for this is the ambiguous meaning of undef as a result of a query to the symbol table. Suppose Montage Var (node 7) is querying for the name i in the symbol table. As a result it gets undef. This could mean two things: 1. There was no declaration of a variable i 2. The initialisation rules of node 5 did not yet fire. They might do in the future, but then it is too late for node 7. At least, this scenario would stop and report an error. But suppose the outer declaration (node 3) fired before node 7. Then querying the symbol table would result in retrieving the outer declaration instead of the inner one. The program transformation would continue and generate faulty code. Therefore we have to impose an order on the initialisation. We can do this by generating dependencies among the nodes. As the symbol table has to be initialised as well, we can use the initialisation of the SymTab reference to generate a correct initialisation order. The symbol table will not change its contents at every node in the AST. So it makes sense to define as a default behaviour to copy the reference from the parent node SymTab : return parent.SymTab;
But this behaviour can be overridden by providing a different initialisation rule. For example: SymTab : SymbolTable st = parent .SymTab; st.add(Ident.Name, this); return st;
A new entry will be added to the symbol table. It is a reference to the current Montage8 and it can be retrieved with the given key (Ident.Name). Note that the symbol table has to be implemented such that it can cope with multiple entries of the same name in different scopes. In our example, this means that the symbol table has to distinguish between the different entries for i and furthermore it has to offer a different view for different nodes. In Fig. 22 8
denoted by this, the Java reference to the current object.
66
4 From Composition to Interpretation Decl ::= Ident ["=" Expr]. OptInit I
Ident
Expr
init
T
Initialisation
Prop
Type
name value
String return Ident.value; Integer new Integer(); // dummy value
Action @init: if (OptInit.exists) value = Expr.value;
Figure 23: Decl Montage, variable declaration
there is only one single symbol table. To node 1 it is empty (a), nodes 2 and 3 see the declaration of the outer i (b) and the rest of the nodes will see the symbol table as it is displayed on the bottom (c). There are different implementations possible that will meet all the requirements (see section 5.3.10). Initialisation of the Instance Montages in the AST of Fig. 22 will happen according to the initialisation rules of the following four Montage. The Decl Montage specifies the actual declaration of a variable. For convenience purposes, the property name is introduced. It is initialised by retrieving the value of the token representing the identifier. The property value is the most important property in a declaration, as it holds the value of the variable during runtime. References to this property have to be established wherever the variable is accessed. Initially, this property is set to some dummy value, as there is no static evaluation of the initializing expression in this example. At runtime, the variable’s content has to be initialised to either the value of the expression (if present). Nothing has to be done in absence of the initializer, because the dummy value was already set during static semantics evaluation. A Block is the syntactic entity of a scope. Variables declared within a scope must not have the same names; this condition will be asserted9 by the unique property. The symbol table valid for this scope is built during initialisation of the predefined SymTab property. First, the reference to the symbol table is 9
A Java set data structure is filled with all the names of the declared variables. The add operation returns true if an name is new to the set.
4.6 Static Semantics Analysis
67
Block ::= "{" {Decl} {Stmt} "}".
I
DeclList
StmtList
Decl
Stmt
Prop
Type
unique
Boolean Set set = new HashSet(); foreach decl in DeclList if (!set.add(decl.name)) return undef; return new Boolean(true);
T
Initialisation
SymTab Object
SymbolTable st = parent.SymTab; foreach decl in DeclList st.add(decl.name, decl); return st;
Figure 24: Block Montage, container of a scope
retrieved from the parent node, then all declarations are added with their names as keys. The Var Montage shows the use of a variable. Note that this Montage does only specify static semantics as there are no actions to perform at runtime. Read and write accesses to the value property of Var will be specified by the appropriate Montages, e.g. the Print Montage below. It is important that all variables are declared prior their use, which is checked with the IsDeclared property. SymTab denotes the predefined reference to the symbol table. As its initialisation is not overridden, it will be the same as in its parent Montage. Var ::= Ident. Prop
Type
Initialisation
isDeclared Boolean return SymTab(Ident.value) value Integer if (isDeclared) return SymTab(Ident.value).value; else return undef
Figure 25: Var Montage, use of a variable
The Print Montage finally shows how to access a variable’s value at runtime. Print is somehow an opposite to the Var Montage, as it does not specify any
68
4 From Composition to Interpretation Print ::= Var. I
Var
print
T
Action @print: System.out.println(Var.value); Figure 26: Print Montage, prints the contents of a variable to the standard output stream.
static semantics but only runtime behaviour. The action rule accesses the value of the variable directly via its reference (the value property).
4.7 Control Flow Composition 4.7.1 Connecting Nodes In the last phase of the transformation process, the control flow of a program will be assembled from the control flow graphs of the Instance Montages of the AST. We will explain control flow composition by means of an example. Given the CASE-statement of Fig. 13, p. 47 and the following code fragment: CASE a < 10 Do Stmt1 CASE a>= 10 && a <=20 DO // nothing CASE a > 20 DO Stmt2 ESAC
Then, the parser will build the AST given in Fig. 27. The nodes in the lower levels of the AST in Fig. 27 display the program text they represent for convenience. The parser can already do a considerable amount of work concerning the “wiring”: it simply copies the control flow graph in a Montage with all its control flow edges whenever an appropriate construct is encountered in the program text. Fig. 27a shows all the connections between the nodes of the subtree of the Instance Montage Case after parsing but before control flow composition. Nonterminal nodes are placeholders for the entire control flow graph of their designated Montage. At control flow composition, the nonterminal node will be replaced by the entire control flow graph of the designated Montage. All incoming edges of the nonterminal will be deviated to the initial node I of the
4.7 Control Flow Composition
69
a) structure generated by parser
I
a < 10
T
I
LIST~1
I
10≤a≤20
OPT~3
I
StmtBlk1
Case
OPT~3
T
I
I
b) after control flow composition
I
LIST~1
StmtBlk2
o
T
T
T
Case
OPT~3
a < 10
10≤a≤20
a > 20
OPT~3
OPT~2
OPT~3
StmtBlk1
T
OPT~3
OPT~2
T
a > 20
o
StmtBlk2
Figure 27: AST built by parser
replaced graph and all outgoing edges will leave from the terminal node of the Montage. Kutter illustrates this replacement excellently in [Kut01, chapter 3]. Repetition nodes indicate, that their contents (their subtrees) may occur several times in the program. The number of occurrences must be in a certain range which is part of a repetition node definition. E.g. the definition of an option allows a minimum of zero and a maximum of one occurrence of its contents. The parser will check whether the number of actual instances is within the given range and in addition it will build a subtree for each of these instances. This is illustrated in Fig. 27 where all three occurrences of CASE-part are attached to the LIST~1 node.
70
4 From Composition to Interpretation
The contents of a repetition node specifies what these subtrees look like. These subtrees or subgraphs (as they also are a partial control flow) are put together by connecting the terminal node of the nth instance with the initial node of the n+1th instance. All incoming edges of the repetition node will be deviated to the initial node of the first instance and analogously all outgoing edges of the list node will leave from the terminal node of the last instance (illustrated in Fig. 27b). Note that the repetition nodes are present regardless whether there is an actual occurrence in the program or not. The second CASE-part and the optional DEFAULT-part are missing in our sample code. The parser will create the nodes while copying the control flow graph. They serve the parser as a stub where it can plug in any actual instances appearing in the program. The above mentioned stubs remain empty if there is no corresponding code available. Empty stubs cannot be removed, as they still serve a purpose: they can be used to query whether their contents was present in the program. We did this e.g. in Decl Montage (Fig. 23, p. 66) in the action of node init. After all nonterminals were replaced by their control flow graphs we get a network of action nodes. We use the term network here instead of graph because the nodes and edges resemble an active communication network with action nodes as routers (routing the control flow) with computing abilities and edges as communication lines.
4.7.2 Execution After the transformation process is completed, execution of the program is almost trivial. The network of action nodes can be executed by starting at the initial edge of the starting Montage. It refers to some node which will get the execution focus, i.e. its rules are executed. Then the conditions of all outgoing edges will be evaluated. If there is none that evaluates to true, then the system stops, if there are more than one ready to fire, then the system stops too, because an ambiguous control flow was detected10. In the “normal” case of only one edge being ready to fire, control will be transferred to its target node. The system runs in this manner as long as there are control flow edges ready to fire.
10 A
non-deterministic behaviour of the action network could also be implemented, though parallel execution semantics was not the focus of our research.
Chapter 5
Implementation
Implementation In this chapter, the implementation of the Montage Component System (MCS) is discussed. The system allows to compose several Montage specifications to a language specification that can be executed, i.e. an interpreter for the specified language is generated. We begin this chapter with a discussion on what a language is in MCS (section 5.1). Section 5.2 will explain syntax processing and parsing and section 5.3 covers the static semantics analysis and control flow composition. Notice that the given code samples are not always exact copies from the actual implementation. The actual code has to deal with visibility rules, type checking and thus is usually strewn with casts and additional method calls to access private data. As we try to focus on the basic aspects, we do not want to confuse the reader with too many details. The code is trimmed for legibility and simplifies e.g. getter and setter methods became attributes or class casts were omitted. The architecture of MCS follows the Model/View/Controller (MVC) paradigm [KP88]. The discussion presented in this chapter concentrates on the aspects concerning the Model in this MVS triad. User interfaces such as Montage editors are only mentioned occasionally.
5.1 Language A language specification consists of Montages and tokens and they will be discussed in detail in the following sections. In order to use a Montage or a token as a partial specification of a language, it has to be registered at the language first. In MCS this is done by either creating a new Montage in the context of a language or by importing existing Montages into a language.
71
72
5 Implementation
A Montage can be stored separately in order to be imported into a language, but it cannot be deployed out of the context of a language. Does this conform to the definition of a software component given in section 2.1? This definition identified five aspects of a component (here: a Montage). On a language level, three of them are of interest: extent, occurrence and usage. Appearance and requirements of Montages will be discussed in section 5.3. Definitely, a Montage is a unit of composition and is subject to composition by third parties. Montages can be stored and distributed separately, which qualifies them for the extent and usage aspects of our definition. The occurrence aspect – components can be deployed independently - has to be discussed in more detail. Independent deployment does not necessarily mean that a component is a stand-alone application. Consider a button component for instance; it can be deployed independently of any other graphical components (such as sliders, text fields or menus), but it cannot be deployed out of the context of a surrounding panel or window. Similar rules apply for Montages. They may be viewed, edited and stored independently of other Montages (as buttons can be manipulated separately in a programming environment), but their runtime context has to be a language. Within such a context, Montages can either be imported or exported separately or in groups (see section 5.3 for further details). The main graphical interface of MCS reflects the leading role that the language plays. The main interface contains a list of all registered Montages (see Fig. 28). Here an overview over the language is given by a list of EBNF rules from all Montages. Tokens are listed on the second panel of this user interface (see Fig. 29; this interface is explained in more detail in section Fig. 5.2.1). Plugging a Montage into a language basically means to add it to this list. The Montage is then marked as being a part of this language definition; no consistency checks are performed at this time. This is necessary to allow for convenient editing of the Montages. If these are imported, they have to be adjusted to the new language environment, i.e. the syntax has to be adapted to match the general syntax rules (e.g. capitalized keywords) or properties of the static semantics have to be renamed in order to be used by other Montages (see section 5.3 for details). Only after all these adaptations have been performed, the interpreter for a language may be generated. This happens in several steps which will be listed next and described in detail in the following sections.
5.2 Syntax
73
Figure 28: MCS main user interface for manipulating a language.
5.2 Syntax Syntax definitions are given in terms of EBNF productions [Wir77b]. They do not only specify the syntax of the programming language, they also declare how Montages can be combined to form a language. We distinguish between characteristic productions and synonym productions (see also section 3.3.2). Characteristic Productions: In MCS, characteristic productions are associated with a Montage, i.e. each Montage has exactly one characteristic production. This production defines the concrete syntax of the Montage and therefore it reflects the control flow graph given in the graphical part of the Montage. The control flow graph basically defines the abstract syntax of the Montage. How this correspondence between concrete and abstract syntax is defined, was explained in section 4.4.3. This strict correspondence between the control flow graph and the concrete syntax does not allow alternatives (separated by “|”) in a characteristic production. Examples for characteristic productions:
74
5 Implementation While ::= "WHILE" Condition "DO" StmtSeq "END". Block ::= [ConstDecl][VarDecl]{ProcDecl} "BEGIN" StmtSeq "END" ".".
Synonym Productions: Synonym production rules assign one of the alternatives on their right side to the symbol (the placeholder) on their left side. In MCS there are two different categories of synonym productions: the nonterminal synonym productions and the terminal synonym productions. As their names imply, the right side of a nonterminal synonym production may contain only nonterminal symbols as alternatives whereas in terminal synonym productions only terminal symbols are allowed. Nonterminal symbols and nonterminal synonym productions are the pivot in language construction. They operate as placeholders and thus introduce flexibility in syntax rules. One possibility to enhance or extend a language is to provide further alternatives to a synonym production. Nonterminal synonym productions contain nonterminal symbols on their right side. Only one nonterminal symbol is allowed per alternative, but there may be several terminal symbols. Terminal symbols in alternatives will be discarded by the parser as they may not feature any semantic purposes. Examples of nonterminal synonym productions are: Statement = Assign | Call | StmtSeq | If | While. Factor = Ident | Number | "(" Expression ")".
Terminal Synonym Productions: A Montage may feature terminal synonym productions, provided that the placeholder appears in the characteristic production of the Montage. An example: Comparison ::= Expression CompOp Expression. CompOp = "=" | "#" | "<" | "<=" | ">" | ">=". Comparison is the characteristic production that describes the concrete syntax of the Montage. CompOp is a terminal synonym production that conveniently enumerates all possible comparison operators applicable in this Montage. Normally, terminal symbols will be discarded when parsed. However, terminal symbols declared in a terminal synonym production will be stored in a predefined property of the same name. To be precise: the property will contain the string that was found in the program text. In the CompOp example, the parser would generate a CompOp property of type java.lang.String and its value would be the actual comparison operator found. Storing these strings is necessary because after parsing a program text, only this property will contain information about the actual comparison.
5.2 Syntax
75
5.2.1 Token Manager and Scanner The processing of a program text begins with lexical analysis. There, the input stream of characters is grouped into tokens. Each token represents a value and in most cases this value corresponds directly to the character string scanned from the program text. Such tokens are typically keywords (e.g. if, while, etc.), separators (e.g. “;”, “(“, “)”, etc.) or operators (e.g. “+”, “*”, etc.) of the programming language and serve readability purposes or separate syntactic entities. In certain cases, however, the original string of the program has to be converted into a more suitable form. When scanning the textual representation of a number for example, the actual character string is of minor interest as long as it can be converted into its corresponding integer value. These tokens are called literals; integers, floating point numbers and booleans are typical literals. Beyond that strings and characters need to be converted as well, i.e. it might be necessary to replace escaped character sequences by their corresponding counterparts (e.g. the Unicode escape sequence ‘\u2021’ will be replaced by a double dagger ‘‡’). In MCS, tokens are the smallest unit of processing. They cannot be expressed in terms of Montages, as they contain no semantics at all, they only represent a value. Therefore they have to be managed and specified separately. In order to completely specify a language, Montages do not suffice; some token specifications will be needed as well. Fortunately, there are only a few such specifications, which normally are highly reusable. Specifically these token specifications are literals and white spaces. Keywords, separators and operators can be extracted from the EBNF rules of the Montages. Literals and white spaces cannot be generated automatically, as they have a variable micro syntax. In order to efficiently handle scanning of program texts, MCS has a Token Manager that keeps track of all tokens related to a language. Each token that must be recognized has to be registered with the token manager. The majority of the tokens will automatically be registered by the Montages as soon as they generate their parsers. This is very convenient, not only because their number can be high but also because they differ from language to language. Literals and white spaces, however, have to be specified independently from any Montages (although Montages will refer to them in their EBNFs). Such a specification consists of:
76
5 Implementation
Figure 29: Screen shot of the Token Manager. Each token is specified by a name, a regular expression, a conversion method (represented by a type name) and a skip flag indicating whether this token will be passed to the parser.
• • • •
a name that can be used in EBNF to refer to a token, a regular expression that describes the token’s micro syntax, a method that returns an object containing the converted value of this token, a flag signalling whether this token specifies a white space, an thus will be skipped, i.e. it will not be passed to the parser.
The token manager will generate a scanner capable of scanning the program text and returning tokens as specified. The method we chose to generate a scanner is the same as in Lex [Les75] and is explained in [ASU86]: applying Thompson’s construction to generate a nondeterministic finite automaton (NFA) from a regular expression, and subsequently using a method called subset construction to transform these NFAs into one big deterministic finite automaton (DFA). The application of these algorithms is not problematic at all as long as all regular expression specifying keywords are processed before those specifying literals. Normally the character sequence representing a keyword could also be interpreted as an identifier, as the same lexical rules apply (specified by two different regular expressions). If subset construction is fed with regular expressions
5.2 Syntax
77
of literals first, then it will return a literal token instead of a keyword token (refer to [ASU86] chapter 3.8 for further details about this property of lexical recognizers). MCS solves this problem by numbering token specifications. Literals and white spaces are automatically assigned higher numbers1, thus guaranteeing correct recognition of keyword tokens. The Token Manager will generate a lexical scanner on demand. The scanner is then available to the parsers of the Montages. In order to process a program text, they have to access the scanner’s interface to get a tokenized form of the input stream. The Token Manager and the scanner are central to all Montages. This contradicts in a way the decentralised architecture of MCS. Why does not every Montage define its own scanner? Although decentralised scanners could be implemented, they do not make sense in practice. The main reason inconsistent white space and identifier specification. In a decentralised setup, it would be possible that each Montage defines different rules for white spaces. Such programs would be unreadable and very difficult to code, as syntax rules might change in every token. It does not make sense that white spaces in expressions should be defined differently from white spaces in statements. An example of such a program can be found in Fig. 30. This example shows the core of Euclid’s algorithm with the following white space rules: IF, REPEAT and assignments have the same rules as in Modula-2, Comparisons use underscores ‘_’, Expressions use tabs, StatementSequences have again the same rules as in Modula-2 except that newline or carriage return is not allowed. REPEAT IF u_>_v THEN t := u; u := v; v := t; END; u := u v; UNTIL u_=_0; Figure 30: An example of a program with different white space rules.
In a language specification, white spaces and literals would have to be specified redundantly in many different Montages and thus making them error prone. Unintended differences could easily be imported by reusing Montages from different sources. By using a single scanner for all Montages, the user of MCS
1
Literals and white spaces are numbered starting from 10000, assuming that there will never be more than 10000 keywords, separators and operators in a language. This simple method prevents renumbering of tokens. Remember that keyword tokens will not be registered before parser generation, i.e. user specified literals and white spaces will be entered first, preventing a first come first serve strategy.
78
5 Implementation
does gain simplicity, consistency and speed with the loss of some (usually unwanted) expressive power.
5.2.2 Tokens When tokens are registered with the Token Manager, they get a unique id that helps to identify a token later in the AST. The id is an int value and can be used for quick comparisons, e.g. it is faster to compare two integers than comparing to strings each containing “IMPLEMENTATION” (the longest Modula-2 keyword). Furthermore each token features a value and a text. The value is an object of the expected data type, e.g. if the token represents a numerical value, then value would refer to an java.lang.Integer or a java.lang.Float type object. The token’s text however will contain the character string as it was found in the program code and is always of type java.lang.String. The data type of value is known either through its id or it can be queried, e.g. using Java’s instanceof type comparison. Fig. 31 shows the token classes that are available. The classes Token and VToken are abstract classes. VTokens will return values that have to be transformed from the text in the program. The types of these values are standard wrapper classes from the package java.lang: Boolean, Character, Integer, Float, and String respectively. Token
VToken
KeywordToken BooleanToken CharacterToken IntegerToken RealToken StringToken
Figure 31: Token class hierarchy
5.2 Syntax
79
5.2.3 Modular Parsing Each Montage is capable of parsing the construct it specifies. In order to do so, a parser has to be generated first. This is done by parsing the EBNF rule and building a syntax tree (the EBNF tree concrete syntax tree) for each Montage. The nodes of the EBNF tree are of the same types (classes) than the nodes of the control flow graph. These classes will be described in section 5.3. Concrete and abstract syntax tree of a Montage are very similar which has been commented on in section 4.4.3 on “Internal Consistency”. Generating the parser is done in the integration phase of our transformation process and thus the EBNF tree is part of a Template Montage (see p. 58) and will not be used in Instance Montages. Parsing a program text is now as simple as traversing these EBNF trees and invoking the parse method on each visited node, beginning with the EBNF tree of the start Montage of the language. Each node in an EBNF tree belongs to one of the following categories and has parsing capabilities as described: Terminal symbol: In an EBNF rule they are enclosed in quotation marks, e.g. "if". When the parsing method of a terminal symbol node is invoked, it gets the next token from the scanner and compares it to its own string. An exception is thrown if comparison fails. Upon success, a token representing the terminal symbol is returned. Terminal symbols are normally not kept in the abstract syntax tree (see discussion in previous section). A kind of an exception are terminals stemming from terminal synonym productions whose text is stored as predefined property. Note that, upon parsing, the terminal symbol cannot decide on its own whether it should be discarded or not. Therefore a token is returned upon every successful scanning. Nonterminal symbol: In EBNF, these are identifiers designating other Montages. When expecting a nonterminal, i.e. the parser encounters a nonterminal node in an EBNF tree, then parsing will simply be delegated to the Montage that the nonterminal node represents. This Montage in turn traverses now its own EBNF tree in order to continue the parse process. Of course, the nonterminal nodes in the EBNF have to be aware of the Montages they represent. This awareness will be achieved during the integration phase as described in section 4.4. Repetition rule: Repetitions are marked by “{...}” or “[...]”. The contents of a repetition is contained in the children nodes of a repetition rule. During parsing, the parsers of these children are called in turn until an error occurs. If this error occurs at the first child, the repetition has been parsed completely, otherwise an error has to be reported to the calling node. Note that it
80
5 Implementation
is possible to get an error on the very first attempt to parse a child. This means that the optional part specified by the repetition was not present in the code. The parser is also responsible to check whether the actual number of occurrences of the repetition contents is within the specified range, i.e. min @actual occurrences @max, where min and max denote the minimal and maximal allowed occurrences respectively. Alternative rule: (An alternative rule separates different nonterminals by a vertical line “|”). The parser tries to parse each alternative (nonterminal) in turn. If all alternatives report errors, then an error has to be reported to the calling node. There are different strategies that can be implemented for a successful attempt to parse an alternative. Either the first alternative that reports no errors is chosen and its parse tree is hooked in the AST, or the remaining alternatives are tested as well. If additional alternatives report valid parse trees then there are two choices again. Either to stop parsing because of an ambiguous grammar or to allow user assistance. We chose the latter option of both choices: testing all alternatives and allowing user intervention. The user will be presented with a set of valid alternatives an may chose the one that will be inserted in the AST. This approach will substitute parser control in a certain way, as it allows to specify ambiguous non context-free grammars (see also discussion in sections 4.5 and 7.2.1). Terminal synonym rules are treated differently. The string that was read by the scanner will be stored in a predefined property which has the same name as the terminal synonym rule. As the result of a terminal synonym rule can only be one single string this is the most efficient way to handle these tokens and no additional indirection (tree node access) is necessary to access its value.
5.3 Data Structures for Dynamic Semantics of Specification The description of the tasks of the static semantics analysis and control flow composition in sections 4.6 and 4.7 was given in a chronological order. This provided an overview over the various relations between Montages properties, control flow graphs, abstract syntax trees etc. We would like to concentrate now on the data structures used for the implementation. The following sections are not reflecting any chronological order of the transformation process but rather follow the hierarchy of our main classes: the nodes in the AST.
5.3 Data Structures for Dynamic Semantics of Specification
81
5.3.1 Main Class Hierarchy The main data structure we have to deal with is the Montage. We decided to model a Montage as a Java class. And although it is one of the most important classes in our framework, it plays a rather marginal role in the class hierarchy we defined. When modelling the classes and their relations, we proceeded from the assumption that the most important data structure is not the Montage but rather the abstract syntax tree (AST) built by the parser. This tree will manage all the important information about the program and its static and dynamic semantics. The whole dynamic semantics of the specification (see Fig. 10, p. 43) is centred around the AST. Taking this point of view, the first question is: what objects will be inserted in the AST? We have already seen the different kinds of nodes that populate the AST: nonterminals, actions, repetitions, initial and terminal nodes and Montages. When analysing the relations between these nodes certain common properties may be sorted out (candidates for abstract classes). After some experimenting, we found the class hierarchy given in Fig. 32 the best fitting type model for our system. CFlowNode
Action
I
Terminal
T
CFlowContainer
Nonterminal
Synonym
Repetition
Montage
Figure 32: The MCS class hierarchy
At the root of this hierarchy is an abstract base class CFlowNode that represents all common properties of a control flow node. E.g. it implements a tree node interface (javax.swing.tree.MutableTreeNode) that enables it to
82
5 Implementation
be managed by (visual) tree data structures. It also manages incoming and outgoing edges in order to be used as a node in a graph. In other words, a CFlowNode object is capable of being a node in a tree and a node in a graph simultaneously. Furthermore an abstract method for parsing is defined. CFlowContainer is an abstract class too that has additional capabilities to manage a subtree of CFlowNodes. The concrete classes then implement all the specific features that the different nodes will need to perform their duties. We will present them in more detail in the following sections.
5.3.2 Action An Action object will represent an action node at runtime. Each Action object is a JavaBean listening to an Execute event and featuring a NextAction property. Fig. 34 sketches the declaration of the Action class. The action provided by the user will be encapsulated in an object implementing the Rule interface. interface Rule { // fire the dynamic semantics rule public void fire(); } Figure 33: Declaration of interface Rule
MCS will encapsulate the code that the user gave for a specific action in a class that is suited to handle all the references and that has all the access rights needed. When executing a program, action nodes are wrapper objects hiding the different implementations of their rules and thus simplifying the execution model. Action nodes do not need to implement any parsing action2 as they have no representation in the syntax rule.
5.3.3 I and T Nodes In [AKP97, Kut01] I and T in the control flow graphs denote the initial and terminal edge respectively. From an implementation point of view it turned out to be easier to implement an initial and a terminal node3. Simply think of the I
2
In order to be a concrete class, Action implements the parsing method, but it is empty.
3
terminal node denotes a T node here, in contrast to a node representing a terminal symbol which is of class Terminal. Class names are printed in italics
5.3 Data Structures for Dynamic Semantics of Specification
83
class Action extends CFlowNode implements ExecuteListener { // an object containing the rule to execute private Rule rule; // Constructor public Action(Rule rule) { this.rule = rule; } // handle the Execute event public void executeAction(ExecuteEvent evt) throws DynamicSemanticsException { rule.fire(); } // NextAction property public Action getNextAction() { Action next = null; for(Iterator it = outgoing.iterator(); it.hasNext;) { Transition t = it.next(); if (t.evaluate()) { next = t.target; break; } } return next; } } Figure 34: Declaration of class Action
and T letters as nodes instead of annotations of the initial and terminal edge. This model has several advantages: • Because there is only one edge class there is no need to distinguish between ordinary edges and edges with ‘lose ends’, which in fact the initial and terminal edges are because they are attached to only one node. Having two more node classes does not hurt, as we have to distinguish between the different kinds of nodes anyway. • Connecting Montages is easier as we can merge nodes instead of tying lose ends of edges. An T node can be merged with an I node (of the following Montage) by deviating all incoming and outgoing edges of the I node to the T node. When all I and T nodes are merged this way, only T nodes remain.
84
5 Implementation
a)
I
T I
AnyMtg
Expr
T I value.true
T
AnyMtg
Stmt
b) I
Term
compare
Term
T
Expr value.true
c)
I
...
T
I
AnyMtg
... Expr
T
I
...
T
AnyMtg value.true
I
...
T
Stmt
d)
I
... AnyMtg
T
... Expr
T
...
T
AnyMtg value.true
...
T
Stmt
Figure 35: ‘Wiring’ a Montages
• The only difference between T nodes and Action nodes is that the first do not fire any actions. Evaluating conditions on the outgoing edges is done exactly the same way as in action nodes. • I nodes do neither fire actions nor evaluate outgoing edges. Fig. 35 illustrates the ‘wiring’ of a control flow of a While Montage surrounded by other Montages. Its Expr nonterminal node is expanded in Fig. 35b and collapsed again in Fig. 35c. Finally, in Fig. 35d all I nodes were removed and their attached edges diverted to T nodes. In this process, only one I node will remain untouched: the initial node of the start symbol’s control flow. This is where execution of the program will begin.
5.3 Data Structures for Dynamic Semantics of Specification
85
5.3.4 Terminal A node of class Terminal is capable of parsing a terminal symbol. Terminal nodes will not be inserted in the AST, but they are members of the EBNF trees which reflect the EBNF rule. During the integration phase, when the parse tree is generated, a Terminal node registers its terminal symbol with the Token Manager and receives an unique id in return. In the parsing phase, when a token is parsed, the Terminal node then compares the id of the encountered token to the previously stored one. Parsing may continue upon correspondence and the token is returned to the calling node in the parse tree. Usually this node is content with a positive result and does not store the terminal symbol in the AST. However, for future versions of MCS it would be possible to insert Terminal nodes as well for improving debugging services for example.
5.3.5 Repetition Repetition objects represent lists, options or groups. They are CFlowContainers because they maintain their contents as a subtree (see e.g. Fig. 13). There is no need for specialised subclasses for these three kinds of repetitions, because they would not sufficiently differ. We decided to introduce two attributes minOccurrence and maxOccurence which can be accessed as Java properties. They determine the minimum and maximum number of occurrences of their contents in a program text. The default values are given in the following table: Repetition
min
max
List
0
java.lang.Integer.MAX_VALUEa
Option
0
1
Group
1
1
a. the maximum integer value which comes as close toB as possible.
Of course minimal and maximal occurrences are not bound to these numbers and can be set freely by the user provided that min @ max. When internal consistency is checked, the values of min and max can be unambiguously assigned to the concrete syntax tree instance of a repetition. This is necessary, as it is not possible to specify any other values for min and max with EBNF than the ones shown in the table above. Repetition nodes in the AST serve as the container for nodes of the actual occurrences. Edges leaving from or going to a repetition will be managed by these nodes. The actual instances of the repetition body are stored below a Rep-
86
5 Implementation
a) concatenated occurrence
b) list occurrence LIST
Stmt
Stmt Stmt
Stmt
Stmt
LIST [0] Stmt [1] Stmt
Figure 36: Difference between repeated occurrence and list occurrence of statements
etition node in an array. Fig. 36b shows how the AST of a list of two occurrences of a statement looks like. In contrast to Fig. 36a which illustrated a concatenated occurrence of statements in the specification. Notice that in the graphical representation the initial and terminal edges can be left out if there is only one node in the repetition. In this case, it is obvious where these edges have to be attached to. A Repetition object has to guarantee that there exists one I node and one T node either being explicitly set by the user or implicitly assumed as in described above. The numbered nodes are visualizing the array buckets in which the actual instances are stored, they do no exist as nodes in the AST.
5.3.6 Nonterminal Nonterminal is a concrete subclass of CFlowContainer, as it is possible to nest nonterminals (see Fig. 9, p. 41). Nonterminals serve as placeholders for their designated Montages, therefore the most important extension in the Nonterminal class is a reference that will refer to the designated Montage. This reference will be set during the integration phase (section 4.4). Parsing will be delegated to the Montage the Nonterminal object is referring to.
5.3.7 Synonym A Synonym object has two responsibilities: trying to parse the specified alternatives and representing the actually parsed instance in the AST.
5.3 Data Structures for Dynamic Semantics of Specification
87
Grammar of L
Program P
Asg ::= Ident "=" Expr. Expr ::= Term { AddOp Term }. Term ::= Factor { MultOp Factor }. Factor = Ident | Number | "(" Expr ")".
d = 5*c
AST of P Compact AST of P
Asg
Asg Ident: d
Expr Ident: d Term
*
Expr
Term
Factor
Factor
Term
Num: 5
Ident: c
Factor Num: 5
*
Term Factor Ident: c
Figure 37: AST and compact AST of a program
Parsing has already been described in section 5.2.3. It is important not to lose information about the origin of nodes in the AST representing the parsed program. It is common that a Montage refers to a different one not by its proper name but by the name of a synonym production. Fig. 37 shows such a situation. A Factor node will always have exactly one child. Therefore the Factor node and its child could be merged. But it is important not to ‘forget’ that there was a synonym node in between. If this information was lost, it would not be possible to refer to Factor in Montage Term. E.g. in a Term Montage there would probably be some action rule similar to: value = Factor~1.value * Facotr~2.value
which is computing the result of the multiplication. Referring to Factor is the only possible way, as we do not know at specification time what kind of Factor we will encounter in the program text. As we may not remove the synonym node information, we do not remove the entire node from the AST. This is in contrast to the recommendations of the original Montage specification [AKP97]. From an implementation point of view, we cannot simplify matters by compacting the trees. The gain in memory is not worth the extra coding, and the additional indirection on accesses to
88
5 Implementation
Num, Val or Expr is certainly less time consuming than querying the node for information about its previous ancestors.
5.3.8 Montage The most complex data structure in our framework is the class Montage. The complexity can be explained by the versatile use of a Montage object. For simplicity of implementation we do not distinguish between Template Montages and Instance Montage but rather use the same data structure for both. The overhead we get in terms of memory is not as bad as it may seem on first glance: there are only few class attributes that are exclusively used by the registration and integration phases. Most items will be needed in static semantics analysis and control flow composition as well. The methods needed to implement Instance and Template Montages are basically the same, which would result in practically the same implementation, i.e. the Instance Montage implementation would be a subset of the Template Montage implementation. Not distinguishing between the two has therefore the advantage that changes in the code have to be done only in one class. Fig. 38 shows the interface of the class Montage. Note that Montage is a CFlowNode which enables it to be a node in an AST. Furthermore it inherits the behaviour of a CFlowContainer, thus it is capable of managing subtrees of CFlowNodes. All methods and attributes related to these abilities were already implemented in Montage’s superclasses. The methods in Fig. 38 are grouped according to the underlying data structures and their appearance in the graphical representation. The first three methods are concerned with terminal synonym rule handling. Synonym rules (objects of class Synonym) can be added to or retrieved and deleted from a Montage object; they are referenced by name. Adding and deleting is usually done manually by the user, when he edits a Montage during the registration/adaptation phase. The same is valid for the next group of methods which is concerned with construction of the control flow. The methods allow to create Actions, Nonterminals, Repetitions and Transitions (control flow edges). They are convenience methods, as it would also be possible to call their respective constructors directly and insert them ‘manually’ by calling the corresponding tree handling methods. But using these convenience methods has the advantage, that nodes inserted this way are guaranteed to be correctly set up. E.g. parents are checked first if they can hold a new node and it is guaranteed that a transition always connects two valid nodes. Removal does not need such consistency checks and therefore can be done by calling the tree handling methods inherited from class
5.3 Data Structures for Dynamic Semantics of Specification
89
public class Montage extends NonTerminal { // datastructure managing transitions protected I initial; protected T terminal; // Terminal Synonym Rules public void addSynonymRule(Synonym sr); public Synonym getSynonymRule(String name); public void removeSynonymRule(String name); // Editing the Control Flow Graph public Action newAction(String name, CFlowNode p); public NonTerminal newNonTerminal(String name, CFlowNode p, int cor); public Repetition newList(String name, CFlowNode p); public Repetition newOption(String name, CFlowNode p); public Repetition newRepetition(String name, CFlowNode p, int min, int max); public Transition newTransition(String label, CFlowNode from, CFlowNode to); public I setInitialTransition(CFlowNode node); public T setTerminalTransition(CFlowNode node); // Properties public void addProperty(Property p); public Property getProperty(String name); public void removeProperty(String name); // Action public void addActionRule(Action node, Rule r); public void removeActionRule(Action node); // Registration phase public void setLanguage(Language newLanguage); public Language getLanguage(); // Integration phase public void generateParser() throws StaticSemanticsException; // Parsing phase public void parse() throws ParseException; } Figure 38: Interface of class Montage
CFlowNode. Setting an initial and a terminal transition is done by marking a node in the subtree as the target of the initial transition or the source of the ter-
90
5 Implementation
minal transition of the Montage respectively. Internally, the Montage will allocate an I or T object that represent the corresponding edge, as described in section 5.3.3. Properties are modelled in a class of their own which is described in the next section. They can be added to and removed from a Montage during editing and (important for static semantics) they can be retrieved by their name. The two methods concerned with actions allow to add and remove an action to/from an Action node respectively. Note that the interface Rule has already been introduced in Fig. 33. In the registration phase, when a Montage is associated with a language, the setLanguage() method is used to set a reference from the Montage back to the language. This is necessary so that e.g. the parser generator may have access to the Token Manager that is stored with the language. getLanguage() is also used during the static semantics phase to find other Montages of the same language in order to access their properties. Consistency checks and generation of a parser is done in method generateParser(). If it throws an exception, then an error occurred. StaticSemanticsException is the base class of various, more detailed exception classes that can be thrown upon the many possible errors. Any errors are also reported to System.err which is the standard error stream in Java. The method parse() can only be invoked after successful parser generation, of course. It can also throw various exceptions (among them NoParserAvailableException) which are subclasses of ParseException. Static semantics analysis and control flow composition can be done without any access methods in class Montage. Both phases will operate directly on properties and AST/control flow graphs respectively.
5.3.9 Properties and their Initialisation Property Declaration. A MCS property is represented by an object that implements the Property interface given in Fig. 39. Declaring Property as an interface has advantages over a declaration as a class. We are not forced to implement it as a subclass and thus virtually any object can be made a property by simply implementing its interface. Being able to alter the name of a property is very crucial as only a change in name allows us to adapt an imported Montage to the needs of its new language environment. When importing a Montage to a new language, there will probably be naming conflicts. E.g. an initialisation rule may refer to a property named value. The imported Montage would feature a matching property, but its name may be val. Renaming val to value resolves this property reference.
5.3 Data Structures for Dynamic Semantics of Specification
91
interface Property { // user view of a property public void setName(String name); public String getName(); public Class getType(); public void setValue(Object value); public Object getValue(); // building dependency graph public void checkReferences(Language language) throws ReferenceException; public void resolveReferences(Montage montage) throws ReferenceException; // methods for topological sorting public boolean isReadyToFire(); public void markReady(Property p); public Iterator dependent(); // initialisation of property public void initialize() throws StaticSemanticsException } Figure 39: Declaration of interface Property
The type of a property cannot be set, this will guarantee that despite of renaming, properties will remain compatible. The type of a property is defined by the implementation of the getType() method. Setting a new value has to be done in accordance with the type of the property, i.e. it is in the responsibility of the setValue() method, that the stored value is of the same type as returned by getType(). When reading a value, an object of the Java base class java.lang.Object is returned. The receiver of this value may assume that it is of the expected type. During the integration phase, the method checkReferences() will be called for all properties of the Template Montages. It is responsible to find all properties that are referred to from the initialisation rule. The argument language provides access to the other Montages of the specified language. If a property cannot be found, an exception will be thrown. The counterpart of checkReferences() in the static semantics phase is the method resolveReferences(). It is called for the properties of Instance Montages. It will resolve the property references in the initialisation rules by finding the target properties in the AST. In order to do so, it needs access to the current Instance Montage, which is given by the argument montage. The prop-
92
5 Implementation A B+C
B
C
2*C
5
A
B
A refers to B
A
B
dataflow
Figure 40: Dependencies among properties
erty is registered with the target property. By doing so, in each target property there will be built a list of dependent properties. These reversed references can also be seen as dataflow arrows, as the computed values will flow along these references. Fig. 40 illustrates these two dependencies among properties. The solid arrows indicate references between properties, whereas the dashed arrows indicate the dataflow, i.e. initialisation of properties is done along the dashed arrows. For determining the firing order of initialisation rules, some helper methods are needed. isReadyToFire() will indicate whether all referred properties are available, i.e. their values have been computed (undef). When, during the firing of the initialisation rules, a referred property becomes available this will be notified through the markReady() method. Its argument tells which property has become available. In order to traverse the dependency graph of the properties, it is important to have access to the dependent properties. This will be granted by the java.util.Iterator object that is returned by the dependent() method. Note that internally, each property will probably have a list of references in order to efficiently process the resolveReferences() method. Finally, initialize() invokes the initialisation rule. It is completely up to the implementation how this rule is processed. The only requirement is that the value will be initialized with an object of the expected type (i.e. Java class). Firing the Initialisation Rules. The concept of firing the initialisation rules of all Montages has been explained in section 4.6.1 and now we want to describe the announced algorithm. Topological sorting can be done in O(|P| + |R|) [Wei99] with P being the set of all properties and R being the set of all references among them. Resolving references and initializing properties can be done with the same algorithm. Note that O(|P| + |R|) is the runtime complexity of
5.3 Data Structures for Dynamic Semantics of Specification
93
void topsort(Montage montage) throws StaticSemanticsException { Stack s = new Stack(); Set toProcess = new TreeSet(); Property p, q; for each property p { toProcess.add(p); p.resolveReferences(montage); if (p.isReadyToFire()) { s.push(p); } } while( !s.empty() ) { p = s.pop(); toProcess.remove(p); p.initialize(); for each q adjacent to p { q.markReady(p); if (q.isReadyToFire()) { s.push(q); } } } if (toProcess.size() > 0) { throw new CycleFound(toProcess); } } Figure 41: Pseudocode to perform initialisation of properties
the algorithm given in Fig. 41. It does not consider any additional runtime effort of the initialisation rules. As we cannot influence the user defined initialisation rules the more important is an efficient determination of the firing order. The algorithm defines two temporary stores, a stack and a set. The stack will contain all properties which are ready to fire and in the set we store all properties that still have to be processed. First, the algorithm iterates over all properties and adds them to the set of unprocessed properties. Each of them will have to resolve its references. The properties that are ready to fire in the first place (e.g. because they contain constants) will be stored in the stack. The following while loop pops a property from the stack (and removes it from the set of unprocessed properties as well). Then its initialisation rule is
94
5 Implementation
called. As the value of the property is available thereafter, we can notify all dependent properties of this fact. If during this notification such a property reports to have all data available now, it is pushed on the stack. In the end, we check if all properties were processed. If not, an exception will be thrown, containing all the unprocessed properties. This information will help to locate the circular references.
5.3.10 Symbol Table Implementation Insertions to the symbol table must not destroy the contained information. As an example we have seen two identifiers with the same name but in different scopes (Fig. 22, p. 64). The insertion of the inner symbol i may only shadow the outer declaration for the subtree rooted at node 4 but not for the rest of the nodes in the AST. This can be achieved by several implementations. The most unimaginative one would be to copy the data such that we have a symbol table for each node. This would be a considerable – if not unrealizable – memory overhead. A complex but more memory efficient approach is the organisation of the symbol table as a binary tree structure. Consider, for example, the search tree in Fig. 42a which represents the symbol table entries (e.g. type names and associated declaration node). Truck 1, Bike 2, Car 3 We can add the entry Bus 5, creating a new symbol table rooted at node Bike in Fig. 42b without destroying the old one. If we add a new node at depth d of the tree, we must create d new nodes – better than copying the whole tree. Using the Java class java.util.TreeMap which is based on a Red-Black tree implementation, we can guarantee that d log(n), where n is the number of entries in the last symbol table. This implementation is described in more detail in [App97].
5.3 Data Structures for Dynamic Semantics of Specification Bike
Car
2
Bike
95 2
3
Bus
Truck
a.
1
b.
Figure 42: Binary search trees implementing a symbol table.
5
Chapter 6
Related Work
Related Work Programming language processing is one of the oldest disciplines in computer science, and therefore a wide variety of different approaches and systems has been proposed or implemented. In this chapter we present a selection of them and compare them to our system. The first three sections cover closely related projects, which can be seen as competitors to our Montage Component System. Section 6.4 comprises the traditional compiler construction approaches towards language specification. Section 6.5 briefly compares three different component models that are in wide-spread use and explains their pros and cons for our project. Finally, section 6.6 concludes this chapter with some remarks on two projects in our institute that influenced design decisions and the way MCS evolved.
6.1 Gem-Mex and XASM The system related closest to MCS is Gem-Mex, the Montage tool companion. Montages are specified graphically using the Gem editor [Anl] (a pleonasm, as Gem stands for Graphical Editor for Montages). Mex [Anl] (Montage Executable Generator) then transforms the Montages specifications into XASM [Anl00] code. XASM again is transformed into C code which in turn can be compiled to an executable format. All specifications in the Montages are given in terms of XASM rules. As already mentioned above, ASMs are the underlying model of computation in Montages. Gem is a simple editor with hardly any knowledge about the edited entities. All textual input remains unchecked, whereas very limited integrity checks are performed when editing in the graphical section occurs. E.g. there is only one static semantics frame allowed. A Montage specification is stored in an intermediate textual format for further processing. Mex simply transforms the dif97
98
6 Related Work
ferent entities of the intermediate format into an ASM representation. Separating the editor from processing has the advantage that the tools are largely independent of each other. Changes in one tool will affect the others only if the intermediate representation has to be adapted to reflect the change. On the other hand, the compiler has to reconstruct many of the informations that were originally available in the editor, e.g. the nesting and connecting of non-terminals, lists and actions. The centrepiece of this system is the XASM (Extensible ASM) environment. XASM offers an improved module concept over the ASM standard [Gur94, Gur97]. The macro concept known in pure ASMs turns out not to be very useful when building larger systems. This simple text replacement mechanism, however, does not provide encapsulation, information hiding or namespaces, properties essential to reuse and federated program development. The component concept of XASM addresses this issue. Component basedness is achieved through «calling conventions» between ASMs (either as a subASM or as a function, refer to [Anl00] for further details) and through extended declarations in the header of each component-ASM. These declarations do not only announce the import relations but they also contain a list of what functions are expected to be accessible from an importing component. Xasm components may be stored in a library in compiled form. How does MCS distinguish from Gem-Mex/XASM? • MCS is based on Java as a specification language instead of ASMs. Expressiveness and connectivity is therefore bound to Java’s and JavaBeans’ possibilities. The advantage of this approach is that the full power of the existing Java libraries is available from the beginning. In an ASM environment, there are only very few third party libraries available. To circumvent this, one has to interface with C libraries. By doing so, the advantages of ASMs over Java (simplicity of model of computation, implicit parallelism, formal background) are given up. • Components are partially reconfigurable at run time. Reconfiguration of XASM components can only be achieved through recompilation. • The Gem-Mex system uses Lex & Yacc to generate a parser for a specified language. It implements the horizontal partitioning model (see subsection 3.1.1. on p. 24). Xasm-components can therefore not be deployed on the level of Montages. They can however be called in the static or dynamic semantics in terms of library modules.
6.2 Vanilla
99
• Gem-Mex has a fixed built-in traversal mechanism of the abstract syntax tree. This may force the user to implement artificial passes. MCS on the other hand uses a topological sort on the dependency graph of the attributes in a language specification. The user does not have to care about the order of execution.
6.2 Vanilla The Vanilla language framework [DNW+00] supports programming language construction on the basis of components, as our MCS does. Vanilla’s aim is very similar to ours, namely to support re-use of existing components in language design. The vanilla team identified the same shortcomings in traditional compiler design and language frameworks as we did in chapters 1 and 3. Not surprisingly, their motivation is almost identical to ours. Their interests also focus on domain specific languages, but no detailed papers on this topic could be found. The vanilla framework is implemented in Java and thus using Java’s component model. In Vanilla, the entity of development is called a pod. A pod corresponds roughly to a Montage. It specifies e.g. type checking features, run-time behaviour and I/O. Component interaction in the Vanilla framework occurs on the level of pods. Pods in turn are built of a group of objects interacting on method call level. The Vanilla team implemented some simple languages (Pascal, O-2) and gained some interesting experience from these test cases. They describe that the degree of orthogonality between individual programming constructs was surprisingly high. They expected considerable overlap between components but discovered that in reality there are remarkably few constraints on how language features may be combined. This finding was very encouraging to us, as it marked a first (independent) confirmation of our own impression on language composition. If there are so many similarities, the following question arises: How does MCS distinguish from Vanilla? Vanilla is basically a collection of Java libraries that facilitate the generation of interpreter components. There are no graphical user interfaces nor any model behind the language specification. Specifying a new pod is merely done by programming the desired behaviour in some Java classes. These classes may be inherited from some Vanilla base classes, or they must follow a certain syntax in order to be processed by some Vanilla tools (e.g. the parser generator). The type-checking library contains some sophisticated classes that support free variables, substitution, sub-typing etc. We therefore think that Vanilla is only
100
6 Related Work
suited for compiler construction professionals. Although re-use is encouraged to a high extent, one still must have a wide knowledge of type-checking techniques (as an example) to successfully make use of the library pods. The fact that there is no preferred model behind the Vanilla pods adds to the amount of knowledge and experience a user should have. This freedom will probably ask too much of programmer untrained in the field of language design and implementation. In Montage, users have to follow a certain scheme (e.g. use control flow charts), but the intention is to simplify the model of Montages, and thus to simplify its use.
6.3 Intentional Programming The pivot in modern compiler architecture is the intermediate representation. It is generated by the front-end, it is language independent and serves as input for the target code-generating back-end. If well designed, the intermediate representation is one of the slowest evolving parts in compiler suites. So why not using the intermediate representation for programming instead of struggling with the pitfalls of concrete syntax? This is the core idea of Intentional Programming (IP1) [Sim96, Sim99], a project at Microsoft Research. IP tries to unify programming languages on a level that is common to all of them: the intention behind language constructs. A loop can be written in many different ways using many different programming languages, e.g. in terms of a for or while-statement in C++, a LOOP statement in Oberon or a combination of labels and GOTO’s in Basic. They share all the same intention, namely to repeatedly execute a certain part of a program. In IP, a loop could still be represented in several different ways, but with no concrete syntax attached. Such an abstraction can be manipulated by the programmer directly by changing the abstract syntax tree. Charles Simonyi, project leader of IP summarizes his vision of programming in the future as: “Ecology of Abstractions” [Sim96]. Abstractions are the information carriers of the evolving ideas, comparable with the genes in biology. While the individuals of the biological ecology serve as “survival machines” for genes, programs and programming components are the carriers of abstractions. Programming will be the force behind this ecology. Market success, reusability, efficiency, simplicity, etc. will be some of the criteria for the selection process of the “survival of the fittest”. 1
Unfortunately, IP becomes more and more an overloaded term. We are aware that in computer science IP will usually be associated with “Internet Protocol” or sometimes with “Intellectual Property”. However, we are using IP here, the same way as it is also used in the referenced papers.
6.3 Intentional Programming
101
Concrete syntax in this environment can be compared to the personal settings of the desktop environment of a workstation. Each programmer working with IP will define his own preferences for the transformation of abstractions into a human readable form. IP claims to solve many problems with legacy code as well. There exist parsers for the most important (legacy) programming languages, which transform the source code into IP abstractions (basically an AST). Once this transformation is applied, the program can be manipulated, extended, debugged etc. completely within the IP system. There will be no need to manipulate such a program with text editors any more. If required, a textual representation can be generated for any programming language. How does MCS distinguish from Intentional Programming? Code generation in IP is done by a conventional compiler back-end. This means that abstractions have to be transformed into a suitable representation called “R-code”. This reduction (as the transformation is referred to) is given in terms of IP itself. The MCS system uses Java in the place of R-code, the reduction (static semantics rules in MCS) is given in terms of Java. Boot-strapping the MCS system using IP would be possible, whereas the R-code centred architecture of IP would prevent a boot-strapping of IP using MCS. The intermediate language plays a less important role in MCS. The system would also be operational if some Montages were specified using Java and some using COBOL. This is because interoperability is specified on a higher level of abstraction, namely on the component level instead of the machine code level. The architecture of MCS does not suppress a multi-language, multi-paradigm approach. The only requirement is that Montage components can communicate with each other using some component system (see also section 6.5). Programmers of legacy languages could still use their knowledge to specify Montages. MCS is a tool allowing a smooth shift from language based coding to IP. IP is an option in MCS (although an important one) and not a handicap. In contrast, the IP approach means that, when existing legacy programs have to be embedded into the IP framework, the legacy language has to be specified first, and then a legacy parser has to be adapted to produce an appropriate IP AST instead of its conventional AST. In other words, there has to be a converter for each legacy language of which we want to reuse code. This approach invalidates much of the legacy programmers knowledge. After the conversion to IP, his former language skills are no longer needed. Even worse, he may not be well educated enough to cope with the paradigm shift from concrete language based programming to the more abstract tree based programming. The acceptance of the IP approach will thus be limited to organisations that can afford the radical paradigm shift.
102
6 Related Work
6.4 Compiler-Construction Tools Although compiler construction is well understood and widely practised, there are surprisingly few tools and systems that are in common use. Two of the most popular ones are Lex [Les75] and Yacc [Joh75]. They were constantly improved and their names stand as a synonym for front-end generators. Many other tools never attracted such a large community. This is mainly due to the steep learning curve associated with the large number of different syntaxes, conventions and options to control them. The following subsections therefore do not present isolated tools, but compiler construction systems, which (more or less) smoothly integrate the tools to obtain better interoperability.
6.4.1 Lex & Yacc Hardly any text on programming language processing can ignore Lex and Yacc. Their success is based on the coupling of the two tools — Lex was designed to produce lexical analysers that could be used with Yacc. Although they were not the first tools of their kind2, bundling them with Unix System V certainly helped to make them known to many programmers. The derivative JLex is based upon the Lex analyser generator model. It takes a specification similar to that accepted by Lex and creates a Java source file for the corresponding lexical analyser. MCS’ lexical analyser is in fact a subset of JLex [Ber97]. Instead of generating Java code, it builds a finite automaton in memory and uses it for scanning the input stream. The regular expressions accepted for specifying tokens are identical to those of Lex. Yacc is a LALR [ASU86] parser generator. It is fed with a parser specification file and creates a C program (y.tab.c). This file represents an LALR parser plus some additional C functions that the user may have added to the specification. These functions may support e.g. tree building or execute simple actions when a certain production has been recognized. The input file format to both Lex and Yacc have basically the following form: declarations %% translation rules %% auxiliary C functions
2
As the name of Yacc (Yet another compiler compiler) implies.
6.4 Compiler-Construction Tools
103
Declarations are either plain C declarations (e.g. temporary variables) or specifications of tokens (Yacc) and regular definitions (Lex) respectively. The declared items may be used in the subsequent parts of the specification. The translation rules basically associate a regular expression or production rule with some action to be executed as soon as the expression is scanned or the production has been parsed respectively. Actions are also given in terms of C code. The third part contains additional C functions to be called from the actions. In a Yacc specification, the third part will also contain an #include directive which links Lex’s tokens to Yacc’s parser. Lex and Yacc generate monolithic parsers, i.e. they support horizontal modularization. Their support for compiler construction is limited to the first two phases (lexical and syntax analysis, see Fig. 1, p. 2). Semantic analysis and code generation have to be coded by hand. Providing semantics in the actions of the rules, however, is only feasible for simple languages without complex type systems. The generation of source code binds the compiler implementor to C and even more to Unix where these tools are available on almost every machine. On the other hand, the problem of a steep learning curve is weakened by this approach somewhat. There are only two new notations to be learned: regular expressions and BNF-style productions. Both are so fundamental to compiler construction and computer science education in general that this can hardly be accounted for steepening the learning curve.
6.4.2 Java CC The Java community soon started to develop its own compiler construction tools. JLex [Ber97], JFlex [Kle99] and CUP [Hud96] are Java counterparts to Lex, Flex and Yacc, respectively. At Sun Labs a group of developers followed a different approach called Java CC3. The Java CC utility follows a contrary approach: instead of decorating the specification with C or Java code, Java is extended with some constructs to declare tokens. The syntax is not specified in terms of BNF or EBNF rules, but Java CC uses the new method modifier production to mark a Java method as a production. Java CC is merely a preprocessor that transforms this extended Java syntax into pure Java. Information about the lexical analyser and the syntax are extracted from the Java CC input file and distilled into an additional method that starts and controls the parsing process. 3
This group of developers founded Metamata, a company specializing on debugging and productivity tools around Java. Their freeware tool Metamata Parse is the successor of Java CC
104
6 Related Work
Java CC will generate top-down LL(k) parsers; LALR grammars have to be transformed first. The top-down approach simplifies the parsing algorithm in such a manner that the non-terminals on the right hand side of EBNF rules represent calls to corresponding methods. Deciding on which rule to follow in the case of a synonym rule does not have to be implemented by the user since Java CC will add the necessary code. Java CC’s support of compiler construction is also limited to the scanning and parsing phase. The philosophy behind this tool differs from Lex/Yacc in the way specifications are given. Java CC directly exploits the skills and experiences of Java programmers. The learning curve is minimal, as the user is confronted with even fewer new constructs than in Lex/Yacc. The notation for regular expressions keeps very much to Java’s syntax, and the top down approach in parsing is easier to understand for a compiler novice than the table driven LALR systems. The obviously strong ties to Java restrict the deployment of Java CC to Java environments. As virtual machines exist for any combination of major hardware platforms and operating systems, this is only a restriction in terms of usability of the generated code.
6.4.3 Cocktail In [GE90] Grosch and Emmelmann present a «A Tool Box for Compiler Construction» which is also known as «Cocktail» [Gro94]. It contains tools for most of the important compiler phases, including support for attribute grammars and back end generation. Cocktail is a collection of largely independent tools, resulting in a large degree of freedom in compiler design. Each of these tools features a specification language which may contain additional target language code (Modula-2 or C). The implementors are aware of the fact that such target code will make it impossible for the tools to perform certain consistency checks (e.g. Ag – the attribute evaluator generator – can not guarantee that attribute evaluations are side-effect free). Nevertheless, they argue that the advantages outweigh this disadvantage: practical usability, e.g. interfacing other tools and flatter learning curve, as e.g. conditions and actions can be provided in a standard language.
6.4.4 Eli The Eli system [GHL+92] – as Cocktail – is basically a suite of compiler construction tools. There is an important difference, however: Eli provides a smooth integration of these tools. Integration is achieved by an expert system, called Odin [CO90], helping the user to cope with all these tools. One does
6.4 Compiler-Construction Tools
105
not have to care about matching the output of one tool to the input of another. Thus, tools developed by different people with different conventions can be combined into an integrated system. This integration works also if the tools are only available in an executable format. To add a new tool to the Eli system, only the knowledge base of the expert system has to be changed. The knowledge base manages tool dependencies, data transformation between tools and complex user requests. Dependencies are represented by a derivation graph. A node in this graph represents a manufacturing step: the process of applying a particular tool to a particular set of inputs and creating a set of outputs. A normal user of Eli does not have to deal with the knowledge base, it will be updated by the programmers of the tools when they add or remove them. The major goal of the Eli project is to reduce the cost of producing compilers. The preferred way to construct a compiler is to decompose this task into a series of subproblems. Then, to each of these subproblems a specialized tool is applied. Each of these tools operates on its own specialized language. Eli uses declarative specifications instead of algorithmic code. The user has to describe the nature of a problem instead of giving a solution method. The application of solutions to the problem will be performed by the tool. The aim is to relieve the user as much as possible from the burden of dealing with the tools; he should be able to concentrate on the specification of the compiler. Eli’s answer to the steep learning curve is somewhat ambivalent. On the one hand, the expert system relieves the user from fiddling around with formats, options and conventions. But on the other hand, mastering Eli also means mastering many specialized descriptive languages. And what seems very convincing at a first glance may be a major hurdle to use this tool suite: Many programmers are educated in operational languages only and therefore have difficulties in mastering the new paradigms associated with declarative specifications.
6.4.5 Depot 4 Depot4 [Lam98] is a system for language translation, i.e. an input language is translated into an output language. There are no restrictions to these languages other than they must represent object streams. This idea was influenced by the Oberon system, the original implementation platform for Depot44. Texts in Oberon can easily be extended to object streams, thus allowing Depot4 to act as a fairly general stream translation/conversion system. However Depot4 is 4
Depot4 was ported to Java due to the better availability of Java Virtual Machine platforms.
106
6 Related Work
designed as an application generator rather than a traditional compiler. This means, programming in the large (i.e. assemble different modules and specify operations between them by providing a DSL) is the preferred use. Although not impossible, machine instruction generation would be a hard task, as supporting mechanisms and tools are missing. EBNF rules can be annotated by operations given in a metalanguage (Ml4). Nonterminals on the right hand side of each EBNF rule are treated as procedure calls to their corresponding rules. This approach implies a predictive parser algorithm. Of course, grammars have to be left-recursion free. Ml4 provides the programmer with some predefined variables which keep track of the repetition count, whether an option exists or which alternative was chosen in a synonym rule. As MCS, Depot4 addresses the (occasional) implementor of DSLs, who does not have the extensive experience of a compiler constructor. It aims to support fast and easy creation of language translators, without trying to compete with full blown compiler construction suites. Depot4’s similarities to MCS: • EBNF is used for syntax specification. • the system’s parser is vertically partitioned; the concept of modularization of specification is the same. • language specifications are precompiled and loaded dynamically on demand Depot4 does not support: • symbol table management • semantic analysis • intermediate code or machine instruction generation
6.4.6 Sprint Sprint is a methodology [CM98, TC97]for designing and implementing DSLs developed in the Compose project at IRISA, Rennes. Sprint’s formal framework is based on denotational semantics. In contrast to the approaches presented above, it does not so much feature specialized tool support, but rather sketches how to approach the development of a DSL. Following the Sprint methodology, DSL development undergoes several phases: Language analysis: Given a problem family, the first step includes analysis of the mutuality and the variations of the intended language. This analysis should identify and describe objects and operations needed to express solutions to the problem family.
6.4 Compiler-Construction Tools
107
Interface definitions: In the next phase, the design elements of the language are refined, the syntax is defined and an informal semantics is developed. This semantics relates syntax elements to the objects and operations identified in the previous step. Another aspect of this phase is to form the signature of the semantic algebras by formalising the domain of objects and the type of operations. Staged semantics: During this phase, static semantics has to be separated from dynamic semantics. Formal definition: The semantics of the syntactic constructs are defined in terms of valuation functions. They describe how the operations of the semantic algebras are combined. Abstract machine: The dynamic semantic algebras are then grouped to form an abstract machine which models the behaviour of the DSL. Denotational semantics provides an interpretation of the DSL in terms of this abstract machine. Implementation: The abstract machine is then implemented, typically by using libraries. The valuation function can either be implemented as an interpreter running on the abstract machine or as a compiler generating abstract machine instructions. Partial evaluation: To automatically transform a DSL program into a compiled version (given an interpreter), partial evaluators can be applied. The Sprint framework does not need specific tool support, as it relies on proven techniques. As the above phases would also apply to a general purpose language, tools supporting these can be employed. Techniques to derive implementations from definitions in denotational semantics can be adopted to the Sprint approach. For the implementation, standard software libraries and available partial evaluators are used. This form of reuse helps to speed the development time of a new DSL considerably. In contrast to our approach, reuse is undertaken from a global view, i.e. only after all the parts of the DSL are specified (in the Implementationstep) reuse is employed. At this late point, reuse involves an implicit analysis phase, as the implementor first has to find appropriate libraries or implementations of similar abstract machines. For in-house development, this approach will meet its expectations, whereas it would be difficult to share implementation between different organisations. Thus, it would hardly be suitable to an open development environment as discussed in chapter 2. This is due to the fact, that the formal definition does not describe the behaviour of what is reused later on (the software library). In our approach, the entity of reuse is also the entity of specification. This increases the confidence of a client of a Mon-
108
6 Related Work
tage in its semantics, whereas there is no such direct link between the semantics of a valuation function and a library function for example.
6.5 Component Systems After the success of object-oriented software development in the 1980’s, the advent of component systems began in the 1990’s. Among the many component systems proposed, only a few succeeded in software markets. As we have pointed out in chapter 2, components vitally rely on a market to be successful. Thus we focus on the three most wide-spread component systems. We will give a very short overview in order to be able to discuss it with respect to MCS. Detailed comparisons of the three systems can be found in [JGJ97,Szy97]. All three systems provide a kind of ‘wiring standard’ in that they standardize how components interact with each other. Each has it’s own background and market.
6.5.1 CORBA Overview. CORBA (Common Object Request Broker Architecture) was developed by the Object Management Group (OMG), which is a large consortium, having more than 700 member companies. It mainly focused on enterprise computing. This background is also reflected in the architecture of the system: it is the most inefficient and complex approach in our comparison. But being independent from a large software house (as in contrast to Sun’s JavaBeans or Microsoft’s COM) also has advantages. To name two: (i) a wide variety of hardware and software platforms is supported and (ii) interface definitions are more stable, as many partners have to agree on a change of standards. CORBA was developed to provide interaction between software written in different languages and running on different systems. Its architecture is shown in figure 43. The basic idea was to provide a common interface description language (IDL) which allows to specify the interface that a piece of software provides. Compilers for this IDL generate object adapters (called stubs) that convert data (in this case the identifiers, parameters and return values of procedure calls) to be understood by an object request broker (ORB). It is this brokers responsibility to redirect invocation requests to corresponding objects (which provide methods that can process the requests). So basically an ORB can be seen as a centralised method invocation service. As such it can provide additional services like object trader service, object transaction service, event notifi-
6.5 Component Systems
109
Application objects
Clients
Application interfaces (skeletons)
Client interfaces (stubs)
Object Request Broker
Trading
Transaction
Licencing
CORBA services
Figure 43: CORBA architecture and services model (CORBAfacilities not shown)
cation service, licensing services and many more (standardised by the Object Management Architecture (OMA)). CORBA as platform for the MCS. The central role of an ORB gives the user the most explicit view of the communication network of a system. Specific Montage services could be implemented as an OMA service, thus simplifying the architecture of MCS. Examples are: The object trader service, which offers search facilities for services, could be extended to find suitable Montages. Additional services could be implemented like parsing and scanning services. They could provide specialised parsing techniques (LL, LALR). A table management service could implement centralized table management for a language. CORBA was not chosen for a prototype implementation because of its complexity. Setting up and running ORBs is non-trivial. Additional services often undergo tedious standardisation processes. We think that the MCS architecture is to fine grained too be efficiently used in a CORBA environment. On the one hand, MCS builds a dense web of many simple components with only limited functionality (e.g. the parse tree nodes or the Montage properties and their firing rules). On the other hand, CORBA was designed to provide distributed services over networks, and thus network delays may slow down the system considerably. However, the independence of implementation and platform is a major advantage of CORBA. If MCS is commercially deployed, it would be
110
6 Related Work
worth watching the progress of CORBA. The standard Java libraries provide CORBA supporting classes. With improved hard- and software implementations of CORBA, it might be interesting to provide interfaces to this component system.
6.5.2 COM Overview. COM (Component Object Model) is Microsoft’s standard for interfacing software. It is a binary standard, which means it does not support portability (although a MacIntosh port exists, which emulates the Intel conventions on Motorola’s PowerPC). With the success of the Windows OS and the wide variety of software available, the need for inter-application communication increased. The main goal was to provide a system that allows applications written in different languages to communicate efficiently with each other. A binary standard for interfaces was established. COM does not specify what an object or a component is. It merely defines an interface that a piece of software might support. COM allows a component to have several interfaces, as well as an interface may be shared by different components. Interfaces are represented by COM objects. Fig. 44 shows a COM object Pets, featuring the interfaces IUnknown, IDog and ICat (this is just a simple example). The IUnknown interface is mandatory and serves to identify the object. Every interface has a first method called QueryInterface. Its purpose is to enable introspection. The next two methods shown AddRef and Release, support reference counting. The last method (Bark or Meow) symbolizes additional methods that a component might export. COM was designed to run with different programming languages, most of them not supporting introspection and garbage collection. The omissions in the language have to be made up by forcing the programmer into rigid coding schemes (e.g. exact rules, when reference counting methods have to be called), which is also reflected in every interface specification of a COM component (see Fig. 44). COM as a platform for MCS. COM’s binary interface philosophy makes it very unattractive to be used in heterogeneous environments like the internet. As Microsoft does not have to consider a wide variety of interests, hardware and software platforms, changes and updates are occurring more often than on the other systems. ActiveX controls (very simplified: a collection of predefined COM interfaces) has undergone many updates in the recent years. This might lead to (forced) updates in our prototype implementation, which we wanted to avoid. However, Microsoft’s COM offers the fastest component wiring by far. Commercialising MCS based on client’s demand, this should be considered. The loss of compatibility could be made up by bridging ActiveX to JavaBeans.
6.5 Component Systems
111 IUnknown
Pets IDog ICat
IDog vptr
QueryInterface
ICat vptr
AddRef
Object data
Release Bark
QueryInterface AddRef Release Meow
Figure 44: A COM object and its internal structure. The implementation of the interfaces is very similar to the virtual method tables of C++
This would allow Java versions to still run with an ActiveX implementation, although at a lower speed due to conversion and Java’s slower execution. As this would penalize non-Windows clients (and probably scare them off), it is questionable whether the a double implementation is worth the effort.
6.5.3 JavaBeans Overview. JavaBeans is in many ways the most modern component system compared to CORBA and COM. Its main market are internet applications. Based on Sun’s Java language, JavaBeans provides component interaction by following coding standards. Any Java object is potentially a component (a JavaBean) if it follows some naming conventions for its methods. These conventions and the packing of such components into a compressed file format (the so-called .jar format) is basically all behind JavaBeans. JavaBeans profits from the youthfulness of its base language Java. Many features of Java support the safety and ease-of-use of JavaBeans, as illustrated with the following examples (i) the automatic garbage-collection prevents memory leaks in JavaBeans, whereas in COM, an error-prone reference counting scheme has to be followed; (ii) interfaces are part of the Java language, while in CORBA and COM they have to be implemented following coding guidelines; (iii) the virtual machine representation of Java objects and classes allows introspection of their exported features at runtime, without additional implementation overhead. What makes JavaBeans attractive might also be its major drawback: the tight coupling of the component system with the programming language. JavaBeans is not suited to integrate legacy code into a component framework (CORBA
112
6 Related Work
and COM both do better here). Portability is only supported through the Java Virtual Machine (JVM), which means that the implementation language has to be Java. CORBA and COM do not restrict the programming language, CORBA is also independent of the executing hardware platform. And last but not least, when it comes to execution speed, JavaBeans systems are much slower than COM implementations. JavaBeans as a platform for MCS. We chose JavaBeans as our implementation platform. In our case, the advantages outweighed the disadvantages. The safety of Java’s garbage collected memory model, type system and language syntax were far more valuable to the development of MCS than speed, which was not a major concern. We believe that also for a commercial version of MCS, Java would be a good implementation platform for the following reasons: • Java neatly integrates with the internet (running as an applet). • Java offers the best language support (safety of implementation). • Legacy code is usually not a concern when developing new programming languages. • Portability is cheap: i) it is “built-in” (on the basis of the Java virtual machine) in contrast to COM which is relying on Intel hardware and ii) it does not require to generate stubs for various platforms (as in CORBA). • Speed will be a decreasingly urgent problem, due to better JIT compilers. However, one drawback remains: Java is the only implementation language available for the JavaBeans component framework. If this would become a hindrance, other component systems should be investigated carefully (see corresponding sections above).
6.6 On the history of the Montage Component System The Montage Component System continues a long tradition of research in software engineering at our institute5. Although in its contents MCS differs from the other projects – namely GIPSY, PESCA and CIP – concept, design, and implementation was greatly influenced by them. GIPSY. The GIPSY approach to software development [Mar94, Mur97, Sche98] widens the narrow focus on supporting tools. So called Integrated Programming Support Environments (IPSE) go beyond supporting the programming task (editing, compiling, linking). Software development is embedded in
5
Computer Engineering and Networks Laboratory at ETHZ
6.6 On the history of the Montage Component System
113
the other business processes such as knowledge management, customer support or even human resources management [MS98]. Marti [Mar94] describes in his thesis GIPSY as a system that manages document flow in software development systems. Integrated into the system is GIPSY/L, a domain specific language that allows to describe document flows. Formal language definitions written in GIPSY/L are used to specify the documents’ properties. Such definitions combine the documents’ syntactical structure and semantical properties. From these specifications GIPSY generates attributed syntax trees. Using extensible attribute grammars [MM92], specifications for languages are extensible in an object oriented way, i.e. specifications may inherit properties and behaviour from their base specifications. The algorithm used to evaluate attributed syntax trees is the same as we use in our system: the partial order of the dependency graph spanned by the attributes in the tree is topologically sorted before evaluation. The influence of GIPSY on the Montage Component System was the understanding that our system cannot be seen as an independent piece of code. Using MCS will have consequences not only for programmers, but will also have an impact on how language components are distributed, deployed and maintained. For detailed reflections on this topic see chapter 2. PESCA. The project on Programming Environment for Safety Critical Applications [Schw97, SD97] investigated automated proving of code to specification correspondences. Safety critical applications where first specified formally and then programmed, using a very restricted (primitive recursive) subset of the programming language Oberon [Schw96]. An automated transformation of this program into a formal specification could then be validated against the original specification. However, experience showed that this approach will be difficult to scale up to handle real-world applications. Programs where restricted to primitive recursiveness and the proof of correspondence between the to specifications was only tractable for small programs. MCS has a more prototypical character and does not focus on languages for safety critical applications, but still there were some lessons learned from PESCA. Algebraic specifications used in PESCA feature a steep learning curve, they appear to be very abstract to a programmer used to imperative programming. Using Abstract State Machines or even Java seemed to be much closer to the programmers understanding. As ease of use and simplicity were our goals, a pragmatic approach was chosen for MCS: using Java to provide operational semantics of language constructs. CIP. Using the CIP method [FMN93, Fie94, Fie99], the functional behaviour of a system is specified by an operational model of cooperating extended state machines. Formally a CIP system is defined as the product of all state machines
114
6 Related Work
comprised, corresponding to a single state machine in a multi-dimensional state space. CIP had a great influence on the implementation design of MCS. It taught us to concentrate on the basic functionality. CIP features a robust model of computation which forces its users to a rigid but well understood development scheme. The rationale behind this is, that it is rewarding to trade some of the developers freedom for more productivity, reliability and clearness. MCS is based on this conviction too.
Chapter 7
Concluding Remarks
Concluding Remarks The work on the Montage Component System revealed interesting insights into language processing from an unusual point of view as we tried to emphasize the compositionality of a language, and thus, we approached language specification from a different angle than usual. We gained experience in the way such systems can be specified and built and upon this experience, we will give some ideas and proposals for improving Montages and the MCS. We hope that our reflections on Montages and their market context as well as our ideas for improvements will be helpful in the planned project of commercializing Montages.
7.1 What was achieved We described a system for composing language specifications on a construct by construct basis. The overall structure of the system differs considerably from conventional approaches towards language specification or compiler construction, as it is modularized along the language constructs and not along the usual compiler phases. We explained how such a partitioning can be realized and used in language design. Deployment of (partial) language specifications on a component market has been investigated. The system was put into a wider context of development, support, distribution, and marketing of language components. The main fields of deployment we foresee for a system like ours are domain specific language development and education, since both can profit from the modularity of the system. Additions to a language can be specified locally in one Montage and encapsulation prevents unwanted side effects in unaltered parts of the specification.
115
116
7 Concluding Remarks
In contrast to the original Montages [AKP97, Kut01], we use Java instead of ASM as our specification language. The modularisation of our system is more rigid than the ASM approach. Global variables or functions are not allowed in our system (in contrast to the ASM approach). Therefore, precompiled Montage specifications are easier to reuse because fewer dependencies have to be considered. Java as a specification language allows to fall back upon a vast variety of libraries. Using Java also implies that our specifications are typed, which is not the case with ASM specifications. Whether this is an advantage or not is probably a question of personal preferences. On the one hand, strong typing can prevent errors; on the other hand many fast-prototyping tools renounce to typing because this gives the developer more freedom of how to use the system.
7.2 Rough Edges We found - as we think - some weak points in the Montages approach and try to sketch some ideas for improvement. It is important to understand that the presented ideas are just proposals. We do not think that these problems can be solved easily since they are mostly concerned with the Montage notation. Changes in Montages should be undertaken carefully and based on feedback of as many users as possible1. In each of the following sections we describe an open problem in Montages and try to sketch ideas for their solution or at least to give some arguments to start a discussion.
7.2.1 Neglected Parsing Montages provide no means of control over the parsing of a program. We think that this omission is an obstacle in the deployment of the Montages approach. Most grammars that need additional control of the parsing process are not context-free grammars, in other words, the parsing of their programs relies on context information. In many textbooks on compiler construction, context sensitive grammars are disparaged and rewriting techniques are described in order to make them context-free [e.g. in App97]. However, this point of view reflects only the language designer’s arguments and ignores the needs of the users of the language.
1
This is a lesson learned from programming language design. Take the introduction of inner classes in Java as an example for a hasty enhancement of a language undertaken by a language designer and much criticised by programmers and experts in academia.
7.2 Rough Edges
117
Often, users of a DSL are specialists in their own field which has nothing to do with programming language theory. They use notations that are – in some cases – hundreds of years old and well accepted. Examples are: mathematics, physics, chemistry or music. Mathematicians have an own rich set of notations at their disposal which can (more or less) easily be typed on a standard keyboard. Suppose an input language for a symbolic computation system as a typical DSL used by mathematicians. To enter a multiplication, it would be much more comfortable to use the common juxtaposition of variable names ab instead of the unusual asterisk a*b which is never used in printed or hand-written formulas. Such a language can only be implemented if the parser has access to all additional knowledge, i.e. access to context information [Mis97]. If Montages shall be successful on a long term, then some means to control the parser have to be offered.
7.2.2 Correspondence between Concrete and Abstract Syntax One of the trickiest parts in MCS is the correspondence between the given EBNF rule and the control flow graph or, generally speaking, the mapping of the concrete syntax onto the abstract syntax. The implemented solution is straightforward: Each nonterminal in the control flow graph has to be assigned to a nonterminal in the EBNF. This approach is not satisfactory concerning list processing; more precisely, it fails to model special conditions on the occurrence of a nonterminal in a list. Fig. 45 shows an example of an expression list as it is defined for argument passing in many languages. Note that this kind of syntax specification is not available yet. ExprList ::= Expression { "," Expression }. LIST Expression
Structures ContolFlow LIST+ Expression
EBNF Expression LIST* Expression
Figure 45: List of expressions showing a clash in structures between control flow and EBNF.
118
7 Concluding Remarks
In a concrete syntax, entities often have to be textually separated. In the given example, at least one expression has to occur. Subsequent expressions (if present) have to be separated by a comma. Abstract syntax does not need such separators, i.e. they are purely syntactical. In addition, the LIST object contains a property for both the minimum and maximum number of expressions. Thus, the abstract syntax definition of the ExprList Montage can be much more compact and concise than the EBNF rule. The comparison of the two structures at the bottom of Fig. 45 illustrates this. The structure of the abstract syntax does not reflect the structure of the concrete syntax. In order to simplify reuse of Montages, such clashes should be allowed. If the abstract syntax depends on a concrete syntax, then a Montage is not very attractive for reuse. In most reuse cases, one wants to keep the specification of the behaviour, but change the concrete appearance of a language construct. For the above example, solutions may seem straightforward, but in more complex situations, it is difficult to give a general rule for mapping EBNF occurrences of nonterminals to control flow objects. As a more complex example consider an If-Statement (Fig. 46) as it appears in almost every language. It shall exemplify some of the open questions: LIST
I
Expression
T LIST
true I
LIST
Statement
Statement
T
Figure 46: Abstract syntax of an if construct with conditional and default blocks
Can this control flow graph be the mapping of the following EBNF rule? If ::= “IF” Expression “THEN” { Statement {“;” Statement} } { “ELSIF” Expression “THEN” { Statement {“;” Statement} } } [ “ELSE” { Statement {“;” Statement} } ] “END”. This rule could be formulated more elegantly by introducing a new Montage for the statement sequences:
7.2 Rough Edges
119
If
::= “IF” Expression “THEN” StatSeq { “ELSIF” Expression “THEN” StatSeq } [ “ELSE” StatSeq ] “END”. StatSeq ::= Statement { “;” Statement }. Unfortunately this adds to the complexity of the language, as there is now an additional Montage which does not contain any semantics but is purely syntacitcal. Ongoing work should investigate whether the introduction of macros would provide a solution that is sufficiently flexible. Lists as in statement sequences or in an expression list are easy to detect and to replace, but does this also apply to the if statement itself? It contains a list with more than one nonterminal (an expression and a corresponding list of statements). Why should the default case (“ELSE” clause) be modelled with an extra list of statements? Making lists more powerful, this statement sequence could be incorporated as a special case (no expression, at the end) into the other list. General mapping rules for concrete to abstract syntax should be investigated. The application of such rules can be found in the compiler construction literature, but not the reversed problem: which rule to apply for translation between a given concrete and a given abstract syntax. Sophisticated pattern matching possibly combined with a rule database should be investigated here.
7.2.3 BNF or EBNF? One way to circumvent the mapping problems described above would be to ban list processing from Montages. In order to do this properly, we propose to use plain Backus-Naur Form (BNF) instead of its extended version (EBNF) for specifying syntax. As BNF grammars are better suited for bottom-up parsers, we also suggest to introduce some means of controlling the parser from Montages (refer to section 7.2.1 for motives). BNF was extended to EBNF by introducing repetitions, i.e. groups, options and lists represented by parenthesises “( )”, brackets “[ ]”, and braces “{ }” respectively. In addition, alternatives can be expressed by enumerating them separated by a vertical line. EBNF allows a much denser representation of a grammar than BNF, as repetitions turn out to be a powerful notational aid. This means that the number of Montages specifying a language can be reduced and thus the language specification is more compact. Yet there are some problems introduced to Montages by using EBNF instead of BNF. While the synonym rules display alternative productions extremely well, repetitions can be a nuisance for the language designer. Apart from the
120
7 Concluding Remarks Add ::= Term {AddOp Term}. AddOp = "+" | "-". I
Term~1
setValue
ListAdd I
Term~2
add
T
T
Prop
Type
value op
Integer return new Integer(); Integer return new Integer(AddOp == "+" ? 1 : -1);
Initialisation
Action @setValue: value = Term~1.value; @add: value = value + op*Term~2.value Figure 47: Add Montage with EBNF
mapping problems discussed in the previous section, we have to distinguish between presence and absence of repetitions. Consider the simple specification of a variable declaration in Fig. 23, p. 66. The optional initialisation of the variable requires a tedious distinction between the presence and absence of an initialisation. In some cases, the presence of repetition complicates a Montage. Often, it is not obvious to a novice user how initialisation of properties or execution of actions work. Some background knowledge of how repetitions are managed is necessary. Let us exemplify this by the Montage in Fig. 47. Although the syntax rule is very compact, the control flow graph is far from that. By only looking at the graph it would be hard to deduce the meaning of this Montage. Moreover, in contrast to the declaration Montage in Fig. 23, it is not even necessary to distinguish between presence and absence of the list here! An alternative representation using BNF rules can be found in Fig. 48. Its advantages are that the control flow graphs are much easier to understand and repetitions are banned. BNF is (in contrast to EBNF) known to a wider community of programmers and language designers. Of course, it is very easy to learn EBNF; its defi-
7.3 Conclusions and Outlook
121
Add: Term Add AddOp Term AddOp: one of "+" "-" a) I
setValue
Term
T
b) I
Add
Term
add
T
Prop
Type
value op
Integer return new Integer(); Integer return new Integer(AddOp == "+" ? 1 : -1);
Initialisation
Action @setValue: value = Term.value; @add: value = Add.value + op*Term.value Figure 48: Add Montage with BNF using left recursive grammar
nition was given by Wirth in [Wir77b] on just half a page. Yet BNF has the advantage (which should not be underestimated) that many publications about languages are given using BNF. This means that the specifications given in many books and articles will be easier to enter in Montage as the original language rules do not have to be rewritten first.
7.3 Conclusions and Outlook The problems described in the previous section are inherent in Montages and have their origin in the overspecification of the syntax tree. We consider the improvement of the parsing process the most important issue in the continuing development process of Montage tools. The rest of this section will sketch a possible solution to the parsing problem and hint at possible future directions of Montages and their applications.
122
7 Concluding Remarks
7.3.1 Separation of Concrete and Abstract Syntax We have discussed parsing techniques already in section 4.5, p. 52ff and an immediate solution would be to use more powerful parsers, e.g. an LR(k) parser such as it was described by Earley [Ear70]. Unfortunately, this does not solve all problems we have concerning parsing – we still cannot handle context-sensitive grammars as they were motivated in section 7.2.1. On the one hand, a simple LL parser seems to be too rigid for many given grammars, on the other hand, why bother a developer of a simple language with all the expressive power of context-sensitive parsers? We therefore propose to separate the problem of parsing completely from the rest of the language specification. The MCS would then read abstract syntax trees instead of programs given as character streams. XML as an Intermediate Representation. A very simple and pragmatic approach would be to use XML (eXtensible Markup Language [W3C98]) as an intermediate representation, generated by a parser and read by the Montage Component System. As the syntax of XML is very easy to parse, an existing XML parser (virtually any SAX2- or DOM-parsers for XML) can be applied to replace the existing backtracking parser. For the user of the system this has several advantages: • He can use any existing parser generator, e.g. CUP [Hud96] for LALR grammars or Metamata Parse [Met97] for LL grammars. • The parser can be chosen to fit any existing grammar rules, e.g. in many books BNF is used to explain a language (e.g. Java Language Specification [GJS96]). • Developers can use a parser of their choice, i.e. one they are familiar with. • Using XML also allows to use parsers on non-Java platforms. The generated XML document can then be sent to an MCS running on a Java virtual machine. In fact, this intermediate representation already exists and we call it the Montage Exchange Format (MXF): when saving a Montage, an XML file will be generated containing all information necessary3 to reconstruct it again. Presently the defining DTD (Document Type Declaration) specifies only the format for one single Montage (so for each Montage there is a separate file). But it should be easy to extend this DTD to allow one file containing a whole
2
SAX, the Simple API for XML and DOM, the Document Object Model both define interfaces to access an XML parser and the parsed document respectively.
3
the coordinates of the elements of the control flow graph are stored in optional XML tags.
7.3 Conclusions and Outlook
123
abstract syntax tree. The MXF also intends to be an tool independent format which will simplify the exchange of Montages between the different tools.
7.3.2 Optimization and Monitoring After successfully specifying and implementing a language, it would be desirable to compile programs as fast as possible. Fast execution would allow to deploy MCS also as a production tool and not only as a fast prototyping tool. Unfortunately, optimizing executable programs for speed or memory requirements is not very well supported in MCS as it relies on the Java compiler that compiles the generated code. Optimizations often run on a global scale of a program but the partitioning scheme of MCS builds on the locality of the given specifications. To extend the system, we would propose plug-ins: Plug-Ins. Optimization could be offered through plug-ins to the Montage Component System. Such plug-ins are system components that can operate on the internal data structure (basically the annotated abstract syntax tree). They would be operational between two phases of the transformation process, or after control flow composition (see Fig. 10, p. 43), but before the Java compiler will compile the generated code. As such plug-ins can only operate between two phases of the transformation process, they would be limited in their optimization capabilities. But it is still possible to write plug-ins offering monitoring of the AST data structure. They could visualize or even animate the transformation process and/or allow to edit the AST interactively. Restructuring MCS. In order to be able to replace the implementation of the different transformation phases, it would be necessary to implement them as plug-ins as well. In the present implementation the different phases cannot be replaced4 by the user. The transformation phases are accessed through Java interfaces. Therefore, it is necessary to replace some implementing class files and restart the system in order to replace the behaviour/implementation of the transformation process (or parts of it). To support plug-ins that can be exchanged by the user, it is necessary to extend the interfaces of the transformation phases to offer plug-in capabilities (e.g. install, remove). This can probably be done by introducing a new interface PlugIn and by extending the existing interfaces from it.
4
at runtime of MCS
124
7 Concluding Remarks
7.3.3 A Possible Application A system like MCS does not have to be a standalone application, it could also be part of a web browser. There it would serve as a kind of generalized Java virtual machine. A web page would not only contain tags referring to Java code, but it would contain tags referring to a language specification and tags referring to the program code. A MCS web browser then could download first the language specification, generate an interpreter for it and then download and interpret the program(s). Of course, downloading a whole language before using it might be a waste of bandwidth for small programs. But there are certainly scenarios where the overhead of downloading the language specification will outperform the costs for downloading equivalent Java applets. Using long term caching for downloaded Montage specifications will further improve the performance of such a web browser. The DSL used in such web pages could be chosen according to its contents and thus simplify the creation and support of a web site. But in order to justify the overhead of downloading a language specification first, an application should be used over a longer period of time and it should be highly interactive. We might think of the following scenarios: Tutorials: The DSL could be the subject of the tutorial, or it could be used to program the many different user interactions that take place throughout the tutorial. Forms: In an application that uses many forms, a DSL for form description could be used to configure customer tailored forms. Forms that are capable of displaying different aspects according to the user’s input are code and data intensive in Java. A form DSL could be much more concise (e.g. consider the TCL/TK [Ous97] versus Java API [CLK99]) and thus form intensive applications would be less bandwidth consuming. Symbolic computation systems: The downloaded language is a specification of the Maple-, Mathematica- or Matlab-language. Instead of buying and installing these applications, they could be rented on a “per use” base. Such applications should be developed in accordance with the marketing and support strategies presented in chapter 2 (Electronic Commerce with Software Components).
Bibliography
[ADK+98]
W. Aitken, B. Dickens, P. Kwiatkowski, O. de Moor, D. Richter, C. Simonyi. Transformation in Intentional Programming. In Proceedings of the 5th International Conference on Software Reuse (ICSR’98), IEEE Computer Society Press, 1998.
[AKP97]
M. Anlauff, P. W. Kutter and A. Pierantonio. Formal Aspects of and Development Environment for Montages. In M.P.A. Sellink, editor, Workshop on the Theory and Practice of Algebraic Specifications, volume ASFSDF-97 of electronic Workshops in Computing. British Computer Society, 1997.
[Anl]
M. Anlauff. Montages Tool Companion: Gem-Mex. Download at ftp://ftp.first.gmd.de/pub/gemmex/.
[Anl00]
M. Anlauff. XASM – An Extensible, Component-Based Abstract State Machines Language. In Y. Gurevich, P. W. Kutter, M. Odersky and L. Thiele, editors, Proceedings of the Abstract State Machine Workshop ASM2000, volume 1912 of Lecture Notes in Computer Science, pages 69-90. Springer, 2000.
[App97]
A.W. Appel. Modern Compiler Implementation in Java. Cambridge University Press, 1997.
[ASU86]
A.V. Aho, R. Sethi and J. D. Ullman. Compilers – Principles, Techniques and Tools. Addison-Wesley, 1986.
[Ber97]
E. Berk. JLex: A lexical analyzer generator for Java. http://www.cs.princeton.edu/~appel/modern/java/JLex, 1997
[BS98]
E. Börger and W. Schulte. A Programmer Friendly Modular Definition of the Semantics of Java. In J. Alves-Foss, editor, Formal Syntax and Semantics of Java. Springer LNCS, 1998. 125
126
Bibliography
[Che00]
Z. Chen. JavaCard™ Technology for Smart Cards. The Java Series. Addison-Wesley, 2000.
[CLK99]
P. Chan, R. Lee and D. Kramer. The Java Class Libraries. Volume 1 & 2. The Java Series, Addison-Wesley, 1999.
[CM98]
C. Consel and R. Marlet. Architecturing software using a methodology for language development. In Proceedings of the 10th Intrnational Symposium on Programming Languages, Implementations, Logics and Programs (PLILP/ALP ‘98), pp. 170–194, Pisa, Italy, September 1998.
[CO90]
G. M. Clemm and L. J. Osterweil. A mechanism for environment integration. ACM Transactions on Programming Languages and Systems, 12(1):1–25, January 1990.
[Col99]
M. Colan. InfoBus 1.2 Specification. Sun microsystems, February 1999.
[DAB99]
Ch. Denzler, Ph. Altherr and R. Boichat. NOLC – Network and on-line Consulting. Informatik/Informatique, 5:38–40, October 1999.
[DEC96]
P. Deransart, A. Ed-Dbali and L. Cervoni. Prolog: The Standard, Reference Manual. Springer-Verlag, 1996.
[DNW+00] S. Dobson, P. Nixon, V. Wade, S. Terzis and J. Fuller. Vanilla: an open language framework. In K. Czarnecki and U.W. Eisenecker, editors, Generative and Component-Based Software Engineering, LNCS 1799, pages 91–104. Springer-Verlag, 2000. [Ear70]
J. Earley. An Efficient Context-Free Parsing Algorithm. Communications of the ACM, 13(2):94–102, February 1970.
[Eco97]
Survey of Electronic Commerce. The Economist, May 10, p. 17, 1997.
[Fel98]
P. Felber. The CORBA Object Group Service: A Service Approach to Object Groups in CORBA. PhD thesis 1867, École Polytechnique Fédérale de Lausanne, 1998.
[FGS98]
P. Felber, R. Guerraoui and A. Schiper. The Implementation of a CORBA Object Group Service. Theory and Practice of Object Systems, 4(2):93–105, 1998.
[Fie94]
H. Fierz. SCSM – Synchronous Composition of Sequential Machines, TIK Report 14, Computer Engineering and Networks Laboratory, ETH Zürich, 1998.
Bibliography
127
[Fie99]
H. Fierz. The CIP Method: Component and Model-based Construction of Embedded Systems. European Software Engineering Conference ESEC’99, Toulouse, 1999.
[FMN93]
H. Fierz, H. Müller and S. Netos. CIP – Communicating Interacting Processes. A Formal Method for the Development of Reactive Systems. In J. Gorski, editor, Proceedings SAFECOMP’93. Springer-Verlag, 1993.
[F+95]
Frey et al. Allgemeine Didaktik. Karl Frey, Verhaltenswissenschaften, ETH Zürich, 8th edition, 1995.
[GCC]
GCC Team. GNU Compiler Collection. http://gcc.gnu.org.
[GE90]
J. Grosch and H. Emmelmann. A Tool Box for Compiler Construction. Report 20, GMD, Forschungsstelle an der Universität Karlsruhe, Vincenz-Prießnitz-Str. 1, D-7500 Karlsruhe, January 1990
[GH93]
Y. Gurevich and J. K. Huggins. The Semantics of the C Programming Language. In Selected papers from CSL’92 (Computer Science Logic), LNCS 702, pages 274–308. Springer-Verlag 1993.
[GHL+92] R. W. Gray, V. P. Heuring, S. P. Levi, A. M. Sloane and W. M. Waite. Eli: A Complete, Flexible Compiler Construction System. Communications of the ACM, 35(2):121-131, February 1992. [GJS96]
J. Gosling, B. Joy and G. Steele. The Java Language Specification. The Java Series. Addison-Wesley, 1996.
[GJSB00]
J. Gosling, B. Joy, G. Steele and G. Bracha. The Java Language Specification, Second Edition. The Java Series, Addison-Wesley, 2000.
[Gro94]
J. Grosch. CoCoLab. http://www.cocolab.de, 1994. Ingenieurbüro für Datenverarbeitung, Turenneweg 11, D-77880 Sasbach.
[Gur94]
Y. Gurevich. Evolving Algebras 1993: Lipari Guide. In E. Börger, editor, Specification and Validation Methods. Oxford University Press, 1994.
[Gur97]
Y. Gurevich. May 1997 Draft of the ASM Guide. Technical Report CSE-TR-336-97, University of Michigan EECS Department, 1997.
[Har92]
S. P. Harbison. Modula-3, Prentice Hall, 1992.
[Hed99]
G. Hedin. Reference Attributed Grammars. In D. Parigot and M. Mernik, editors, Second Workshop on Attribute Grammars and their
128
Bibliography Applications, WAGA’99, pages 153–172, Amsterdam, The Netherlands, March 1999.
[HMT90] R. Harper, R. Milner and M. Tofte. The Definition of Standard ML. The MIT Press 1990. [HSSS97]
S. Handschuh, K. Stanoevska-Slabeva and B. Schmid. The Concept of Mediating Electronic Product Catalogues. EM – Electronic Markets, 7(3), September 1997.
[Hud96]
S. E. Hudson. CUP Parser Generator for Java. http://www.cs.princeton.edu/appel/modern/java/CUP, 1996.
[JGJ97]
I. Jacobson, M. Griss and P. Jonsson. Software Reuse. acm press, New York, 1997.
[Joh75]
S. C. Johnson. Yacc — Yet another compiler-compiler. Computing Science Tech. Rep. 32, Bell Laboratories, Murray Hill, N.J. 1975.
[JW74]
K. Jensen and N. Wirth. PASCAL – User Manual and Report. Springer-Verlag, 1974
[Kle99]
G. Klein. JFlex: The Fast Analyser Generator for Java. http://jflex.de, 1999.
[Knu68]
D. E. Knuth. Semantics of context-free languages. Mathematical Systems Theory, 2(2):127-145, June 1968.
[KP88]
G. E. Krasner and S. T. Pope. A cookbook for using the modelview controller user interface paradigm in Smalltalk-80. Journal of Object-Oriented Programming, 1(3):26-49, August 1988.
[KP97a]
P. W. Kutter and A. Pierantonio. The Formal Specification of Oberon. Journal of Universal Computer Science, 3(5):443–501, May 1997.
[KP97b]
P. W. Kutter and A. Pierantonio. Montages: Specifications or Realistic Programming Languages. Journal of Universal Computer Science, 3(5):416–442, May 1997.
[KR88]
B. Kernighan and D. Ritchie. C Programming Language. PrenticeHall, 2nd edition, May 1988.
[Kut01]
P. W. Kutter. Montages – Engineering of Computer Languages. PhD thesis, Institut TIK, ETH Zürich, 2001.
[Lam98]
J. Lampe. Depot4 – A generator for dynamically extensible translators. Software – Concepts & Tools, 19:97–108, 1998.
Bibliography
129
[Les75]
M. E. Lesk. Lex — A Lexical analyzer generator. Computing Science Tech. Rep. 39, Bell Telephone Laboratories, Murray Hill, N.J. 1975.
[Lis00]
R. Lischner. Delphi in a Nutshell. O’Reilly & Associates, 2000.
[LW93]
D. Larkin and G. Wilson. Object-Oriented Programming and the Objective-C Language. Available at: www.gnustep.org, NeXT Computer Inc, 1993.
[LY97]
T. Lindholm and F. Yellin. The Java Virtual Machine Specification. The Java Series. Addison-Wesley, 1997.
[Mar94]
R. Marti. GIPSY: Ein Ansatz zum Entwurf integrierter Softwareentwicklungssysteme. PhD thesis 10463, Institut TIK, ETH Zürich, 1994.
[MB91]
T. Mason and D. Brown. Lex & Yacc. Nutshell Handbooks. O’Reilly & Associates, 1991.
[[Met97]
Metamata. Java CC and Metamata Parse. http://www.metamata.com, 1997.
[Mis97]
S. A. Missura. Higher-Order Mixfix Syntax for Representing Mathematical Notation and its Parsing. PhD thesis 12108, ETH Zürich, 1997.
[MM92]
R. Marti and T. Murer. Extensible Attribute Grammars. TIK Report Nr. 6, Computer Engineering and Networks Laboratory, ETH Zürich, December 1992.
[MS98]
T. Murer and D. Scherer. Organizational Integrity: Facing the Challenge of the Global Software Process. TIK-Report 51, Computer Engineering and Networks Laboratory, ETH Zürich, 1998.
[Mur97]
T. Murer. Project GIPSY: Facing the Challenge of Future Integreated Software Engineering Environments. PhD thesis 12350, Institut TIK, ETH Zürich, 1997.
[MV99]
T. Murer and M. L. van de Vanter. Replacing Copies With Connections: Managing Software across the Virtual Organization. In IEEE 8th International Workshop on Enabling Technologies: Infrastructure for Collaborative Enterprises, Stanford University, California, USA, 16-18 June 1999.
[MW91]
H. Mössenböck and N. Wirth. The Programming Language Oberon-2. Structured Programming, 12:179-195, 1991.
130
Bibliography
[MZW95a] A. Moorman Zaremski and J. M. Wing. Signature Matching: a Tool for Using Software Libraries. ACM Transactions on Software Engineering and Methodology, 4(2):146–170, April 1995. [MZW95b] A. Moorman Zaremski and J. M. Wing. Specification Matching of Software Components. Proceedings of the third ACM SIGSOFT symposium on the foundations of software engineering, pages 6–17, October 1995. [Nau60]
P. Naur. Revised Report on the Algorithmic Language ALGOL 60. Communications of the ACM, 3(5):299–314, May 1960.
[Ous97]
J. K. Ousterhout. Tcl and the Tk Toolkit. Addison-Wesley Professional Computing, 1994.
[PH97a]
A. Poetsch-Heffter. Specification and Verification of Object-Oriented Programs. Habilitationsschrift, 1997.
[OMS97]
Oberon Microsystems Inc. Component Pascal Language Report. Available at www.oberon.ch, 1997.
[PH97b]
A. Poetsch-Heffter. Prototyping Realistic Programming Languages Based on Formal Specifications. Acta Informatica, 1997.
[Sal98]
P. H. Salus, editor. Little Languages and Tools, volume 3 of Handbook of Programming Languages. Macmillan Technical Publishing, 1st edition, 1998.
[Sche98]
D. Scherer. Internet-wide Software Component Development Process and Deployment Integration. PhD thesis 12943, Institut TIK, ETH Zürich, 1998
[Schm97]
D. A. Schmidt. On the Need for a Popular Formal Semantics. In ACM Conference on Strategic Directions in Computing Research, volume 32 of ACM SIGPLAN Notices, pages 115–116, June 1997.
[Schw96]
D. Schweizer. OberonT – eine Programmiersprache für sicherheitskritische Systeme. TIK-Report 21, Computer Engineering and Networks Laboratory, ETH Zürich, 1996.
[Schw97]
D. Schweizer. Ein neuer Ansatz zur Verifikation von Programmen für sicherheitskritische Systeme. PhD thesis 12056, Institut TIK, ETH Zürich, 1997.
[SD97]
D. Schweizer and Ch. Denzler. Verifying the Specification-toCode Correspondance for Abstract Data Types. In M. Dal Cin, C. Meadows, and W.H. Sanders, editors, Dependable Computing for Critical Applications 6, volume 11 of Dependable Computing and
Bibliography
131
Fault-Tolerant Systems, pages 177–202. IEEE Computer Society, 1997. [Sim96]
C. Simonyi, Intentional Programming - Innovation in the Legacy Age. Presented at IFIP WG 2.1 meeting, June 4, 1996, http://www.research.microsoft.com/research/ip/ifipwg/ifipwg.htm
[Sim99]
C. Simonyi, The Future is Intentional. In IEEE Computer, pp. 56–57. IEEE Computer Society, May,1999.
[SK95]
K. Slonneger and B. L. Kurtz. Formal Syntax and Semantics of Programming Languages. Addison-Wesley, Reading, 1995.
[SM00]
R. M. Stallman and R. McGrath. GNU Make. Manual, Free Software Foundation, April 2000.
[Sml97]
Standard ML of New Jersey. Bell Laboratories, URL: ftp:// ftp.research.bell-labs.com/dist/smlnj, 1997
[SP97]
C. Szyperski and C. Pfister. Workshop on Component-Oriented Programming, Summary. In M. Mühlhäuser, editor, Special Issues in Object-Oriented Programming – ECOOP96 Workshop Reader, dpunkt Verlag, Heidelberg, 1997.
[SS99]
K. Stanoevska-Slabeva. The Virtual Software House. Informatik/ Informatique, 5:37–38, October 1999.
[Ste90]
G. L. Steele. Common Lisp: The Language. Digital Press, 2nd edition, May 1990.
[Ste99]
G. L. Steele. Growing a Language. In Journal of Higher-Order and Symbolic Computation 12, 3:221–236, October 1999
[Str97]
B. Stroustrup. The C++ Programming Lanuage. Addison-Wesley, 3rd edition, July 1997.
[Szy97]
C. Szyperski. Component Software. ACM Press, Addison-Wesley, 1997.
[TC97]
S. Thibault and C. Consel. A Framework of Application Generator Design. In M. Harandi, editor, Proceedings of the ACM SIGSOFT Symposium on Software Reusability (SSR ‘97), Software Engineering Notes, 22(3):131–135, Boston, USA, May 1997.
[Tho99]
S. Thompson. Haskell: The Craft of Functional Programming. Addison-Wesley, 2nd edition, 1999.
[Van94]
M. T. Vandevoorde. Exploiting specifications to improve program performance. PhD thesis, Department of Electrical Engineering and Computer Science, MIT, February 1994.
132
Bibliography
[VM99]
M. L. van de Vanter and T. Murer. Global Names: Support for Managing Software in a World of Virtual Organizations. In Ninth International Symposium on System Configuration Management (SCM-9), Toulouse, France, 5-7 September 1999.
[W3C98]
W3C. Extensible Markup Language (XML) 1.0, REC-xml19980210 edition, February 1998, W3C Recommendation.
[Wal95]
C. R. Wallace. The Semantics of the C++ Programming Language. In E. Börger, editor, Specification and Validation Methods, pages 131–164. Oxford University Press, 1995.
[Wal98]
C. R. Wallace. The Semantics of the Java Programming Language: Preliminary Version. Technical Report CSE-TR-335-97, EECS Department, University of Michigan, 1997.
[WC99]
J. C. Westland and T. H. K. Clark. Global Electronic Commerce. MIT Press, 1999.
[Wei99]
M. A. Weiss. Data structures and algorithm analysis in Java. Addison Wesley Longman, 1999.
[Wir77a]
N. Wirth. Modula – A Language for Modular Multiprogramming. Software Practice and Experience, 7(1):3-35, January 1977.
[Wir77b]
N. Wirth. What Can We Do about the Unnecessary Diversity of Notations for Syntactic Definitions? Communications of the ACM, 20(11):882–883, 1977.
[Wir82]
N. Wirth. Programming in Modula-2. Springer-Verlag, 1982.
[Wir86]
N. Wirth. Compilerbau, volume 36 of Leitfäden der angewandten Mathematik und Mechanik LAMM, B.G.Teubner, 4th edition, 1986.
[Wir88]
N. Wirth. The Programming Language Oberon. Software – Practice and Experience, 18:671–690, 1988.
Curriculum Vitae
I was born on July 20, 1968 in Liestal (BL). From 1975 to 1984 I attended primary school and Progymnasium in Muttenz. In 1984 I entered High School (Gymnasium) in Muttenz, from which I graduated in 1987 with Matura Typus C. In 1988 I began studying computer science at ETH Zürich. During this time I did two internships at Integra (now Siemens) and Ubilab (UBS). I received the degree Dipl. Informatik-Ing. ETH in 1993. My master thesis entitled A Message Mechanism for Oberon was supervised by Prof. Niklaus Wirth. Afterwards I started working as a research and teaching assistant at the Computer Engineering and Networks Lab (TIK) of ETH in the System Engineering group led by Prof. Albert Kündig.
133