Proceedings of the 14th ACM SIGPLAN International Conference on Functional Programming : August 31-September 2, 2009, Edinburgh, Scotland

August 31–September 2, 2009 Edinburgh, Scotland ICFP’09 Proceedings of the 2009 ACM SIGPLAN International Conference o...

Author: Graham Grutton | Andrew Tomalch (editors)

11 downloads 577 Views 9MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

August 31–September 2, 2009 Edinburgh, Scotland

ICFP’09 Proceedings of the 2009 ACM SIGPLAN

International Conference on Functional Programming Sponsored by:

ACM SIGPLAN Supported by:

Credit Suisse, Galois, Jane Street Capital, Microsoft Research, & Standard Chartered

The Association for Computing Machinery 2 Penn Plaza, Suite 701 New York, New York 10121-0701 Copyright © 2009 by the Association for Computing Machinery, Inc. (ACM). Permission to make digital or hard copies of portions of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permission to republish from: Publications Dept., ACM, Inc. Fax +1 (212) 869-0481 or . For other copying of articles that carry a code at the bottom of the first or last page, copying is permitted provided that the per-copy fee indicated in the code is paid through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923. Notice to Past Authors of ACM-Published Articles ACM intends to create a complete electronic archive of all articles and/or other material previously published by ACM. If you have written a work that has been previously published by ACM in any journal or conference proceedings prior to 1978, or any SIG Newsletter at any time, and you do NOT want this work to appear in the ACM Digital Library, please inform [email protected], stating the title of the work, the author(s), and where and when published.

ISBN: 978-1-60558-332-7 Additional copies may be ordered prepaid from:

ACM Order Department PO Box 30777 New York, NY 10087-0777, USA Phone: 1-800-342-6626 (US and Canada) +1-212-626-0500 (Global) Fax: +1-212-944-1318 E-mail: [email protected] Hours of Operation: 8:30 am – 4:30 pm ET

ACM Order Number 565090 Printed in the USA

ii

Foreword This volume contains the papers presented at ICFP 2009, the 14th ACM SIGPLAN International Conference on Functional Programming, held August 31–September 2, 2009 in Edinburgh, Scotland. ICFP provides a forum for researchers and practitioners to hear about the latest developments in the art and science of functional programming. The 2009 call for papers solicited submissions on topics including language design, implementation, software-development techniques, foundations, transformation and analysis, applications, and domain-specific languages, as well as functional pearls and short experience reports. ICFP welcomes work on all languages that encourage functional programming, including both purely applicative and imperative languages, as well as languages with objects or concurrency. In response to the call for papers, the program committee received 101 submissions: 85 regular papers (including 8 pearls) and 16 experience reports. From these, the program committee selected 26 regular papers and 6 experience reports for presentation at the conference. Regular papers were evaluated according to their relevance, correctness, significance, originality, and clarity. Pearls were evaluated similarly, except that they were not required to report original research, but were expected to be concise, instructive, and entertaining. Regular papers and pearl submissions were strictly limited to twelve pages. Experience reports were evaluated separately from regular papers. An experience report is not expected to add to the body of knowledge of the functional-programming community. Rather, it should extend the body of published, refereed, citable evidence that functional programming really works—or describe what obstacles prevent it from working. Each experience report is labeled as such in its title, and an experience report is shorter than a full paper: submissions were restricted to four pages, and final versions to six pages. Each of the submissions was assigned to at least three members of the program committee. Committee members were allowed to find external reviewers for their assigned submissions, but they were required in all cases to read the papers themselves and to form their own opinions. Initial reviews were made available to authors, who had a two-day period in which to respond. Further online discussion by the program committee followed. Final decisions were made at a two-day committee meeting held in Portland, Oregon and attended in person by all but one committee member, who participated by telephone. Reviewers disqualified themselves from reviewing any papers for which they had a conflict of interest, according to a set of strict criteria. Program committee members with a conflict of interest for a paper were prevented from seeing any reviews or online discussion for that paper and were required to leave the room during its live discussion. There were three submissions by members of the program committee, one of which was accepted. These submissions were discussed after all other papers and were held to a higher standard than the other papers. I have felt very honored to be entrusted with the task of organizing this year’s ICFP program, and I am extremely grateful to the many people who have helped make the process run smoothly. I received invaluable guidance and support from ICFP General Chair Graham Hutton; past Program Chairs Norman Ramsey, Kathleen Fisher, and Peter Thiemann; and the other members of the ICFP Steering Committee, especially Jim Hook. Rich Gerber and Paolo Gai of Softconf provided prompt and effective support for their START submission system. Andrew McCreight gave essential assistance during the PC meeting. It has been a privilege to work with the members of the program committee, who dedicated a tremendous amount of time and trouble to the tasks of reviewing and deliberating (and who traveled to the PC meeting at their own expense). Many thanks are also due to the external reviewers. Finally, on behalf of everyone involved in organizing the conference, I thank all the authors who submitted papers. Your continued participation is what makes ICFP possible and worthwhile.

Andrew Tolmach ICFP 2009 Program Chair Portland State University

iii

Table of Contents ICFP 2009 Conference Organization ..................................................................................................ix Invited Talk •

Organizing Functional Code for Parallel Execution or, foldl and foldr Considered Slightly Harmful.........................................................................................1 Guy L. Steele Jr. (Sun Microsystems Laboratories)

Session 1 •

Functional Pearl: La Tour D’Hanoï.............................................................................................................3 Ralf Hinze (University of Oxford)

•

Purely Functional Lazy Non-deterministic Programming ......................................................................11 Sebastian Fischer (Christian-Albrechts University), Oleg Kiselyov (FNMOC), Chung-chieh Shan (Rutgers University)

Session 2 •

Safe Functional Reactive Programming through Dependent Types......................................................23 Neil Sculthorpe, Henrik Nilsson (University of Nottingham)

•

Causal Commutative Arrows and Their Optimization............................................................................35 Hai Liu, Eric Cheng, Paul Hudak (Yale University)

Session 3 •

A Functional I/O System or, Fun for Freshman Kids..............................................................................47 Matthias Felleisen (Northeastern University), Robert Bruce Findler (Northwestern University), Matthew Flatt (University of Utah), Shriram Krishnamurthi (Brown University)

•

Experience Report: Embedded, Parallel Computer-Vision with a Functional DSL............................59 Ryan R. Newton (Massachusetts Institute of Technology), Teresa Ko (University of California at Los Angeles)

•

Runtime Support for Multicore Haskell....................................................................................................65 Simon Marlow, Simon Peyton Jones, Satnam Singh (Microsoft Research)

Session 4 •

Effective Interactive Proofs for Higher-Order Imperative Programs ...................................................79 Adam Chlipala, Gregory Malecha, Greg Morrisett, Avraham Shinnar, Ryan Wisnesky (Harvard University)

•

Experience Report: seL4: Formally Verifying a High-Performance Microkernal............................................................................91 Gerwin Klein (NICTA & University of New South Wales), Philip Derrin (NICTA), Kevin Elphinstone (NICTA & University of New South Wales)

•

Biorthogonality, Step-Indexing and Compiler Correctness ....................................................................97 Nick Benton (Microsoft Research), Chung-Kil Hur (University of Cambridge)

Session 5 •

Scribble: Closing the Book on Ad Hoc Documentation Tools...............................................................109 Matthew Flatt (University of Utah), Eli Barzilay (Northeastern University), Robert Bruce Findler (Northwestern University)

v

Invited Talk •

Lambda, the Ultimate TA: Using a Proof Assistant to Teach Programming Language Foundations .....................................................................................121 Benjamin C. Pierce (University of Pennsylvania)

Session 6 •

A Universe of Binding and Computation.................................................................................................123 Daniel R. Licata, Robert Harper (Carnegie Mellon University)

•

Non-Parametric Parametricity .................................................................................................................135 Georg Neis, Derek Dreyer, Andreas Rossberg (MPI-SWS)

Session 7 •

Finding Race Conditions in Erlang with QuickCheck and PULSE .....................................................149 Koen Claessen, Michał Pałka, Nicholas Smallbone (Chalmers University of Technology), John Hughes, Hans Svensson, Thomas Arts (Chalmers University of Technology and Quviq AB), Ulf Wiger (Erlang Training and Consulting)

•

Partial Memoization of Concurrency and Communication ..................................................................161 Lukasz Ziarek, KC Sivaramakrishnan, Suresh Jagannathan (Purdue University)

Session 8 •

Free Theorems Involving Type Constructor Classes: Functional Pearl..............................................173 Janis Voigtländer (Technische Universität Dresden)

•

Experience Report: Haskell in the “Real World”: Writing a Commercial Application in a Lazy Functional Language ...................................................185 Curt J. Sampson (Starling Software)

•

Beautiful Differentiation ............................................................................................................................191 Conal M. Elliott (LambdaPix)

Session 9 •

OXenstored: An Efficient Hierarchical and Transactional Database using Functional Programming with Reference Cell Comparisons .....................................................203 Thomas Gazagnaire, Vincent Hanquez (Citrix Systems)

•

Experience Report: Using Objective Caml to Develop Safety-Critical Embedded Tools in a Certification Framework .....................................................................................215 Bruno Pagano, Olivier Andrieu, Thomas Moniot (Esterel Technologies), Benjamin Canou, Emmanuel Chailloux, Philippe Wang, Pascal Manoury (Université Pierre et Marie Curie), Jean-Louis Colaço (Prover Technologies S.A.S)

•

Identifying Query Incompatibilities with Evolving XML Schemas .....................................................221 Pierre Genevès (CNRS), Nabil Layaïda, Vincent Quint (INRIA)

Invited Talk •

Commutative Monads, Diagrams and Knots ..........................................................................................231 Dan Piponi (Industrial Light & Magic)

Session 11 •

Generic Programming with Fixed Points for Mutually Recursive Datatypes ....................................233 Alexey Rodriguez Yakushev (Vector Fabrics B.V.), Stefan Holdermans, Andres Löh (Utrecht University), Johan Jeuring (Utrecht University and Open University of the Netherlands)

•

Attribute Grammars Fly First-Class: How to do Aspect Oriented Programming in Haskell............................................................................245 Marcos Viera (Universidad de la República), S. Doaitse Swierstra (Utrecht University), Wouter Swierstra (Chalmers University of Technology)

vi

Session 12 •

Parallel Concurrent ML ............................................................................................................................257 John Reppy (University of Chicago), Claudio V. Russo (Microsoft Research), Yingqi Xiao (University of Chicago)

•

A Concurrent ML Library in Concurrent Haskell ................................................................................269 Avik Chaudhuri (University of Maryland, College Park)

Session 13 •

Experience Report: OCaml for an Industrial-Strength Static Analysis Framework ........................281 Pascal Cuoq, Julien Signoles, Patrick Baudin, Richard Bonichon, Géraud Canet, Loïc Correnson, Benjamin Monate, Virgile Prevosto, Armand Puccetti (Commissariat à l’Energie Atomique)

•

Control-Flow Analysis of Function Calls and Returns by Abstract Interpretation...........................287 Jan Midtgaard (Roskilde University), Thomas P. Jensen (CNRS)

Session 14 •

Automatically RESTful Web Applications: Marking Modular Serializable Continuations ............299 Jay A. McCarthy (Brigham Young University)

•

Experience Report: Ocsigen, a Web Programming Framework..........................................................311 Vincent Balat, Jérôme Vouillon, Boris Yakobowski (Université Paris Diderot and CNRS)

•

Implementing First-Class Polymorphic Delimited Continuations by a Type-Directed Selective CPS-Transform ........................................................................................317 Tiark Rompf, Ingo Maier, Martin Odersky (École Polytechnique Fédérale de Lausanne)

Session 15 •

A Theory of Typed Coercions and its Applications................................................................................329 Nikhil Swamy (Microsoft Research), Michael Hicks (University of Maryland), Gavin M. Bierman (Microsoft Research)

•

Complete and Decidable Type Inference for GADTs ............................................................................341 Tom Schrijvers (Katholieke Universiteit Leuven), Simon Peyton Jones (Microsoft Research), Martin Sulzmann (Intaris Software GmbH), Dimitrios Vytiniotis (Microsoft Research)

Author Index ................................................................................................................................................353

vii

ICFP 2009 Conference Organization General Chair: Program Chair: Local Arrangements:

Workshops: Developer Track: Programming Contest: Publicity:

Graham Hutton (University of Nottingham) Andrew Tolmach (Portland State University) Philip Wadler (University of Edinburgh) Kevin Hammond (University of St. Andrews) Gregory Michaelson (Heriot–Watt University) Christopher Stone (Harvey Mudd College) Michael Sperber (DeinProgramm) Bryan O’Sullivan (Linden Lab) Michael Sperber (DeinProgramm) Andy Gill (University of Kansas) Matthew Fluet (Rochester Institute of Technology)

Program Committee:

Amal Ahmed (Indiana University) Maria Alpuente (Technical University of Valencia (UPV)) Lennart Augustsson (Standard Chartered Bank) Lars Birkedal (IT University of Copenhagen) Manuel Chakravarty (University of New South Wales) Koen Claessen (Chalmers University of Technology) Marc Feeley (Universit´e de Montr´eal) Andrzej Filinski (University of Copenhagen) Daan Leijen (Microsoft Research) Xavier Leroy (INRIA Paris-Rocquencourt) Conor McBride (University of Strathclyde) Matthew Might (University of Utah) Shin-Cheng Mu (Academia Sinica) Atsushi Ohori (Tohoku University) Kristoffer Rose (IBM Thomas J. Watson Research Center)

Steering Committee:

Amal Ahmed (Indiana Unviersity, USA) Manuel Chakravarty (University of New South Wales, Australia) Robby Findler (Northwestern University, USA) Matthew Fluet (Rochester Institute of Technology, USA) Ralf Hinze (Oxford University, UK), Chair James Hook (Portland State University, USA) Zhenjiang Hu (National Institute of Informatics, Japan) Paul Hudak (Yale University, USA) Graham Hutton (University of Nottingham, UK) Franc¸ois Pottier (INRIA Paris-Rocquencourt, France) Norman Ramsey (Tufts University, USA) Peter Thiemann (Universit¨at Freiburg, Germany) Andrew Tolmach (Portland State University, USA) Philip Wadler (University of Edinburgh, UK) Stephanie Weirich (University of Pennsylvania, USA)

Sponsor:

ix

External Reviewers Mauricio F. Alba-Castro Thorsten Altenkirch Sergio Antoy Jim Apple Michele Baggi Lennart Beringer Jean-Philippe Bernardy Richard Bird Matthias Blume Mathieu Boespflug Rajesh Bordawekar Edwin Brady Alexandre Buisse Rafael Caballero Marco Carbone Jacques Carette James Chapman Arthur Chargu´eraud Juan Chen Kung Chen Liang-Ting Chen James Cheney Adam Chlipala Tyng-Ruey Chuang Will Clinger Alcino Cunha Sharon Curtis Nils Anders Danielsson Olivier Danvy Gabriel Ditu Lucas Dixon Kevin Donnelly Derek Dreyer Conal Elliott Santiago Escobar Moreno Falaschi Marco Antonio Feliu Matthias Felleisen Amy Felty Xinyu Feng

Jean-Christophe Filliˆatre Matthew Fluet Ronald Garcia Jacques Garrigue Neil Ghani Healfdene Goguen Peter Hancock Ralf Hinze Martin Hofmann Liyang Hu John Hughes Jos´e Iborra Patrik Jansson Jacob Johannsen Christophe Joubert Yukiyoshi Kameyama Matt Kaufmann Gabriele Keller Andrew Kennedy Oleg Kiselyov Ralf L¨ammel Roman Leshchinskiy Ann Lilliestr¨om Sam Lindley Andres L¨oh Salvador Lucas Bradley Lucier Stephen Magill Geoffrey Mainland Luc Maranget Simon Marlow Andrew McCreight Trevor McDonell James McKinna Yasuhiko Minamide Neil Mitchell Stefan Monnier Peter Morris Aleksandar Nanevski Ulf Norell

Supporters

x

Michael Norrish Russell O’Connor Nicolas Oury Gordon Pace Michał Pałka Franc¸ois Pottier Mar´ıa Jos´e Ram´ırez-Quintana Yann R´egis-Gianas Benoˆıt Robillard Daniel Romero Jaime S´anchez-Hern´andez Jeffrey Sarnat Peter Sestoft Zhong Shao Josep Silva J´erˆome Sim´eon Satnam Singh Christian Skalka Nicholas Smallbone Dan Spoonhower Don Stewart Christopher Stone Kasper Svendsen Joel Svensson Wouter Swierstra Salvador Tamarit Peter Thiemann Phil Trinder Alicia Villanueva Lionel Villard Janis Voigtl¨ander Dimitrios Vytiniotis Geoffrey Washburn Shu-Chun Weng Simon Winwood Limsoon Wong Yingqi Xiao Ian Zerny

Organizing Functional Code for Parallel Execution or, foldl and foldr Considered Slightly Harmful Guy L. Steele Jr. Sun Microsystems Laboratories Burlington, Massachusetts [email protected]

Abstract In this talk I will discuss three ideas (none original with me) that I have found to be especially powerful in organizing Fortress programs so that they may be executed equally effectively either sequentially or in parallel: user-defined associative operators (and, more generally, user-defined monoids); conjugate transforms of data; and monoid-caching trees (as described, for example, by Hinze and Paterson). I will exhibit pleasant little code examples (some original with me) that make use of these ideas.

Alan Perlis, inverting Oscar Wilde’s famous quip about cynics, once suggested, decades ago, that a Lisp programmer is one who knows the value of everything and the cost of nothing. Now that the conference on Lisp and Functional Programming has become ICFP, some may think that OCaml and Haskell programmers have inherited this (now undeserved) epigram. I do believe that as multicore processors are becoming prominent, and soon ubiquitous, it behooves all programmers to rethink their programming style, strategies, and tactics, so that their code may have excellent performance. For the last six years I have been part of a team working on a programming language, Fortress, that has borrowed ideas not only from Fortran, not only from Java, not only from Algol and Alphard and CLU, not only from MADCAP and MODCAP and MIRFAC and the Klerer-May system—but also from Haskell, and I would like to repay the favor.

Categories and Subject Descriptors D.3.3 [Language Constructs and Features]: Concurrent programming structures General Terms Algorithms, Languages, Performance Keywords reduction, associative operator, monoid, tree, conjugate transform

Copyright is held by Sun Microsystems, Inc. ICFP’09, August 31–September 2, 2009, Edinburgh, Scotland, UK. ACM 978-1-60558-332-7/09/08.

1

Functional Pearl: La Tour D’Hano¨ı Ralf Hinze Computing Laboratory, University of Oxford, Wolfson Building, Parks Road, Oxford, OX1 3QD, England [email protected]

Abstract

if we arrange them in a circle. Initially, the discs are placed on one peg in decreasing order of diameter. The task is then to move the disks, one at a time, to another peg subject to the rule that a larger disk must not be placed on a smaller one. This restriction implies that a configuration can be represented by a list of pegs: the first element determines the position of the largest disc, the second element the position of the second largest disc, and so forth. Consequently, there are 3n possible configurations where n is the total number of discs. Lucas’ original puzzle contained 8 discs. The instructions of the puzzle refer to an old Indian legend, attributed to the French mathematician De Parville, according to which monks were given the task of moving a total of 64 discs. Since the day of the world’s creation, they transfer the discs, one per day. According to the legend, once they complete their sacred task, the world will come to an end. Now, taking a wholemeal approach, let us first develop the big picture. The set of all configurations together with the set of legal moves defines a graph, which turns out to enjoy a nice inductive definition. If there are no discs to move around, the graph consists of a singleton node: ◦. For the inductive step we reason as follows: the largest disc can only be moved, if all the smaller discs reside on one other peg. The smaller discs, however, can be moved independent of the largest one. As the largest disk may rest on one of three pegs, the graph for n + 1 discs consequently incorporates three copies of the graph for n discs linked by three edges. The diagram in Fig. 1 illustrates the construction. The graph has the shape of a triangle; the dashed lines indicate the sub-graphs (for n = 0 the sub-graphs collapse to singleton nodes); the three solid lines connect the sub-graphs. The inductive construction shows that

This pearl aims to demonstrate the ideas of wholemeal and projective programming using the Towers of Hanoi puzzle as a running example. The puzzle has its own beauty, which we hope to expose along the way. Categories and Subject Descriptors D.1.1 [Programming Techniques]: Applicative (Functional) Programming; D.3.2 [Programming Languages]: Language Classifications—Applicative (functional) languages, Haskell; D.3.3 [Programming Languages]: Language Constructs and Features—Frameworks, patterns, recursion General Terms

Algorithms, Design, Languages

Keywords Towers of Hanoi, wholemeal programming, projective programming, Hanoi graph, Sierpi´nski graph, Sierpi´nski gasket graph, Gray code

1.

Introduction

Functional languages excel at wholemeal programming, a term coined by Geraint Jones. Wholemeal programming means to think big: work with an entire list, rather than a sequence of elements; develop a solution space, rather than an individual solution; imagine a graph, rather than a single path. The wholemeal approach often offers new insights or provides new perspectives on a given problem. It is nicely complemented by the idea of projective programming: first solve a more general problem, then extract the interesting bits and pieces by transforming the general program into more specialised ones. This pearl aims to demonstrate the techniques using the popular Towers of Hanoi puzzle as a running example. This puzzle has its own beauty, which we hope to expose along the way.

2.

BBn

The Hanoi graph

BCn

The Towers of Hanoi puzzle was invented by the French number ´ theorist Edouard Lucas more than a hundred years ago. It consists of three vertical pegs, on which discs of mutually different diameters can be stacked. For reference, we call the pegs A, B and C and let a, b and c range over pegs.

BAn

ACn

AAn

data Peg = A | B | C I own a version of the puzzle where the pegs are arranged in a row. However, the mathematical structure of the puzzle becomes clearer,

CAn

ABn

CBn

CCn

Figure 1. Inductive construction of the Hanoi graph. the graph is planar: it can be drawn so that no edges intersect. Fig. 2 displays the graph for 4 discs. To reduce clutter, the peg of the largest disc is always written in the centre of the respective subgraph, with the size of the font indicating the size of the disc. Can you find the configuration [B, A, B, C]? The corners of the triangle correspond to perfect configurations: all the discs reside on one peg. The example graph shows that every configuration permits three different moves, except for the three perfect configurations, where only two moves are possible.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICFP’09, August 31–September 2, 2009, Edinburgh, Scotland, UK. c 2009 ACM 978-1-60558-332-7/09/08. . . $10.00 Copyright

3

B

B C

A

B

C

A

A A

C B

B

C

A

C

B

A B

C

C

B

C C

A

C A

C

A

B

B

A

B

B

B

A

B

A C

C

A

C

A

C A

A B

C

A

B B

C

B

B

B

A

C

C

A

B

A

B C

A

A

C

B

C

C C

B

B

C B

A

A B

A

A A

C

C

C C

C

A

B A

A

B

C

B A

A

B C

B

C

A

C

C

A C

B

C

B

B B

A

B

B

A

C

A

A A

C

A

B

B

C

Figure 2. The Hanoi graph for 4 discs. Let us turn our attention to the layout of the sub-graphs. The following notation proves to be useful: the arrangement x y z denotes a permutation of x , y and z , which by assumption are pairwise different. Using a b c to indicate the position of the largest disc—the largest disc in the left triangle resides on a and so forth—we observe that if the corners of the graph are a b c , then the corners of the sub-graphs are a c b , c b a and b a c , respectively. Using this observation, we can capture the informal description of the construction as a pseudo-Haskell program.

BBn

BCn

BAn

ACn

AAn

graph 0 (a b c ) = ◦ graph n+1 (a b c ) = b / graph n (c b a ) / \ a / graph n (a c b ) — c / graph n (b a c )

CAn

ABn

CBn

CCn

Figure 3. Change of direction.

The function graph n maps an arrangement a b c to an undirected graph, whose vertices are lists of pegs of length n. The notation a / g means prepend a to all the vertices of g; ◦ denotes a singleton node labelled with the empty list. We leave the type of graphs unspecified. If the type were a functor, then a / g would be fmap (a :) g. The call graph 4 (A B C ) yields the graph in Fig. 2. A few observations are in order. The definition of graph n implies that the graph has 3n nodes (a0 = 1, an+1 = 3 · an ) and (3n+1 − 3)/2 edges (a0 = 0, an+1 = 3 · an + 3). Furthermore, the length of a side of the triangle is 2n −1 (a0 = 0, an+1 = 2·an +1). Since there are only 3! = 6 permutations of three different items, the graph contains at most six different mini-triangles: A B C , C A B , C C B A B A , A B , C A and B C . Note that the first three arrange the pegs clockwise and the last three counterclockwise. Inspecting the definition of graph n , we see that the direction changes with every recursive step. The diagram in Fig. 3 illustrates the change of direction. This observation implies that the recursion pattern is quite regular: at any depth there are only three different recursive calls. Fig. 4 visualises the call structure using three different colours.

3.

Towers of Hanoi, recursively

Solving the Towers of Hanoi puzzle amounts to finding a shortest path in the corresponding graph. Clearly, the shortest path between two corners is along the side of the triangle. Projecting onto the lower side, we transform graph n to hanoi 00 (a b c ) = [[ ]] hanoi 0n+1 (a b c ) = a / hanoi 0n (a c b ) + + c / hanoi 0n (b a c ) . Now, a / x is shorthand for fmap (a :) x . The call hanoi 0n (a b c ) returns a list of configurations solving the problem of ‘moving n discs from a to c using b’. Instead of configurations (vertices of the graph) we can alternatively return a list of moves (corresponding to edges). hanoi 0 (a b c ) = [ ] hanoi n+1 (a b c ) = hanoi n (a c b ) + + [(a, c)] + + hanoi n (b a c )

4

Figure 4. Call structure of hanoi at recursion depth 4. The pair (a, c) represents the instruction ‘move the topmost disc from a to c’. Here is the sequence of moves for n = 4:

tract his work plan from hanoi n by omitting all the moves, except when n equals 1.

> > hanoi 4 (A B C ) [(A, B), (A, C), (B, C), (A, B), (C, A), (C, B), (A, B), (A, C), (B, C), (B, A), (C, A), (B, C), (A, B), (A, C), (B, C)] .

cycle 0 (a b c ) = [ ] cycle 1 (a b c ) = [(a, c)] cycle n+1 (a b c ) = cycle n (a c b ) + + cycle n (b a c )

Note that the lower part of the list can be obtained from the upper part via a clockwise rotation of the pegs: a c b becomes b a c . There is at least one further variation: instead of an arrangement one can pass a source and a target peg.

The function is called cycle , because the smallest disc cycles around the pegs: in the recursive call it is moved from a to b and then from b to c. Whether it cycles clockwise or counterclockwise depends on the parity of n—the direction changes with every recursive call of cycle . Of course, the smallest disc is by no means special: all the discs cycle around the pegs, albeit at a slower pace and in alternating directions. In fact, hanoi satisfies the following ‘fractal’ property:

hanoi 0 a c = [ ] hanoi n+1 a c = hanoi n a (a ⊥ c) + + [(a, c)] + + hanoi n (a ⊥ c) c The function ⊥, which determines ‘the other peg’, is given by A ⊥ A = A; B ⊥ A = C; C ⊥ A = B;

A ⊥ B = C; B ⊥ B = B; C ⊥ B = A;

hanoi n+1 (a b c ) = cycle n+1 (a b c ) g hanoi n (a b c ) ,

A⊥C=B B⊥C=A C⊥C=C .

where g denotes the interleaving of two lists. [] g bs = bs (a : as) g bs = a : (bs g as)

We will find some use for ⊥ later on. For the moment, we just note that the operation is commutative and idempotent, but not associative.

4.

(1)

The fractal property suggests the following alternative definition of hanoi , which has a strong parallel flavour. phanoi 0 (a b c ) = [ ] phanoi n+1 (a b c ) = cycle n+1 (a b c ) g phanoi n (a b c )

Towers of Hanoi, parallelly

Imagine that the monastery always accommodates as many monks as there are discs. The tallest monk is responsible for moving the largest disc, the second tallest monk for moving the second largest disc, and so forth. Can we set up a work schedule for the monastery? Inspecting Fig. 2, we notice that, somewhat unfairly, the smallest monk is the busiest. Since the smallest triangles correspond to moves of the smallest disc, he is active every other day. We can ex-

In words, the smallest monk starts to work on the first day; he is active every second day and moves his disc, say, clockwise around the pegs. The second smallest monk starts on the second day; he is active every fourth day and moves his disc counterclockwise. And so forth. Actually, only the smallest monk must remember the direction; for the larger discs there is no choice, as one of the other two pegs is blocked by the smallest disc.

5

for any given initial arrangement a b c . This arrangement can be seen as representing the bijective mapping {a 7→ 0, b 7→ 2, c 7→ 1}. Alternatively, we can use an arrangement, say, l t r as a representation of the ‘inverse’ mapping {A 7→ l , B 7→ t, C 7→ r } obtaining the following slightly more succinct variant of pos.

There is an intriguing cross-connection to binary numbers: the activity diagram of the monks (Which monk has to work on a given day?) discs 0 = [ ] discs n+1 = discs n + + [n ] + + discs n yields the binary carry sequence or ruler function as n goes to infinity (Hinze, 2008). This sequence gives the number of trailing zeros in the binary representations of the positive numbers, most significant bit first. Or put differently, it specifies the running time of the binary increment. We will make use of this observation later on. The fractal property (1) enjoys a simple inductive proof, which makes essential use of the following abide law. If x1 and x2 are of the same length, then

pos 0 pos 0 pos 0 pos 0

The two variants are related by pos (A B C ) = pos 0 (0 2 1 ). The latter definition is particularly easy to invert. conf conf conf conf

(x1 + + y1 ) g (x2 + + y2 ) = (x1 g x2 ) + + (y1 g y2 ) . The basis of the induction is straightforward. Here is the inductive step. =

=

= =

(a b c ) [ ] = [] (a b c ) (0 : ps) = a : conf (a c b ) ps (a b c ) (2 : ps) = b : conf (c b a ) ps (a b c ) (1 : ps) = c : conf (b a c ) ps

The call conf (A B C ) [0, 1, 0, 1] yields [A, B, C, A], the configuration we obtain after 5 days. The functions pos 0 and conf are actually identical, if we identify A with 0, B with 2 and C with 1. Then pos 0 is an involutive graph isomorphism between Hanoi graphs and Sierpi´nski graphs of the same order. If the monks have lost track—a 2 appears in the answer list— then we can use the idea of locating a configuration in the Hanoi graph to determine the shortest path to the final configuration cn . This leads to the following generalised version of hanoi .

cycle n+2 (a b c ) g hanoi n+1 (a b c ) { definition of cycle and definition of hanoi } (cycle n+1 (a c b ) + + cycle n+1 (b a c )) g (hanoi n (a c b ) + + [(a, c)] + + hanoi n (b a c )) { abide law } + [(a, c)] (cycle n+1 (a c b ) g hanoi n (a c b )) + + +(cycle n+1 (b a c ) g hanoi n (b a c )) { ex hypothesi } hanoi n+1 (a c b ) + + [(a, c)] + + hanoi n+1 (b a c ) { definition of hanoi }

ghanoi 0 [ ] (a b c ) = [ ] ghanoi n+1 (p : ps) (a b c ) + [(a, c)] + + hanoi n (b a c ) | p a = ghanoi n ps (a c b ) + c + [(b, c)] + + hanoi n (a b c ) | p b = ghanoi n ps (b a ) + b | p c = ghanoi n ps (a c )

hanoi n+2 (a b c ) The abide law is applicable in the second step, because the two + [(a, c)], have the lists, cycle n+1 (a c b ) and hanoi n (a c b ) + same length, namely, 2n . Speaking of the length of lists, note P n i that 2n+1 − 1 = i=0 2 is a simple consequence of the fractal property.

5.

(l t r ) [ ] = [] (l t r ) (A : ps) = l : pos 0 (l r t ) ps (l t r ) (B : ps) = t : pos 0 (r t l ) ps (l t r ) (C : ps) = r : pos 0 (t l r ) ps

(The index n is actually redundant: it always equals the length of the peg list.) Depending on the location of p, we either walk within the left triangle and then along the lower side, or within the upper triangle and then along the right side, or within the right triangle. The reader should convince herself that ghanoi n ps (a b c ) indeed yields the shortest path between ps and c n . Note that the first two equations are perfectly symmetric, in fact, ghanoi n ps (a b c ) = ghanoi n ps (b a c ). As an aside, if we fuse ghanoi with length, then we obtain a function that yields the distance to the final configuration.

When will the world come to an end?

Many visitors come to the monastery. Looking at the configuration of discs, they often wonder how many days have passed since the creation of the world. Or, when will the world come to an end? We can answer these questions by locating the current configuration in the Hanoi graph. If we use as positions the three digits 2 0 1 —that is, 0 for the left triangle, 2 for the upper triangle and 1 for the right triangle—then we obtain the answer to the first question in binary. Fig. 5 displays the Hanoi graph for 4 discs suitably re-labelled—this graph is also known as the Sierpi´nski graph. The 24 nodes on the lower side of the triangle, and only those, are labelled with binary numbers. Consequently, if the current position contains a 2, we know that the monks have lost track. (To answer the second question, we use as positions 1 2 0 instead of 0 2 1 . If we are only interested in the distance to the final configuration, we simply replace the digit 2 by a 1.)

6.

Towers of Hanoi, iteratively

The generalised version of the puzzle—moving from an arbitrary configuration to a perfect configuration—serves as an excellent starting point for the derivation of an iterative solution, that is, a function that maps a configuration to the next move or to the next configuration. The next move is easy to determine: we fuse ghanoi with the natural transformation first :: [α] → Maybe α first [ ] = Nothing first (a : as) = Just a

b

pos (a c ) [ ] = [ ] pos (a b c ) (p : ps) | p a = 0 : pos (a c b ) ps | p b = 2 : pos (c b a ) ps | p c = 1 : pos (b a c ) ps

that maps a list to its first element. We obtain move 0 [ ] (a b c ) = Nothing move n+1 (p : ps) (a b c ) + Just(a, c) + + first(hanoi n (b a c )) | p a = move n ps (a c b ) + c | p b = move n ps (b a ) + + Just(b, c) + + first(hanoi n (a b c )) b | p c = move n ps (a c ) .

For instance, pos (A B C ) [A, B, C, A] yields [0, 1, 0, 1], the binary representation of 5, most significant bit first. The function pos (a b c ) defines a bijection between {A, B, C}n and {0, 2, 1}n

6

2

2 0

1

2

2

2

0 0

1 1

0

1

2

2

2

2 0

1

0

2

0

2

0 0

1

0

0

1

0

2

2

1

0

0

1

0

0

2

2

2

2

0

2

2

1

0 1

1 1

1

1

1

0

2

2

0

0 1

2

2

2

2

2

0 0

0

1

2

1 1

1

2 0

2

0

2

0 0

2

0

1

1 1

0

2

2

1

0

0

1

0

2

2

1 1

0

1

2

1

1

0 1

1

0

2 0

2

2

1

0 1

0

1

1

2

1 1

0

1

Figure 5. The Sierpi´nski graph of order 4. the target pegs explicit. The rows labelled ci lists the target peg for each disc: the first, user-supplied target is c0 = C, the next target is c1 = p0 ⊥ c0 , and so forth.

The operator + + is overloaded to also denote concatenation of Maybe values (+ + is really the mplus method of MonadPlus). (+ +) :: Maybe α → Maybe α → Maybe α Nothing + +m =m Just a + + m = Just a

110

We can simplify the definition of move drastically: since Just a + + m = Just a, the calls to hanoi can be eliminated; because of that the index n is no longer needed. Furthermore, the first two equations can be unified through the use of ⊥, the operator that determines ‘the other peg’. Finally, the second argument, a b c , can be simplified to c, the target peg.

111

i pi ci pi ci

0 A C A C

1 B B B B

2 B B B B

3 A B A B

4 C C C C

5 C C C C

6 C C C C

7 B C C C

A C

The smallest disc is not in place, so it is moved from B to C. In the next round, disc 3 has to be moved from A to B. The smaller discs are moved more frequently, so it is actually prudent to reverse the list of pegs so that the peg on which the smallest disc is located comes first. The situation is similar to the binary increment: visiting the digits from least to most significant bit is more efficient than the other way round. Let us assume for the moment that we know the ‘last target’, the value of c that is discarded in the first equation of move (the pegs that stick out in the example above). Since ⊥ is reversible, p ⊥ c = c 0 iff c = p ⊥ c 0 , we can reconstruct the previous target c from the next target c 0 . The following variant of move makes use of this fact—the suffix ‘i’ indicates that the input list is now arranged in increasing order of diameter. movei [ ] c 0 = Nothing movei (p : ps) c 0 + movei ps (p ⊥ c 0 ) | p 6 (p ⊥ c 0 ) = Just (p, p ⊥ c 0 ) + | p (p ⊥ c 0 ) = movei ps c 0

move [ ] c = Nothing move (p : ps) c = move ps (p ⊥ c) + + Just (p, c) |p6 c |p c = move ps c Here is a short interactive session that illustrates the use of move. > > move [A, B, B, A, C, C, C, B] C Just (B, C) > > move [A, B, B, A, C, C, C, C] C Just (A, B) > > move [C, C, C, C, C, C, C, C] C Nothing Since the pair (B, C) means ‘move the topmost disc from peg B to peg C, the configuration following [A, B, B, A, C, C, C, B] is [A, B, B, A, C, C, C, C]. Incidentally, if we start with the initial peg list [A, A, A, A, A, A, A, A] and target C, we obtain these configurations after 110 and 111 steps. The function move implements a two-way algorithm: on the way down the recursion it calculates the target peg for each disc (using ⊥); on the way up the recursion it determines the smallest disc that is not yet in place (using + +). The following table makes

We consistently changed the second argument of move to reflect the fact that we compute the previous from the next target and additionally replaced the remaining occurrences of c by p ⊥ c 0 . Again, we can simplify the code: p 6 (p ⊥ c 0 ) is the same as p 6 c 0 . Applying Just a + + m = Just a once more, we can eliminate the first recursive call to movei so that the function

7

stops as soon as the smallest displaced disc is found—this was the purpose of the whole exercise. We obtain

Summing up, we obtain the following iterative implementation for solving the generalised Hanoi puzzle. ihanoi ps c = map fst (iterate step (ps, foldr (⊥) c ps)) iterate :: (a → Maybe a) → (a → [a ]) iterate f x = x : case f x of {Nothing → [ ]; Just x 0 → iterate f x 0 }

movei [ ] c 0 = Nothing movei (p : ps) c 0 = Just (p, p ⊥ c 0 ) | p 6 c0 0 |p c = movei ps c 0 .

7.

´ Longest paths and Sierpinski’s triangle

In a nutshell, movei determines the smallest disc that does not reside on the last target. So, for an iterative version of hanoi we have to maintain two pieces of information: the current configuration and the current ‘last target’. It remains to determine the initial last target and how the last target changes after each move. If the list of pegs is given in the original decreasing order, then we can transform move to

So far we have considered shortest paths in the Hanoi graph. Since the destruction of the world hangs in the balance, as a gift to future generations, we might want to look for the longest path. In the following variant of hanoi the largest disc makes a detour. (According to the Indian legend, the temple is actually in Bramah rather than in Hanoi, hence the name of the function.)

last [ ] c=c last (p : ps) c = last ps (p ⊥ c) ,

bramah 0 (a b c ) = [ ] bramah n+1 (a b c ) = bramah n (a b c ) + + [(a, b)] + + bramah n (c b a ) + + [(b, c)] + + bramah n (a b c )

which yields the last target. It is not hard to see that last is an instance of the famous foldl : last ps c = foldl (⊥) c ps. If we reverse the list, we simply have to replace foldl by foldr additionally using the fact that ⊥ is commutative: foldr (⊥) c ps = foldl (⊥) c (reverse ps). Next, we augment movei so that it returns the next configuration instead of the next move and additionally the next ‘last target’.

The largest disc is first moved from a to b, and then from b to c. Since bramah n returns 3n − 1 moves (a0 = 0, an+1 = 3 · an + 2), we have actually found a Hamiltonian path. The path has another interesting property: if the pegs are arranged in a row, a b c, then discs are only moved between adjacent pegs. The Hamiltonian path for four discs is displayed in Fig. 6. The picture is quite appealing. Actually, if we move the sub-triangles closer to each other so that the corners touch, we obtain a nice fractal image. Fig. 7 shows the result for 7 discs. The corresponding graph is known as the discrete Sierpi´nski gasket graph.1 The picture has been drawn using Functional Metapost’s turtle graphics (Korittky, 1998).

step ([ ], c 0 ) = Nothing step (p : ps, c 0 ) | p 6 c0 = Just ((p ⊥ c 0 ) : ps, p ⊥ c 0 ) = do {(ps 0 , c) ← step (ps, c 0 ); | p c0 return (p : ps 0 , p ⊥ c)}

curve 0 d = forward N turn (2 ∗ d ) N forward curve n+1 d = turn d N curve n (−d ) N turn d N curve n d N turn d N curve n (−d ) N turn d

Consider the second equation: p is moved to p ⊥ c 0 . After the move, the disc resides on the target peg. Consequently, the next target is also p ⊥ c 0 —recall that ⊥ is idempotent. This target is then updated in the third equation ‘on the way back’ mimicking move’s mode of operation. The function step runs in constant amortised time, since it performs the same number of steps as the binary increment— recall that the activity diagram of the monks coincides with the binary carry sequence. The next last target can, in fact, be easily calculated by hand. Consider the following two subsequent moves (as usual, a, b and c are pairwise different).

··· ··· ··· ···

n a b b b

n−1 c c c b

n−2 c c c a

··· ··· ··· ···

1 c c c b

0 c c c a

c

The command forward moves the turtle one step forward, turn d turns the turtle clockwise by d degrees, and N sequences two turtle actions. Since turtle graphics is state-based—the turtle has a position and faces a direction—recursive definitions typically maintain an invariant. To draw the ‘triangle’ a b c , we start at a looking at b and stop at c looking away from b. The overall change of direction is twice the second argument of curve , which for equilateral triangles is either 60◦ or −60◦ . The curve is closely related to Sierpi´nski’s arrowhead curve. In fact, both curves yield Sierpi´nski’s triangle as n goes to infinity. As an aside, Sierpi´nski’s triangle is a so-called fractal curve: it has the Hausdorff dimension log 3/ log 2 ∼ = 1.58496, as it consists of three self-similar pieces with magnification factor 2.

b

8.

Gray code

The function bramah enumerates the configurations {A, B, C}n changing only one peg at a time. In other words, the succession of configurations corresponds to a ternary Gray code! To investigate the correspondence a bit further, here is a version of bramah that returns a list of configurations, rather than a list of moves.

Assume that the first configuration ends with an even number of cs. The topmost disc of a is then moved to b. The new succession of target pegs consequently alternates between b and a: since the number of cs is even, the new last target is b; for an odd number, it is a. So, the monks can be instructed as follows: determine the smallest disc that is not on c. Transfer it from a to b. If the disc’s number is even, the new last target is b; otherwise, it is a. If we solve the original puzzle, that is, if the configurations lie on one of the sides of the triangle, then the next last target is even easier to determine: like the discs, it cycles around the pegs. If the pegs are arranged A B C and we move the discs from A to C, then the last target moves counterclockwise for an even number of discs and clockwise for an odd number.

bramah 00 (a b c ) = [[ ]] bramah 0n+1 (a b c ) = a / bramah 0n (a b c ) + + b / bramah 0n (c b a ) + + c / bramah 0n (a b c ) 1 The

term gasket graph is actually not used consistently. Some authors call the counterpart of the Hanoi graph the Sierpi´nski gasket graph, and its contracted variant the Sierpi´nski graph.

8

B

B C

A

B

C

A

A A

C B

B

C

A

C

B

A B

C

C

B

C C

A

C

A A

B

B

B

C

C

A

B

B

B

A

C

C

A

A

C

A

C

A

C

B

A

B

C

C C

B

B

A

B

C

A

A

A

A

B

C

A

B

A

C

B

C

B

B

C

B B

A

C A

A

A

C

C

C

B A

A

A

C

A

B

B B

B

A

C

B A

A

B

B

C

B

B C

C

C

B C

B

C

A

A A

A

C

A

C B

Figure 6. A Hamiltonian path in the Hanoi graph for 4 discs.

Figure 7. A Hamiltonian path in the Hanoi graph for 7 discs.

9

A

B

C

There are two standard ternary Gray codes: the so-called modular code, the digits vary 012|201|120 · · ·, and the reflected code, the digits vary 012|210|012 · · ·. The definition above yields the latter, as the discs are only moved between adjacent pegs. In fact, we have bramah 0n (c b a ) = reverse (bramah 0n (a b c )). Using this property on the second recursive call, we can simplify bramah 0n (0 1 2 ) somewhat. gray3 0 = [[ ]] gray3 n+1 = 0 / gray3 n + + 1 / reverse gray3 n + + 2 / gray3 n

Hanoi graph. From that we derived a series of programs evolving around the Tower of Hanoi theme. Knowing the big picture was jolly useful: for instance, calculating the number of moves could be reduced to the problem of locating a configuration in the graph. Projective program transformations are abundant: hanoi is derived from graph by projecting onto the lower side of the graph, cycle is derived from hanoi by mashing out the moves of the larger discs, and so forth. The transformations could be made rigorous within the Algebra of Programming framework (Bird and de Moor, 1997). Occasionally, this comes at an additional cost. For instance, to derive cycle from hanoi we would additionally need the disc number, which is not present in hanoi ’s output. A lot is left to explore. There are literally hundreds of papers on the subject: Paul Stockmeyer’s comprehensive bibliography has a total of 369 entries (2005). From that bibliography I learned that my definition of the Hanoi graph is not original: Er introduced it to analyse the complexity of the generalised Tower of Hanoi problem (1983). The original instructions of the game already alluded to the recursive procedure for solving the puzzle. It has been used since to illustrate the concept of recursion. The parallel version—usually classified as iterative— is due to Buneman and Levy (1980). Backhouse and Fokkinga (2001) show that each disc cycles around the pegs exploiting the associativity of equivalence. To the best of the author’s knowledge the iterative, or stepwise variant is original. The connection to Gray codes has first been noticed by Gardner (1972).

For comparison, here is the definition of the Gray binary sequence. gray2 0 = [[ ]] gray2 n+1 = 0 / gray2 n + + 1 / reverse gray2 n Actually, the binary Gray code is also hidden in the Tower of Hanoi puzzle. We only have to work with a different configuration space: instead of {A, B, C}n we use {0, 1}n keeping track whether a disc has been moved an even or an odd number of times. The initial configuration AAn becomes 00n , the final configuration CCn becomes 10n . Since the ith disc is moved 2i times, all the discs make an even number of moves, except for the largest, which makes a single move. We can easily adapt ihanoi to generate binary Gray code. We take as a starting point the first definition of movei, slightly modified so that the next configuration is returned instead of the next move (we also applied the Just a + + m = Just a simplification). configi configi |p6 |p

[] c 0 = Nothing (p : ps) c 0 (p ⊥ c 0 ) = Just (p ⊥ c 0 : ps) (p ⊥ c 0 ) = p / configi ps (p ⊥ c 0 )

Acknowledgments Special thanks are due to Daniel James for enjoyable discussions and for suggesting a number of stylistic and presentational improvements. Thanks are also due to the anonymous referees for an effort to make the paper less dense.

Now, the Gray code equivalent of ⊥ is the Boolean operation exclusive or, that is, inequality of Booleans. This implies that the last target c 0 corresponds to a parity bit. Thus, configi becomes (the code uses Booleans rather than bits)

References Backhouse, Roland, and Maarten Fokkinga. 2001. The associativity of equivalence and the Towers of Hanoi problem. Information Processing Letters 77:71–76.

codei [ ] p = Nothing codei (b : bs) p | b 6 (b 6 p) = Just ((b 6 p) : bs) | b (b 6 p) = b / codei bs (b 6 p) .

Bird, Richard, and Oege de Moor. 1997. Algebra of Programming. London: Prentice Hall Europe.

Again, we can simplify the code a bit: inequality and equality of Boolean values are associative (Backhouse and Fokkinga, 2001), so b 6 (b 6 p) is simply p. Using False 6 b = b and True 6 b = ¬ b, we obtain

Buneman, Peter, and Leon Levy. 1980. The Towers of Hanoi problem. Information Processing Letters 10(4–5):243–244. Er, M.C. 1983. An analysis of the generalized Towers of Hanoi problem. BIT 23:429–435.

codei [ ] p = Nothing codei (b : bs) p |p = Just (¬ b : bs) |¬p = b / codei bs b .

Gardner, Martin. 1972. Mathematical games: The curious properties of the Gray code and how it can be used to solve puzzles. Scientific American 227(2):106–109. Reprinted, with Answer, Addendum, and Bibliography, as Chapter 2 of Knotted Doughnuts and Other Mathematical Entertainments, W. H. Freeman and Co., New York, 1986.

In words: we traverse the list p : bs up to the first 1; the following bit, if any, is flipped. As for ihanoi, we have to maintain two pieces of information: the current Gray code and the current parity bit. The latter is easy to update: it is flipped in each step. Summing up, we obtain the following Gray code generator.

Hinze, Ralf. 2008. Functional Pearl: Streams and Unique Fixed Points. In Proceedings of the 2008 International Conference on Functional Programming, ed. Peter Thiemann, 189–200. ACM Press.

igray bs = map fst (iterate step (bs, foldr (6 ) True bs)) where step (bs, p) = do {bs 0 ← codei bs p; return (bs 0 , ¬ p)}

Knuth, Donald E. 2005. The Art of Computer Programming, Volume 4, Fascicle 2: Generating All Tuples and Permutations. Addison-Wesley Publishing Company.

This is, in fact, the functional version of Knuth’s Algorithm G (2005).

9.

Korittky, Joachim. 1998. Functional METAPOST. Diplomarbeit, Universit¨at Bonn.

Conclusion and further reading

Stockmeyer, Paul K. 2005. The Tower of Hanoi: A bibliography. Available from http://www.cs.wm.edu/~pkstoc/biblio2. pdf.

We have come to the end of the Tour D’Hano¨ı. In the spirit of the wholemeal approach we started with an inductive definition of the

10

Purely Functional Lazy Non-deterministic Programming Sebastian Fischer

Oleg Kiselyov

Chung-chieh Shan

Christian-Albrechts University, Germany [email protected]

FNMOC, CA, USA [email protected]

Rutgers University, NJ, USA [email protected]

Abstract

(Wadler 1985) or another MonadPlus instance, and sharing can be represented using a state monad (Acosta-G´omez 2007; §2.4.1). These features are particularly useful together. For instance, sharing the results of non-strict evaluation—known as call-by-need or lazy evaluation—ensures that each expression is evaluated at most once. This combination is so useful that it is often built-in: as delay in Scheme, lazy in OCaml, and memoization in Haskell. In fact, many programs need all three features. As we illustrate in §2, lazy functional logic programming (FLP) can be used to express search problems in the more intuitive generate-and-test style yet solve them using the more efficient test-and-generate strategy, which is to generate candidate solutions only to the extent demanded by the test predicate. This pattern applies to propertybased test-case generation (Christiansen and Fischer 2008; Fischer and Kuchen 2007; Runciman et al. 2008) as well as probabilistic inference (Goodman et al. 2008; Koller et al. 1997). Given the appeal of these applications, it is unfortunate that combining the three features naively leads to unexpected and undesired results, even crashes. For example, lazy in OCaml is not thread-safe (Nicollet et al. 2009), and its behavior is unspecified if the delayed computation raises an exception, let alone backtracks. Although sharing and non-determinism can be combined in Haskell by building a state monad that is a MonadPlus instance (Hinze 2000; Kiselyov et al. 2005), the usual monadic encoding of nondeterminism in Haskell loses non-strictness (see §2.2). The triple combination has also been challenging for theoreticians and practitioners of FLP (L´opez-Fraguas et al. 2007, 2008). After all, Algol has made us wary of combining non-strictness with any effect. The FLP community has developed a sound combination of laziness and non-determinism, call-time choice, embodied in the Curry language. Roughly, call-time choice makes lazy non-deterministic programs predictable and comprehensible because their declarative meanings can be described in terms of (and is often same as) the meanings of eager non-deterministic programs.

Functional logic programming and probabilistic programming have demonstrated the broad benefits of combining laziness (non-strict evaluation with sharing of the results) with non-determinism. Yet these benefits are seldom enjoyed in functional programming, because the existing features for non-strictness, sharing, and nondeterminism in functional languages are tricky to combine. We present a practical way to write purely functional lazy non-deterministic programs that are efficient and perspicuous. We achieve this goal by embedding the programs into existing languages (such as Haskell, SML, and OCaml) with high-quality implementations, by making choices lazily and representing data with non-deterministic components, by working with custom monadic data types and search strategies, and by providing equational laws for the programmer to reason about their code. Categories and Subject Descriptors D.1.1 [Programming Techniques]: Applicative (Functional) Programming; D.1.6 [Programming Techniques]: Logic Programming; F.3.3 [Logics and Meanings of Programs]: Studies of Program Constructs—Type structure General Terms Keywords

1.

Design, Languages

Monads, side effects, continuations, call-time choice

Introduction

Non-strict evaluation, sharing, and non-determinism are all valuable features in functional programming. Non-strict evaluation lets us express infinite data structures and their operations in a modular way (Hughes 1989). Sharing lets us represent graphs with cycles, such as circuits (surveyed by Acosta-G´omez 2007), and express memoization (Michie 1968), which underlies dynamic programming. Since Rabin and Scott’s Turing-award paper (1959), nondeterminism has been applied to model checking, testing (Claessen and Hughes 2000), probabilistic inference, and search. These features are each available in mainstream functional languages. A call-by-value language can typically model nonstrict evaluation with thunks and observe sharing using reference cells, physical identity comparison, or a generative feature such as Scheme’s gensym or SML’s exceptions. Non-determinism can be achieved using amb (McCarthy 1963), threads, or first-class continuations (Felleisen 1985; Haynes 1987). In a non-strict language like Haskell, non-determinism can be expressed using a list monad

1.1

Contributions

We embed lazy non-determinism with call-time choice into mainstream functional languages in a shallow way (Hudak 1996), rather than, say, building a Curry interpreter in Haskell (Tolmach and Antoy 2003). This new approach is especially practical because these languages already have mature implementations, because functional programmers are already knowledgeable about laziness, and because different search strategies can be specified as MonadPlus instances and plugged into our monad transformer. Furthermore, we provide equational laws that programmers can use to reason about their code, in contrast to previous accounts of call-time choice based on directed, non-deterministic rewriting. The key novelty of our work is that non-strictness, sharing, and non-determinism have not been combined in such a general way before in purely functional programming. Non-strictness and non-determinism can be combined using data types with nondeterministic components, such that a top-level constructor can be

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICFP’09, August 31–September 2, 2009, Edinburgh, Scotland, UK. c 2009 ACM 978-1-60558-332-7/09/08. . . $5.00 Copyright

11

the duplication of a computation bound to x does not cause it to be evaluated twice. In a lazy language, the value of the expression factorial 100 would only be computed once when evaluating iterate (‘div‘2) (factorial 100). This property—called sharing—makes lazy evaluation strictly more efficient than eager evaluation, at least on some problems (Bird et al. 1997).

computed without fixing its arguments. However, such an encoding defeats Haskell’s built-in sharing mechanism, because a piece of non-deterministic data that is bound to a variable that occurs multiple times may evaluate to a different (deterministic) value at each occurrence. We retain sharing by annotating programs explicitly with a monadic combinator for sharing. We provide a generic library to define non-deterministic data structures that can be used in non-strict, non-deterministic computations with explicit sharing. Our library is implemented as a monad transformer and can, hence, be combined with arbitrary monads for non-determinism. We are, thus, not restricted to the list monad (which implements depth-first search) but can also use monads that backtrack more efficiently or provide a complete search strategy. The library does not directly support logic variables—perhaps the most conspicuous feature of FLP—and the associated solution techniques of narrowing and residuation, but logic variables can be emulated using non-deterministic generators (Antoy and Hanus 2006) or managed using an underlying monad of equality and other constraints. We present our concrete code in Haskell, but we have also implemented our approach in OCaml. Our monadic computations perform competitively against corresponding computations in Curry that use non-determinism, narrowing, and unification. 1.2

2.2

return :: a -> m a (>>=) :: m a -> (a -> m b) -> m b

Structure of the paper

The operation return builds a deterministic computation that yields a value of type a, and the operation >>= (“bind”) chains computations together. Haskell’s do-notation is syntactic sugar for long chains of >>=. For example, the expression

In §2 we describe non-strictness, sharing, and non-determinism and why they are useful together. We also show that their naive combination is problematic, to motivate the explicit sharing of non-deterministic computations. In §3 we clarify the intuitions of sharing and introduce equational laws to reason about lazy non-determinism. Section 4 develops an easy-to-understand implementation in several steps. Section 5 generalizes and speeds up the simple implementation. We review the related work in §6 and then conclude.

2.

do x >= \x -> e2. If a monad m is an instance of MonadPlus, then two additional operations are available: mzero :: m a mplus :: m a -> m a -> m a

Non-strictness, sharing, and non-determinism

In this section, we describe non-strictness, sharing, and non-determinism and explain why combining them is useful and non-trivial. 2.1

Non-determinism

Programming non-deterministically can simplify the declarative formulation of an algorithm. For example, many languages are easier to describe using non-deterministic rather than deterministic automata. As logic programming languages such as Prolog and Curry have shown, the expressive power of non-determinism simplifies programs because different non-deterministic results can be viewed individually rather than as members of a (multi-)set of possible results (Antoy and Hanus 2002). In Haskell, we can express non-deterministic computations using lists (Wadler 1985) or, more generally, monads that are instances of the type class MonadPlus. A monad m is a type constructor that provides two polymorphic operations:

Here, mzero is the primitive failing computation, and mplus chooses non-deterministically between two computations. For the list monad, return builds a singleton list, mzero is an empty list, and mplus is list concatenation. As an example, the following monadic operation computes all permutations of a given list non-deterministically:

Lazy evaluation

Lazy evaluation is illustrated by the following Haskell predicate, which checks whether a given list of numbers is sorted: isSorted :: [Int] -> Bool isSorted (x:y:zs) = (x [a] -> m [a] perm [] = return [] perm (x:xs) = do ys [a] -> m [a] insert x xs = return (x:xs) ‘mplus‘ case xs of [] -> mzero (y:ys) -> do zs a) -> a -> [a] iterate next x = x : iterate next (next x)

The operation perm permutes a list by recursively inserting elements at arbitrary positions. To insert an element x at an arbitrary position in xs, the operation insert either puts x in front of xs or recursively inserts x somewhere in the tail of xs if xs is not empty. Non-determinism is especially useful when formulating search algorithms. Following the generate-and-test pattern, we can find solutions to a search problem by non-deterministically describing candidate solutions and using a separate predicate to filter results. It is much easier for a programmer to express generation and testing separately than to write an efficient search procedure by hand, because the generator can follow the structure of the data

The test isSorted (iterate (‘div‘2) n) yields the result False if n>0. It does not terminate if n m a -> m (List m a) -> m (List m a) insert mx mxs = cons mx mxs ‘mplus‘ do Cons my mys [Int] -> m [Int] sort xs = do ys m (List m Int) -> m (List m Int) sort xs = let ys = perm xs in do True m (m a)

nil :: Monad m => m (List m a) nil = return Nil

where m is an instance of MonadPlus that supports explicit sharing. (We describe the implementation of explicit sharing in §§4–5.) The function sort can then be redefined to actually sort:

cons :: Monad m => m a -> m (List m a) -> m (List m a) cons x y = return (Cons x y)

sort xs = do ys m Bool isSorted ml = ml >>= \l -> case l of Cons mx mxs -> mxs >>= \xs -> case xs of Cons my mys -> mx >>= \x -> my >>= \y -> if x return True _ -> return True

In this version of sort, the variable ys denotes the same permutation wherever it occurs but is nevertheless only computed as much as demanded by the predicate isSorted.

3.

Programming with lazy non-determinism

In this section we formalize the share combinator and specify equational laws with which a programmer can reason about nondeterministic programs with share and predict their observations. Before the laws, we first present a series of small examples to clarify how to use share and what share does.

By generating lists with non-deterministic arguments, we can define a lazier version of the permutation algorithm.

3.1

perm :: MonadPlus m => m (List m a) -> m (List m a) perm ml = ml >>= \l -> case l of Nil -> nil Cons mx mxs -> insert mx (perm mxs)

The intuition of sharing

We define two simple programs. The computation coin flips a coin and non-deterministically returns either 0 or 1. coin :: MonadPlus m => m Int coin = return 0 ‘mplus‘ return 1

Note that we no longer evaluate (bind) the recursive call of perm in order to pass the result to the operation insert, because insert now takes a non-deterministic list as its second argument.

1 In

13

fact, the signature has additional class constraints; see §5.

The function duplicate evaluates a given computation a twice.

first :: MonadPlus m => m (List m a) -> m a first l = l >>= \(Cons x xs) -> x

duplicate :: Monad m => m a -> m (a, a) duplicate a = do u m (List m a) dupl x = cons x (cons x nil) The function dupl is subtly different from duplicate: whereas duplicate runs a computation twice and returns a data structure with the results, dupl returns a data structure containing the same computation twice without running it. The following two examples illustrate the benefit of data structures with non-deterministic components.

Sharing enforces call-time choice

We contrast three ways to bind x: dup_coin_let dup_coin_bind

= let x = coin in duplicate x = do x >= \x -> return (return x) This implementation satisfies the (Choice) law, but it does not satisfy the (Fail) and (Bot) laws. The (Lzero) law of MonadPlus shows that this implementation renders share mzero equal to mzero, which is observationally different from the return mzero required by (Fail). This attempt ensures early choice using early demand, so we get eager sharing, rather than lazy sharing as desired. 4.2

Memoization

We can combine late demand and early choice using memoization. The idea is to delay the choice until it is demanded, and to remember the choice when it is made for the first time so as to not make it again if it is demanded again. To demonstrate the idea, we define a very specific version of share that fixes the monad and the type of shared values. We use a state monad to remember shared monadic values. A state monad is an instance of the following type class, which defines operations to query and update a threaded state component. class MonadState s m where get :: m s put :: s -> m ()

x ? y -> y

Using the theorem above, we conclude that JC[a?b]K = JC[a]?C[b]K, which inspired our (Choice) law.

data Thunk a = Uneval (Memo a) | Eval a Here, Memo is the name of our monad. It threads a list of Thunks through non-deterministic computations represented as lists.

3 without

selective strictness via seq 4 The set monad can be implemented in Haskell just like the list monad, with the usual Monad and MonadPlus instances that do not depend on Eq or Ord, as long as computations can only be observed using the null predicate.

newtype Memo a = Memo { unMemo :: [Thunk Int] -> [(a, [Thunk Int])] }

16

The instance declarations for the type classes Monad, MonadState, and MonadPlus are as follows:

We also reuse the memo function, which has now a different type. We could try to define share simply as a renaming for memo again:

instance Monad Memo where return x = Memo (\ts -> [(x,ts)]) m >>= f = Memo (concatMap (\(x,ts) -> unMemo (f x) ts) . unMemo m)

share :: Memo (List Memo Int) -> Memo (Memo (List Memo Int)) share a = memo a However, with this definition lists are not shared deeply. This behavior corresponds to the expression heads_bind where the head and the tail of the demanded list are still executed whenever they are demanded and may hence yield different results when duplicated. This implementation does not satisfy the (HNF) law. We can remedy this situation by recursively memoizing the head and the tail of a shared list:

instance MonadState [Thunk Int] Memo where get = Memo (\ts -> [(ts,ts)]) put ts = Memo (\_ -> [((),ts)]) instance MonadPlus Memo where mzero = Memo (const []) a ‘mplus‘ b = Memo (\ts -> unMemo a ts ++ unMemo b ts)

share :: Memo (List Memo Int) -> Memo (Memo (List Memo Int)) share a = memo (do l nil Cons x xs -> do y = eval return (Cons (return y) (return ys))

This implementation of share adds an unevaluated thunk to the current store and returns a monadic action that, when executed, queries the store and either returns the already evaluated result or evaluates the unevaluated thunk before updating the threaded state. The argument a given to share is not demanded until the inner action is performed. Hence, this implementation of share satisfies the (Fail) and (Bot) laws. Furthermore, the argument is only evaluated once, followed by an update of the state to remember the computed value. Hence, this implementation of share satisfies the (Choice) law (up to observation, as defined in §4.4). If the inner action is duplicated and evaluated more than once, then subsequent calls will yield the same result as the first call due to memoization. 4.3

Observing non-deterministic results

In order to observe the results of a computation that contains nondeterministic components, we need a function (such as run in Figure 2) that evaluates all the components and combines the resulting alternatives to compute a non-deterministic choice of deterministic results. For example, we can define a function eval that computes all results from a non-deterministic list of numbers.

The lists returned by eval are fully determined. Using eval, we can define an operation run that computes the results of a nondeterministic computation: run :: Memo (List Memo Int) -> [List Memo Int] run m = map fst (unMemo (m >>= eval) []) In order to guarantee that the observed results correspond to predicted results according to the laws in §3.2, we place two requirements on the monad used to observe the computation ([] above). (In contrast, the laws in §3.2 constrain the monad used to express the computation (Memo above).) Idempotence of mplus The (Choice) law predicts that the computation run (share coin = λ . ret Nil) gives ret0 Nil ⊕0 ret0 Nil. However, our implementation gives a single solution ret0 Nil (following the (Ignore) law, as it turns out). Hence, we require ⊕0 to be idempotent; that is, m ⊕0 m = m. This requirement is satisfied if we abstract from the multiplicity of results (considering [] as the set monad rather than the list

Non-deterministic components

The version of share just developed memoizes only integers. However, we want to memoize data with non-deterministic components, such as permuted lists that are computed on demand. So instead of thunks that evaluate to numbers, we redefine the Memo monad to store thunks that evaluate to lists of numbers now. newtype Memo a = Memo { unMemo :: [Thunk (List Memo Int)] -> [(a, [Thunk (List Memo Int)])] }

5 This

implementation of share does not actually type-check because share x in the body needs to invoke the previous version of share, for the type Int, rather than this version, for the type List Memo Int. The two versions can be made to coexist, each maintaining its own state, but we develop a polymorphic share combinator in §5 below, so the issue is moot.

The instance declarations for Monad and MonadPlus stay the same. In the MonadState instance only the state type needs to be adapted.

17

monad), as is common practice in FLP, or if we treat ⊕0 as averaging the weights of results, as is useful for probabilistic inference. Distributivity of bind over mplus the result of the computation

For example, we have applied our approach to express and sample from probability distributions as OCaml programs in direct style (Filinski 1999). With less development effort than state-of-theart systems, we achieved comparable concision and performance (Kiselyov and Shan 2009). The implementation of our monad transformer is available as a Hackage package at: http://hackage.haskell.org/cgi-bin/ hackage-scripts/package/explicit-sharing-0.1

According to the (Choice) law,

run (share coin = λ c. coin = λ y. c = λ x. ret (Cons (ret x) (ret (Cons (ret y) (ret Nil))))) is the following non-deterministic choice of lists (we write hx, yi to denote ret0 (Cons (ret0 x) (ret0 (Cons (ret0 y) (ret0 Nil))))). 0

0

5.1

0

We have seen in the previous section that in order to share nested, non-deterministic data deeply, we need to traverse it and apply the combinator share recursively to every non-deterministic component. We have implemented deep sharing for the type of nondeterministic lists, but want to generalize this implementation to support arbitrary user-defined types with non-deterministic components. It turns out that the following interface to non-deterministic data is sufficient:

(h0, 0i ⊕ h0, 1i) ⊕ (h1, 0i ⊕ h1, 1i) However, our implementation yields (h0, 0i ⊕0 h1, 0i) ⊕0 (h0, 1i ⊕0 h1, 1i). In order to equate these two trees, we require the following distributive law between =0 and ⊕0 . a =0 λ x. ( f x ⊕0 g x) = (a =0 f ) ⊕0 (a =0 g)

class MonadPlus m => Nondet m a where mapNondet :: (forall b . Nondet m b => m b -> m (m b)) -> a -> m a

If the observation monad satisfies this law, then the two expressions above are equal (we write coin0 to denote ret0 0 ⊕0 ret0 1): (h0, 0i ⊕0 h0, 1i) ⊕0 (h1, 0i ⊕0 h1, 1i) = (coin0 =0 λ y. h0, yi) ⊕0 (coin0 =0 λ y. h1, yi)

A non-deterministic type a with non-deterministic components wrapped in the monad m can be made an instance of Nondet m by implementing the function mapNondet, which applies a monadic transformation to each non-deterministic component. The type of mapNondet is a rank-2 type: the first argument is a polymorphic function that can be applied to non-deterministic data of any type. We can make the type List m Int, of non-deterministic number lists, an instance of Nondet as follows.

= coin0 =0 λ y. (h0, yi ⊕0 h1, yi) = (h0, 0i ⊕0 h1, 0i) ⊕0 (h0, 1i ⊕0 h1, 1i). Hence, the intuition behind distributivity is that the observation monad does not care about the order in which choices are made. This intuition captures the essence of implementing call-time choice: we can perform choices on demand and the results are as if we performed them eagerly. In general, it is fine to use our approach with an observation monad that does not match our requirements, as long as we are willing to abstract from the mismatch. For example, the list monad satisfies neither idempotence nor distributivity, yet our equational laws are useful in combination with the list monad if we abstract from the order and multiplicities of results. We also do not require that ⊕0 be associative or that 0/ 0 be a left or right unit of ⊕0 .

5.

Non-deterministic data

instance MonadPlus m => Nondet m Int where mapNondet _ c = return c instance Nondet m a => Nondet m (List m a) where mapNondet _ Nil = return Nil mapNondet f (Cons x xs) = do y m a eval = mapNondet (\a -> a>>=eval>>=return.return)

We achieve the first goal by introducing a type class with the interface to process non-deterministic data. We achieve the second goal by defining a monad transformer Lazy that adds sharing to any instance of MonadPlus. After describing a straightforward implementation of this monad transformer, we show how to implement it differently in order to improve performance significantly. Both of these generalizations are motivated by practical applications in non-deterministic programming.

This operation generalizes the specific version for lists given in §4.4. In order to determine a value, we determine values for the arguments and combine the results. The bind operation of the monad nicely takes care of the combination. Our original motivation for abstracting over the interface of non-deterministic data was to define the operation share with a more general type. In order to generalize the type of share to allow not only different types of shared values but also different monad type constructors, we define another type class.

1. The ability to work with user-defined types makes it easier to compose deterministic and non-deterministic code and to draw on the sophisticated type and module systems of existing functional languages. 2. The ability to plug in different underlying monads makes it possible to express techniques such as breadth-first search (Spivey 2000), heuristics, constraint solving (Nordin and Tolmach 2001), and weighted results.

class MonadPlus m => Sharing m where share :: Nondet m a => m a -> m (m a) Non-determinism monads that support the operation share are instances of this class. We next define an instance of Sharing with the implementation of share for arbitrary non-deterministic types.

18

5.2

State monad transformer

This completes an implementation of our monad transformer for lazy non-determinism, with all of the functionality motivated in §§2–3.

The implementation of memoization in §4 uses a state monad to thread a list of thunks through non-deterministic computations. The straightforward generalization is to use a state monad transformer to thread thunks through computations in arbitrary monads. A state monad transformer adds the operations defined by the type class MonadState to an arbitrary base monad. The type for Thunks generalizes easily to an arbitrary monad:

5.3

data Thunk m a = Uneval (m a) | Eval a Instead of using a list of thunks, we use a ThunkStore with the following interface. Note that the operations lookupThunk and insertThunk deal with thunks of arbitrary type. emptyThunks :: ThunkStore getFreshKey :: MonadState ThunkStore lookupThunk :: MonadState ThunkStore => Int -> m (Thunk m a) insertThunk :: MonadState ThunkStore => Int -> Thunk m a -> m

Optimizing performance

We have applied some optimizations that improve the performance of our implementation significantly. We use the permutation sort in §2 for a rough measure of performance. The implementation just presented exhausts the search space for sorting a list of length 20 in about 5 minutes.6 The optimizations described below reduce the run time to 7.5 seconds. All implementations run permutation sort in constant space (5 MB or less) and the final implementation executes permutation sort on a list of length 20 roughly three times faster than the fastest available compiler for Curry, the M¨unster Curry Compiler (MCC). As detailed below, we achieve this competitive performance by

m => m Int m

1. reducing the amount of pattern matching in invocations of the monadic bind operation, 2. reducing the number of store operations when storing shared results, and 3. manually inlining and optimizing library code.

m ()

There are different options to implement this interface. We have implemented thunk stores using the generic programming features provided by the Data.Typeable and Data.Dynamic modules but omit corresponding class contexts for the sake of clarity. Lazy monadic computations can now be performed in a monad that threads a ThunkStore. We obtain such a monad by applying the StateT monad transformer to an arbitrary instance of MonadPlus.

5.3.1

Less pattern matching

The Monad instance for the StateT monad transformer performs pattern matching in every call to >>= in order to thread the store through the computation. This is wasteful especially during computations that do not access the store because they do not perform explicit sharing. We can avoid this pattern matching by using a different instance of MonadState. We define the continuation monad transformer ContT:7

type Lazy m = StateT ThunkStore m For any instance m of MonadPlus, the type constructor Lazy m is an instance of Monad, MonadPlus, and MonadState ThunkStore. We only need to define the instance of Sharing ourselves, which implements the operation share.

newtype ContT m a = C { unC :: forall w . (a -> m w) -> m w } runContT :: Monad m => ContT m a -> m a runContT m = unC m return

instance MonadPlus m => Sharing (Lazy m) where share a = memo (a >>= mapNondet share)

We can make ContT m an instance of the type class Monad without using operations from the underlying monad m:

The implementation of share uses the operation memo to memoize the argument and the operation mapNondet to apply share recursively to the non-deterministic components of the given value. The function memo resembles the specific version given in §4.2 but has a more general type.

instance Monad (ContT m) where return x = C (\c -> c x) m >>= k = C (\c -> unC m (\x -> unC (k x) c)) An instance for MonadPlus can be easily defined using the corresponding operations of the underlying monad. The interesting exercise is to define an instance of MonadState using ContT. When using continuations, a reader monad—a monad where actions are functions that take an environment as input but do not yield one as output—can be used to pass state. More specifically, we need the following operations of reader monads:

memo :: MonadState ThunkStore m => m a -> m (m a) memo a = do key do x m s local :: MonadReader s m => (s -> s) -> m a -> m a The function ask queries the current environment, and the function local executes a monadic action in a modified environment. In combination with ContT, the function local is enough to implement state updates:

The only difference in this implementation of memo from before is that it uses more efficient thunk stores instead of lists of thunks. In order to observe a lazy non-deterministic computation, we use the functions eval to compute fully determined values and evalStateT to execute actions in the transformed state monad.

instance Monad m => MonadState s (ContT (ReaderT s m)) where get = C (\c -> ask >>= c) put s = C (\c -> local (const s) (c ()))

run :: Nondet (Lazy m) a => Lazy m a -> m a run a = evalStateT (a >>= eval) emptyThunks

6 We

performed our experiments on an Apple MacBook with a 2.2 GHz Intel Core 2 Duo processor using GHC with optimizations (-O2). 7 This implementation differs from the definition shipped with GHC in that the result type w for continuations is higher-rank polymorphic.

This function is the generalization of the run function to arbitrary data types with non-deterministic components that are expressed in an arbitrary instance of MonadPlus.

19

With these definitions, we can define our monad transformer Lazy:

reverse function on long lists, which involves a lot of deterministic pattern-matching. In this benchmark, the monadic code is roughly 20% faster than the corresponding Curry code in MCC. The overhead compared to a non-monadic Haskell program is about the same order of magnitude. Our library does not directly support narrowing and unification of logic variables but can emulate it by means of lazy nondeterminism. We have measured the overhead of such emulation using a functional logic implementation of the last function:

type Lazy m = ContT (ReaderT ThunkStore m) We can reuse from §5.2 the definition of the Sharing instance and of the memo function used to define share. After this optimization, searching all sorted permutations of a list of length 20 takes about 2 minutes rather than 5. 5.3.2

Fewer state manipulations

The function memo just defined performs two state updates for each shared value that is demanded: one to insert the unevaluated shared computation and one to insert the evaluated result. We can save half of these manipulations by inserting only evaluated head-normal forms and using lexical scope to access unevaluated computations. We use a different interface to stores now, again abstracting away the details of how to implement this interface in a type-safe manner.

last l | l =:= xs ++ [x] = x where x,xs free This Curry function uses narrowing to bind xs to the spine of the init of l and unification to bind x and the elements of xs to the elements of l. We can translate it to Haskell by replacing x and xs with non-deterministic generators and implementing the unification operator =:= as equality check. When applying last to a list of determined values, the monadic Haskell code is about six times faster than the Curry version in MCC. The advantage of unification shows up when last is applied to a list of logic variables: in Curry, =:= can unify two logic variables deterministically, while an equality check on non-deterministic generators is non-deterministic and leads to search-space explosion. More efficient unification could be implemented using an underlying monad of equality constraints. All programs used for benchmarking are available under: http: //github.com/sebfisch/explicit-sharing/tree/0.1.1

emptyStore :: Store getFreshKey :: MonadState Store m => m Int lookupHNF :: MonadState Store m => Int -> m (Maybe a) insertHNF :: MonadState Store m => Int -> a -> m () Based on this interface, we can define a variant of memo that only stores evaluated head normal forms. memo :: MonadState Store m => m a -> m (m a) memo a = do key do x > >

∗∗∗ sfl

loop sfs

sfr

sﬀ

sfr

Figure 1. The Sequential ( ≫ ) and Parallel ( ∗∗∗ ) Composition Combinators

Figure 2. The Feedback Combinator (loop) 3.3.2 Switches

Note that Set is the “type of types” in Agda (similar to kind ∗ in Haskell)2 . Next we introduce signal vector descriptors. A signal vector descriptor is simply a (type level) list of signal descriptors:

Signal function networks are made dynamic through the use of switches. Basic switches have the following type: switch : ∀ {as bs } → {e : Set } → SF as (E e :: bs) → (e → SF as bs) → SF as bs

SVDesc : Set SVDesc = List SigDesc

dswitch : ∀ {as bs } → {e : Set } → SF as (E e :: bs) → (e → SF as bs) → SF as bs

For the purpose of stating the new conceptual definition of signal functions, and for use in semantic definitions later, we postulate a function (SVRep) that maps a signal vector descriptor to some suitable type for representing a sample of signal vectors of that description, and use this to define signal vectors:

(Agda allows the type of an implicit argument to be omitted when it is clear from the context. In the definitions above, both as and bs are clearly of type SVDesc as they are used as arguments to the type constructor SF .) The behaviour of a switch is to run the subordinate signal function (the first explicit argument), emitting all but the head (the event) of the output vector as the overall output. When there is an event occurrence in the event signal, the value of that signal is fed into the function (the second explicit argument) to generate a residual signal function. The entire switch is then removed from the network and replaced with this residual signal function. The difference between a switch and a dswitch (decoupled switch) is whether, at the moment of switching, the overall output is the output from the residual signal function (switch), or the output from the subordinate signal function (dswitch).3 A key point regarding switches is that the residual signal function does not start “running” until it is applied to the input signal at the moment of switching. Consequently, rather than having a single global Time, each signal function has its own local time.

SVRep : SVDesc → Set SigVec : SVDesc → Set SigVec as ≈ Time → SVRep as

However, we do not require the existence of such a function: an implementation may opt to not represent signal vectors explicitly at all. Finally, we refine the conceptual definition of signal functions: SF : SVDesc → SVDesc → Set SF as bs ≈ SigVec as → SigVec bs

3.3 Example Combinators and Primitives To demonstrate the new conceptual model, we define some common primitive signal functions and combinators from Yampa. These primitives either operate at the reactive level, or mediate between the functional and reactive levels.

Local Time. The time since this signal function was applied to its input signal. This will have been either when the entire system started, or when the sub-network containing the signal function in question was switched in.

3.3.1 Sequential and Parallel Composition

3.3.3 Loops

Signal functions can be composed sequentially (≫) or in parallel (∗∗∗) (see Figure 1):

The loop primitive provides the means for introducing feedback loops into signal function networks. A loop consists of two signal functions: a subordinate signal function (the first explicit argument) and a feedback signal function (the second explicit argument). The input of the feedback signal function is a suffix of the output of the subordinate signal function, and the output of the feedback signal function is a suffix of the input to the subordinate signal function:

≫ : {as bs cs : SVDesc } → SF as bs → SF bs cs → SF as cs ∗∗∗

: {as bs cs ds : SVDesc } → SF as cs → SF bs ds → SF (as + + bs) (cs + + ds)

(In Agda, is used to indicate the argument positions for infix and mixfix operators, while the curly braces are used to enclose implicit arguments: arguments that only have to be provided at an application site if they cannot be inferred from the context.) Note that ∗∗∗ composes two signal functions that take different inputs. For parallel composition where both signal functions take the same input, there is the & & & combinator:

loop : ∀ {as bs cs ds } → SF (as + + cs) (bs + + ds) → SF ds cs → SF as bs

Intuitively, we use the feedback signal function to connect some of the output signals of the subordinate signal function to some of its input signals, forming a feedback loop (see Figure 2).

& & & : {as bs cs : SVDesc } → SF as bs → SF as cs → SF as (bs + + cs)

3.3.4 Primitive Signal Functions We can lift pure functions to the reactive level using the primitives pure and pureE 4 . Such lifted signal functions are always stateless:

2 Strictly

speaking, SigDesc should have type Set1 (the type of Set). However, for clarity, we use the Agda option that accepts Set as the type of Set. We have successfully implemented the type system without this option, but, because Agda does not support universe polymorphism, the result is very repetitive code and loss of conceptual clarity.

3 In Yampa, dswitch also decouples part of its input from part of its output, but we do not assume any such behaviour here. 4 It is possible to have one pure primitive that is overloaded to operate on either time domain, but we do not do so here for clarity.

26

pure

: {a b : Set } → (a → b) → SF [C a ] [C b ]

Its purpose is to monitor a real-valued continuous-time input signal and output the same signal until the input dips below 0. At this point, the output should be clamped to 0, and then remain at 0 from then on.

pureE : {a b : Set } → (a → b) → SF [E a ] [E b ]

Note that we are using [s ] as a synonym for (s :: [ ]). We can lift values to the reactive level using the primitive constant . This creates a signal function with a constant, continuoustime, output:

clamp : SF [C R] [C R] clamp = switch ((pure (λ x → x < 0 ) ≫ edge) & & & pure id) (λ → constant 0 )

constant : ∀ {as } → {b : Set } → b → SF as [C b ]

Events can only be generated and accessed by event processing primitives. Examples include

4. Decoupled Signal Functions As previously discussed, the loop combinator allows feedback to be introduced into a network. This is an essential capability, as feedback is widely used in reactive programming. However, feedback must not cause deadlock due to a signal function depending on its own output in an unproductive manner. To guarantee this, we conservatively prohibit instantaneous cycles in the network. This is a common design choice in reactive languages, but our way of enforcing it is different. We identify decoupled signal functions, essentially a class of signal functions that can be used safely in feedback loops, and index the type of a signal function by whether or not it is decoupled.

• edge, which produces an event whenever the boolean input

signal changes from false to true; • hold , which emits as a continuous-time signal the value carried

by its most recent input event; • never , which outputs an event signal containing no event oc-

currences; • now , which immediately outputs one event, but never does so

again. edge : SF [C Bool ] [E Unit ]

Decoupled Signal Function. A signal function is decoupled if, at any given time, its output can depend upon its past inputs, but not its present and future inputs:

hold : {a : Set } → a → SF [E a ] [C a ] never : ∀ {as } → {b : Set } → SF as [E b ] now : ∀ {as } → SF as [E Unit ]

SF dec as bs = { sf : SF as bs | ∀ (t : Time) (sv 1 sv 2 : SigVec as) . (∀ t < t. sv 1 t ≡ sv 2 t ) ⇒ (sf sv 1 t ≡ sf sv 2 t)}

The primitive pre conceptually introduces an infinitesimal delay: pre : ∀ {a } → SF [C a ] [C a ]

To make this precise, the ideal semantics of pre is that it outputs whatever its input was immediately prior to the current time; that is, the left limit of the input signal at all points:

Decoupled Cycle. A cycle is decoupled if it passes through a decoupled signal function.

∀ (t : Time + ) (s : Signal a) . pre s t = lim− s t t →t

Instantaneous Cycle (Algebraic Loop). A cycle is instantaneous if it does not pass through a decoupled signal function.

+

Here, Time denotes positive time. Consequently, at any given point, the output of pre does not depend upon its present input, which is the crucial property of pre: see Section 4. The primitive pre is usually implemented as a delay of one time step. Of course, this only approximates the ideal semantics. However, if the length of the time steps tends to zero, the semantics of such an implementation of pre converges to the ideal semantics. Note that pre is only defined for continuous-time signals. This is because the left limit at any point of a discrete-time signal (a signal defined only at countably many points in time) is undefined. In our setting, this amounts to an event signal without any occurrences; which is a signal equivalent to the output from never . Applying pre to an event signal would thus be pointless (use never instead), and any attempt to do so would likely be a mistake stemming from a misunderstanding of the semantics of pre. Disallowing pre on events thus eliminates a potential source of programming bugs. In contrast, Yampa, because discrete-time signals are realised as continuous-time signals carrying an option type (see Section 3.1), cannot rule out pre being applied to event signals, nor can it guarantee the proper semantics of such an application. Note also that pre is only defined for positive time. When the local time is zero (henceforth referred to as time 0 ), the output of pre is necessarily undefined as there are no prior points in time. Thus we need an initialise combinator that defines a signal function’s output at time 0 :

In Yampa, the onus is on the programmer to ensure that all cycles are correctly decoupled. An instantaneous cycle will not be detected statically, and the program could well loop at run-time. Many reactive languages deal with this problem by requiring a specific decoupling construct (a language primitive) to appear syntactically within the definition of any feedback loops. This works in a first order setting, but becomes very restrictive in a higher order setting as decoupled signal functions cannot be taken as parameters and used to decouple loops. Our solution is to encode decoupledness information in the types of signal functions. This allows us to statically ensure that a well-typed program does not contain any instantaneous cycles. Furthermore, the decoupledness of a signal function will be visible in its type signature, providing guidance to an FRP programmer. 4.1 Decoupledness Descriptors We introduce a data type of decoupledness descriptors: data Dec : Set where dec : Dec -- decoupled signal functions cau : Dec -- causal signal functions

We then index SF with a decoupledness descriptor: SF : SVDesc → SVDesc → Dec → Set

We can now enforce that the feedback signal function within a loop is decoupled:

initialise : ∀ {as b } → b → SF as [C b ] → SF as [C b ]

loop : ∀ {as bs cs ds } → {d : Dec } → SF (as + + cs) (bs + + ds) d → SF ds cs dec → SF as bs d

Initialisation is discussed further in Section 5. 3.4 Example

The primitive signal functions now need to be retyped to include appropriate decoupledness descriptors:

Let us illustrate the concepts and definitions that have been introduced thus far by constructing a simple signal function network.

27

pure : ∀ {a b } → (a → b) → SF [C a ] [C b ] cau

4.2.1 Recurring Switches

pureE : ∀ {a b } → (a → b) → SF [E a ] [E b ] cau

For this example we need to introduce an additional class of switching combinators: recurring switches (similar to every in Lucid Synchrone). The behaviour of a recurring switch is to apply its subordinate signal function to the tail of its input, producing the overall output. Whenever an event (the head of the input) occurs, the signal function carried by that event replaces the subordinate signal function. Recurring switches come in two varieties: like basic switches, they differ in whether the output at the instant of switching is from the new (rswitch) or old (drswitch) subordinate signal function.

constant : ∀ {as b } → b → SF as [C b ] dec edge : SF [C Bool ] [E Unit ] cau hold : ∀ {a } → a → SF [E a ] [C a ] cau never : ∀ {as b } → SF as [E b ] dec now : ∀ {as } → SF as [E Unit ] dec pre : ∀ {a } → SF [C a ] [C a ] dec initialise : ∀ {as b } → {d : Dec } → b → SF as [C b ] d → SF as [C b ] d

rswitch : ∀ {as bs d1 d2 } → SF as bs d1 → SF (E (SF as bs d2 ) :: as) bs cau

Notice that, from the definition of decoupled signal functions, it is evident that they are a subtype of causal signal functions (dec >> first f >>> arr swap where swap (a, b) = (b, a)

2. An Introduction to Arrows Arrows [23] are a generalization of monads that relax the stringent linearity imposed by monads, while retaining a disciplined style of composition. Arrows have enjoyed a wide range of applications, often as a domain-specific embedded language (DSEL [19, 20]), including the many Yampa applications cited earlier, as well as parsers and printers [25], parallel computing [18], and so on. Arrows also have a theoretical foundation in category theory, where they are strongly related to (but not precisely the same as) Freyd categories [2, 37].

Parallel composition can be defined as a sequence of first and second: (***) :: (Arrow a) ⇒ a b c → a b’ c’ → a (b, b’) (c, c’) f *** g = first f >>> second g

A mere implementation of the arrow combinators, of course, does not make it an arrow – the implementation must additionally satisfy a set of arrow laws, which are shown in Figure 2.

2.1 Conventional Arrows

2.2 Looping Arrows

Like monads, arrows capture a certain class of abstract computations, and offer a way to structure programs. In Haskell this is achieved through the Arrow type class:

To model recursion, we can introduce a loop combinator [32]. The exponential example given in the introduction requires recursion, as do many applications in signal processing, for example. In Haskell this combinator is captured in the ArrowLoop class:

class Arrow a where arr :: (b → c) → a b c (>>>) :: a b c → a c d → a b d first :: a b c → a (b,d) (c,d)

class Arrow a ⇒ ArrowLoop a where loop :: a (b,d) (c,d) → a b c

A valid instance of this class should satisfy the additional laws shown in Figure 3. This class and its associated laws are related to the trace operator in [40, 17], which was generalized to arrows in [32]. We find that arrows are best viewed pictorially, especially for applications such as signal processing, where domain experts commonly draw signal flow diagrams. Figure 1 shows some of the basic combinators in this manner, including loop.

The combinator arr lifts a function from b to c to a “pure” arrow computation from b to c, namely a b c where a is the arrow type. The output of a pure arrow entirely depends on the input (it is analogous to return in the Monad class). >>> composes two arrow computations by connecting the output of the first to the input of the second (and is analogous to bind ((>>=)) in the Monad class). But in addition to composing arrows linearly, it is desirable to compose them in parallel – i.e. to allow “branching” and “merging” of inputs and outputs. There are several ways to do this, but by simply defining the first combinator in the Arrow class, all other combinators can be defined. first converts an arrow computation taking one input and one result, into an arrow computation taking two inputs and two results. The original arrow is applied to the first part of the input, and the result becomes the first part of the output. The second part of the input is fed directly to the second part of the output. Other combinators can be defined using these three primitives. For example, the dual of first can be defined as:

2.3 Arrow Syntax Recall the Yampa definition of the exponential function given earlier: exp = proc () → do rec let e = 1 + i i ← integral −≺ e returnA −≺ e

This program is written using arrow syntax, introduced by Paterson [32] and adopted by GHC (the predominant Haskell implementation) because it ameliorates the cumbersome nature of writing in

36

left identity right identity associativity composition extension functor exchange unit association

arr id ≫ f f ≫ arr id (f ≫ g) ≫ h arr (g · f ) ﬁrst (arr f ) ﬁrst (f ≫ g) ﬁrst f ≫ arr (id × g) ﬁrst f ≫ arr fst ﬁrst (ﬁrst f ) ≫ arr assoc where assoc ((a, b), c)

= = = = = = = = = =

f f f ≫ (g ≫ h) arr f ≫ arr g arr (f × id) ﬁrst f ≫ ﬁrst g arr (id × g) ≫ ﬁrst f arr fst ≫ f arr assoc ≫ ﬁrst f (a, (b, c))

Figure 2. Conventional Arrow Laws left tightening right tightening sliding vanishing superposing extension

loop (ﬁrst h ≫ f ) loop (f ≫ ﬁrst h) loop (f ≫ arr (id × k)) loop (loop f ) second (loop f ) loop (arr f ) where trace f b

= = = = = = =

h ≫ loop f loop f ≫ h loop (arr (id × k) ≫ f ) loop (arr assoc −1 ≫ f ≫ arr assoc) loop (arr assoc ≫ second f ≫ arr assoc −1 ) arr (trace f ) let (c, d) = f (b, d) in c

Figure 3. Arrow Loop Laws commutativity product

ﬁrst f ≫ second g init i init j

= =

second g ≫ ﬁrst f init (i, j)

newtype SF a b = SF { unSF :: a → (b, SF a b) } instance Arrow SF where arr f = SF h where h x = (f x, SF h) first f = SF (h f) where h f (x, z) = let (y, f’) = unSF f x in ((y, z), SF (h f’)) f >>> g = SF (h f g) where h f g x = let (y, f’) = unSF f x (z, g’) = unSF g y in (z, SF (h f’ g’))

Figure 4. Causal Commutative Arrow Laws the point-free style demanded by arrows. The above program is equivalent to the following sugar-free program: exp = fixA (integral >>> arr (+1)) where fixA f = loop (second f >>> arr (λ (_, y) → (y, y)))

Although more cumbersome, we will use this program style in the remainder of the paper, in order to avoid having to explain the meaning of arrow syntax in more detail.

instance ArrowLoop SF where loop f = SF (h f) where h f x = let ((y, z), f’) = unSF f (x, z) in (y, SF (h f’))

3. Causal Commutative Arrows In this section we introduce two key extensions to conventional arrows, and demonstrate their use by implementing a stream transformer in Haskell. First, as mentioned in the introduction, the set of arrow and arrow loop laws is not strong enough to capture stream computations. In particular, the commutativity law shown in Figure 4 establishes a non-interference property for concurrent computations – effects are still allowed, but this law guarantees that concurrent effects cannot interfere with each other. We say that an arrow is commutative if it satisfies the conventional laws as well as this critical additional law. Yampa is in fact based on commutative arrows. Second, we note that Yampa has a primitive operator called iPre that is used to inject a delay into a computation; indeed it is the primary effect imposed by the Yampa arrow [35, 21]. Similar operators, often called delay, also appear in dataflow programming [43], stream processing [39, 41], and synchronous languages [4, 8]. In all cases, the operator introduces stateful computation into an otherwise stateless setting. In an effort to make this operation more abstract, we rename it init and capture it in the following type class:

instance ArrowInit SF where init i = SF (h i) where h i x = (i, SF (h x)) runSF :: SF a b → [a] → [b] runSF f = g f where g f (x:xs) = let (y, f’) = unSF f x in y : g f’ xs

Figure 5. Causal Stream Transformer

of causal computations, namely that the current output depends only on the current as well as previous inputs. Besides causality, we make no other assumptions about the nature of these values: they may or may not vary with time, and the increment of change may be finite or infinitesimally small. More importantly, a valid instance of the ArrowInit class must satisfy the product law shown in Figure 4. This law states that two inits paired together are equivalent to one init of a pair. Here we use the *** operator instead of its expanded definition first . . . >>> second . . . to imply that the product law assumes commutativity. We will see in a later section that init and the product law are critical to our normalization and optimization strategies. But init

class ArrowLoop a ⇒ ArrowInit a where init :: b → a b b

Intuitively, the argument to init is the initial output; subsequent output is a copy of the input to the arrow. It captures the essence

37

Syntax Variables Primitive Types Types Expressions Environment

V t α, β, θ E Γ

::= ::= ::= ::= ::=

x | y | z | ... 1 | Int | Bool | . . . t|α×β|α→β |α;β V | (E1 , E2 ) | fst E | snd E | λx : α.E | E1 E2 | () | . . . x1 : α1 , . . . , xn : αn

Typing Rules (UNIT)

(APP)

Γ E1 : α → β Γ E2 : α Γ E1 E2 : β

Γ () : 1

(PAIR)

(VAR)

(x : α) ∈ Γ Γx:α

Γ E1 : α Γ E2 : β Γ (E1 , E2 ) : α × β

(ABS)

Γ, x : α E : β Γ λx : α.E : α → β

ΓE :α×β Γ fst E : α

(FST)

(SND)

Γ E :α×β Γ snd E : β

Constants arrα,β ≫ α,β,θ ﬁrstα,β,θ Definitions assoc assoc assoc −1 assoc −1 juggle juggle transpose transpose shuﬄe shuﬄe shuﬄe −1 shuﬄe −1

: = : = : = : = : = : =

: : :

(α → β) → (α ; β) (α ; β) → (β ; θ) → (α ; θ) (α ; β) → (α × θ ; β × θ)

loopα,β,θ initα

(α × β) × θ → α × (β × θ) λz.(fst (fst z), (snd (fst z), snd z)) α × (β × θ) → (α × β) × θ λz.((fst z, fst (snd z)), snd (snd z)) (α × β) × θ → (α × θ) × β assoc −1 · (id × swap) · assoc (α × β) × (θ × η) → (α × θ) × (β × η) assoc · (juggle × id) · assoc −1 α × ((β × δ) × (θ × η)) → (α × (β × θ)) × (δ × η) assoc −1 · (id × transpose ) (α × (β × θ)) × (δ × η) → α × ((β × δ) × (θ × η)) (id × transpose ) · assoc

: :

id id (·) (·) (×) (×) dup dup swap swap second second

(α × θ ; β × θ) → (α ; β) α → (α ; α)

: = : = : : : = : = : =

α→α λx.x (β → θ) → (α → β) → (α → θ) λf.λg.λx.f (g x) (α → β) → (θ → γ) → (α × θ → β × γ) λf.λg.λz.(f (fst z), g (snd z)) α→ α×α λx.(x, x) α×β →β×α λz.(snd z, fst z) (α ; β) → (θ × α ; θ × β) λf.arr swap ≫ ﬁrst f ≫ arr swap

Figure 6. CCA: a language of Causal Commutative Arrows is also important in allowing us to define operators that were previously taken as domain-specific primitives. In particular, consider the integral operator used in the exponentiation examples. With init, we can define integral using the Euler integration method and a fixed global step dt as follows:

As a demonstration, we can sample the exponential function at a fixed time interval by running the exp arrow over an uniform input stream inp: dt = 0.01 inp = () : inp

:: Double :: [()]

∗Main> runSF exp inp [1.0,1.01,1.0201,1.030301,1.04060401,1.0510100501, ...

integral :: ArrowInit a ⇒ a Double Double integral = loop (arr (λ (v, i) → i + dt ∗ v) >>> init 0 >>> arr (λi → (i, i)))

We must stress that the SF type is but one instance of a causal commutative arrow, and alternative implementations such as the synchronous circuit type SeqMap in [32] and the stream function type (incidentally also called) SF in [24] also qualify as valid instances. The abstract properties such as normal forms that we develop in the next section are applicable to any of these instances, and thus are more broadly applicable than optimization techniques based on a specific semantic model, such as the one considered in [5].

To complete the picture, we give an instance (i.e. an implementation) of CCA that captures a causal stream transformer, as shown in Figure 5, where: • SF a b is an arrow representing functions (transformers) from

streams of type a to streams of type b. It is essentially a recursively defined data type consisting of a function with its continuation, a concept closely related to a form of finite state automaton called a Mealy Machine [14]. Yampa enjoys a similar implementation, and the same data type was called Auto in [32].

4. A Language of Causal Commutative Arrows

• SF is declared an instance of type classes Arrow, ArrowLoop

and ArrowInit. For example, exp can be instantiated as type exp :: SF () Double. These instances obey all of the arrow laws, including the two additional laws that we introduced.

To study the properties of CCA more rigorously, we first introduce a language of CCA terms in Figure 6. which is an extension of the simply-typed lambda calculus with a few primitives types, tuples, and arrows. Note that:

• runSF :: SF a b -> [a] -> [b] converts an SF arrow

• Although the syntax requires that we write type annotations for

into a stream transformer that maps an input stream of type [a] to an output stream of type [b].

variables in lambda abstraction, we often omit them and instead give the type of an entire expression.

38

(a) Original (a) Reorder parallel pure and stateful

(b) Reorder sequential pure and stateful

(b) Reorganized

Figure 8. Diagrams for exp (c) Change sequential composition to parallel

(d) Move sequential composition into loop

(e) Move parallel composition into loop

Figure 9. Diagram for loopB

5. Normalization of CCA In most implementations, programs written using arrows carry a runtime overhead, primarily due to the extra tupling forced onto functions’ arguments and return values. There have been several attempts [30, 24] to optimize arrow-based programs using arrow laws, but the results have not been entirely satisfactory. Although conventional arrow and arrow loop laws offer ways to combine pure arrows or collapse nested loops, they are not powerful enough to deal with effectful arrows, such as the init combinator.

(f) Fuse nested loops

Figure 7. Arrow Transformations

5.1 Intuition

• In previous examples we used the Haskell type Arrow a => a

Our new optimization strategy is based on the following rather striking observation: any CCA program can be transformed into a single loop containing one pure arrow and one initial state value. More precisely, any CCA program can be normalized into either the form arr f or:

b c to represent an arrow type a mapping from type b to type c. However, CCA does not have type classes, and thus we write α ; β instead. • Each arrow constant represents a family of constant arrow func-

tions indexed by types. We’ll omit the type subscripts when they are obvious from context.

loop(arr f ≫ second (second (init i))) where f is a pure function and i is an initial state. Note that all other arrow combinators, and therefore all of the overheads associated with them (tupling, etc.) are completely eliminated. Not surprisingly, the resulting improvement in performance is rather dramatic, as we will see later. We treat the loop combinator not just as a way to provide feedback from output to input, but also as a way to reorganize a complex composition of arrows. To see how this works, it is helpful to visualize a few examples, as shown in Figure 7, and explained below. This should help explain the intuition behind our normalization process, which is treated formally in the next section. The diagrams in Figure 7 can be explained as follows:

The figure also defines a set of commonly used auxiliary functions. Besides satisfying the usual beta law for lambda expressions, arrows in CCA also satisfy the nine conventional arrow laws (Figure 2), the six arrow loop laws (Figure 3), and the two causal commutative arrow laws (Figure 4). Due to the existence of immediate feedback in loops, CCA is able to make use of general recursion that is not allowed in the simply typed lambda calculus. To see why immediate feedback is necessary, we can look back at the fixA function used to define the combinator version of exp. We rewrite it using CCA syntax below: ﬁxA : (α ; α) → (β ; α) ﬁxA = λf.loop(second f ≫ arr (λx.(snd x, snd x)))

(a) Re-order parallel pure and stateful arrows. Figure 7(a) shows the exchange law for arrows, which is a special case of the commutativity law, and useful for re-ordering pure and stateful arrows.

It computes a fixed point of an arrow at the value level, and contains no init in its definition. We consider the ability to model general recursion a strength of our work that is often lacking in other stream or dataflow programming languages.

(b) Re-order sequential pure and stateful arrows. Figure 7(b) shows how the immediate feedback of the loop combinator

39

loop init composition extension left tightening right tightening vanishing superposing

loop f init i arr f ≫ arr g ﬁrst (arr f ) h ≫ loopB i f loopB i f ≫ arr g loopB i (loopB j f ) ﬁrst (loopB i f )

→ → → → → → → →

loopB () (arr assoc −1 ≫ ﬁrst f ≫ arr assoc) loopB i (arr (swap · juggle · swap)) arr (g · f ) arr (f × id) loopB i (ﬁrst h ≫ f ) loopB i (f ≫ ﬁrst (arr g)) loopB (i, j) (arr shuﬄe ≫ f ≫ arr shuﬄe −1 ) loopB i (arr juggle ≫ ﬁrst f ≫ arr juggle)

Figure 10. One Step Reduction for CCA

(NORM)

e⇓e

(SEQ) (FIRST) (LOOP)

f ⇓ f

∃(i, f ) s.t. e = arr f or e = loopB i (arr f )

e1 ⇓ e1

e2 ⇓ e2 e1 ≫ e2 → e e1 ≫ e2 ⇓ e

ﬁrst f → e e ⇓ e ﬁrst f ⇓ e

loop f → e e ⇓ e loop f ⇓ e

(INIT)

(LOOPB)

f ⇓ f

e ⇓ e init i → e e ⇓ e init i ⇓ e loopB i f → e loopB i f ⇓ e

e ⇓ e

Figure 11. Normalization Procedure for CCA helps to re-order arrows. This follows from the definition of second , and the tightening and sliding laws for loops.

initialized before looping back, and is often regarded as an internal state. The value of type γ is immediately fed back and often used for general recursions at the value level. We define a single step reduction → as a set of rules in Figure 10, and a normalization procedure in Figure 11. The normalization relation ⇓ can be seen as a big step reduction following an innermost strategy, and is indeed a function. Note that some of the reduction rules resemble the arrow laws of the same name. However, there are some subtle but important differences: First, unlike the laws, reduction is directed. Second, the rules are extended to handle loopB instead of loop. Finally, they are adjusted to avoid overlaps.

(c) Change sequential composition to parallel. Figure 7(c) shows that in addition to the sequential re-ordering we can use the product law to fuse two stateful computations into one. (d) Move sequential composition into loop. Figure 7(d) shows the left-tightening law for loops. Because the first arrow can also be a loop, we are able to combine sequential compositions of two loops into a nested one. (e) Move parallel composition into loop. Figure 7(e) shows a variant of the superposing law for loops using ﬁrst instead of second . Since we know that parallel composition can be decomposed into first and second, and if each of them can be transformed into a loop, they will eventually be combined into a nested loop as shown.

Theorem 5.1 (CCNF) For all e : α ; β, there exists a normal form enorm , called the Causal Commutative Normal Form, which is either of the form arr f , or loopB i (arr f ) for some i and f , such that enorm : α ; β, and e ⇓ enorm . In unsugared form, the second form is equivalent to:

(f) Fuse nested loops. Figure 7(f) shows an extension of the vanishing law for loops to handle stateful computations. Its proof requires the commutative law and product law to switch the position of two stateful arrows and join them together.

loop(arr f ≫ second (second (init i)))

As a concrete example, Figure 8(a) is a diagram of the original exp example given earlier. In Figure 8(b) we have unfolded the definition of integral and applied the optimization strategy. The result is a single loop, where all pure functions can be combined together to minimize arrow implementation overheads.

Proof: Follows directly from Lemmas 5.1 and 5.2. 2 Note that we only consider closed terms with empty type environments in Theorem 5.1, otherwise we would have to include lambda normal forms as part of CCNF. For example, x : α ; β x : α ; β would qualify as a valid CCNF since x is of an arrow type, and there is no further reduction possible. Although this addition may be needed in real implementations, it would unnecessarily complicate the discussion, so we disallow open terms for simplicity.

5.2 Algorithm In this section we give a formal definition of the normalization procedure. First we define a combinator called loopB that can be viewed as syntactic sugar for handling both immediate and delayed feedback: loopB : θ → (α × (γ × θ) ; β × (γ × θ)) → (α ; β) loopB = λi.λf.loop (f ≫ second (second (init i)))

Lemma 5.1 (Soundness) The reduction rules given in Figure 10 are both type and semantics preserving, i.e., if e → e then e = e is syntactically derivable from the set of CCA laws.

A pictorial view of loopB is given in Figure 9. The second argument to loopB is an arrow mapping from an input of type α to output β, while looping over a pair γ × θ. The value of type θ is

Proof: By equational reasoning using arrow laws. The loop and init rules follow from the definition of loopB ; composition and

40

runCCNF :: e → ((b, e) → (c, e)) → [b] → [c] runCCNF i f = g i where g i (x:xs) = let (y, i’) = f (x, i) in y : g i’ xs

extension are directly based on the arrow laws with the same name; left and right tightening and superposing rules follow the definition of loopB , the commutativity law and the arrow loop laws with the same name. The proof of the vanishing rule is more involved, and is given in Appendix B. 2 Note that the set of reduction rules is sound but not complete, because the loop combinator can introduce general recursion at the value level.

runCCNF essentially converts an optimized CCNF term directly into a stream transformer. In doing so, we have successfully transformed away all arrow instances, including the data structure used to implement them! The result is of course no longer abstract, and is closely tied to the low-level representation of streams.

Lemma 5.2 (Termination) The normalization procedure for CCA given in Figure 11 terminates for all well-typed arrow expressions e : α ; β.

Combining CCA With Stream Fusion We can perform even more aggressive optimizations on CCNF by borrowing the stream representation and optimization techniques introduced by Coutts et al. [12]. First, we define a datatype to encapsulate a stream as a product of a stepper function and an initial state:

Proof: By structural induction over all possible combinations of well-typed arrow terms. See Appendix A for details. 2

data Stream a = forall s. Stream (s → Step a s) s data Step a s = Yield a s

6. Further Optimization

Here a is the element type and s is an existentially quantified state type. For our purposes, we have simplified the return type of the original stepper function in [12]. Our stepper function essentially consumes a state and yields an element in the stream paired with a new state. The key to effective fusion is that all stream producers must be non-recursive. In other words, a recursively defined stream such as exp should be written in terms of non-recursive stepper functions, with recursion deferred until the stream is unfolded. Programs written in this style can then be fused by the compiler into a tailrecursive loop, at which point tail-call eliminations and various unboxing optimizations can be easily applied. This is where CCA and our normalization procedure fit together so nicely. We can take advantage of the arrow syntax to write recursive code, and rely on the arrow translator to express it nonrecursively using the loop combinator. We then normalize it into CCNF, and rewrite it in terms of streams. The last step is surprisingly straightforward. We introduce yet another loop combinator loopS that closely resembles loopD:

We have implemented the normalization procedure of CCA in Haskell. In fact the normalization of an arrow term does not have to stop at CCNF, because pure functions in the language are of simply typed lambda calculus, which is strongly normalizing. Extra care was taken to preserve sharing of lambda terms, to eliminate redundant variables, and so on. In the remainder of this section we describe a simple sequence of other optimizations that ultimately leads to a single imperative loop that can be implemented extremely efficiently. Optimized Loop In addition to loopB, for optimization purposes we introduce another looping combinator, loopD, for loops with only delayed feedback. For comparison, the Haskell definitions of both are given below: loopB :: ArrowInit a ⇒ e → a (b, (d, e)) (c, (d, e)) → a b c loopD :: ArrowInit a ⇒ e → a (b, e) (c, e) → a b c loopB i f = loop (f >>> second (second (init i))) loopD i f = loop (f >>> second (init i))

loopS :: t → ((a, t) → (b, t)) → Stream a → Stream b loopS z f (Stream next0 s0) = Stream next (z, s0) where next (i, s) = case next0 s of Yield x s’ → Yield y (z’, s’) where (y, z’) = f (x, i)

The reason to introduce loopD is that many applications of CCA result in an arrow in which all loops only have delayed feedback. For example, after removing redundant variables, normalizing lambdas, and eliminating common sub-expressions, the CCNF for exp is: exp’ = loopB 0 (arr (λ (x, (z, y)) → let i = y + 1 in (i, (z, y + dt ∗ i))))

Intuitively, loopS is the unlifted version of loopD. The initial state of the output stream consists of the initial feedback value z and the state of the input stream. As the resulting stream gets unfolded, it supplies f with an input tuple and carries the output along with the next state of the input stream. In general, we can rewrite terms of the form loopD i (arr f ) into loopS i f for some i and f . To illustrate this, let us revisit the exp example. We take the optimized CCNF exp’’ and rewrite it in terms of loopS as expOpt:

Clearly the variable z here is redundant, and it can be removed by changing loopB to loopD: exp’’ = loopD 0 (arr (λ (x, y) → let i = y + 1 in (i, y + dt ∗ i)))

The above function corresponds nicely with the diagram shown in Figure 8(b). We call this result optimized CCNF. Inlining Implementation In fact loopD can be made even more efficient if we expose the underlying arrow implementation. For example, using the SF data type shown in Figure 5, loopD can be defined as:

expOpt :: Stream Double expOpt sr = loopS 0 (λ (x, y) → let i = y + 1 in (i, y + dt ∗ i)) (constS ())

loopD i f = SF (g i f) where g i f x = let ((y, i’), f’) = unSF f (x, i) in (y, SF (g i’ f’))

constS :: a → Stream a constS c = Stream next () where next _ = Yield c ()

Since the resulting stream producer ignores any input, we define constS to supply a stream of unit values. This does not negatively impact performance, as the compiler is able to remove the dummy values eventually. To extract elements from a stream, we can write a tail-recursive function to unfold it. For example, the function nth extracts the nth element from a stream:

Also, if we examine the use of loopD in optimized CCNF, we notice that the arrow it takes is always a pure arrow, and hence we can drop the arrow and use the pure function instead. Furthermore, if our interest is just in computing from an input stream to an output, we can drop the intermediate SF data structure altogether, thus yielding:

41

nth :: Int → Stream a → a nth n (Stream next0 s0) = go n s0 where go n s = case next0 s of Yield x s’ → if n == 0 then x else go (n-1) s’ e2 :: Double e2 = nth 2 expOpt

Name (LOC) exp (4) sine (6) oscSine (4) 50’s sci-fi (5) robotSim (8)

1. GHC 1.0 1.0 1.0 1.0 1.0

2. arrowp 2.4 2.66 1.75 1.28 1.48

3. CCNF 13.9 12.0 4.1 10.2 8.9

4. Stream 190.9 284.0 13.0 19.2 36.8

Figure 12. Performance Ratio (greater is better)

-- 1.0201

We can define unfolding functions other than nth in a similar manner. With the necessary optimization options turned on, GHC fuses nth and expOpt into a tail-recursive loop. The code below shows the equivalent intermediate representation extracted from GHC after optimization. It uses only strict and unboxed types (Int# and Double#).

intermediate data structures, and can be orders of magnitude faster than its arrow-based predecessors. 3. GHC’s arrow syntax translator does not do as well as Paterson’s original translator for the sample programs we chose, though both are significantly outperformed by our normalization techniques.

go :: Int# → Double# → Double# go n y = case n of __DEFAULT → go (n - 1) (y + dt ∗ (y + 1.0)) 0 → y + 1.0

8. Discussion Our key contribution is the discovery of a normal form for core Yampa, or CCA, programs: any CCA program can be transformed into a single loop with just one pure (and strongly normalizing) function and a set of initial states. This discovery is new and original, and has practical implications in implementing not just Yampa, but a broader class of synchronous dataflow languages and stream computations because this property is entirely based on axiomatic laws, not any particular semantic model. We discuss such relevance and related topics to our approach below.

e2 :: Double e2 = D# (go 2 0.0)

In summary, employing stream fusion, the GHC compiler can turn any CCNF into a tight imperative loop that is free of all cons cell and closure allocations. This results in a dramatic speedup for CCA programs and eliminates the need for heap allocation and garbage collection. In the next section we quantify this claim via benchmarks.

8.1 Alternative Formalisms Apart from arrows, other formalisms such as monads, comonads and applicative functors have been used to model computations over data streams [3, 42, 28]. Central to many of these approaches are the representation of streams and computations about them. However, notably missing are the connections between stream computation and the related laws. For example, Uustalu’s work [42] concluded that comonad is a suitable model for dataflow computation, but it lacks any discussion of whether the comonadic laws are of any relevance. In contrast, it is the very idea of making sense out of arrow and arrow loop laws that motivated our work. We argue that arrows are a suitable abstract model for stream computation not only because we can implement stream functions as arrows, but also because abstract properties like the arrow laws help to bring more insights to our target application domain. Besides having to satisfy respective laws for these formalisms, each abstraction has to introduce domain specific operators, otherwise it would be too general to be useful. With respect to causal streams, many have introduced init (also known as delay) as a primitive to enable stateful computation, but few seem to have made the connection of its properties to program optimizations. Notably the product law we introduced for CCA relates to a bisimilarity property of co-algebraic streams, i.e., the product of two initialized streams are bisimilar to one initialized stream of product.

7. Benchmarks We ran a set of benchmarks to measure the performance of several programs written in arrow syntax, but compiled and optimized in different ways. For each program, we: 1. Compiled with GHC, which has a built-in translator for arrow syntax. 2. Translated using Paterson’s arrowp pre-processor to arrow combinators, and then compiled with GHC. 3. Normalized into CCNF combinators, and compiled with GHC. 4. Normalized into CCNF combinators, rewritten in terms of streams, and compiled with GHC using stream fusion. The five benchmarks we used are: the exponential function given earlier, a sine wave with fixed frequency using Goertzel’s method, a sine wave with variable frequency, “50’s sci-fi” sound synthesis program taken from [15], and a robot simulator taken from [21]. The programs were compiled and run on an Intel Core 2 Duo machine with GHC version 6.10.1, using the C backend code generator and -O2 optimization. We measured the CPU time used to run a program through 106 samples. The results are shown in Figure 12, where the numbers represent normalized speedup ratios, and we also include the lines of code (LOC) for the source program. The results show dramatic performance improvements using normalized arrows. We note that:

8.2 Co-algebraic streams The co-algebraic property of streams is well known, and most relevant to our work is Caspi and Pouzet’s representation of stream and stream functions in a functional language setting [5], which also uses a primitive similar to the trace operator (and hence the arrow loop combinator) to model recursion. Their compilation technique, however, lacks a systematic approach to optimize nested recursions. We consider our technique more effective and more abstract.

1. Based on the same arrow implementation, the performance gain of CCNF over the first two approaches is entirely due to program transformations at the source level. This means that the runtime overhead of arrows is significant, and cannot be neglected for real applications. 2. The stream representation of CCNF produces high-performance code that is completely free of dynamic memory allocation and

42

Most synchronous languages, including the one introduced in [5], are able to compile stream programs into a form called single loop code by performing a causality analysis to break the feedback loop of recursively defined values. Many efforts have been made to generate efficient single loop code [16, 1], but to our best knowledge there has not been a strong result like normal forms. Our discovery of CCNF is original, and the optimization by normalization approach is both systematic and deterministic. Together with stream fusion, we produce a result that is not just a single loop, but a highly optimized one. Also relevant is Rutten’s work on high-order functional stream derivatives [38]. We believe that arrows are a more general abstraction than functional stream derivatives, because the latter still exposes the structure of a stream. Moreover, arrows give rise to a high-level language with richer algebraic properties than the 2-adic calculus considered in [38].

fib_stream :: Stream Int fib_stream = Stream next (0, 1) where next (a, b) = Yield r (b, r) where r = a + b f1 :: Int f1 = nth 5 fib_stream

-- 13

Stream fusion will fuse nth and fib stream to produce an efficient loop. For a comparison, with our technique the arrow version of the Fibonacci sequence shown below compiles to the same efficient loop as f1 above, and yet retains the benefit of being abstract and concise. fibA = proc _ → rec let r = d2 d1 ← init d2 ← init returnA −≺ r

8.3 Expressiveness

do + d1 0 −≺ d2 1 −≺ r

We must stress that writing stepper functions is not always as easy as in trivial examples like fib and exp. Most non-trivial stream programs that we are concerned with contain many recursive parts, and expressing them in terms of combinators in a nonrecursive way can get unwieldy. Moreover, this kind of coding style exposes a lot of operational details which are arguably unnecessary for representing the underlying algorithm. In contrast, arrow syntax relieves the burden of coding in combinator form and allows recursion via the rec keyword. It also completely hides the actual implementation of the underlying stream structure and is therefore more abstract. The strength of CCA is the ability to normalize any causal and recursive stream function. Combining both fusion and our normalization algorithm, any CCA program can be reliably and predictably optimized into an efficient machine-friendly loop. The process can be fully automated, allowing programmers to program at an abstract level while getting performance competitive to programs written in low-level imperative languages.

It is known that operationally a Mealy machine is able to represent all causal stream functions [38], while the CCA language defined in Figure 6 represents only a subset. For example, the switch combinator introduced in Yampa [21] is able to dynamically replace a running arrow with a new one depending on an input event, and hence to switch the system behavior completely. With CCA, there is no way to change the compositional structure of the arrow program itself at run time. For another example, many dataflow and stream programming languages also provide conditionals, such as if-then-else, as part of the language [43, 4]. To enable conditionals at the arrow level, we need to further extend CCA to be an instance of the ArrowChoice class. Both are worthy extensions to consider for future work. It should also be noted that the local state introduced by init is one of the minimal side effects one can introduce to arrow programs. The commutativity law for CCA ensures that the effect of one arrow cannot interfere with another when composed together, and it is no longer satisfiable when such ordering becomes important, e.g., when arrows are used to model parsers and printers [25]. On the other hand, because the language for CCA remains highly abstract, it could be applicable to domains other than FRP or dataflow. We’ll leave such findings to future work.

Acknowledgements This research was supported in part by NSF grants CCF-0811665 and CNS-0720682, and a grant from Microsoft Research.

References

8.4 Stream fusion

[1] Pascalin Amagbegnon, Loc Besnard, and Paul Le Guernic. Implementation of the data-flow synchronous language signal. In In Conference on Programming Language Design and Implementation, pages 163–173. ACM Press, 1995.

Stream fusion can help fuse zips, left folds, and nested lists into efficient loops. But on its own, it does not optimize recursively and lazily defined streams effectively. Consider a stream generating the Fibonacci sequence. It is one of the simplest classic examples that characterizes stateful stream computation. One way of writing it in Haskell is to exploit laziness and zip the stream with itself:

[2] Robert Atkey. What is a categorical model of arrows? Mathematically Structured Functional Programming, 2008.

In

[3] Per Bjesse, Koen Claessen, Mary Sheeran, and Satnam Singh. Lava: Hardware design in haskell. In ICFP, pages 174–184, 1998. [4] P. Caspi, N. Halbwachs, D. Pilaud, and J.A. Plaice. Lustre: A declarative language for programming synchronous systems. In 14th ACM Symp. on Principles of Programming Languages, January 1987.

fibs :: [Int] fibs = 0:1:zipWith (+) fibs (tail fibs)

While the code is concise and elegant, such programming style relies too much on the definition of an inductively defined structure. The explicit sharing of the stream fibs in the definition is a blessing and a curse. On one hand, it runs in linear time and constant space. On the other hand, the presence of the stream structure gets in the way of optimization. None of the current fusion or deforestation techniques are able to effectively eliminate cons cell allocations in this example. Real-world stream programs are usually much more complex and involve more feedback, and the time spent in allocating intermediate structure and by the garbage collector could degrade performance significantly. We can certainly write a stream in stepper style that generates the Fibonacci sequence:

[5] Paul Caspi and Marc Pouzet. A Co-iterative Characterization of Synchronous Stream Functions. In Coalgebraic Methods in Computer Science (CMCS’98), Electronic Notes in Theoretical Computer Science, March 1998. Extended version available as a VERIMAG tech. report no. 97–07 at www.lri.fr/∼pouzet. [6] Eric Cheng and Paul Hudak. Look ma, no arrows – a functional reactive real-time sound synthesis framework. Technical Report YALEU/DCS/RR-1405, Yale University, May 2008. [7] Mun Hon Cheong. Functional programming and 3d games, November 2005. also see www.haskell.org/haskellwiki/Frag. [8] Jean-Louis Colac¸o, Alain Girault, Gr´egoire Hamon, and Marc Pouzet. Towards a higher-order synchronous data-flow language. In

43

[28] Conor McBride and Ross Paterson. Applicative programming with effects. J. Funct. Program., 18(1):1–13, 2008.

EMSOFT ’04: Proceedings of the 4th ACM international conference on Embedded software, pages 230–239, New York, NY, USA, 2004. ACM.

[29] Eugenio Moggi. Notions of computation and monads. Inf. Comput., 93(1):55–92, 1991.

[9] Antony Courtney. Modelling User Interfaces in a Functional Language. PhD thesis, Department of Computer Science, Yale University, May 2004.

[30] Henrik Nilsson. Dynamic optimization for functional reactive programming using generalized algebraic data types. In ICFP, pages 54–65, 2005.

[10] Antony Courtney and Conal Elliott. Genuinely functional user interfaces. In Proc. of the 2001 Haskell Workshop, September 2001.

[31] Clemens Oertel. RatTracker: A Functional-Reactive Approach to Flexible Control of Behavioural Conditioning Experiments. PhD thesis, Wilhelm-Schickard-Institute for Computer Science at the University of T¨ubingen, May 2006.

[11] Antony Courtney, Henrik Nilsson, and John Peterson. The Yampa arcade. In Proceedings of the 2003 ACM SIGPLAN Haskell Workshop (Haskell’03), pages 7–18, Uppsala, Sweden, August 2003. ACM Press.

[32] Ross Paterson. A new notation for arrows. In ICFP’01: International Conference on Functional Programming, pages 229–240, Firenze, Italy, 2001.

[12] Duncan Coutts, Roman Leshchinskiy, and Don Stewart. Stream fusion: From lists to streams to nothing at all. In Proceedings of the ACM SIGPLAN International Conference on Functional Programming, ICFP 2007, April 2007.

[33] John Peterson, Gregory Hager, and Paul Hudak. A language for declarative robotic programming. In International Conference on Robotics and Automation, 1999.

[13] Conal Elliott and Paul Hudak. Functional reactive animation. In International Conference on Functional Programming, pages 263– 273, June 1997.

[34] John Peterson, Paul Hudak, and Conal Elliott. Lambda in motion: Controlling robots with Haskell. In First International Workshop on Practical Aspects of Declarative Languages. SIGPLAN, Jan 1999.

[14] G. H. Mealy. A method for synthesizing sequential circuits. Bell System Technical Journal, 34(5):1045–1079, 1955.

[35] John Peterson, Zhanyong Wan, Paul Hudak, and Henrik Nilsson. Yale FRP User’s Manual. Department of Computer Science, Yale University, January 2001. Available at http://www.haskell.org/ frp/manual.html.

[15] George Giorgidze and Henrik Nilsson. Switched-on yampa. In Paul Hudak and David Scott Warren, editors, Practical Aspects of Declarative Languages, 10th International Symposium, PADL 2008, San Francisco, CA, USA, January 7-8, 2008, volume 4902 of Lecture Notes in Computer Science, pages 282–298. Springer, 2008.

[36] Simon Peyton Jones et al. The Haskell 98 language and libraries: The revised report. Journal of Functional Programming, 13(1):0–255, Jan 2003. http://www.haskell.org/definition/.

[16] N. Halbwachs, P. Raymond, and C. Ratel. Generating efficient code from data-flow programs. In J. Maluszy´nski and M. Wirsing, editors, Proceedings of the Third International Symposium on Programming Language Implementation and Logic Programming, number 528, pages 1–13207–218. Springer Verlag, 1991.

[37] John Power and Hayo Thielecke. Closed freyd- and kappa-categories. In ICALP, pages 625–634, 1999. [38] Jan J. M. M. Rutten. Algebraic specification and coalgebraic synthesis of mealy automata. Electr. Notes Theor. Comput. Sci, 160:305–319, 2006.

[17] Masahito Hasegawa. Recursion from cyclic sharing: traced monoidal categories and models of cyclic lambda calculi. pages 196–213. Springer Verlag, 1997.

[39] Robert Stephens. A survey of stream processing. Acta Informatica, 34(7):491–541, 1997.

[18] L. Huang, P. Hudak, and J. Peterson. Hporter: Using arrows to compose parallel processes. In Proc. Practical Aspects of Declarative Languages, pages 275–289. Springer Verlag LNCS 4354, January 2007.

[40] Ross Howard Street, A. Joyal, and D. Verity. Traced monoidal categories. Mathematical Proceedings of the Cambridge Philosophical Society, 119(3):425–446, 1996. [41] William Thies, Michal Karczmarek, and Saman P. Amarasinghe. Streamit: A language for streaming applications. In CC ’02: Proceedings of the 11th International Conference on Compiler Construction, pages 179–196, London, UK, 2002. Springer-Verlag.

[19] P. Hudak. Building domain specific embedded languages. ACM Computing Surveys, 28A:(electronic), December 1996. [20] Paul Hudak. Modular domain specific languages and tools. In Proceedings of Fifth International Conference on Software Reuse, pages 134–142. IEEE Computer Society, June 1998.

[42] Tarmo Uustalu and Varmo Vene. The essence of dataflow programming. In Zolt´an Horv´ath, editor, CEFP, volume 4164 of Lecture Notes in Computer Science, pages 135–167. Springer, 2005.

[21] Paul Hudak, Antony Courtney, Henrik Nilsson, and John Peterson. Arrows, robots, and functional reactive programming. In Summer School on Advanced Functional Programming 2002, Oxford University, volume 2638 of Lecture Notes in Computer Science, pages 159–187. Springer-Verlag, 2003.

[43] William W. Wadge and Edward A. Ashcroft. LUCID, the dataflow programming language. Academic Press Professional, Inc., San Diego, CA, USA, 1985.

[22] Paul Hudak, Paul Liu, Michael Stern, and Ashish Agarwal. Yampa meets the worm. Technical Report YALEU/DCS/RR-1408, Yale University, July 2008.

A. Proof for the termination lemma Proof: We will show that the there always exists a enorm for well formed arrow expression e : α ; β, and the normalization procedure always terminates. This is done by structural induction over all possible arrow terms, and any closed expression e that’s not already in arrow terms shall be first beta reduced.

[23] John Hughes. Generalising monads to arrows. Science of Computer Programming, 37:67–111, May 2000. [24] John Hughes. Programming with arrows. In Advanced Functional Programming, pages 73–129, 2004. [25] Patrik Jansson and Johan Jeuring. Polytypic compact printing and parsing. In ESOP, pages 273–287, 1999.

1. e = arr f It already satisfies the termination condition.

[26] Sam Lindley, Philip Wadler, and Jeremy Yallop. The arrow calculus (functional pearl). Draft, 2008.

2. e = ﬁrst f By induction hypothesis, f ⇓ arr f , or f ⇓ loopB i (arr f ), where f and f are pure functions. In the first case by extension rule ﬁrst f → arr (f × id) and terminates; In the second case

[27] Hai Liu and Paul Hudak. Plugging a space leak with an arrow. Electronic Notes in Theoretical Computer Science, 193:29–45, nov 2007.

44

juggle · f · juggle · shuﬄe))

ﬁrst f ∗ ﬁrst (loopB i(arr f )) → superposing →loopB i (arr juggle ≫ arr f ≫ arr juggle) composition →loopB i (arr (juggle · f juggle))

4. e = loop f By induction hypothesis, f ⇓ arr f or f ⇓ loopB i (arr f ). In the first case loop f ∗ → loop (arr f ) loop →loopB ()(arr assoc −1 ≫ arr f ≫ arr assoc) composition →loopB ()(arr (assoc · f · assoc −1 ))

and terminates. 3. e = f ≫ g By induction hypothesis, f ⇓ arr f or f ⇓ loopB i (arr f ), and g ⇓ arr g or g ⇓ loopB i (arr g ). So there are 4 combinations, and in all cases they terminate.

and terminates. In the second case loop f ∗ → loop (loopB i (arr f )) loop →loopB ()(arr assoc −1 ≫ loopB i (arr f ) ≫ arr assoc) left and right tightening ∗ loopB ()(loopB i (ﬁrst (arr assoc −1 ) ≫ arr f ≫ → ﬁrst (arr assoc))) extension and composition ∗ loopB ()(loopB i (arr ((assoc × id)· → f · (assoc −1 × id )))) vanishing →loopB ((), i)(arr shuﬄe ≫ arr ((assoc × id )· f · (assoc −1 × id )) ≫ arr shuﬄe −1 ) composition →loopB ((), i)(arr (shuﬄe −1 · (assoc × id ) · f · (assoc −1 × id) · shuﬄe))

1)

f≫g ∗ → arr f ≫ arr g composition →arr (g · f ) 2)

f≫g ∗ arr f ≫ loopB i (arr g ) → left tightening →loopB i (ﬁrst (arr f ) ≫ arr g ) extension →loopB i (arr (f × id) ≫ arr g ) composition →loopB i (arr (g · (f × id ))) 3)

f≫g ∗ loopB i (arr f ) ≫ arr g → right tightening →loopB i (arr f ≫ ﬁrst (arr g )) extension →loopB i (arr f ≫ arr (g × id)) composition →loopB i (arr ((g × id) · f ))

and terminates. 5. e = init i By init rule, init i → loopB i (arr (swap · juggle · swap)) and terminates. 2

B. Proof for the vanishing rule of loopB

4)

Proof: We will show that

f≫g →loopB i (arr f ) ≫ loopB i (arr g ) left tightening →loopB j (ﬁrst (loopB i (arr f )) ≫ arr g ) superposing →loopB j (loopB i (arr juggle ≫ arr f ≫ arr juggle) ≫ arr g ) composition ∗ → loopB j (loopB i (arr (juggle · f · juggle)) ≫ arr g ) right tightening →loopB j (loopB i (arr (juggle · f · juggle) ≫ ﬁrst (arr g ))) extension →loopB j (loopB i (arr (juggle · f · juggle) ≫ arr (g × id))) composition →loopB j (loopB i (arr ((g × id) · juggle ·f · juggle))) vanishing →loopB (j, i) (arr shuﬄe ≫ arr ((g × id ) · juggle · f · juggle) ≫ arr shuﬄe −1 ) composition ∗ loopB (j, i) (arr (shuﬄe −1 · (g × id)· → ∗

loopB i (loopB j f ) = loopB (i, j) (arr shuﬄe ≫ f ≫ arr shuﬄe −1 ) by equational reasoning.

= = = = =

=

45

loopB i (loopB j f ) definition of loopB loop (loopB j f ≫ second (second (init i))) definition of loopB loop (loop (f ≫ second (second (init j))) ≫ second (second (init i))) right tightening of loop loop (loop (f ≫ second (second (init j)) ≫ f irst(second (second (init i))))) commutativity loop (loop (f ≫ f irst(second (second (init i))) ≫ second (second (init j)))) vanishing of loop loop (arr assoc −1 ≫ f ≫ ﬁrst (second (second (init i))) ≫ second (second (init j)) ≫ arr assoc) Lemma B.1 loop (arr assoc −1 ≫ f ≫ arr shuﬄe −1 ≫ second (second (init (i, j))) ≫ arr shuﬄe ≫ arr assoc)

shuffle−1 · shuffle = id = loop (arr (shuﬄe −1 · assoc −1 ) ≫ arr shuﬄe ≫ f ≫ arr shuﬄe −1 ≫ second (second (init (i, j))) ≫ arr shuﬄe ≫ arr assoc) shuffle−1 · assoc−1 = id × transpose = loop (arr (id × transpose ) ≫ arr shuﬄe ≫ f ≫ arr shuﬄe −1 ≫ second (second (init (i, j))) ≫ arr shuﬄe ≫ arr assoc) sliding = loop (arr shuﬄe ≫ f ≫ arr shuﬄe −1 ≫ second (second (init (i, j))) ≫ arr shuﬄe ≫ arr assoc ≫ arr (id × transpose )) shuffle−1 = (id × transpose) · assoc = loop (arr shuﬄe ≫ f ≫ arr shuﬄe −1 ≫ second (second (init (i, j))) ≫ arr shuﬄe ≫ arr shuﬄe −1 ) shuffle · shuffle−1 = id = loop (arr shuﬄe ≫ f ≫ arr shuﬄe −1 ≫ second (second (init (i, j)))) definition of loopB = loopB (i, j)(arr shuﬄe ≫ f ≫ arr shuﬄe −1 )

arr ((swap × id ) · assoc −1 · (swap × id ) · assoc −1 ) composition and normalization = arr (λ((a, (c, d)), (b, e)).(d, (e, ((c, b), a)))) ≫ ﬁrst (init i) ≫ arr (λ(d, (e, ((c, b), a))).((a, (c, d)), (b, e))) and from lhs:

=

=

=

=

Lemma B.1 ﬁrst (second (second (init i))) ≫ second (second (init j)) = arr shuﬄe −1 ≫ second (second (init(i, j))) ≫ arr shuﬄe

=

Proof: We first show =

Hence lhs = rhs. Using similar technique, we can also prove (details omitted to save space)

ﬁrst (second (second (init i))) arr shuﬄe −1 ≫ second (second (ﬁrst (init i))) ≫ arr shuﬄe

second (second (init j)) = arr shuﬄe −1 ≫ second (second (second (init j))) ≫ arr shuﬄe

This can be done by equational reasoning from both sides. From lhs:

= =

=

=

=

=

=

arr shuﬄe −1 ≫ second (second (ﬁrst (init i))) ≫ arr shuﬄe definition of second arr shuﬄe −1 ≫ arr swap ≫ ﬁrst (arr swap ≫ ﬁrst (ﬁrst (init i)) ≫ arr swap) ≫ arr swap ≫ arr shuﬄe functor and extension arr (swap · shuﬄe −1 ) ≫ arr (swap × id ) ≫ ﬁrst (ﬁrst (ﬁrst (init i))) ≫ arr (swap × id) ≫ arr (shuﬄe · swap) association arr ((swap × id ) · swap · shuﬄe −1 ) ≫ arr assoc ≫ arr assoc ≫ ﬁrst (init i) ≫ arr assoc −1 ≫ arr assoc −1 ≫ arr (shuﬄe · swap · (swap × id)) composition arr (assoc · assoc · (swap × id) · swap · shuﬄe −1 ) ≫ ﬁrst (init i) ≫ arr (shuﬄe · swap · (swap × id) · assoc −1 · assoc −1 ) normalization arr (λ((a, (c, d)), (b, e)).(d, (e, ((c, b), a)))) ≫ ﬁrst (init i) ≫ arr (λ(d, (e, ((c, b), a))).((a, (c, d)), (b, e)))

Therefore we have

ﬁrst (second (second (init i))) definition of second ﬁrst (arr swap ≫ ﬁrst (arr swap ≫ ﬁrst (init i) ≫ arr swap) ≫ arr swap) functor and extension arr (swap × id) ≫ ﬁrst (ﬁrst (arr swap ≫ ﬁrst (init i) ≫ arr swap)) ≫ arr (swap × id) association arr (swap × id) ≫ arr assoc ≫ ﬁrst (arr swap ≫ ﬁrst (init i) ≫ arr swap) ≫ arr assoc −1 ≫ arr (swap × id) functor and extension arr (assoc · (swap × id)) ≫ arr (swap × id) ≫ ﬁrst (ﬁrst (init i)) ≫ arr (swap × id) ≫ arr ((swap × id) · assoc −1 ) association arr ((swap × id) · assoc · (swap × id)) ≫ arr assoc ≫ ﬁrst (init i) ≫ arr assoc −1 ≫ arr ((swap × id) · assoc −1 · (swap × id )) composition arr (assoc · (swap × id) · assoc · (swap × id)) ≫ ﬁrst (init i) ≫ arr ((swap × id) · assoc −1 · (swap × id ) · assoc −1 ) Lemma B.2 arr (assoc · (swap × id) · assoc · (swap × id)) ≫ arr (id × (swap · assoc −1 · transpose · assoc −1 )) ≫ ﬁrst (init i) ≫ arr (id × (assoc · transpose · assoc · swap)) ≫

ﬁrst (second (second (init i))) ≫ second (second (init j)) substitution = arr shuﬄe −1 ≫ second (second (ﬁrst (init i))) ≫ arr shuﬄe ≫ arr shuﬄe −1 ≫ second (second (second (init i))) ≫ arr shuﬄe shuffle · shuffle−1 = id = arr shuﬄe −1 ≫ second (second (ﬁrst (init i))) ≫ second (second (second (init i))) ≫ arr shuﬄe functor and product = arr shuﬄe −1 ≫ second (second (init(i, j))) ≫ arr shuﬄe 2 Lemma B.2 ∀g, g −1 , g · g −1 = id , we have ﬁrst f = arr (id × g) ≫ ﬁrst f ≫ arr (id × g −1 ) Proof:

= = = =

arr (id × g) ≫ ﬁrst f ≫ arr (id × g −1 ) exchange ﬁrst f ≫ arr (id × g) ≫ arr (id × g −1 ) composition ﬁrst f ≫ arr ((id × g −1 ) · (id × g)) normalization ﬁrst f ≫ arr id right identity ﬁrst f 2

46

A Functional I/O System ∗ or, Fun for Freshman Kids Matthias Felleisen Northeastern University

Robert Bruce Findler Northwestern University

Matthew Flatt University of Utah

Abstract

Shriram Krishnamurthi Brown University

Surprisingly, even O’Sullivan et al. (2008)’s Real World Haskell has difficulties explaining I/O, according to some on-line reviews. Here we present our approach to reconciling I/O with purely functional programming, especially for a pedagogical setting. The I/O framework extends the DrScheme (Findler et al. 2002) teaching languages for our text How to Design Programs (HtDP) (Felleisen et al. 2001), but it is also accessible from other dialects. Our framework does not require the use of any monads or other threading devices, meaning middle school students can write animation and interactive games as a bunch of mathematical functions. Indeed, because everything is just a function on numbers, strings, and images, students can also test every step as they design their programs. For the past three years, the curriculum has been deployed at eight middle schools (ages 10–14). The students tend to embrace programming enthusiastically after a nine-week course. Moreover, this kind of programming experience seems to improve the performance of these students in standard mathematics courses. The purpose of our paper is to share our technical development so that others can duplicate our pedagogical experiences, whether the functional I/O library is layered on top of an imperative library (as in our implementation) or on top of an I/O monad (as an implementation for Haskell would be). The next section provides an overview of our experiences to explain some of the sociological context where we apply functional programming with I/O. Section 3 illustrates how we start middle school students on functional programming. The rest of the paper focuses on the technical foundations.Sections 4 and 5 explain the I/O library and how to use it. Section 6 is about the college-level curricular context. In section 7, we compare our approach to the Clean Event I/O system, which is closest to our framework, and to some other pieces of work.

Functional programming languages ought to play a central role in mathematics education for middle schools (age range: 10–14). After all, functional programming is a form of algebra and programming is a creative activity about problem solving. Introducing it into mathematics courses would make pre-algebra course come alive. If input and output were invisible, students could implement fun simulations, animations, and even interactive and distributed games all while using nothing more than plain mathematics. We have implemented this vision with a simple framework for purely functional I/O. Using this framework, students design, implement, and test plain mathematical functions over numbers, booleans, string, and images. Then the framework wires them up to devices and performs all the translation from external information to internal data (and vice versa)—just like every other operating system. Once middle school students are hooked on this form of programming, our curriculum provides a smooth path for them from pre-algebra to freshman courses in college on object-oriented design and theorem proving. Categories and Subject Descriptors D.2.10 [Software Engineering]: Design—methodologies; D.4.7 [Operating Systems]: Organization and Design—interactive systems General Terms Design, Languages Keywords Introductory Programming

1. Functions for Freshmen Based on our decade-long experience (Felleisen et al. 2004a), novices to programming tend to accept languages that they haven’t heard of—as long as they can quickly construct a program that is like the applications they use on their computers. To this end, the chosen language must come with a rich framework for input and output (I/O), ideally via graphical interfaces. Chakravarty and Keller (2004) present corroborating evidence based on a thorough analysis of their Haskell-based introductory courses. They also report, however, that most Haskell texts deemphasize I/O. Our own review shows that three (Thompson 1997; Bird and Wadler 1998; Hutton 2007) of four major Haskell-based text books introduce I/O in the last third of the book or the final chapter; only Hudak (2000) tackles it head-on, though in a quasi-imperative manner.

2. Experience Age group

Since

Framework

middle school 2006 animation two- or three-object games, simplistic sound high school 2004 animation N object games, simulations college 2003 animation N object games, simulations, visualization college & MS program 2008 dist. prog. distributed two-player games

∗ This

research was partially supported by several grants from the National Science Foundation.

Schools eight ≥ 30 ≥ 20 Chicago & NEU

Variants of our I/O library have been in use since 2003 at a variety of places. The nearby table provides a concise overview of the kinds of students, sites, and projects that it enables. Clearly, the library is primarily useful in college-level freshman courses. Starting in 2004, we also introduced the library into the TeachScheme! workshops; some 30 schools have run courses with it. Three years ago, Emmanuel Schanzer derived the Bootstrap curriculum (www.bootstrapworld.org) from HtDP for Citizen Schools (www.citizenschools.org), which runs after-school programs

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICFP’09, August 31–September 2, 2009, Edinburgh, Scotland, UK. c 2009 ACM 978-1-60558-332-7/09/08. . . $5.00 Copyright

47

on a wide range of topics in poor neighborhoods across the US. At this point, eight sites (in Austin, the Bay Area, Boston and surroundings, and New York City) have used Bootstrap with a specialized variant of the I/O library. Finally, the first author uses the library in an immersion course for Northeastern’s MS program. Middle school students in the Bootstrap after-school courses typically have little background in mathematics. They are barely comfortable with calculations on numbers; they struggle with variables, if they have encountered them at all; and many are discouraged about the entire topic. Occasionally some of the students are concurrently enrolled in a first pre-algebra course. One goal of Bootstrap is to bring across the notion of a function as something that relates quantities, though not necessarily numbers. Naturally, the citizen teachers (volunteer instructors working for Citizen Schools) do not tell students that they are about to study a parenthesized form of algebra. Instead, they demonstrate interactive computer games and encourage students to think about creating such games on their own. A normal Citizen Schools course lasts nine weeks, with one two-hour session per week and no homework assignments. Within this time, most students design and systematically implement an interactive game of their choice that involves a fixed number of moving objects, object collisions, and score keeping. They code these games in the “Beginning Student” language of HtDP, extended with our new library. Citizen teachers report that the majority of students actively and enthusiastically participate in these courses, many asking for more when the course ends.1,2 By the end of the course, some of the citizen teachers share with the students that their game programs are mathematics. Anecdotal evidence suggests that making mathematics come alive in this manner has a direct impact on students’ performance in subsequent mathematics courses. Numerous instructors report conversations with mathematics teachers on how the students have changed their attitude about mathematics and how their grades have improved. The I/O library plays a similarly important role in high school courses and first-year college curricula as it does for Bootstrap. Many high school teachers and college instructors have noticed that students fail to understand mathematical functions. An introductory curriculum that makes functions critical and fun for animations, simulations, and interactive and multi-player/multi-computer games can thus play a central role in enhancing students’ preparation. Once students understand the basic idea of a function, it is easy to motivate them to study the systematic design of functions as advertised in HtDP (Felleisen et al. 2004b). After all, designing properly enables them to write more interesting simulations, animations, games, etc. As we explain in section 6, this kind of first course also prepares students well for the rest of the first year, including applicative and imperative object-oriented programming. Some typical examples are the freshman courses at the University of Chicago and Northeastern University. Although Chicago uses a quarter system, its course reaches the same milestones as Northeastern’s, due to the small class size at Chicago and students’ strong academic preparation. In both courses, students work out at least two complete game project in a purely functional setting,

often the same ones at both places (2008: Snake, Chat Noir). The first project (between 500 and 1,000 lines) is repeated twice, once to ensure students can implement suggestions and a second time to introduce abstraction via higher-order functions. The second project (some 1,000–2,000 loc) is typically an end-of-semester/quarter project where students have some creative freedom, too. The authors have repeatedly observed that students continue to work on their game programs after the semester/quarter ends.

3. Arithmetic, Algebra, and Movies In middle school, mathematics teachers often ask students to determine the next number in a sequence such as 1, 4, 9. Or they show students a series of shapes and ask what the next shape should look like. Eventually these series are arranged in the form of tables, e.g., x= y=

0 0

1 1

2 4

3 9

4 ?

... ...

i y(i)

and students are asked to fill in the result for 4 and to determine a general formula for determining any element in the sequence. Next comes an algebra course where students experience the joys of such exciting problems as two trains leaving from Chicago and Philadelphia and colliding in Pittsburgh. Functional programmers know that these students are encountering their first core concept in their mathematical education: functions and “variable expressions.” They also know that functions and expressions don’t need to be restricted to numbers and operations on numbers. It is perfectly acceptable to speak of the arithmetic and algebra of booleans, chars, strings, etc. DrScheme programmers also formulate functions and expressions that compute with images as first-class values, e.g., (empty-scene 100 100) returns a blank square of 100 by 100 pixels, and (place-image

10 20 (empty-scene 100 100))

combines the image of the rocket with the blank square by placing the former 10 pixels to the right of the left margin and 20 pixels down from the top margin of the latter. Now imagine teaching in this context. Teachers can ask students what the next image in the following series is: x= z=

0

1

2

3

4

? and what the general formula is for the image. As before, students would struggle and eventually come up with an answer. At this point, teachers could explain that displaying 25 to 30 of these scenes per second would create the effect of a “movie” that simulates a rocket landing. Students know movies, and students find movies more interesting than trains colliding in Pittsburgh. So the teacher could show them how to play this movie in DrScheme:

1 Due

to the demand, Schanzer [personal communication, Dec. 2008] is working on a second-level course for next summer. 2 Over the past few years, Alice [alice.org] and Scratch [scratch.mit. edu] have been touted as frameworks for teaching middle school students how to program. Both efforts are incomparable with Bootstrap for two reasons. First, both Alice and Scratch are mostly imperative and thus fail to have direct benefits for the students’ mathematics education. Second, as others also note (Powers et al. 2007), these GUI-oriented systems come without a natural transition to full-fledged programming whereas our curriculum spells out a natural path from middle school mathematics to the second semester in college.

(define (rocket-scene i) (place-image

50 i (empty-scene 100 100)))

The function definition captures the general answer to the teacher’s question, though in parentheical syntax. Although this isn’t close to the historically grown infix notation of mathematics (or to that of fashionable languages), our experience shows that students don’t seem to mind after some initial reluctance. If the student now

48

applies the library function run-simulation to rocket-scene, the library generates a series of scenes. More precisely, it applies its argument (here: rocket-scene) to 0, 1, 2, 3, . . . and displays the resulting series of images in a separate window, like the one embedded in this paragraph, at the rate of 28 numbers per second. The rest of the paper shows how to generalize this idea so that the language of 9th grade school mathematics can be used to design interactive, and even distributed, games.

event handlers that process these states. A world program matches these two parameters with a data definition for the collection of states and with a collection of functions. Figure 1 expresses this dependency between the library and a student’s program via a unit diagram (Flatt and Felleisen 1998). The universe library is parameterized over the type WorldSt; it exports two types and a function that consumes WorldSt-processing functions. Conversely, a world program is a unit that imports all of this, exporting in return a WorldSt type. Linking the two creates the executable. In reality, though, programs specify types as comments, and UNIVERSE does not export a function for specifying event handlers but a syntactic extension, dubbed big-bang: (big-bang WorldState-expr (on-tick tock-expr rate-expr† )† (on-key react-expr)† (on-mouse click-expr)† (stop-when done-expr)† (on-draw render-expr width-expr† height-expr† )† )

4. Designing a World The HtDP curriculum heavily emphasizes functional programming. DrScheme (Findler et al. 2002), HtDP’s accompanying IDE, supports a series of five teaching languages, each expanding the expressive power of its predecessor. The first four of these teaching languages are purely functional, and they are usually the only ones used in courses for novice programmers.

A big-bang expression has one required sub-expression—the initial state of the world—and five optional clauses (indicated via † superscripts). These clauses are introduced via one of five keywords (on-tick, on-key, on-mouse, stop-when, and on-draw), mimicking keyword-based parameter passing. Each clause specifies at least one sub-expression; two have additional optional subexpressions (see † superscripts). When PLT Scheme encounters a big-bang expression, it first evaluates all sub-expressions and checks some basic properties. The result of WorldState-expr becomes the initial state of the world. The remaining values give access to a subset of the underlying platform’s events:

Universe type WorldSt ... body ... type KeyEvt = ... type MouseEvt = ... val big-bang: WorldSt x (WorldSt -> WorldSt) x (WorldSt KeyEvt -> WorldSt) x (WorldSt Nat Nat MouseEvt -> WorldSt) -> WorldSt

1. If an on-tick clause exists, big-bang starts a clock that ticks at a rate of 28 times per second or as often as the result of rate-expr—a natural number—specifies. The expression tock-expr must evaluate to a function of one argument:

WorldProgram type KeyEvt; type MouseEvt; val big-bang: WorldSt x (WorldSt -> WorldSt) x (WorldSt KeyEvt -> WorldSt) x (WorldSt Nat Nat MouseEvt -> WorldSt) -> WorldSt

;; WorldSt → WorldSt Specifically, the function consumes a state of the world and produces one. The universe library invokes it on the current state every time the clock ticks; its result becomes the next state.

... body ...

2. An on-key clause specifies how a world program reacts to a keyboard event. Its sub-expression must evaluate to a function of two arguments:

type WorldSt

Figure 1. A unit perspective of world programs and UNIVERSE

;; WorldSt KeyEvt → WorldSt

Our I/O framework comes as a library, dubbed UNIVERSE . The library implements and exports two expression forms for launching world and universe programs. This section explains worlds; the next one is about universes.

The first is again the current state of the world; the second is a data representation of the keyboard event. In UNIVERSE , a keyboard event is represented as either a onecharacter string (e.g., "a") or a number of special strings (e.g., "left", "release"). The former denote regular keys on the keyboard; the latter are used to represent arrow keys, other special keys, and the event of releasing a key. The library invokes this function for every keyboard event and uses the result of the invocation as the new state of the world.

4.1 The World is a Virtual Machine To a student, the UNIVERSE library represents the computer’s operating system and hardware. As such the library is the keeper of a representation of the state of the world. When the hardware or operating system notices certain events, the library hands over the state of the world to a function in the student’s program and expects another state back. We call this state a world, and the phrase world program denotes the collection of functions that interact with the library. Combining the library and a world program creates an executable program. The library is parameterized over the kinds of states—called WorldSt—that the world program wishes to deal with as well as the

3. Similarly, an on-mouse clause determines how a world program reacts to a mouse event with a function of four arguments: ;; WorldSt Nat Nat MouseEvt → WorldSt As always, the first argument is the current state of the world. The next two arguments capture the x and y coordinates of the

49

What the figure does not show is the orthogonally specified rendering of each state as a scene or image. Although these images are values in PLT Scheme, they are usually not a component of world states. One way to imagine this rendering process is to add a different kind of arrow to each state and connecting this arrow to the scene that the on-draw function produces for this state. Given this explanation, we can explain the workings of the run-simulation function. Its world is the world of natural numbers, i.e., the state of the world represents the number of times the clock has ticked so far:

W′

W

tickH

tickH

keyH mouseH

keyH

W′′

mouseH

mouseH

tickH

...

...

keyH

W′′′′

;; WorldSt = Nat ;; interp. the number of clock ticks

W′′′

As for run-simulation, it consumes a function from natural numbers to Scenes. Its purpose is to start the world, to count the number of clock ticks, and to invoke the given function on each clock tick to render a series of Scenes:

Figure 2. A state transition diagram for world programs event, measured in the number of pixels from the left and top of the screen. Finally, the MouseEvt argument determines what kind of mouse action has taken place. It is one of the following six strings: "button-up", "button-down", "drag", "move", "enter", "leave". Like the underlying operating system, the UNIVERSE library does not notify a world program of every mouse event, but it samples the mouse events at a reasonably high rate. The result of applying the mouse-event handler function becomes the next world.

;; (Nat → Scene) → Nat (define (run-simulation render) (big-bang 0 (on-tick add1) (on-draw render))) The result of run-simulation is a natural number: specifically, the number of clock ticks that have passed (once the simulation halts). 4.2 Designing a World Program Designing a world program is surprisingly easy. The first step is to design a data representation for the information that varies and that is to be tracked through the duration of the program execution. We recommend expressing the data representation as a data (type) definition (or several) and equipping it with comments that interpret this data in terms of the visible canvas (world). Naturally, this data definition fills in for the WorldSt type from the preceding section. The second step is to tease out constants that describe properties of the world. This includes both quasi-physical constants, e.g., the width and height of the screen, as well as image constants, e.g., the background or a fixed shape that moves across the scenery. The third step is to design the event-handling functions. Here “design” refers to the design recipe from HtDP. Given that we already have data definitions (from the first step and the library), we also have contracts for all the top-level functions. Hence the next step is to think through examples and to turn them into tests. The creation of templates usually (but not always) uses the WorldSt type for orientation. After coding, it is important to run the tests. Also following HtDP, iterative development is the most appropriate approach for world programs. Specifically, we recommend that students provide a minimally useful data definition for WorldSt and then design one state-processing event handler and the rendering function. This enables them to test the core of the program and interact with it. From here, they can pursue two different directions: enriching the data and adding event handlers.

4. The stop-when clause determines when the world ends. Its sub-expression must evaluate to a predicate: ;; WorldSt → Boolean After handling an event with one of the above event-handling functions, UNIVERSE uses this predicate to find out whether the resulting state of the world is a final state. If the result state satisfies the predicate, no further events are processed. 5. Last but not least, a big-bang expression may come with an on-draw clause, which has either one or three sub-expressions. The first sub-expression must evaluate to a function of one argument: ;; WorldSt → Image If the big-bang expression specifies such a function, the UNI VERSE library opens a separate window whose size is determined by the size of the first image that the function produces. Alternatively, a program may specify the size of the canvas explicitly via the two additional sub-expressions, which must evaluate to natural numbers. The function specified in on-draw is used every time an eventhandling function produces a state. The resulting image is rendered in the separate window.

4.3 Controlling a UFO Let us illustrate how to design world programs with an example from the second or third week in a college freshman course. The goal of the exercise is to move a UFO (“flying saucer”) across the canvas in a continuous manner. Later we add functions that allow “players” to control the UFO’s movements via the arrow keys on the keyboard and via mouse clicks. A moving object on a flat canvas has (at least) four properties, meaning we need to use a structure4 to represent the essential data:

Once the world ends, big-bang returns the final state.3 As figure 2 suggests, the core of an executable world program denotes a state machine. Each element W, W , W , . . . of WorldSt is a state of this machine. For each state and for each kind of event, the event handlers (plus event-specific inputs) specify the successor state; that is, each state—except for final ones—is the source of three (family of) arrows (with distinct targets). The final states are those for which the predicate specified in the stop-when clause produces true.

4 In

teaching languages, a structure definition like this one introduces three kinds of functions: a constructor (make-ufo), a predicate (ufo?), and one selector per field to extract the values (ufo-x, ufo-y, ufo-dx, ufo-dy). PLT Scheme also adds imperative mutators on demand.

3 It

is instructive to contrast this to the type of reactimate in Fran (Elliot and Hudak 1997).

50

;; WorldSt KeyEvt → WorldSt ;; control the ufo’s direction via the arrow keys (check-expect (control (make-ufo 5 8 -1 -1) "down") (make-ufo 5 8 -1 +1)) ;; ... more test cases ... (define (control w ke) (cond [(key=? ke "up") (set-ufo-dy w -1)] [(key=? ke "down") (set-ufo-dy w +1)] [(key=? ke "left") (set-ufo-dx w -1)] [(key=? ke "right") (set-ufo-dx w +1)] [else w])) ;; WorldSt Int → WorldSt (define (set-ufo-dy u dy) (make-ufo (ufo-x u) (ufo-y u) (ufo-dx u) dy))

;; WorldSt Nat Nat MouseEvt → WorldSt ;; move the ufo to a new position on the canvas (check-expect (hyper (make-ufo 10 20 -1 +1) 40 30 "button-up") (make-ufo 10 20 -1 +1)) ;; ... more test cases ... (define (hyper w x y a) (cond [(mouse=? "button-down" a) (make-ufo x y (ufo-dx w) (ufo-dy w))] [else w])) ;; WorldSt → Boolean ;; has the ufo landed? (check-expect (landed? (make-ufo 5 (- SIZE 5) -1 +1)) false) ;; ... more test cases ... (define (landed? w) (>= (ufo-y w) SIZE))

Figure 3. Using keyboard and mouse events to control a ufo (define-struct ufo (x y dx dy)) ;; WorldSt = (make-ufo Nat Nat Int Int) ;; interp. the location (pixels) ;; and velocity (pixels/tick)

Before we can interact with the program, we must design one more function, namely, a function for rendering the current state of the world as a scene: ;; WorldSt → Scene ;; place the ufo into MT at its current position

Because nothing else in this “game” changes over time, we identify the state of the world with the state of the UFO. Next we fix the size of the canvas, the background (an empty scene), and the shape of the UFO: (define SIZE 400) (define MT (empty-scene SIZE SIZE)) (define UFO (overlay (circle 10 "solid" "green") (rectangle 40 2 "solid" "green"))) ) (define UFO.version2

(check-expect (render (make-ufo 10 20 -1 +1)) (place-image UFO 10 20 MT)) (define (render w) (place-image UFO (ufo-x w) (ufo-y w) MT)) Designing such a function proceeds according to the same recipe as designing the move function. Also notice that we can test the outcome of this function as if it were a function on the reals. Because images are first-class values, it makes sense to construct the expected output and to compare it to the actual result of the function. PLT Scheme’s standard equal? function works for images, too. While we recommend that students develop such “expected results” expression (interactively in the REPL) to gain some understanding of how the function should proceed, it is indeed possible to insert an actual image instead of such an expression:

This time we use basic image creation and manipulation primitives to create the right kind of shape; using the definition of UFO.version2 instead of UFO would of course work equally well. With the above data definition, we have determined the complete type signature of the event-handling functions for clock tick events. Of course we should add a purpose statement: ;; WorldSt → WorldSt ;; move the ufo for one tick of the clock

(check-expect (render (make-ufo 10 20 -1 +1)) ) Equipped with move and render, it is possible to define a main function and to watch these first two definitions in action: ;; WorldSt → WorldSt ;; run a complete world program, ;; starting in state w0 (define (main w0) (big-bang w0 (on-tick move) (on-draw render)))

The next step in our design recipe calls for examples that describe the behavior of the function. We formulate these examples immediately in the unit testing framework that comes with DrScheme’s teaching languages:5 (check-expect (move (make-ufo 10 20 -1 +1)) (make-ufo 9 21 -1 +1)) The example illustrates that the function’s purpose is to add the velocity to the current position and to use it as the new position: (define (move w) (make-ufo (+ (ufo-x w) (ufo-dx w)) (+ (ufo-y w) (ufo-dy w)) (ufo-dx w) (ufo-dy w)))

In short, we have finished the first stage of our iterative design cycle, creating a first useful part of the overall program. From here, it is easy to design the rest of the function. See the left-hand side of figure 3 for the definition of a function that controls the movements of the UFO via arrow keys. The function key=? compares two keyboard events. The right-hand side of the same figure displays functions for making the UFO jump to the position of a mouse click; mouse=? of course compares mouse events. The last function checks whether the UFO has landed.

5 DrScheme

collects all check-expect expressions and evaluates them after all definitions and expressions are evaluated. It then outputs the results and tabulates failed test cases with hyper-links to the source text of the test.

51

5. Universe: A World is Not Enough

W*

Designing interactive graphical programs via purely functional programming is only half the game. The other half is about designing distributed programs, especially distributed games. The principles remain the same, but the differences deserve a close look.

S-exp

tickH S-exp

recH

5.1 Universes

S-exp

A universe consists of a distributed collection of world programs that communicate with each other via a programmable server:

W′

tickH

tickH

keyH mouseH recH

keyH mouseH recH

S-exp

Universe

W World 2

keyH S-exp W′′

S-exp S-exp

S-exp S-exp

mouseH S-exp...

S-exp S-exp

...

...

World 3

keyH tickH

mouseH

S-exp S-exp

World 1

Server

World 4

We make no assumption about where the programs run, in particular, UNIVERSE cannot find servers automatically. The communication links rely on TCP/IP connections, meaning messages sent from a world to a server (or vice versa) are guaranteed to arrive in the order in which they are dispatched. Of course, when two distinct world programs send messages to the server, there is no guarantee that the messages arrive in the order they were sent; similarly, if the server broadcasts messages to (some of) the participating worlds, the messages may again arrive at distinct worlds in an unrelated order. In order to design a universe based on the UNIVERSE teachpack, students design a communication protocol, which they implement via a “server” program. Some protocols simply pass messages from one world program to another and back, with the server playing the role of a conduit. Other protocols assume that the server is an arbiter, enforcing the rules of a game or directing traffic among the participants, as in a chat room. Finally, the server could be configured in such a way that the world programs simulate peers in a peer-to-peer neighborhood.

S-exp

W′′′′

W′′′

Figure 4. State transition view of world communicating programs Figure 4 is a revision of figure 2 for communicating worlds. Again, all elements of WorldSt are states, but now all states come with four kinds of transition arrows. The fourth one is the event handler that deals with message receipts. In addition, each arrow now comes with an optional output label in the form of an Sexpression. Just as UNIVERSE displays the rendering of a state as an image for a world program, it also implements the sending these messages from state transitions to the universe’s server. 5.3 The Universe and its Server The UNIVERSE library supports the design of servers in a manner that is analogous to the design of world programs. A programmer describes a server via a pair of specifications: a data definition of universe states, dubbed UniSt, and a universe description, which is analogous to a big-bang description. For a server, three kinds of events matter most: the entry of an additional world into the universe, a world’s disappearance, and the arrival of a message from a participating world. Accordingly server programs must deal with representations of participating worlds, and UNIVERSE supports this:

5.2 A World in the Universe For a world program to participate in a universe, it registers with the server using a (register ip-expr) clause in its big-bang expression. The sub-expression designates an IP address (as a string). A registered world program sends messages via its event handlers. To this end, the UNIVERSE library defines package structures and exports its constructor and predicate: (define-struct package (world msg)) ;; Package = (make-package World S-exp)

(define-struct iworld (name in out)) ;; IWld = (make-iworld String Port Port) ;; interp. internal representation ;; of a participating world The iworld structure keeps track of a world program’s name, its input TCP port, and its output port, though a server program may only access the name field of iworld structures. Other than that, server programs must compare worlds and do so with iworld=?. Here is the core grammar of a universe description:

Moreover, the library actually deals with event handlers that return one of two kinds of results, meaning the signature of, say, key event handlers is really ;; WorldSt KeyEvt → (∪ Package WorldSt)

(universe UniSt-expr (on-new new-expr) (on-msg msg-expr) (on-disconnect disc-expr disc-expr)† ...)

instead of the one specified in the preceding section. If an event handler produces a package, the library uses the value in the first field as the next state of the world, and the value in the second field is sent off to the server. Otherwise, the library assumes the result is just the state of the world. To receive messages, a world program installs an event-handling function via an on-receive clause in big-bang. It subexpression must evaluate to a function with the following signature:

The first, required sub-expression determines the initial state of the server. Furthermore, every universe description comes with an on-new clause and an on-msg clause. Optionally, it may also contain an on-disconnect clause. Every server’s event handler consumes the current state of the universe—as perceived and maintained by the server’s event handlers—and the representatin of a participating world; it may also consume a message received from such a world. An event

;; WorldSt S-exp → (∪ Package WorldSt) When a message in the form of an S-expression arrives, this event handler is applied to the current state of the world and the message. Like all other event handlers, this handler may return a Package.

52

handler produces a bundle, i.e., a UNIVERSE -specified structure that contains three distinct pieces of information: the new server state (UniSt); a list of messages to designated worlds; and the list of worlds to be discarded: (define-struct bundle (state mails to-discard)) (define-struct mail (to msg)) ;; Bundle = (make-bundle UniSt Mail∗ IWld∗) ;; Mail∗ = [Listof (make-mail IWld S-exp)] ;; IWld∗ = [Listof IWld]

Our experimentation with the UNIVERSE library suggests that interaction diagrams—like those used for object-oriented designs based on UML—are a good medium for discussing ideas. Instead of spelling out this recommendation in detail, however, we illustrate it with a simple example. 5.5 Serving a Turn As mentioned, the coordination among the worlds of a universe depends on the server and the message protocol it employs. We and our students have implemented a number of servers. Here we illustrate the power of the UNIVERSE library with the design of a server and some UFO controller clients where each client gets a turn to control a (local) UFO. We start with the protocol design, followed by the design of the server, and then the adaptation of the UFO program from section 4 to support distribution.

Event handlers may only construct bundles and mails; they may not destructure them. The event handlers function as follows: 1. An on-new handler has the signature ;; UniSt IWld → Bundle

Protocol Design The prose suggests the following, informal and schematic interaction diagram:

i.e., it consumes the server state and a representation of the world that wishes to join. The resulting bundle may contain this new world as one that should be discarded, which effectively represents a rejection of the request. Optionally, the handler may send out messages about the event.

server world1: sam register: "sam"

2. An optional on-disconnect event handler has the same signature as an on-new handler, but it deals with the disappearance of a world from the universe:

"your-turn"

;; UniSt IWld → Bundle This kind of event is usually due to a severed connection or because the corresponding world program shut down.

3. The signature for on-msg handlers also includes the message that arrived in the form of an S-expression:

-

world2: carl register: "carl"

"done" "your-turn"

-

;; UniSt IWld S-exp → Bundle When the on-msg event handler is invoked, it is applied to the state of the server, the world that sent in a message, and the message itself. The result bundle determines how this event is shared with other worlds in the universe.

"done" "your-turn"

Optional handlers may drive the server via clock ticks, render the current state of the server in a console, or deal with other events. A complete universe program—as specified in a universe expression—is best thought of as a state-transition machine, just like the one for world programs depicted in figure 4. Each element of UniSt is a state of the machine; each event handler (and its auxiliary parameters) represents one possible transition from one UniSt element to another. In contrast to world programs, the state transitions in a universe program come with two labels: one for sending mail to a list of participating worlds, and another one for deleting worlds from the list of participants.

-

The three vertical lines are “world life-lines,” while the horizontal lines are registration or message sending steps. This particular diagram shows the key properties of our proposed universe. The server is on the left; the participating worlds are to its right. After creation, a world registers with the server, which we assume sends along a name for the world. Our diagram shows that as soon as a first world has registered, the server gives this world a turn without waiting for any other world to show up. If another world shows up—possibly during some turn—the server becomes aware of it but continues to wait for a "done" signal from the world whose turn it is. Once the active world ends its turn, the server gives a turn to the next world on the list. Finally, the diagram also shows what happens when a world disappears, say due to the closure of a connection. The server notes the disappearance and gives a turn to (one of) the remaining worlds.

5.4 Designing a Universe Designing a universe requires two different perspectives: a global one concerning coordination and local ones for the server and the world programs. Once the global view has been developed, the local design of the servers and world programs proceeds just like stand-alone world programs. The global perspective demands the design of a coordination and communication protocol. This protocol design has the goal of creating and maintaining an invariant for the universe. In order to achieve this goal, we teach students to consider the start-up phase, the steady-state phase, and the shut-down phase of a universe. For all cases, it is important to understand (1) the order in which events occur and (2) which S-expressions encode which messages.

Server Design From here, the design of the server proceeds just like the design of a world program, though we must observe the constraints imposed by the protocol. We start with the required data definition: ;; UniSt = ;; interp. ;; ;;

53

IWld∗ list of worlds in the order they take turns, starting with the active one the active world (if any) is first

;; UniSt IWld → Bundle ;; nw is joining the universe (check-expect (add-world (list iworld2) iworld1) (make-bundle (list iworld2 iworld1) ’() ’())) ;; ... more test cases ... (define (add-world ust nw) (if (empty? ust) (make-bundle (list nw) (m2 nw) ’()) (make-bundle (append ust (list nw)) ’() ’()))) ;; UniSt IWld "done" → Bundle ;; mw sent message m; assume mw = (first ust), m = "done" (check-expect (switch (list iworld1 iworld2) iworld1 "done") (make-bundle (list iworld2 iworld1) (m2 iworld2) ’())) ;; ... more test cases ... (define (switch ust mw m) (local ((define l (append (rest ust) (list mw))) (define nxt (first l))) (make-bundle l (m2 nxt) ’())))

;; UniSt IWld → Bundle ;; dw disconnected from the universe (check-expect (del-world (list iworld1 iworld3) iworld3) (make-bundle (list iworld1) ’() ’())) ;; ... more test cases ... (define (del-world ust dw) (if (not (iworld=? (first ust) dw)) (make-bundle (remq dw ust) ’() ’()) (local ((define l (rest ust))) (if (empty? l) (make-bundle ’() ’() ’()) (local ((define nxt (first l)) (define mll (m2 nxt))) (make-bundle l mll ’())))))) ;; IWld → Mail∗ ;; create single-item list of mail to w ;; no test cases (define (m2 w) (list (make-mail w "your-turn")))

Figure 5. A primitive functional server program (universe ’() (on-new add-world) (on-msg switch) (on-disconnect del-world))

Note again interpretation that comes with the data definition. It has several implications for the design of the event handlers. Since this server deals with three kinds of events—registration of a world, message receipt, and disconnection of a world from the universe—we need three event handlers. The UNIVERSE specifications and the agreement to send certain messages dictate the contract statements:

Adding this expression to the bottom launches a process that waits for TCP/IP events and deals with them by invoking one of the three event handlers.

;; add-world : UniSt IWld → Bundle ;; switch : UniSt IWld "done" → Bundle ;; del-world : UniSt IWld → Bundle

Client Design To illustrate how the client side works, let us consider a small change to our UFO controller from the preceding section. Suppose we give each “player” a turn to land a UFO and that when the UFO touches the ground, it is the next world’s turn. One obvious implication is that there is now a distinct new kind of state of the world: ;; WorldSt is one of: ;; --- "rest" ;; --- (make-ufo Nat Nat Int Int)

The names of the three functions are suggestive of their purpose. Just as in the case of the UFO controller, we can design these functions in a systematic manner. In support of unit tests for event handlers in a server, UNIVERSE exports three sample worlds iworld1, iworld2, and iworld3; of course, it does not export the capability of creating representations of participating worlds. Otherwise, the design of these three server functions proceeds in a straightforward fashion. The three definitions and fragments of their test suites are displayed in figure 5:6

When it isn’t this world’s turn, the world is in a "rest" state. Next we replace the event handler for ticks with a function that sends out messages when the UFO lands: ;; WorldSt → (∪ WorldSt Package) (define (move.global w) (cond [(string? w) w] [else (local ((define v (move w))) (if (not (landed? v)) v (make-package "rest" "done")))]))

1. the top-left box contains the code for adding a world; 2. the box in the bottom-left defines the function for dealing with a message from the active world, which is the only kind of messages that the server expects; 3. the top-right box concerns the event of a world disconnecting from the universe; and 4. the final box in the bottom right contains the definition of a auxiliary function for creating a list of mail to a single world.

The function distinguishes the two cases from the data definition. For a string, it returns the world as is. Otherwise, it moves the world using the old move function and then checks whether the UFO has landed; if so, the new event handler produces a package. In addition, we need a handler for "your-turn" messages:

As far as the server is concerned, the only task left to do is to formulate the universe expression and to evaluate it at DrScheme’s reply to start the server:

;; WorldSt "your-turn" → WorldSt ;; assume: messages arrive only ;; if the state is "rest" (define (receive w msg) (make-ufo 20 10 -1 +1))

6 The

definitions use the local construct from the HtDP teaching languages. Roughly speaking, (local defs body) introduces the mutually recursive definitions defs for the evaluation of body. Unlike Scheme’s internal definitions, local definitions have the exact same semantics as global definitions but come with a restricted lexical scope.

54

;; WorldSt (WorldSt → WorldSt) (WorldSt KeyEvt → WorldSt) (WorldSt Nat Nat MouseEvt → WorldSt) [Listof Event] ;; → ;; WorldSt ;; process a list of events given the initial world and event handlers (define (big-bangF w0 tickH keyH mouseH loe0) (local (... dispatch: see below, on the right ... ;; accumulator design: w is the result of dealing with all events between loe0 and loe (inclusive) (define (big-bangF w loe) (cond [(empty? loe) w] [else (big-bangF (dispatch w (first loe)) (rest loe))]))) (big-bangF w0 loe0))) (define-struct tick ()) (define-struct key (kind)) (define-struct mouse (x y kind)) ;; An Event is one of: ;; --- (make-tick) ;; --- (make-key KeyEvt) ;; --- (make-mouse Nat Nat MouseEvt)

;; WorldSt Event → WorldSt ;; deal with a single event, given the state of the world (define (dispatch w e) (cond [(tick? e) (tickH w)] [(key? e) (keyH w (key-kind e))] [(mouse? e) (mouseH w (mouse-x e) (mouse-y e) (mouse-kind e))]))

Figure 6. The semantics of functional event handling Unlike move.global, receive does not distinguish two kinds of worlds. Whether the world is in a resting state or not, the function returns some UFO. The revised main function registers the world with the server and specifies a name for the world that is used for registration:

the third and fourth part of the book (and its teaching languages) lambda and local definitions are added. Programming is developed as the systematic design of computational solutions to “word” problems. The design of individual functions follows a general six-step procedure paired with a systematic development of data definitions. The design of programs is presented as an iterative refinement process, comparable to the scientific process of developing models of the world. Specifically the program is the model, and the world is the set of our (or our client’s) objectives. As we refine the program, our model satisfies more and more of the objectives. Obviously, this design recipe also applies to the design of I/O functions for world and universe programs. The key is that UNI VERSE translates external information into internal data and invokes the event handlers on the latter. Furthermore, the event handlers produce only internal data, which UNIVERSE then displays as external information. The translations are hidden from the students’ transformations. Hence, the process of formulating contracts, functional examples, etc. remains the same. Because images are just another form of atomic data, the design recipe even applies to the rendering functions that produce complex graphical scenes. The separation of the actual act of performing I/O from the processing or production of I/O data is critical for effective testing. It empowers a programmer to unit-test every single function, covering the complete chain from where input data appears to the point of where output data is delivered. As a matter of fact, this covers the testing of image-producing functions for which we recommend two different testing strategies. The first is to develop an expression in the read-eval-print loop of DrScheme that creates an image for simple inputs. This kind of experimentation suggests both an “expected value” expression as well as the body for the desired function. The second strategy is to create the expected image separately:

;; String → WorldSt (define (main-for-client n) (big-bang "rest" (on-tick move) (on-draw renderR) (on-rec receive) (name n) (register LOCALHOST))) Here we assume the server is running on the same computer as the client and that renderR renders the new kind of worlds. Note: The design assumes that all participating worlds and the server implement the protocol correctly. The assumptions above suggest how functions may protect themselves against errors in the implementations or attacks. The reader may wish to explore the small changes needed to check those assumptions.

6. Design and Curriculum Designing reactive programs in a purely functional manner comes with several advantages. For one, it is straightforward to explain big-bang as if it were a function. As figure 6 shows, this function traverses a list of events,7 accumulating the changes to the initial world. Also, it uniquely fits in with our design curriculum, which covers functional design followed by courses on logical reasoning and object-oriented design. 6.1 Design Recipe HtDP introduces its teaching programming languages as a generalization of school mathematics. Instead of functions over just numbers, these languages can express functions and expressions that deal with atomic data (numbers, symbols, chars, strings, images, boolean data) and compound data (structures, vectors, and lists). In

(check-expect (create-ufo) ) (check-expect (render-world (make-ufo ...)) (place-image ... ... (empty-scene SIZE SIZE)) As the second check-expect specification shows, it is of course possible to mix and match those two strategies. Once tests are developed, DrScheme’s built-in test coverage tool pin-points those expressions that haven’t been evaluated during a

7 Our

implementation replaces the list with an imperative stream of events, plus a thread for receiving messages from the server. The stream dispatcher and the thread are coordinated via the CML-inspired synchronization primitives of PLT Scheme.

55

(define world% (class fun-world% (super-new) (init-field ufo) (field [MT (empty-scene 500 500)]) ;; → world% ;; deal with a tick event in this world (define/augment (tick) (new world% [ufo (send ufo move/tick)]) ) ;; → scene ;; render this world as a scene (define/augment (render) (send ufo render MT))))

(define world% (class imp-world% (super-new) (init-field ufo) (field [MT (empty-scene 500 500)]) ;; → void ;; deal with a tick event in this world (define/augment (tick) (send ufo move/tick) ) ;; → scene ;; render this world as a scene (define/augment (render) (send ufo render MT))))

(define ufo% (class object% (super-new) (init-field x y dx dy) (field [UFO (overlay (rectangle ...) (circle ...))]) ;; → ufo% ;; move this ufo for one tick (define/public (move/tick) (new ufo% [x (+ x dx)][y (+ y dy)][dx dx][dy dy]) ) ;; → scene ;; add this ufo to the given scene s (define/public (render s) (place-image UFO x y s))))

(define ufo% (class object% (super-new) (init-field x y dx dy) (field [UFO (overlay (rectangle ...) (circle ...))]) ;; → void ;; effect: change this ufo’s coordinates, for a move (define/public (move/tick) (begin (set! x (+ x dx)) (set! y (+ y dy))) ) ;; → scene ;; add this ufo to the given scene s (define/public (render s) (place-image UFO x y s))))

Figure 7. Applicative and imperative world classes test run. We want novice programmers to attempt to cover all expressions, except for those that connect the event handlers to the underlying operating system (big-bang, universe). While complete coverage is a good first goal, the design of reactive programs tends to demonstrate that unit testing does not suffice. Even when an individual reactive function passes all unit tests, the composition of all the reactive functions to deal with a large stream of events often concocts scenarios that the unit tests don’t cover. Put differently, reactive programming demands some amount of integration testing, too. Given our “list of events” semantics, programmers can usually mimic these scenarios with the composition of event handlers. Last but not least, because the event handlers are just functions, we can also subject them to the functional random testing (Claessen and Hughes 2000) tools now built into DrScheme or its theorem proving environment (Eastlund 2009). Indeed, programmers who learn to formulate conjectures and validate conjectures via random testing are ideally prepared to study the automated verification of interactive/reactive programs.

The mechanized proofs are based on the semantics of the big-bang function in figure 6 and a more general version for universes of world programs. Specifically, a macro unfolds claims about a specific instance of big-bang expressions into an application of a function like big-bangF to all possible lists of events. 6.3 On to Classes At the same time as freshmen learn to formulate claims about their functional animation programs and to prove them correct, they are enrolled in a parallel course on design in the context of class-based object-oriented languages. We prepare the transition at the end of the first semester with some simple conventions and arrangements. Specifically, instead of arranging functions by feature (e.g., all rendering functions in one place, all key-event related functions somewhere else), we organize functions around data definitions. For example, we start with all event handlers for WorldSt: ;; WorldSt is one of ... ;; WorldSt → WorldSt (define (world-tickh w) ...)

6.2 Reasoning about Worlds and Universes During their second semester at Northeastern University, computer science majors study the logic of computation. The course combines a standard theoretical introduction into logic with practical hands-on exercises based on the ACL2 system (Boyer and Moore 1996); see our experience report on the test run of this course (Eastlund et al. 2007). Roughly speaking, the ACL2 system consists of an applicative Common Lisp and an automatic theorem prover based on first-order classical logic. Two years ago we extended the ACL2 system with the UNI VERSE library, enabling students to write reactive games, formulate conjectures about the safety of their game programs, and prove them correct via the ACL2 theorem prover (Eastlund and Felleisen 2009). Here is a typical theorem from such experiments:

;; WorldSt → Scene (define (world-render w) ...) ;; WorldSt KeyEvt → WorldSt (define (world-keyh w ke) ...) and follow it up with an arrangement around UFO: ;; UFO is one of ... ;; UFO → UFO (define (ufo-move u) ...)

(defthm preserve-safety (implies (safe-state game-state) (safe-state (tick game-state)))

;; UFO Scene → Scene (define (ufo-add-to-scene u s) ...) ;; UFO Symbol → UFO (define (ufo-chg u dir) ...)

When the theorem prover fails, students are encouraged to subject their conjectures to our ACL2 random tester (Eastlund 2009).

56

We always make the current state the first parameter of a function, analogous to the implicit this parameter in methods. An experienced programmer can immediately see that programming functional I/O methods is notationally even more convenient in a class-based context than in a functional language. In contrast to functions, methods are defined in a context where all the pieces of a world are accessible as fields. Consider the left-hand side of figure 7. It displays a version of the UFO program in PLT Scheme’s class system (Flatt et al. 2006).8 The functions from section 4 have been turned into methods of a class world% and ufo%. Each event-handling method returns a new instance of the class. Instead of selectors, the methods use field names to access the current values of the world state. Furthermore, the world% class is derived from an abstract class that provides default functionality for all event handlers and the imperative functionality for connecting event-handling methods to the machine’s devices. It naturally motivates inheritance and overriding. Finally, while an applicative world design with classes is notationally superior to a structure-based design, it still suffers from the notational overhead of creating new objects for every transformation. The move/tick (“move per tick”) method in ufo%, for instance, copies both the dx and the dy field into the new instance. Compare this method with move/tick in the imperative variant of ufo% on the right-hand side of figure 7. In general, the transition from a state-transforming functional program to an imperative object-oriented program is straightforward, easy to explain, and thus clarifies to students how the design principles of their first, functional experience carries over to the languages they expect to encounter in college.

concrete types (images). This concreteness enables UNIVERSE programmers to test all functions of an interactive graphical program, including those that produce output. Contrast this situation with the use of an abstract device type in Clean and of the I/O monad in Haskell. The testing of I/O functions in such a framework is similar to the testing of imperative procedures, requiring elaborate set-up and tear-down steps. We consider this activity out of reach for middle school students and distracting for courses that focus on design. Functional reactive programming (FRP) (Elliot and Hudak 1997) overcomes this problem by enabling programmers to write in a functional style over imperative values (event streams, behaviors). The programmer effectively describes a dataflow graph via expression dependencies; the run-time system updates values using this graph. While programming with event streams and behaviors is truly elegant, our pedagogic experience has been that the necessity of operators like switch puts it out of the reach of novices. Technically, FRP also has the disadvantage of requiring devices to be adapted to behave as reactive elements, which is a research problem that has been solved only partially (Ignatoff et al. 2006). Erlang (Armstrong et al. 1996) factors its I/O framework in a different but related manner. A distributed program in Erlang also consists of world-transforming event handlers, though such a program also need a process-local loop to keep track of the state. Our UNIVERSE library naturally separates these two concerns by factoring out the common loop from the server and the participants. From a pedagogical perspective, van Dam and his colleagues (1987, 1995) pioneered the event-oriented approach for teaching novices in the 1980s, but via imperative object-oriented programming. Bruce et al. (2001, 2004) resumed this direction in the early 2000s. We consider the functional alternative presented here even more useful than an imperative, object-oriented approach. On one hand, a functional approach is close to the mathematics that students encounter, meaning our approach promises a straightforward skill transfer. While we have only anecdotal evidence so far, we are convinced that a formal evaluation would confirm this conjecture. On the other hand, we consider object-oriented programming for novices an overkill because beginners don’t have programs of enough complexity to benefit from the structuring that object-orientation provides and demands. Chakravarty and Keller (2004) share our analysis concerning the teaching of functional programming languages in the first course as well as the problems of Haskell I/O. Their reaction is to turn this weakness of Haskell into an advantage. Specifically, the course switches perspective, emphasizing the imperative character of I/O actions and the need for ordering actions. While we acknowledge the pedagogical need for a transition to imperative programming, we consider this strategy a kludge and prefer the systematic approach via objects explained in section 6.3. After all, postponing I/O suggests that functional programming can’t cope with the full spectrum of programming tasks and fails to exploit it for the motivational aspects of assignments. An alternative and appealing solution is due to Achten (2008), who packaged up one special-purpose case study (playing soccer) along the lines of our framework. Sadly focusing on soccer limits the appeal of the framework to certain cultures and countries. Finally, Hudak and Peterson each briefly taught Haskell-based functional programming to small groups of selective middle school and high school students. Both arranged lectures around Haskore and Pan but did not use any texts [Hudak and Peterson, independent personal communication, Feb. 2009.]

7. Related Work From a technical perspective, the Clean Event I/O system (Achten and Plasmeijer 1995) comes closest to our approach.9 The Clean programming language supports so-called abstract I/O devices to which programs attach event handlers. In contrast to our event handlers, a Clean event handler has the following signature ;; WorldSt × ∗DeviceSt → WorldSt × ∗DeviceSt where DeviceSt type represents the state of an abstract I/O device. The ∗ notation on a type adds a linearity constraint on the type; the type system enforces this linearity constraints for the matching function parameter. For event handlers, the linearity constraint means that reading and writing to the I/O devices is enabled and translated into efficient imperative actions. Naturally, linearity constraint also has implications for the design and organization of event handlers, making them look like imperative functions. Our I/O framework supports only devices (windows, keyboards, mouse clicks, clocks) whose state can be supplied all at once when an event handler is invoked. Conversely, if a state needs to change, the event handlers don’t write to the device. Instead, the library uses an orthogonal rendering function to translate the state into an image that it displays, or it allows event handlers to return an additional value that it writes to a TCP port. In short, because our framework completely decouples event processing from writing to a device, there is no need in our framework to use linearity types and to thread the state of a device through an event handler. An additional difference between Clean and UNIVERSE concerns the nature of the devices. In Clean I/O devices are abstract types; in UNIVERSE the rendering functions translate states into 8 In

our courses and workshops, we use Java. (with Weirich, 2000) turned the Clean Event I/O system into the Clean Object I/O system and later ported it to Haskell (Achten and Jones 2001). Daan Leijen provided a binding to the wx media kit, now known as the wxHaskell toolkit [Achten, personal communication, Feb. 2009].

8. Conclusion

9 Acten

Our work demonstrates that with a suitable I/O framework, purely functional programming is an engaging medium for students of all ages. The Bootstrap effort routinely guides middle school students

57

without apparent mathematical talent to write interactive games in a language that is basically equivalent to high school algebra. For freshman students, we exploit the same framework to simultaneously strengthen their mathematical skills and to introduce them to the basics of program design. In one second-semester course, students even use an automatic theorem prover to establish interesting properties about such interactive games. At the same time, eventdriven programming can also be used to prepare freshmen for a course on object-oriented programming. Our work relies on two key insights and one technicality. First, it is important to leave the translation of external information into internal data (structures) to the framework and vice versa. As far as students are concerned, these are tasks that the computer and/or the operating systems takes on for the program. Second, the framework must separate event handling (as state transitions) from rendering (from states to images, sounds, or message transmission). This separation of concerns empowers novice programmers to design one function per task, without worrying about ordering any computational actions. One DrScheme-specific technicality facilitates the second step: turning images into first-class values. Although inserting images into programs and dealing with them directly at an interactive read-eval-print can be especially helpful, we don’t expect this technicality to be critical for an adaptation of our approach to other functional languages. In short, we conjecture that every functional language can easily supplement its I/O system with a library such as ours and could thus become an appealing medium for a range of educational applications.

Kim B. Bruce, Andrea Danyluk, and Thomas P. Murtagh. Event-driven programming is simple enough for cs1. SIGCSE Bull., 33(3):1–4, 2001. Kim B. Bruce, Andrea Danyluk, and Thomas P. Murtagh. Event-driven programming facilitates learning standard programming concepts. In Object-oriented programming systems, languages, and applications: Educators Symposium, pages 96–100, 2004. Manuel Chakravarty and Gabriele Keller. The risks and benefits of teaching purely functional programming in first year. J. Funct. Program., 14(1): 113–123, 2004. Koen Claessen and John Hughes. QuickCheck: a lightweight tool for random testing of Haskell programs. In ACM SIGPLAN International Conference on Functional Programming, pages 268–279, 2000. Carl Eastlund. DoubleCheck your theorems. In Proc. 8th Intern. Works. ACL2 and its Applications, pages 41–46. Lulu Press, 2009. Carl Eastlund and Matthias Felleisen. Automatic verification for interactive graphical programs. In Proc. 8th Intern. Works. ACL2 and its Applications, pages 33–41. Lulu Press, 2009. Carl Eastlund, Dale Vaillancourt, and Matthias Felleisen. ACL2 for freshmen: First experiences. In Proc. 7th Intern. ACL2 Symposium, pages 200–211. ACM Press, 2007. Conal Elliot and Paul Hudak. Functional reactive animation. In ACM SIGPLAN International Conference on Functional Programming, pages 196–203, 1997. Matthias Felleisen, Robert Bruce Findler, Matthew Flatt, and Shriram Krishnamurthi. How to Design Programs. MIT Press, 2001. URL http://www.htdp.org/. Matthias Felleisen, Robert Bruce Findler, Matthew Flatt, and Shriram Krishnamurthi. The TeachScheme! project: Computing and programming for every student. Computer Science Education, 14:55–77, 2004a.

Acknowledgments We gratefully acknowledge the help of many people: Carl Eastlund for feedback on the design and for discussions concerning its logical content; Kathi Fisler for using experimental releases of the library in her courses; Emmanuel Schanzer for creating and coordinating the Bootstrap outreach program; and Danny Yoo for extending the library with hierarchical GUI features.

Matthias Felleisen, Robert Bruce Findler, Matthew Flatt, and Shriram Krishnamurthi. The structure and interpretation of the computer science curriculum. J. Funct. Program., 14(4):365–378, 2004b. Robert Findler, John Clements, Cormac Flanagan, Matthew Flatt, Shriram Krishnamurthi, Paul Steckler, and Matthias Felleisen. DrScheme: A programming environment for Scheme. J. Funct. Program., 12(2):159– 182, March 2002.

References Peter Achten. Teaching functional programming with soccer-fun. In Proc. 2008 International Workshop on Functional and Declarative Programming in Education, pages 61–72, 2008. Peter Achten and Simon L. Peyton Jones. Porting the Clean object I/O library to Haskell. In IFL ’00: Selected Papers from the 12th International Workshop on Implementation of Functional Languages, pages 194–213, London, UK, 2001. Springer-Verlag. Peter Achten and Marinus J. Plasmeijer. The ins and outs of Clean I/O. J. Funct. Program., 5(1):81–110, 1995. Peter Achten and Martin Wierich. A tutorial to the Clean Object I/O library (version 1.2). Technical report, University of Nijmegen, February 2000. Joe Armstrong, Robert Virding, Claes Wikstr¨om, and Mike Williams. Concurrent Programming in Erlang (2nd Edition). Prentice-Hall, 1996. Bird and Wadler. Introduction to Functional Programming (2nd Edition). Prentice Hall PTR, 1998. Robert S. Boyer and J Strother Moore. Mechanized reasoning about programs and computing machines. In R. Veroff, editor, Automated Reasoning and Its Applications: Essays in Honor of Larry Wos, pages 146–176. The MIT Press, Cambridge, Massachusetts, 1996. URL citeseer.ist.psu.edu/boyer96mechanized.html.

Matthew Flatt and Matthias Felleisen. Units: Cool modules for HOT languages. In ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 236–248, June 1998. Matthew Flatt, Robert Bruce Findler, and Matthias Felleisen. Scheme with classes, mixins, and traits. In Asian Symposium on Programming Languages and Systems (APLAS) 2006, pages 270–289, November 2006. Paul Hudak. The Haskell School of Expression: Learning Functional Programming through Multimedia. Cambridge Univ. Press, 2000. Graham Hutton. Programming in Haskell. Cambridge Univ. Press, 2007. Daniel Ignatoff, Gregory H. Cooper, and Shriram Krishnamurthi. Crossing state lines: Adapting object-oriented frameworks to functional reactive languages. In International Symposium on Functional and Logic Programming, pages 259–276, 2006. Bryan O’Sullivan, Donald Stewart, and John Goerzen. Real World Haskell. O’Reilly Media, Inc., 2008. Kris Powers, Stacey Ecott, and Leanne Hirshfield. Through the looking glass: teaching CS0 with Alice. SIGCSE Bulletin, 39(1):213–217, 2007. Simon Thompson. Haskell: the Craft of Functional Programming. Addison Wesley Longman Publishing Co., Inc., 1997.

58

Experience Report: Embedded, Parallel Computer-Vision with a Functional DSL Ryan R. Newton

Teresa Ko

MIT CSAIL, Cambridge, MA, USA [email protected]

UCLA Vision Lab, Los Angeles, CA, USA [email protected]

Abstract This paper presents our experience using a domain-specific functional language, WaveScript, to build embedded sensing applications used in scientific research. We focus on a recent computervision application for detecting birds in their natural environment. The application was ported from a prototype in C++. In reimplementing the application, we gained a much cleaner factoring of its functionality (through higher-order functions and better interfaces to libraries) and a near-linear parallel speed-up with no additional effort. These benefits are offset by one substantial downside: the lack of familiarity with the language of the original vision researchers, who understandably tried to use the language in the familiar way they use C++ and thus ran into various problems.

Figure 1. Example background subtraction results.

the WaveScript implementation, but rather the software engineering impact of specific language features on the implementation of our application, namely: (1) multi-stage programming; (2) higherorder functions; (3) parametric and ad-hoc polymorphism; and (4) shared-nothing message-passing parallelism. But first we need to describe the application itself.

Categories and Subject Descriptors: D.3.2 Concurrent, distributed, and parallel languages; Applicative (functional) languages; Data-flow languages General Terms: Design, Languages, Performance Keywords: stream processing languages, computer vision

1.

2.

James Reserve Vision Application

A number of pertinent questions about the impact of climate change on our ecosystem are most readily answered by visually monitoring fine-scale interactions between animals, plants, and their environment. For example, species distribution, feeding habits, and timing of plant blooming events are often best observed through visual sensing. Some quantities, such as CO2 intake of plants, have no ”in the wild” sensor and can only be captured through visual sensing. In this paper, we focus on detecting birds at a feeder station in the wild with a network of cameras. Bird populations are particularly informative about changes in the ecosystem, as species distributions can quickly change due to their mobility. The camera infrastructure used is part of the James Reserve Wildlife Observatory. A feeder station was constructed and equipped with a webcam and server. It captures a frame a second at 704x480 pixels. There is inherent pressure to increase a camera’s coverage at the cost of reducing the size of the objects of interest in the image, thereby creating a more challenging detection and recognition task. Similarly, increasing temporal coverage (battery lifetime) pushes for lower sampling rates, limiting the applicable methods. The resulting image sequence will inevitably have small birds with little features to distinguish them from one another or from the background, and instances of the same bird being in a completely different location in consecutive frames. Our vision system is able to identify instances of a single bird in spite of these challenges. The system consists of two major components: background subtraction and bird classification. The case study in this paper will focus on the background subtraction component, because it is both computationally intensive and a substantial improvement over the state of the art in this domain.

Introduction

A sensor network deployment typically involves a collaboration between domain experts and computer scientists (though the latter would ideally be optional). The domain experts are often programmers themselves, often building prototypes in Matlab or C++. The expertise in short supply is in embedded software development. Therefore, tools that make it easier to transition from prototypes to embedded code are of great value. Functional programming is not known for its use in embedded programming, to say the least. But following the recent trend in two-stage domain-specific languages (DSLs) (3; 7; 5), we find that a stream-processing DSL can retain the software engineering benefits of functional programming (in the metaprogram), while generating good embedded code, exploiting parallelism, and partitioning programs transparently across embedded devices and more powerful “servers”. In this paper we describe our experience using the WaveScript language to implement the latest version of our computer vision application for sensor networks. This paper is not about

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICFP’09, August 31–September 2, 2009, Edinburgh, Scotland, UK. c 2009 ACM 978-1-60558-332-7/09/08. . . $5.00 Copyright

59

2.1

700

Background on Background

Natural environments such as the forest canopy present an extreme challenge to background subtraction because the foreground objects, by necessity, blend with the background, and the background itself changes due to the motion of the foliage and the rapid transition between light and shadow. For instance, images of birds at a feeder station exhibit a larger per-pixel variance due to changes in the background than due to the presence of a bird. Rapid background adaptation fails because birds, when present, are often moving less than the background and often end up being incorporated into it. Our background subtraction approach is based on building a model of the colors seen in the neighborhood around each pixel and then computing the difference between each new pixel value and the historical model for that pixel’s neighborhood. Therefore, the algorithm must maintain state for each pixel in the image (its model) and traverse each input image, comparing each new pixel against the model, and updating the model based on the values of surrounding pixels. An example result can be seen in Figure 1. The background model for the pixel located at the ith row and jth column is in general a non-parametric density estimate, denoted by pij (x). The feature vector, x ∈ R3 , is a colorspace representation of the pixel value. For computational reasons, we consider the simplest estimate, given by the histogram 1 X δ(s − x), (1) pij (x) = |S| s∈S

Lines of code

500

0 Original

Refactored

PixelLevel

Figure 2. Background Subtraction: Lines of code. Each work function is an imperative routine that processes a single stream element, updates the private state for that dataflow operator, and produces elements on output streams. The job of the WaveScript front-end is to partially evaluate the source (meta) program to create the dataflow graph, whereas the WaveScript backend performs graph optimizations, profiles and partitions graphs across devices (6), and reduces work functions to an intermediate language that can be fed to a number of backend code generators. The final intermediate language for work functions is a monomorphic first-order language that is easily retargetable to any platform that has a C-compiler (and many that don’t). WaveScript currently supports many embedded platforms including TinyOS “motes”, smartphones running JavaME, iPhones, and embedded Linux devices such as routers. WaveScript is used for embedded sensing applications that involve digital signal processing together with more irregular event processing. For example it has been used for acoustic localization of wild animals (2) and detection of potholes with sensor-equipped taxicabs (4). WaveScript itself is essentially an ML-dialect with a C-like syntax, a special form for accessing first-class streams, and miscellaneous extensions (e.g., extensible records). A top-level source program returns a stream value. Timers and drivers for hardware sensors provide stream sources, and a pair of primitives are the sole means of processing streams: merge combines streams in real-time order of arrival, and iterate is a “for-each” style construct whose evaluation creates a new dataflow operator and provides its work function, and whose return value is a new stream. The user manual contains details (1).

(2)

τ

where Sτ , the set of pixel values contributing to the estimate is defined as (4)

The Bhattacharyya distance between qij,τ (x) and the corresponding background model distribution for that location, pij,τ −1 (x), calculated from the previous frames, is computed to determine the foreground/background labeling. The Bhattacharyya distance between two distributions is given by Z p pij,τ −1 (x)qij,τ (x)dx, (5) d=

4.

Implementation in WaveScript

The application was ported to WaveScript from a prototype in C++. Figure 2 shows a breakdown of how lines of code were spent in both the C++ and WaveScript versions of the application. The porting process consisted of four steps:

X

1. Port code verbatim to WaveScript.

where X is the range of valid x’s and d ranges from 0 to 1. Larger values imply greater similarity in the distribution. A threshold on the computed distance, d, is used to distinguish between foreground and background. While subtle, this combination of background model and classifier allows for large articulated movements in the background to be ignored and small foreground objects to be detected.

3.

300

100

where xt (a, b) is the colorspace representation of the pixel at the ath row and bth column of the image taken at time t. The feature vector, x, is quantized to better approximate the true density. To detect foreground at time τ , a distribution, qij,τ (x), is similarly computed for the pixel located in the ith row and jth column using only the image at time τ according to 1 X δ(s − x), (3) qij,τ (x) = |Sτ | s∈S

Sτ = {xτ (a, b) | |a − i| < C, |b − j| < C},

400

200

where S, the set of pixel values contributing to the estimate, is defined as S = {xt (a, b) | |a − i| < C, |b − j| < C, 0 ≤ t < T },

Data Acquisition Core Algorithm Setup

600

2. Factor duplicated code using higher-order functions. 3. Remove unnecessary floating point. 4. Parameterize design; expose parallelism. The most interesting step is exposing parallelism. The algorithm is clearly data parallel. In fact, a separate process can compute each pixel’s Bhattacharyya distance (and update each pixel’s model) independently. But the data-access pattern is non-trivial. To update each pixel’s model, each process must read an entire patch of pixels from the image around it. Thus, tiling the matrix and assigning tiles to worker threads is complicated by the fact that such tiles must

WaveScript, Briefly

A WaveScript program constructs a dataflow graph of stream operators that executes in a non-synchronous (event-driven) manner. Each operator consists of a work function and optional private state.

60

800 600 400 200 0

d

nt

dP

xe

Fi ne

te

at

Fl

ed or

ct Fa

tim

al

a rb Ve

in

Execution time in milleseconds for one frame

1000

Variant of background subtraction algorithm

Figure 4. Single threaded performance of ported versions vs. original C++ version. Shows average time for processing each frame on a 3.2 gHz Xeon machine. populateBg is called repeatedly with a series of Images to ready

the background model before processing (classifying) new frames. In this version of the code, no significant type abstraction has yet been applied. The “model” for each pixel is a three-dimensional histogram in color space (the argument to populateBg is four dimensional to include a 3D histogram for each pixel in the image). The background subtraction algorithm as a whole consists of 1300 lines of code containing three functions very much like populateBg. A second function updates the model for each new frame, and a third compares each new image with the existing model to compute Bhattacharyya distances for each pixel. These functions traverse both the input image and stored histograms. The performance of the original version, initial port, and subsequent refactorings is illustrated in Figure 4. Generally speaking, excluding automatic memory management, the object-language generated by WaveScript shares most of the characteristics of C code.

Figure 3. An excerpt from the verbatim port.

overlap so each pixel may reach its neighbors. For these reasons, it is not straightforward to implement this algorithm in most stream processing languages. For example, stream processing languages tend to require that the input to each stream operator be a linear sequence of data. Exposing parallelism and exploiting locality in the background subtraction stage then requires an appropriate serialization of the matrix (for example, using Morton-order matrices), but this in turn creates complicated indexing. It is reasonable to say that the stream-processing paradigm is not a natural fit for parallel matrix computations. Yet it can be made to work using a high-level streaming language with a full datatypes (algebraic datatypes, dynamic allocation) and the additional power of meta-programming. 4.1

1200

rig O

f o r r = 0 t o rows−1 { / / c r e a t e t h e l e f t most p i x e l ’ s h i s t o g r a m from s c r a t c h c : : Int = 0; roEnd = r − o f f s e t + S i z e P a t c h ; / / end o f p a t c h coEnd = c − o f f s e t + S i z e P a t c h ; / / end o f p a t c h f o r r o = r−o f f s e t t o roEnd−1 { / / c o v e r t h e row r o i = i f r o < 0 t h e n −ro −1 e l s e i f r o >= rows t h e n 2 ∗ rows−1−r o e l s e r o ; f o r co = c−o f f s e t t o coEnd−1 { / / c o v e r t h e c o l c o i = i f co < 0 t h e n −co−1 e l s e i f co >= c o l s t h e n 2 ∗ c o l s −1−co e l s e co ; / / get the pixel location i = ( roi ∗ cols + coi ) ∗ 3; / / f i g u r e out which h i s t o g r a m bin : binB = ( I n t ) ( ( F l o a t ) image [ i ] ∗ i n v s i z e B i n s 1 ) ; binG = ( I n t ) ( ( F l o a t ) image [ i + 1 ] ∗ i n v s i z e B i n s 2 ) ; binR = ( I n t ) ( ( F l o a t ) image [ i + 2 ] ∗ i n v s i z e B i n s 3 ) ; / / add t o t e m p o r a r y h i s t o g r a m t e m p H i s t [ binB ] [ binG ] [ binR ] += s a m p l e W e i g h t ; } }; / / c o p y temp h i s t o g r a m t o l e f t m o s t p a t c h f o r cb = 0 t o NumBins1−1 { f o r cg = 0 t o NumBins2−1 { f o r c r = 0 t o NumBins3−1 { b g H i s t [ k ] [ cb ] [ cg ] [ c r ] += t e m p H i s t [ cb ] [ cg ] [ c r ] ; }}}; / / increment pixel index k += 1 ; / / c o m p u t e t h e t o p row o f h i s t o g r a m s f o r c = 1 t o c o l s −1 { ... / / Here : two more r o / co l o o p s l i k e a b o v e . / / T h e s e add and s u b t r a c t new d a t a f r o m t h e / / histogram to update i t i n c r e m e n t a l l y . ... }}

4.2

Refactoring Code

The next step was to simply clean up the code. Some of this consisted of factoring out simple first-order functions to capture repeated index calculations (a refactoring applied just as easily to the original C++). Other refactorings involved using higher order functions, for example, to encapsulate traversals over the image matrices and thereby remove for-loops and indexing expressions. After refactoring, the code was reduced to 400 lines. The clearer structure of the populateBg function can be seen in Figure 5. Both Figures 3 and 5 contain an optimization: the difference in histograms for neighboring pixels is small, and one can incrementally be computed from the other: adding some samples, removing others, and thereby “sliding” the patch. But this optimization has become much clearer in the structure of Figure 5. In this version we have begun to abstract the types used. Rather than a 4D nested array to represent a 5D space (a matrix of histograms), we use the preexisting WaveScript 2D and 3D matrix libraries. These provide ADTs with multiple implementations, including a WaveScript native one and a Gnu Scientific Library wrapper (which uses BLAS). Not needing linear algebra, we use the former in this paper. Swapping in a flattened row-major representation results in fewer small objects and fewer pointer dereferences, creating the performance improvement shown in Figure 4. (Note that the initial “Factoring” damaged performance, possibly by obscuring backend compiler optimizations.) Also, it’s a banal point, but parametric polymorphism for data types is critical. Most likely, the inconvenience of emulating this in C++ (templates) and the lack of

Porting Verbatim

Because WaveScript has imperative constructs and a C-like concrete syntax, it is straightforward to do a verbatim translation of C or C++ code. This does not in any way extract parallelism (it results in a dataflow graph with one operator). But it is the best way to establish correspondence with the output of the original program and then proceed by correctness preserving refactorings. The original source code possessed substantial code duplication (Figure 3), having mainly to do with repeated processing of nested arrays, including index calculations. The code in Figure 3 is part of the populateBg function, which builds the initial background model for each pixel and takes as input storage space for the models and an input image. It has the following signature: populateBg :: (Array4D Float, Image) -> (); type Image = (RawImage * Int * Int); // With wid,height type RawImage = Array Color; type Array4D t = Array (Array (Array (Array t)));

61

t y p e P i x e l H i s t = Matrix3D F l o a t ; p o p u l a t e B g : : ( M a t r i x P i x e l H i s t , Image ) −> ( ) ; fun p o p u l a t e B g ( b g H i s t , ( image , c o l s , rows ) ) { / / bgHist : background histograms / / image : frame of video stream a s s e r t e q ( ” Image s i z e : ” , l e n g t h ( image ) , rows∗ c o l s ∗ 3 ) ; t e m p H i s t = P i x e l H i s t : make ( rows , c o l s ) ; / / Strong assumption about order of matrix t r a v e r s a l : M a t r i x : f o r e a c h i ( b g H i s t , rows , c o l s , fun ( r , c , b g H i s t r c ) { i f c ==0 t h e n i n i t P a t c h ( r , c , rows , c o l s , t e m p H i s t , image ) e l s e s h i f t P a t c h ( r , c , rows , c o l s , t e m p H i s t , image ) ; / / c o p y temp h i s t o g r a m t o l e f t m o s t p a t c h : Matrix3D : m a p i n p l a c e 2 ( b g H i s t r c , t e m p H i s t , ( + ) ) ; }) }

Figure 5. The populateBg function builds background models (histograms) for the “Patch” centered around each pixel. First it creates a histogram for the leftmost pixel in a row. Then the next pixel’s histogram is calculated incrementally by shiftPatch: (1) removing pixels in the left most col of the previous patch from the histogram and (2) adding pixels in the right most col of the current pixel’s patch to the histogram. The foreachi function (as opposed to foreach) also passes indices for the data being accessed.

Figure 6. An example dataflow graph resulting from using a tile/pixel transform with rows = cols = 2 (generated by the compiler using AT&T GraphViz). Even though the pixels are partitioned into four disjoint tiles (dark purple), an extra margin must be included around each tile (orange) to ensure that each pixel within the tile may access its local neighborhood.

built-in matrix libraries resulted in the use of monomorphic, nested arrays in the original source. 4.3

x × y tiles, with an overlap of w pixels in both dimensions. First, the init function is called (at metaprogram evaluation) to initializes the mutable state for each worker. Then, at runtime, each tile is processed by the transf orm function, producing a new tile. The type signature for tagged tile kernel is listed in Figure 7. The “tagged” part is an additional complication introduced by the control-structure of the application. Because there is no shared state between kernels, all data must be passed through streams. Typical of stream-processing applications, there is a tension between dividing an application into finer-grained stream kernels and avoiding complicated data movement between kernels (for example, packing many configuration parameters into the stream)1 . In the case of the background subtraction algorithm it is desirable to pass an extra piece of information (tag) from the original stream of matrices down to each individual tile- or pixel-level worker, which is exactly what the interface tagged tile kernel allows. For the background subtraction application, an earlier phase of processing determines what mode the computation is in (populating initial background model, or estimating foreground) and attaches a mode flag (boolean) on the stream of images.

Reducing Floating Point

One of our goals in porting this application was to run it on a wide range of embedded hardware as well as on multicore desktops (and partitioned between the two). In particular, we used Nokia smartphones (N95) with ARM processors lacking floating-point units. Thus, the penultimate step was to reduce the use of floating point calculations (for example, in calculating indices into the color histograms), replacing them with integer or fixed-point calculations. This results in a significant speedup even on desktop machines (Figure 4). WaveScript offered no special support for this refactoring. While it has what amounts to a built-in Num type class, which helps write reusable code, ideally there would be some tool support for this common problem: porting to fixed point, monitoring overflows and quantifying the loss of accuracy. 4.4

Exposing Parallelism, Design Parameterization

Finally, the most interesting part of this case study was using WaveScript to parallelize the application. Fortunately, refactoring for clarity (abstracting data types, matrix transforms) had gotten us most of the way. The essential change was to move away from code handling monolithic matrices (arrays) to expressing the transform locally on image tiles—we use tile to refer to a fixed submatrix of the image—and then finally at the level of the individual pixel (with the stipulation that a pixel transform must also access its local neighborhood). The end result was a reusable library for parallel matrix computations (see parmatrix.ws). The first step is to implement transforms on tiles. From the perspective of the client code, this is the same as doing the transform on the original matrix (only smaller). The library code handles splitting matrices into (overlapping) tiles and disseminating those tiles to workers. The resulting dataflow graph structure is seen in Figure 6. The results of each independent worker are joined, combined into a single matrix, and passed downstream to the remaining computation. In Figure 7 we see the interface for building tile kernels via the function tagged tile kernel. Calling tagged tile kernel(x, y, w, transf orm, init) will construct a stream transformer that splits matrices on its input stream into

From tiles to pixels: The next step in building the parmatrix.ws library was to wrap the tile-level operator to expose only a pixel-level transform. The modified interface is shown in Figure 8. Note that the result—a stream transformer on matrices—has the same type as the tile-level version. The core of the background subtraction algorithm, using the pixel-level interface, is shown in Figure 9. There is, however, one problem. When processing image data at the tile level, it is still possible to incrementally update histograms within a tile, albeit with decreased benefit as tiles get smaller. The pixellevel version based on the interface in Figure 8, however, cannot leverage this optimization. We will return to this issue in a moment. In Figure 8, the Nbrhood type is used to represent the method for accessing the pixels in a local vicinity. It is a function mapping 1 This

is analogous to the sometimes awkward growth in the number of function arguments in purely functional programs, where function arguments are the sole means of communication between disparate program fragments. Of course this problem can be addressed by structuring techniques, for example, using a Reader monad.

62

tagged tile kernel :: / / F i r s t , p r o v i d e # X / Y w o r k e r s and d e p t h o f / / neighborhood a c c e s s (” o v e r l a p ”) r e q u i r e d : ( Int , Int , Int , / / Work f u n c t i o n a t e a c h t i l e : ( ( t a g , s t , T i l e px ) −> T i l e px2 ) , / / Per−t i l e s t a t e i n i t i a l i z e r : ( T i l e px ) −> s t ) −> S t r e a m ( t a g ∗ M a t r i x px ) −> S t r e a m ( M a t r i x px2 ) ;

:: tagged pixel kernel sliding nbrhood ( Int , Int , Int , / / Work f u n c t i o n t a k e s p i x e l s i n and p i x e l s o u t . / / ’ carry ’ w i l l r e p r e s e n t t h e l a s t computed r e s u l t : ( t a g , s t , c a r r y , P i x S e t px , P i x S e t px ) −> ( px2 , c a r r y ) , / / Per−p i x e l s t a t e i n i t i a l i z e r , t a k e s i n d i c e s : ( I n t , I n t ) −> s t , / / F u n c t i o n t o compute t h e f i r s t c a r r y : Nbrhood px −> c a r r y ) −> S t r e a m ( t a g ∗ M a t r i x px ) −> S t r e a m ( M a t r i x px2 ) ;

type Tile t = ( Matrix t ∗ ( I n t ∗ I n t ) ∗ ( I n t ∗ I n t ) ) ; t y p e P i x S e t t = ( ( I n t , I n t , t ) −> ( ) ) −> ( )

Figure 7. Signature for tile-level transform. This function creates X×Y workers, each of which handles a region of the input image. The transform is applied to each tile, and may maintain state between invocation, so long as that state is encapsulated in a mutable value passed as an argument to the transform. This assumes the overlap between tiles is the same in both dimensions. A tile is a piece of the original matrix together with metadata to tell where it came from; its fields are: (1) a matrix, (2) tile origin on original matrix, and (3) original image dimensions.

Figure 10. Signature for pixel-level transform supporting incremental computation of result. With PixSets we avoid constructing new sets in memory, rather we let the client ”visit” the relevant pixels (and indices). Whole-program function inlining removes any performance penalty with this pattern. on tagged tile kernel, and can only leverage incremental results within a tile.

5.

tagged pixel kernel with nbrhood :: ( Int , Int , Int , / / X, Y , overlap ( t a g , s t , Nbrhood px ) −> px2 , / / Per−p i x e l s t a t e i n i t i a l i z e r , t a k e s i n d i c e s : ( I n t , I n t ) −> s t ) −> S t r e a m ( t a g ∗ M a t r i x px ) −> S t r e a m ( M a t r i x px2 ) ;

WaveScript Learning Curve

The core background subtraction algorithm was ported by the first author, who is also the primary implementor of WaveScript, and naturally finds it easy to use. However, as other members of the UCLA group got involved and used WaveScript for other parts of the application, the retraining challenges became clear. There were two main lessons learned, having to do with C-like syntax and quotation-free metaprogramming. First, WaveScript’s syntactic similarity to C-family languages does reduce initial trepidation (even if perhaps it shouldn’t). Certainly, several domain specific languages (e.g., Bluespec (7)) have ended up mimicking C or Verilog syntax for this reason. But we found that it also encouraged attempts to directly reuse inappropriate programming idioms. Setting aside basic misunderstandings of WaveScript constructs (for example, one programmer, unfamiliar with type inference, thought type declarations were necessary for assigning types to variables rather than defining new types), the major problem we found was with the use of mutable state. Of course, many C programmers use mutation by habit. WaveScript’s support for mutable variables and arrays can encourage this. For example, programmers would often declare state globally and attempt to modify it within the work functions for stream operators:

t y p e Nbrhood a = ( I n t , I n t ) −> a ;

Figure 8. Signature for pixel-level transform.

tagged pixel kernel with nbrhood ( workersX , workersY , o v e r l a p , / / T h i s i s t h e work f u n c t i o n a t e a c h p i x e l . fun ( bgEstimateMode , b g h i s t , n b r h o o d ) { i f bgEstimateMode then p o p u l a t e P i x H i s t ( b g h i s t , nbrhood ) ; e l s e e s t i m a t e F g P i x ( b g h i s t , nbrhood ) ; }, / / I n i t i a l i z e per−p i x e l s t a t e ; c r e a t e a h i s t o g r a m : fun ( i , j ) Matrix3D : make ( NumBins1 , NumBins2 , NumBins3 , 0 ) )

Figure 9. Background subtraction using a pixel-level transform.

myvar = 0; S2 = iterate x in S1 { myvar++; ... }

(x, y) locations onto pixel values. At (0, 0) the function gives the value of the center pixel (the one being transformed and updated). With this we are able to express the background subtraction application as a simple pixel-level transformation. The core of the implementation is shown in Figure 9. Given a boolean tag on the input stream (bgEstimateMode), and a state argument that contains the histogram for just that pixel (bghist), the kernel decides whether to populate the background (populatePixHist) or to estimate the foreground (estimateFgPix).

Dangerously, this example actually works; the WaveScript evaluation model involves reifying a stream value back into code, and whatever state is found in the environment of the closure attached to a dataflow operator (work function) becomes the private state for that operator. However, sharing state between operators (the same mutable object reachable by two closures) is disallowed and will result in a compile-time error. The WaveScript design is predicated on the idea that understanding these meta-program evaluation failures is easier than understanding the error messages generated by a sufficiently sophisticated type system to rule out the errors. Nevertheless, it still helps to teach a “beginner” mode, where a special state keyword forces each operator’s state to be declared in a restricted lexical scope:

Restoring incremental histograms: The final step in porting our background subtraction algorithm was to reenable incremental computation of histograms in spite of the pixel-level interface to images. This is not difficult, but it does make the interface more complex (seen in Figure 10). Rather than direct access to neighboring pixel values, now the pixel kernel sees a sliding patch across the image (currently square, but could be generalized to a rectangle). The carried over result from the last position is used, together with the pixels sliding into view and the pixels sliding out, to compute the new result. The underlying implementation is still based

S2 = iterate x in S1 { state { myvar = 0 } myvar++; ... }

The second major hurdle was understanding meta-programming in WaveScript. Again, WaveScript assumes that a simple model with some exceptions and corner cases is easier to learn than a more sophisticated one. Specifically, unlike more general meta-

63

Speedup relative to single threaded version

expect to see an outlier at 16 cores where the mapping is one-to-one (and perhaps 8, 4, and 2). Allowing the metaprogram to read the number of CPUs and determine a tile size is an example of the utility of the metaprogram as a separate phase that precedes the WaveScript compiler’s own dataflow-graph scheduling and optimization. Still, it would be ideal to expose more of this tuning problem to the compiler itself. In the future, perhaps a method will be discovered to expose the compiler’s own profile-driven optimization and auto-tuning process so that the programmer may delegate the tile-size decision. This application also turned out to be a good candidate for distributed (inter-device) partitioning. After performing background subtraction most of the image is simply blacked out; with simple run-length encoding, these frames require much less network bandwidth when sent back to the server over a wireless network. Moreover, once the parmatrix.ws library is used, the space of choices becomes more continuous. Each tile-level worker can be placed on the embedded or server-side. Hosting half the workers results in half the data-reduction at half the cost, and, importantly, half the memory usage for expensive 3D histograms. For a detailed discussion of WaveScript’s partitioning methodology, see (6). Ultimately, while this application was greatly improved during its reimplementation, some of the interfaces we used represent a more imperative formulation than we would like—for example, kernels accepting a mutable state argument rather than producing a fresh state. A pure formulation would be ideal, but we are not currently able to achieve it with the near zero performance penalty that we require. Impure or not, abstracting control-flow was nevertheless valuable in this application.

12 10 8 6 4 2 0 0

2

4

6

8

10

12

14

16

Number of cores enabled

Figure 11. Parallel speedup: 16 data-parallel workers with variable number of enabled processor cores. Test platform was an AMD Barcelona with four quad-core processors. Cores were disabled using Linux’s /sys/devices/system/cpu interface. The data point at 0 shows the single threaded speed—i.e. the program configured with only a single worker. The drop in performance at data-point 1 is due to the overhead of splitting and reassembling the matrix.

programming languages like MetaML (8), WaveScript does not use explicit quotation and anti-quotation constructs for creating “code values”. Rather, the programmer is told that everything outside of an iterate will evaluate at compile-time. Nonetheless we saw frequent confusion about whether a data structure should be initialized at meta-program evaluation time, or at the beginning of runtime. Also, there is the issue of distinguishing the capabilities of the meta- and object-languages. Like many other two-stage DSLs, WaveScript is an asymmetric meta-programming system, where the meta-language differs from the object-language. (For example, the object-language lacks closures.) Attempting to use meta-language features in the object-language results in a compile-time error (with code location). For example, one programmer tried to read configuration files at runtime, which is not possible if the code is running on an embedded platform such as a phone. One related problem had to do with the foreign-function interface (FFI), which can only be accessed at runtime. Programmers ran into difficulties trying to use the FFI before they had a firm grasp of the language. Some of these uses of the FFI were spurious, while others were an unfortunate consequence of the need to interface with hardware to get off the ground in sensing applications. Possible fixes would include banning the FFI in the aforementioned beginner mode, or restricting its use to strict idioms for data acquisition.

6.

References [1] Wavescript users manual, http://regiment.us/wsman/. [2] Michael Allen, Lewis Girod, Ryan Newton, Samuel Madden, Daniel T. Blumstein, and Deborah Estrin. Voxnet: An interactive, rapidlydeployable acoustic monitoring platform. In IPSN ’08: Information processing in sensor networks, 2008. [3] Lennart Augustsson, Howard Mansell, and Ganesh Sittampalam. Paradise: a two-stage dsl embedded in haskell. ICFP Experience Report, pages 225–228, 2008. [4] Jakob Eriksson, Lewis Girod, Bret Hull, Ryan Newton, Samuel Madden, and Hari Balakrishnan. The pothole patrol: using a mobile sensor network for road surface monitoring. In MobiSys ’08: Proceeding of the 6th international conference on Mobile systems, applications, and services, pages 29–39, New York, NY, USA, 2008. ACM. [5] Geoffrey Mainland, Greg Morrisett, and Matt Welsh. Flask: staged functional programming for sensor networks. SIGPLAN Not., 43(9):335–346, 2008. [6] Ryan Newton, Sivan Toledo, Lewis Girod, Hari Balakrishnan, and Samuel Madden. Wishbone: Profile-based partitioning for sensornet applications. In NSDI’09: Networked Systems Design and Implementation, 2009. [7] R. Nikhil. Bluespec system verilog: efficient, correct rtl from high level specifications. Formal Methods and Models for Co-Design, 2004. MEMOCODE ’04., pages 69–70, June 2004. [8] W. Taha and T. Sheard. Multi-stage programming with explicit annotations. In Partial Evaluation and Semantics-Based Program Manipulation, Amsterdam, The Netherlands, June 1997, pages 203–217. New York: ACM, 1997.

Results and Discussion

The end result of this project was a cleaned up implementation that also exposes parallelism. (And a reusable parallel matrix library!) Parallel speedups for the final version of the background subtraction algorithm are shown in Figure 5. These results are generated given a single, fixed 4 × 4 tiling of the matrix (16 tiles) that results in 16 stream operators (“workers”) running on 16 threads. Another approach is to have the metaprogram set the tiling parameters based on the number of CPUs on the target machine. But this is complicated by the need to factor the target number of threads into separate x and y components such that xy = numthreads, which results in tiles of varying aspect ratios. Somewhat suprisingly, the operating system does a great job of juggling these 16 threads among a variable number of processor cores, with the exception of the unfortunate case at 11 cores. If the OS were doing poorly, we would

64

Runtime Support for Multicore Haskell Simon Marlow

Simon Peyton Jones

Satnam Singh

Microsoft Research Cambridge, U.K. [email protected]

Microsoft Research Cambridge, U.K. [email protected]

Microsoft Research Cambridge, U.K. [email protected]

Abstract

communication, or synchronisation. They merely annotate subcomputations that might be evaluated in parallel, leaving the choice of whether to actually do so to the runtime system. These so-called sparks are created and scheduled dynamically, and their grain size varies widely. Our goal is that programmers should be able to take existing Haskell programs, and with a little high-level knowledge of how the program should parallelise, make some small modifications to the program using existing well-known techniques, and thereby achieve decent speedup on today’s parallel hardware. However, when we started benchmarking some existing Parallel Haskell programs, we found that many programs which at first glance appeared to be completely reasonable-looking parallel programs, in fact failed to achieve significant speedup when run with our implementation on parallel hardware. This led us to a critical study of our (reasonably mature) baseline implementation of semi-explicit parallelism in GHC 6.10. In this paper we report the results of that study, with the following contributions:

Purely functional programs should run well on parallel hardware because of the absence of side effects, but it has proved hard to realise this potential in practice. Plenty of papers describe promising ideas, but vastly fewer describe real implementations with good wall-clock performance. We describe just such an implementation, and quantitatively explore some of the complex design tradeoffs that make such implementations hard to build. Our measurements are necessarily detailed and specific, but they are reproducible, and we believe that they offer some general insights. Categories and Subject Descriptors D.3.2 [Programming Languages]: Language Classifications—Applicative (functional) languages; D.3.2 [Programming Languages]: Language Classifications—Concurrent, distributed and parallel languages; D.3.3 [Programming Languages]: Language Constructs and Features— Concurrent programming structures; D.3.4 [Programming Languages]: Processors—Runtime-environments General Terms

1.

Languages, Performance

• We give a complete description of GHC’s parallel runtime,

starting with an overview in Section 4, and amplified in the rest of the paper. A major constraint is that we do barely compromise the (excellent) execution speed of sequential code.

Introduction

At least in theory, Haskell has a head start in the race to find an effective way to program parallel hardware. Purity-by-default means that there should be a wealth of inherent parallelism in Haskell code, and the ubiquitous lazy evaluation model means that, in a sense, futures are built-in. How can we turn these benefits into real speedups on commodity hardware? This paper documents our experiences with building and optimising a parallel runtime for Haskell. Our runtime supports three models of parallelism: explicit thread-based concurrency (Peyton Jones et al. 1996), semi-explicit deterministic parallelism (Trinder et al. 1998), and data-parallelism (Peyton Jones et al. 2009). In this paper, however, we focus entirely on semiexplicit parallelism. Completely implicit parallelism is still a distant goal; one recent attempt at this in the context of Haskell can be found in Harris and Singh (2007). The semi-explicit GpH programming model, in contrast, has been shown to be remarkably effective (Loidl et al. 1999, 2003). The semantics of the program remains completely deterministic, and the programmer is not required to identify threads,

• We discuss several major sets of design choices, relating to

spark distribution, scheduling, and memory management (Sections 5 and 7); parallel garbage collection (Section 6); the implementation of mutual exclusion on thunks (Section 8); and load balancing and thread migration (Section 9). In each case we give quantitative measurements for a number of optimisations that we made over our baseline GHC 6.10 implementation. • While we focus mainly on the implementation, our work has

had some impact on the programming model: we identify the need for pseq as well as seq (Section 2.1), and we isolated a signficant difficulty in the “strategies” approach to writing parallel programs (Section 7). • On the way we developed a useful new profiling and tracing

tool (Section 10.1). Overall, our results are encouraging (Figure 1). The optimisations we describe improve the absolute runtime and scalability of all benchmarks, sometimes dramatically so. Before our work, some programs would speed up on a parallel machine, but others slowed down. Afterwards, using 7 cores of an 8-core machine yielded speedups in the range 3x to 6.6x, which is not bad for a modest investment of programmer effort. Some of the improvements were described in earlier work (Berthold et al. 2008), along with preliminary measurements, as part of a comparison between shared-heap and distributed-heap parallel execution models. In this paper we extend both the range

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICFP’09, August 31–September 2, 2009, Edinburgh, Scotland, UK. c 2009 ACM 978-1-60558-332-7/09/08. . . $5.00. Copyright

65

Speedup on 4 cores Program Before After gray 2.19 2.50 mandel 2.94 3.51 matmult 2.56 3.37 parfib 3.73 3.89 partree 0.74 1.99 prsa 3.28 3.56 ray 0.81 2.11 sumeuler 3.74 3.85

of measurements and the range of improvements, while focussing exclusively on shared-heap execution. All our results are, or will be, repeatable using a released version of the widely-used GHC compiler. Our results do not require special builds of the compiler or libraries: identical results will be obtainable using a standard binary distribution of GHC. At the time of writing, most of our developments have been made public in the GHC source code repository, and we expect to include the remaining changes in the forthcoming 6.12.1 release of GHC, scheduled for the autumn of 2009. The sources to our benchmark programs are available in the public nofib source repository.

2.

Speedup on 7 cores Program Before After gray 2.61 2.77 mandel 4.50 4.96 matmult 4.07 5.04 parfib 5.94 6.67 partree 0.68 3.18 prsa 5.22 5.23 ray 0.82 3.48 sumeuler 6.32 6.42

Figure 1. Speedup results

Background: programming model strategy is a function that may evaluate (parts of) its argument and create sparks, but has no interesting results:

The basic programming model is known as Glasgow Parallel Haskell, or GpH (Trinder et al. 1998), and consists of two combinators:

type Done = () done = () type Strategy a = a -> Done

par :: a -> b -> b pseq :: a -> b -> b

The semantics of par a b is simply the value of b, whereas the semantics of pseq is given by pseq a b

= ⊥, = b,

Strategies compose nicely; that is, we can build complex strategies out of simpler ones:

if a = ⊥ otherwise

rwhnf :: Strategy a rwhnf x = x ‘pseq‘ done

Informally, par stores its first argument as a spark in the spark pool, and then continues by evaluating its second argument. The intention is that idle processors can find (probably) useful work in the spark pool. Typically the first argument to par will be an expression that is shared by another part of the program, or will be an expression that refers to other such shared expressions. 2.1

parList :: Strategy a -> Strategy [a] parList strat [] = done parList strat (x:xs) = strat x ‘par‘ parList strat xs

Finally, we can combine a data structure with a strategy for evaluating it in parallel: using :: a -> Strategy a -> a using x s = s x ‘pseq‘ x

The need for pseq

The pseq combinator is used for sequencing; informally, it evaluates its first argument to weak-head normal form, and then evaluates its second argument, returning the value of its second argument. Consider this definition of parMap:

Here is how we might use the combinators to evaluate all the element of a (lazy) input list in parallel, and then add them up: psum :: [Int] -> Int psum xs = sum (xs ‘using‘ parList rwhnf)

parMap f [] = [] parMap f (x:xs) = y ‘par‘ (ys ‘pseq‘ y:ys) where y = f x ys = parMap f xs

3.

We now turn our attention from the programming model to the implementation. Our baseline is GHC 6.10.1, a mature Haskell compiler. Its performance on sequential code is very good, so the overheads of parallelism are not concealed by sloppy sequential execution. It has supported parallel execution for several years, but while parallel performance is sometimes good, it is sometimes surprisingly bad. The trouble is that it is hard to know why it is bad, because performance is determined by the interaction of four systems — the compiler itself, the GHC runtime system, the operating system, and the physical hardware — each of which is individually extremely complex. The rest of this paper reports on our experience of improving both the absolute performance and its consistency. To whet your appetite, Figure 1 summarises the cumulative improvement of the work we present, for 4 and 7 cores1 . Each table has 2 columns:

The intention here is to spark the evaluation of f x, and then evaluate parMap f xs, before returning the new list y:ys. The programmer is hoping to express an ordering of the evaluation: first spark y, then evaluate ys. The obvious question is this: why not use Haskell’s built-in seq operator instead of pseq? The only guarantee made by seq is that it is strict in both arguments; that is, seq a ⊥ = ⊥ and seq ⊥ a = ⊥. But this semantic property makes no operational guarantee about order of evaluation. An implementation could impose this operational guarantee on seq, but that turns out to limit the optimisations that can be applied when the programmer only cares about the semantic behaviour. Instead, we provide both pseq and seq (with and without an order-of-evaluation guarantee), to allow the programmer to say what she wants while leaving the compiler with as much scope for optimisation as possible. To our knowledge this is the first time that this small but important point has been mentioned in print. The pseq operator first appeared in GHC 6.8.1. 2.2

Making parallel programs run faster

1 Why

did we use only 7 of the 8 cores on our test system? In fact we did perform the measurements for all 8 cores, but found that the results were far less consistent than the 7 core results, and in some cases performance degraded significantly. On closer inspection the OS appeared to be descheduling one or more of our threads, leading to long pauses when the threads needed to synchronise. This effect is discussed in more detail in Section 10.1.

Strategies

In Algorithms + Strategies = Parallelism (Trinder et al. 1998), Trinder et al explain how to use strategies to modularise the construction of parallel programs. In brief, the idea is as follows. A

66

8

7

6

gray Speedup

5

mandel matmult

4

parfib partree

3

prsa ray

2

sumeuler 1

0 1

2

3

4

5

6

7

Number of CPUs

Figure 2. Speedup results • Before: speedup achieved by parallel execution using GHC

roughly one for each physical CPU. This overall scheme is wellestablished, but it is easier to sketch than to implement! Each Haskell thread runs on a finite-sized stack, which is allocated in the heap. The state of a thread, together with its stack, is kept in a heap-allocated thread state object (TSO). The size of a TSO is around 15 words plus the stack, and constitutes the whole state of a Haskell thread. A stack may grow by copying the TSO into a larger area, and may subsequently shrink again. Haskell threads are executed by a set of operating system threads, which we call worker threads. We maintain roughly one worker thread per physical CPU, but exactly which worker thread may vary from moment to moment, as we explain in Section 4.2. Since the worker thread may change, we maintain exactly one Haskell Execution Context (HEC) for each CPU2 . The HEC is a data structure that contains all the data that an OS worker thread requires in order to execute Haskell threads. In particular, a HEC contains

6.10.1, compared to the same program compiled sequentially with 6.10.1, with the parallel GC turned off. In GHC 6.10.1 the parallel GC tended to make things worse rather than better, so this column reflects the best settings for GHC 6.10.1. • After: our best speedup results, using the PARGC3 configura-

tion (Section 6). The improvements are substantial, especially for the most disappointing programs which actually ran slower when parallelism was enabled in 6.10.1. Section 10 gives more details about the experimental setup and the benchmark programs. Figure 2 shows the scaling results for each benchmark program after our cumulative improvements, relative to the performance of the sequential version. By “sequential” we mean that the singlethreaded version of the runtime system was used, in which par is a no-op, and there are no synchronisation overheads.

4.

• An Ownership Field, protected by a lock, that records which

Background: the GHC runtime

worker thread is currently animating the capability (zero if none is). We explain in Section 4.2 why we do not use the simpler device of a lock to protect the entire HEC data structure.

By way of background, we describe in this section how GHC runs Haskell programs in parallel. In the sections that follow we present various measurements to show the effectiveness of certain aspects of our implementation design. Each of our measurements compare two configurations of GHC. Many of our improvements are cumulative and it proved difficult to untangle the source-code dependencies from each other in order to be able to make each measurement against a fixed baseline, so in each case we will clearly state what the baseline is. 4.1

• A Message Queue, containing requests from other HECs. For

example, messages saying “Please wake up thread T” arrive here. • A Run Queue of threads ready to run.

2 In

the source code of the runtime system, a HEC is called a “Capability”. The HEC terminology comes from the lightweight concurrency primitives work (Li et al. 2007b). Others call the same abstraction a “virtual processor” (Fluet et al. 2008a).

The basic setup

The GHC runtime system supports millions of lightweight threads by multiplexing them onto a handful of operating system threads,

67

• An Allocation Area (Section 6). There is a single heap, shared

4 cores Program ∆ Time (%) gray +6.5 mandel -3.4 matmult -2.1 parfib +3.5 partree -1.2 prsa -4.7 ray -35.4 sumeuler +0.0 Geom. Mean -5.5

among all the HECs, but each HEC allocates into its own local allocation area. • GC Remembered Sets (Section 6.2). • A Spark Pool. Each invocation of par a b adds the thunk a to

the (current HEC’s) Spark Pool; this thunk is called a “spark”. • A Worker Pool of spare worker threads, and a Foreign Outcall

Pool of TSOs that are engaged in foreign calls (see Section 4.2). In addition there is a global Black Hole Pool, a set of threads that are blocked on black holes (see Section 8). An active HEC services work using the following priority scheme. Items lower down the list are only performed if there are no higher-priority items available.

Figure 3. The effect of adding work-stealing queues vs. GHC 6.10.1 This also explains why we don’t simply have a mutex protecting the HEC, which all the spare worker threads are blocked on. That approach would afford us less control in the sense that we often want to hand the HEC to a particular worker thread, and a simple mutex would not allow us to do that. Foreign calls are not the focus of this paper, but more details can be found in Marlow et al. (2004).

1. Service a message on the Message Queue. 2. Run a thread on the Run Queue; we use a simple round-robin scheduling order. 3. If any spark pool is non-empty, create a spark thread and start running it (see Section 5.3). 4. Poll the Black Hole Pool to see if any thread has become runnable; if so, run it.

5.

Faster sparks

We now discuss the first set of improvements, which relate to the handling of sparks.

All the state that a HEC needs for ordinary execution of Haskell threads is local to the HEC, so under normal execution a HEC proceeds without requiring any synchronisation, locks, or atomic instructions. Synchronisation is only needed when:

5.1

Sharing sparks

GHC 6.10.1 has a private Spark Pool for each HEC, but it uses a “push” model for sharing sparks, as follows. In between running Haskell threads, each HEC checks whether its spark pool has more than one spark. If so, it checks whether any other HECs are idle (a cheap operation that requires no atomic instructions); if it finds an idle HEC it gives one or more sparks to it, by temporarily acquiring ownership of the remote HEC and inserting the sparks in its pool. To make spark distribution cheaper and more asynchronous we re-implemented each HEC’s Spark Pool as a bounded workstealing queue (Arora et al. 1998; Chase and Lev 2005). A workstealing queue is a lock-free data structure with some attractive properties: the owner of the queue can push and pop from one end without synchronisation, meanwhile other threads can “steal” from the other end of the queue incurring only a single atomic instruction. When the queue is almost empty, popping also incurs an atomic instruction to avoid a race between the popping thread and a stealing thread. When a spark is pushed onto an already full queue, we have a choice between discarding the new spark or discarding one of the older sparks. Our current implementation discards the newer spark; we do not investigate this choice further in this paper. Figure 3 shows the effect of adding work-stealing queues to our baseline GHC 6.10.1. As we can see from the results, work-stealing for sparks is almost always beneficial, and increasingly so as we add more cores. It is of particular benefit to ray, where the task granularity is very small.

• Load balancing is needed (Section 9). • Garbage collection is required (Section 6). • Blocking on black holes (Section 8.1). • Performing an operation on an MVar, or an STM transaction. • Unblocking a thread on another HEC. • Throwing an exception to a thread on another HEC, or a

blocked thread. • Allocating large or immovable memory objects; since these

operations are relatively rare, we allocate such objects from single global pool. • Making a (safe) foreign call (Section 4.2).

4.2

7 cores Program ∆ Time (%) gray -2.1 mandel -4.9 matmult +1.6 parfib -1.2 partree -3.7 prsa -6.7 ray -66.1 sumeuler +1.4 Geom. Mean -14.4

Foreign calls

Suppose that a Haskell thread makes a foreign call to a C procedure that blocks, such as getChar. We do not want the entire HEC to seize up so, before making the call, the worker thread relinquishes ownership of the HEC, leaving the Haskell thread in a tidy state. The thread is then placed in the Foreign Outcall Pool so that the garbage collector can find it. We maintain a Worker Pool of worker threads for each HEC, each eager to become the worker that animates the HEC. When one worker relinquishes ownership, it triggers a condition variable that wakes up another worker from the Worker Pool. If the latter is empty, a new worker is spawned. What happens when the original worker thread W completes its call to getChar and wants to return? To return, it must re-acquire ownership of the HEC, so it must somehow displace any worker thread X that currently owns the HEC. To do so, it adds a message to the HEC’s Message Queue. When X sees this message, it signals W, and returns itself to the worker pool. Worker thread W wakes up, and takes ownership of the HEC. This approach is slightly better than directly giving W ownership of the HEC, because W might be slow to respond, and the HEC does not remain locked for the duration of the handover.

5.2

Choosing a spark to run

Because we use a work-stealing queue for our spark pools, stealing threads must always take the oldest spark in the pool. However, the HEC owning the spark pool has a choice between two policies: it can take the youngest spark from the pool (LIFO), or it can take the oldest spark (FIFO). Taking the oldest spark requires an atomic instruction, but taking the youngest spark does not. Figure 4 shows the effect of changing the default (FIFO) to LIFO. In most of our benchmarks this results in worse performance, because the older sparks tend to be “larger”.

68

4 cores Program ∆ Time (%) gray +10.1 mandel +4.5 matmult -0.4 parfib +0.2 partree -7.3 prsa -0.3 ray +10.0 sumeuler +0.3 Geom. Mean +2.0

7 cores Program ∆ Time (%) gray +5.3 mandel +6.2 matmult -1.2 parfib +1.0 partree -1.6 prsa -2.3 ray +18.7 sumeuler +17.9 Geom. Mean +5.2

which was to create a new thread for each activated spark. Our baseline for these measurements is GHC 6.10.1 plus work-stealingqueues (Section 5.1). Batching sparks is particularly beneficial to two of our benchmarks, matmult and ray, while it is a slight pessimisation for parfib on 7 cores. For ray the rationale is clear: there are lots of tiny sparks, so reducing the overhead for spark execution has a significant effect. For parfib we believe that the reduction in performance shown here is because the program is actually being more effective at exploiting parallelism, which leads to reduced performance due to lack of locality (Section 6); as we shall see later, this performance loss is recovered by proper use of parallel GC.

Figure 4. Using LIFO rather than FIFO for local sparks 4 cores Program ∆ Time (%) gray -0.7 mandel -1.5 matmult -8.2 parfib +0.6 partree -0.6 prsa -1.0 ray -31.2 sumeuler +0.2 Geom. Mean -5.9

6.

7 cores Program ∆ Time (%) gray +2.3 mandel +1.3 matmult -11.3 parfib +6.4 partree +0.3 prsa -1.0 ray -24.3 sumeuler -0.3 Geom. Mean -3.8

Figure 5. The effect of batching sparks 5.3

Garbage collection

The shared heap is divided into fixed-size (4kbyte) blocks, each with a block descriptor that specifies which generation it belongs to, along with other per-block information. A HEC’s Allocation Area simply consists of a list of such blocks. When any HEC’s allocation area is exhausted, a garbage collection must be performed. GHC 6.10.1 offers a parallel garbage collector (see Marlow et al. (2008)), but GC only takes place when all HECs stop together, and agree to garbage collect. We aim to keep this synchronisation overhead to a minimum by ensuring that we can stop a HEC quickly (Section 6.3). In future work we plan to relax the stop-the-world requirement and adopt some form of CPU-independent GC (Section 12.1). When a GC is required, we have the option of either • Performing a single-threaded GC. In this case, the HEC that

Batching sparks

initiated the GC waits for all the other HECs to cease execution, performs the GC, and then releases the other HECs.

To run a spark a, a HEC simply evaluates the thunk a to head normal form. To do so, it needs a Thread State Object. It makes no sense to create a fresh TSO for every spark, and discard it when the evaluation is complete for the garbage collector to recover. Instead, when a HEC has no work to do, it checks whether there are any sparks, either in the HEC’s local spark pool or in any other HEC’s spark pool (check the non-empty status of a spark pool does not require a lock). If there are sparks available, then the HEC creates a spark thread, which is a perfectly ordinary thread except that it runs the following steps in a loop:

• Performing a parallel GC. In this case, the initiating HEC sends

a signal to the other HECs, which causes them to become GC threads and await the start of the GC. Once they have all responded, the initiating HEC performs the GC initialisation and releases the other GC threads to perform GC. When the GC termination condition is reached, each GC thread waits at the GC exit barrier. The initiating HEC performs any post-GC tasks (such as starting finalizers), and then releases the GC threads from the barrier to continue running Haskell code.

1. If the local Run Queue or Message Queue is non-empty, exit.

3. If there were no sparks to steal, exit.

In a single-threaded program, it is often better to use singlethreaded GC for the quick young-generation collections, because the cost of starting up and shutting down the GC threads can outweigh the benefits of doing GC in parallel.

4. Evaluate the spark to weak-head-normal form.

6.1

2. Remove a spark from the local spark pool, or if that is empty, steal a spark from another HEC’s pool.

Avoiding synchronisation in parallel copying GC

Parallel copying GC normally requires each GC thread to use an atomic instruction to synchronise when copying an object, so that objects are not accidentally duplicated. The cost of these atomic instructions is high: roughly 30% of GC time (Marlow et al. 2008). However, as we suggested in that paper, it is possible to relax the synchronisation requirement where immutable objects are concerned. The only adverse effect from making multiple copies of an immutable object is a little wasted space, and we know from measurements that the rate of actual collisions is very low–typically less than 100 collisions per second of GC time–so the amount of wasted space is likely to be minuscule. Our parallel GC therefore adopts this policy, and avoids synchronising access to immutable objects. Figure 6 compares the two policies: the baseline is our current system in which we only lock mutable objects, compared to a modified version in which we lock every object during GC. As the results show, our optimisation of only locking mutable objects has a significant benefit on overall performance: without it, performance drops by over 7%. The effect

where “exit” means that the spark thread exits and performs no further work; its TSO will be recovered by a subsequent GC. A spark thread will therefore evaluate sparks to WHNF one after another until it can find no more sparks, or until there is other work to do, at which point it exits. This is a particularly simple strategy and works well: the cost of creating the spark thread is amortized over multiple sparks, and the spark thread gets out of the way quickly if any other work arrives. If a spark blocks on a black hole, since the spark thread is just an ordinary thread it will block in the usual way, and the scheduler will create another spark thread to continue running the available sparks. We don’t have to worry unduly about having too many spark threads, because a spark thread will always exit when there are other threads around. This reasoning does rely on sparks not being too large, however: many large sparks becoming blocked could lead to a large number of running spark threads. Figure 5 compares the effect of using the spark-batching approach described above to the approach taken in GHC 6.10.1,

69

7 cores Program ∆ Time (%) gray +18.7 mandel +9.4 matmult +4.5 parfib -0.5 partree +17.3 prsa +2.5 ray +4.8 sumeuler +5.8 Geom. Mean +7.6

4 cores Program ∆ Time (%) gray -3.9 mandel -1.4 matmult -0.5 parfib -0.5 partree +5.0 prsa +2.3 ray +3.4 sumeuler -0.3 Geom. Mean +0.5

Figure 7. Using the heap-limit for context switching

Figure 6. Effect of locking all closures in the parallel GC 6.3

is the most marked in benchmarks that do the most GC: gray, but is negligible in those that do very little GC: parfib. 6.2

7 cores Program ∆ Time (%) gray -9.7 mandel -2.1 matmult +0.0 parfib +0.4 partree +0.3 prsa -0.1 ray -2.8 sumeuler -1.7 Geom. Mean -2.0

Pre-emption and garbage collection

Since garbage collection is relatively frequent, and requires all HECs to halt, it is important that they all do so promptly. One way to do this would be to use time-based pre-emption; however that would essentially mandate the use of conservative GC, which we consider an unacceptable compromise. Hence in order to GC, we require that all HECs voluntarily yield at a safe point, leaving the system in a state where all the heap roots can be identified. The standard way to indicate to a running thread that it should yield immediately is to set its heap-limit register to zero, thus causing the thread to return to the HEC scheduler when it next tries to allocate. On a register-poor machine, we keep the heap-limit “register” in memory, in a block of “registers” pointed to by a single real machine register. In this case, it is easy for one HEC to set another HEC’s heap limit to zero, simply by overwriting the appropriate memory location. On a register-rich machine we can keep the heap limit in a real machine register, but it is then a good deal more difficult for one HEC to zap another HEC’s heap limit, since it is part of the register set of a running operating-system thread. We therefore explored two alternatives for register-rich architectures:

Remembered Sets

Remembered sets are used in generational GC to track pointers from older generations into younger ones, so that when collecting the younger generation we can quickly find all the pointers into that generation. Whenever a pointer into a younger generation is created, an entry must be added to the remembered set. There are many choices for the representation of the remembered set and the form of its associated write barrier (Blackburn and Hosking 2004). In GHC, because mutation is rare, we opted for a relatively expensive write barrier in exchange for an accurate remembered set. Our remembered set representation is a sequential store buffer, which contains the addresses of objects in the old generation that contain pointers into the new generation. Each mutable object is marked to indicate whether it is dirty or not; dirty objects are in the remembered set. The write barrier adds a new entry to the remembered set if and only if the object being mutated is in the old generation and is not already marked dirty. In our parallel runtime, each HEC has its own remembered set. The reasons for this are twofold:

• Keep the heap limit in memory. This slows the heap-exhaustion

check, but releases an extra register for argument passing.

• Even though appending to the remembered set is not a common

operation, it is common enough that the effect of including any synchronisation would be noticeable. Hence, we must be able to add new entries to the remembered set without atomic instructions.

• Keep the heap limit in a register, and implement pre-emption

by setting a separate memory-resident flag. The flag is checked whenever the thread’s current allocation block runs out, since it would be too expensive to insert another check at every heapcheck point. This approach is cheap and easy, but pre-emption is much less prompt: a thread can allocate up to 4k of data before noticing that a context-switch is required.

• Objects that have been mutated by the current CPU are likely

to be in its cache, so it is desirable to visit these objects by the garbage collector on the same CPU. This is particularly important in the case of threads: the stack of a thread, and hence its TSO, is itself a mutable object. When a thread executes, the stack will accumulate pointers to new objects, and so if the TSO resides in an old generation it must be added to the remembered set. Having HEC-local remembered sets helps to ensure that the garbage collector traverses a thread on the same CPU that was running the thread.

Figure 7 measures the benefit of using the heap-limit register to signal a context-switch, versus checking a flag after each 4k of allocation. We see a slight drop in performance at 4 cores, changing to an increase in performance at 7 cores. This technique clearly becomes more important as the number of cores and the amount of garbage collection increases: benchmarks like gray that do a lot of GC benefit the most.

One alternative choice for the remembered set is the card table. Card tables have the advantage that they can be updated by multiple threads without synchronisation, but they compromise on accuracy. More importantly for us, however, would be the loss of locality from using a single card table instead of per-HEC sequential store buffers. We do not have individual measurements for the benefit of using HEC-local remembered sets, but believe that it is essential for good performance of parallel programs. In GHC 6.10.1 remembered sets were partially localised: they were local during execution, but shared during GC. We subsequentially modified these partially HEC-local remembered sets to be fully localised.

6.4

Parallel GC and locality

When we initially developed the parallel GC, our goal was to improve GC performance, so we focused most of our effort on using parallelism to accelerate garbage collection for single-threaded programs (Marlow et al. 2008). In this case the key goal is achieving good load-balancing, that is, making sure that all of the GC threads have work to do. However, there is another factor working in the opposite direction: locality. For parallel programs, when GC begins each CPU already has a lot of data in its cache; in a sequential program only one CPU does. It would obviously make sense for each CPU to

70

• PARGC3: using parallel GC in both young and old generations,

∆ Time (%) PARGC2 PARGC3 +52.5 -11.9 +19.0 -9.9 -3.9 -8.1 -1.9 -6.4 -60.7 -65.8 +1.2 -4.4 -2.6 -18.0 -0.5 -1.4 -60.7 -65.8 +52.5 -1.4 -5.1 -19.3

PARGC1 and PARGC2 use work-stealing for load-balancing, PARGC3 uses no load-balancing. In terms of locality, PARGC2 will improve locality significantly by traversing most of the data reachable by parallel threads on the same CPU as the thread is executing. PARGC3 will improve locality further by not moving data from one CPU’s cache to another in an attempt to balance the work of GC. Figures 8 and 9 present the results, for 4 cores and 7 cores respectively. There are several aspects to these figures that are striking:

Figure 8. The effectiveness of parallel GC (4 cores)

• partree delivers an 80% improvement with PARGC3 on 7

Program gray mandel matmult parfib partree prsa ray sumeuler Min Max Geometric Mean

PARGC1 -3.9 +3.6 -14.6 -0.5 +5.0 -0.4 +2.0 +0.3 -14.6 +5.0 -1.2

without load-balancing.

cores, with most of the benefit coming with PARGC2. Clearly locality is vitally important in this benchmark. Program gray mandel matmult parfib partree prsa ray sumeuler Min Max Geometric Mean

PARGC1 +2.8 +8.9 -19.2 +0.9 -2.1 +3.2 -0.8 +5.1 -19.2 +8.9 -0.5

• gray and mandel degrade significantly with PARGC2, recover-

∆ Time (%) PARGC2 PARGC3 +106.2 -3.0 +57.5 -5.2 +14.1 -14.1 -1.5 -7.6 -70.3 -79.8 +16.7 +8.2 +6.2 -5.2 +0.0 -2.8 -70.3 -79.8 +106.2 +8.2 +3.7 -21.3

ing with PARGC3. Load-balancing appears to be having a significant negative effect on performance here. These are benchmarks that don’t achieve full speedup, so it is likely that when a GC happens, idle CPUs are stealing data from the busy CPUs, harming locality more than would be the case if all the CPUs were busy. • PARGC3 is almost always better than the other configurations.

There are of course other possible configurations. For instance, parallel GC in the old generation only without load balancing, or parallel GC in both generations but with load-balancing only in the old generation. We have performed informal measurements on these and other configuarations and found that on average they performed less well than PARGC3, although for individual benchmarks it is occasionally the case that a different configuration is a better choice. Future versions of GHC will use PARGC3 by default for parallel execution, although it will be possible to override the default and select any combination of parallel/sequenctial GC for each generation with and without load-balancing.

Figure 9. The effectiveness of parallel GC (7 cores)

garbage-collect its own data, so far as possible, rather than to allow GC to redistribute it. Each HEC starts by tracing its own root set, starting from the HEC’s private data (Section 4.1). However, our original parallel GC design used global work queues for load-balancing (Marlow et al. 2008). This is a poor choice for locality, because the link between the CPU that copies the data and the CPU that scans it for roots is lost. To tackle this, we modified our parallel GC design to use work-stealing queues. The benefits of this are threefold:

7.

The space behaviour of par

The spark pool should ideally contain only useful work, and we might hope that the garbage collector would assist the scheduler by removing useless sparks from the spark pool. One sure-fire way to do so is to remove any fizzled sparks. A spark has fizzled if the thunk to which it refers it has already been evaluated, so that running the spark would terminate immediately. Indeed, we expect most sparks to fizzle. The par operation creates opportunities for parallel evaluation but, if the machine is busy, few of these opportunities are taken up. For example, consider

1. Contention is reduced. 2. Locality is improved: a CPU will take work from its own queue in preference to stealing. Local work is likely to be in the CPU’s cache, because it consists of objects that this CPU recently copied. 3. We can easily disable load-balancing entirely, by opting not to steal any work from other CPUs. This trades parallelism in the GC for locality.

x ‘par‘ (y ‘pseq‘ (x+y)) This sparks the thunk x (adding it to the spark pool), evaluates y, and then adds x and y. The addition operation forces both its arguments, so if the sparked thunk x has not been taken up by some other processor, the addition will evaluate it. In that case, the spark has fizzled. Clearly a fizzled spark is useless, and the garbage collector can (and does in GHC 6.10) discard them, but which other sparks should the garbage collector retain? Two policies immediately spring to mind, that we shall call ROOT and WEAK:

The use of work-stealing queues for load-balancing in parallel GC is a well-known technique (Flood et al. 2001), however what has not been studied before is the trade-off between whether to do load-balancing at all or not for parallel programs. We will measure our benchmarks in three configurations: • The baseline is our best system, with parallel GC turned off. • PARGC1: using parallel GC in the old generation only, with

load-balancing.

• ROOT: Treat (non-fizzled) sparks as roots for the garbage col-

• PARGC2: using parallel GC in both young and old generations,

lector. That is, retain all such sparks and the graph they point to.

with load-balancing.

71

Strat. No Strat.

Total time(s) 10.7 6.4

MUT time(s) 5.1 5.2

GC time(s) 5.7 1.2

Total sparks 1000000 1000000

Fizzled Sparks 0 999895

that share no graph with the main program”. This is clearly an improvement on the ROOT policy because it lets us discard sparks that share nothing with the main program. However, it is quite difficult to establish whether there is any sharing between the spark and the main program, since this entails establishing a “reaches” property, where each closure in the graph is marked if it can reach certain other closures (namely the main program). This is exactly the opposite of the property that a garbage collector normally establishes, namely “is reachable from”, and is therefore at odds with the way the garbage collector normally works. It requires a completely new traversal, perhaps by reversing all the pointers in the graph. Even if we could implement this strategy, it does not completely solve the problem. A spark may share data with the main program, but that is not enough: it has to share unevaluated data, and that unevaluated data must be part of what the spark will evaluate. Moreover, perhaps we still want to discard sparks that are retaining a lot of unshared data, but still refer to a small amount of shared data, on the grounds that the cost of the space leak outweighs the benefits of any possible parallelism.

Figure 10. Comparison of ray using strategies vs. no strategies • WEAK: Only retain (non-fizzled) sparks that are reachable

from the roots of the program. The problem is, neither of these policies is satisfactory. WEAK seems attractive, because it lets us discard sparks that are no longer required by the program. However, the WEAK policy is completely incompatible with strategies. Consider the parList strategy: parList :: Strategy a -> Strategy [a] parList strat [] = done parList strat (x:xs) = strat x ‘par‘ parList strat xs

Each spark generated by parList is a thunk for the expression “strat x”; this thunk is not shared, since it is created uniquely for the purposes of creating the spark, and hence can never fizzle. Hence, the WEAK policy will discard all sparks created by parList, which is obviously undesirable. So, what about the ROOT policy? This is the policy that is used in existing implementations of GpH, including GUM (Trinder et al. 1996) and GHC. However, it leads to the converse problem: too many sparks are retained, leading to space leaks. Consider the expression

7.1

Improving space behaviour of sparks

One way to improve the space behaviour of sparks is to use the WEAK policy for garbage collection. This guarantees, by construction, that the spark pool does not leak any space whatsoever. However, this choice would affect the programming model. In particular we can no longer use the strategies abstraction as it stands, because every strategy combinator involves sparking unique, unshared thunks, which WEAK will discard immediately. It is for this reason that GHC still uses the ROOT policy: if we were to switch to WEAK, then existing code using Strategies would lose parallelism. We can continue to write parallel programs without space leaks under the ROOT policy, as long as we observe the rule that all sparks must be eventually evaluated. Then we can be sure that any unused sparks will fizzle, and in this case there is no difference between ROOT and WEAK. The following section describes how to write programs in this way. Nonetheless, we believe that in the long term the implementation should use the WEAK policy. The WEAK policy has one distinct advantage over ROOT, namely that it is possible to write programs that use speculative parallelism without incurring space leaks. A speculative spark is by its nature one that may or may not be eventually evaluated, and in order to ensure that such sparks are eventually garbage collected if they turn out not to be required, we need to use WEAK.

sum (parList rnf (map expensive [1..100000])) With the ROOT policy we will retain all of the sparks created by parList, and hence lose no parallelism. But if there are not enough processors to evaluate all of the sparks, they will never be garbage collected, even after the sum is complete! They remain in the spark pool, retaining the list elements that they point to. This can lead to serious space leaks3 . To quantify the effect, we compared two versions of the ray benchmark. The first version uses parBuffer from the standard strategies library, applied to the rwhnf strategy, while the second uses a modified version of parBuffer which avoids the space leak (we will explain how the modified version works in Section 7.2). We ran both versions of the program on a single CPU, to illustrate the degenerate case of having too few CPUs to use the available parallelism. Figure 10 gives the results; MUT is the amount of “mutator time” (execution time excluding garbage collection), GC is the time spent garbage collecting. We can see that with the strategies version, no sparks fizzle, and the GC time suffers considerably as a result4 . Implementations using the ROOT policy have been around for quite a long time, and yet the problem has only recently come to light. Why is this? We are not sure, but postulate that the applications that have been used to benchmark these systems do not suffer unduly from the space leaks, perhaps because the amount of extra space retained is small, and there is little or no speculation involved. If there are enough CPUs to use all the parallelism, then no space leaks are observed; the problem comes when we want to write a single program that works well when run both sequentially and in parallel. Are there any other policies that we should consider? Perhaps we might try to develop a policy along the lines of “discard sparks

7.2

Avoiding space leaks with ROOT

We can still use strategy-like combinators with ROOT, but they are no longer compositional. In the case of parList, if we simply want to evaluate each element to weak-head-normal-form, we use a specialised version of parList: parListWHNF :: Strategy [a] parListWHNF [] = done parListWHNF (x:xs) = x ‘par‘ parListWHNF xs

Now, as long as the list we pass to parListWHNF is also evaluated by the main program, the sparks will all be garbage collected as usual. The rule of thumb is to always put a variable on the left of par. Reducing the granularity with parListChunk is a common technique. The idea is for each spark to evaluate a fixed-size chunk of the list, rather than a single element. To do this without incurring a space leak means that the sparked list chunks must be concatenated into a new list, and returned to the caller:

3 In

fact, this space leak was reported to the GHC team as a bug, http: //hackage.haskell.org/trac/ghc/ticket/2185. 4 Why don’t all the sparks fizzle in the second version? In fact the runtime does manage to execute a few sparks while it is waiting for IO to happen.

72

4 cores Program ∆ Time (%) gray -2.1 mandel -1.0 matmult -7.8 parfib -1.0 partree +5.2 prsa -0.2 ray +1.5 sumeuler -1.2 Geom. Mean -0.9

parListChunkWHNF :: Int -> [a] -> [a] parListChunkWHNF n = concat . (‘using‘ parListWHNF) . map (‘using‘ seqList) . chunk n

where chunk :: Int -> [a] -> [[a]] splits a list into chunks of length n. A combinator that we find ourselves using often is parBuffer, which behaves like parList except that it does not traverse the whole list eagerly; it sparks a fixed number of elements initially, and then sparks subsequent elements as the list is consumed. This formulation works particularly well with programs that produce output as a lazy list, since it allows us to retain the constant-space property of the program while taking advantage of parallelism. The disadvantage is that we have to pick a buffer size, and the best choice of buffer size might well depend on how many CPUs we have available. Our modified version of parBuffer that avoids space leaks is parBufferWHNF:

7 cores Program ∆ Time (%) gray +7.5 mandel +1.8 matmult +4.4 parfib -8.6 partree +9.5 prsa -0.1 ray +0.2 sumeuler +1.7 Geom. Mean +1.9

Figure 11. The effect of eager black-holing

8.

Thunks and black holes

Suppose that two Haskell threads, A and B, begin evaluation of a thunk t simultaneously. Semantically, it is acceptable for them both to evaluate t, since they will get the same answer (Harris et al. 2005); but operationally it is better to avoid this duplicated work. The obvious way to do so is to lock every thunk when starting evaluation, but that is expensive; measurements in the earlier cited work demonstrate an increase in execution time of around 50%. So we considered several variants that trade a reduced overhead against the risk of duplicated work:

parBufferWHNF :: Int -> [a] -> [a] parBufferWHNF n xs = return xs (start n xs) where return (x:xs) (y:ys) = y ‘par‘ (x : return xs ys) return xs [] = xs

EagerBH: immediately on entry, thread A overwrites t with a black hole. If thread B sees a black hole, it blocks until A performs the update (Section 8.1). The “window of vulnerability”, in which a second thread might start a duplicate evaluation, is now just a few instructions wide. The cost compared to sequential execution is an extra memory store on every thunk entry.

start !n [] = [] start 0 ys = ys start !n (y:ys) = y ‘par‘ start (n-1) ys

7.3 Compositional strategies revisited We can recover a more compositional approach to strategies by changing their type. The existing Strategy type is defined thus:

RtsBH: enlists the runtime system, using the scheme described in Harris et al. (2005). The idea is to walk a thread’s stack whenever it returns to the scheduler, and “claim” each of the thunks under evaluation using an atomic instruction. If a thread is found to be evaluating a thunk already claimed by another thread, then we suspend the current execution and put the thread to sleep until the evaluation is complete. Since every thread will return to the scheduler at regular intervals (say, to do garbage collection), this ensures that we cannot continue to evaluate the same thunk in multiple threads indefinitely. The overhead is much less than locking every thunk because most thunks are entered, evaluated, and updated during a single scheduler timeslice.

type Strategy a = a -> Done

Suppose that instead we define Strategy as a projection, like this: type Strategy a = a -> a

then a Strategy can do some evaluation and sparking, and return a new a. In order to use this new kind of Strategy effectively, we need a new version of the par combinator: spark :: Strategy a -> a -> (a -> b) -> b spark strat a f = x ‘par‘ f x where x = strat a ‘pseq‘ a

The spark combinator takes a strategy strat, a value a, and a continuation f. It creates a spark to evaluate strat a, and then passes a new object to the continuation with the same value as a. When evaluated, this new object will cause the spark to fizzle and be discarded. Now we can recover compositional parList and seqList combinators:

PostCheck: As Harris et al. (2005) points out, if two threads both succeed in completing the evaluation of the same thunk, and its value itself contains more thunks, there is a danger that an unbounded amount of work can be duplicated. The PostCheck strategy adds a test just before the update to check whether the thunk has already been updated by another thread. This test does not use an atomic instruction, but reduces the chance of further duplicate work taking place.

parList :: Strategy a -> Strategy [a] parList strat xs = foldr f [] xs where f x xs = spark strat x $ \x -> xs ‘pseq‘ x:xs

In our earlier work (Harris et al. 2005) we measured of the overheads of locking every thunk, but said nothing about the overheads or work-duplication of the other strategies. GHC 6.10.1 implements RtsBH by default. Figure 11 shows the additional effect of EagerBH on our benchmark programs, for 4 and 7 cores. As you might expect, the effect is minor, because RtsBH catches almost all the cases that EagerBH does, except for very short-lived thunks which do not matter much anyhow. Figure 12 shows how many times RtsBH catches a duplicate computation in progress, both with and without adding EagerBH. As we can see, without EagerBH there are occasionally a substantial number of duplicate evaluations (eg. in ray), but EagerBH reduces

seqList :: Strategy a -> Strategy [a] seqList strat xs = foldr seq ys ys where ys = map strat xs

and indeed this works quite nicely. Note that parList requires linear stack space; it is also possible to write a version that only requires linear heap space, but that requires two traversals of the list. Here is parListChunk in the new style: parListChunk :: Int -> Strategy a -> Strategy [a] parListChunk n strat xs = ys ‘pseq‘ concat ys where ys = parList (seqList strat) $ chunk n xs

73

Program gray mandel matmult parfib partree prsa ray sumeuler

RtsBH 0 3 5 70 3 45 1261 0

4 cores Program ∆ Time (%) gray +2.9 mandel +4.2 matmult +42.1 parfib -0.3 partree +3.3 prsa -0.5 ray -5.4 sumeuler -0.2 Geom. Mean +5.0

RtsBH + EagerBH 0 0 5 1 3 0 0 0

Figure 12. The number of duplicate computations caught

Figure 13. Disabling thread migration

9.

that number to almost zero. In ray, although we managed to eliminate a large number of duplicate evaluations using EagerBH, the effect on overall execution time was negligible: this program creates 106 tiny sparks, so 1200 duplicate evaluations has little impact. In fact, with the fine granularity in this benchmark, it may be that the cost of suspending the duplicate evaluation and blocking the thread outweighs the cost of just duplicating the computation. To date we have not measured the effect of PostCheck. We expect it to have no effect on these benchmarks, especially in combination with EagerBH. However, we have experienced the effect of unbounded duplicate work in other programs; one good example where it can occur is in this version of parMap:

Load balancing and migration

In this section we discuss design choices concerning which HEC should run which Haskell threads. 9.1

Sharing runnable threads

In the current implementation, while we (now) use work-stealing for sparks, we use work-pushing for threads. That is, when a HEC detects that it has more than one thread in its Run Queue and there are other idle HECs, it distributes some of the local threads to the other HECs. The reason for this design is mostly historical; we could without much difficulty represent the Run Queue using a work-stealing queue and thereby use work-stealing for the loadbalancing of threads. We measured the effect that automatic thread migration has on the performance of our parallel benchmarks. Figure 13 shows the effect of disabling automatic thread migration, against a baseline of the PARGC3 configuration (Section 6.4). Since these are parallel, rather than concurrent, programs, the only way that multiple threads can exist on the Run Queue of a single CPU is when a thread becomes temporarily blocked (on a blackhole, Section 8.1), and then later becomes runnable again. As we can see from the results, often allowing migration makes no difference. Occasionally it is essential: for example matmult on 4 cores. And occasionally, as in ray, allowing migration leads to worse performance, possibly due to lost locality. Whether to allow migration or not is a runtime flag, so the programmer can experiment with both settings to find the best one.

parMap :: (a -> b) -> [a] -> [b] parMap f [] = [] parMap f (x:xs) = fx ‘par‘ (pmxs ‘par‘ (fx:pmxs)) where fx = f x pmxs = parMap f xs

This function sparks both the head and the tail of the list, instead of traversing the whole list sparking each element as in the usual parMap. The duplication problem occurs if two threads evaluate the pmxs thunk: then the tail of the list is duplicated, possibly resulting in a large number of useless sparks being created. 8.1

7 cores Program ∆ Time (%) gray +2.9 mandel +1.2 matmult -0.0 parfib +7.5 partree +7.0 prsa -1.5 ray -14.4 sumeuler +1.2 Geom. Mean +0.3

Blocking on a black hole

When a thread A tries to evaluate a black hole, it must block until the thread currently evaluating the black hole (thread B) completes the evaluation, and overwrites the thunk with (an indirection to) its value. In earlier implementations (before 6.6) we arranged that thread B would attach its TSO to the thunk, so that thread A could re-awaken B when it performed the update. But that requires expensive synchronisation on every update, in case the thunk by now has a sleeping thread attached to it. Since thunk updates are very common, but collisions (in which a sleeping thread attaches itself to a thunk) are very rare, GHC 6.10 instead optimises for the common case. Instead of attaching itself to the thunk, the blocked thread B simply polls the thunk, waiting for the update. Since a thunk can only be updated once, an update can therefore be performed without any synchronisation whatsoever, provided that writes are not re-ordered. Our earlier work (Harris et al. 2005) discusses these synchronisation issues in much more detail. GHC 6.10 maintains a single, global Black Hole Pool, which the HECs poll when they are otherwise idle, and at least once per GC. We have considered two alternative designs: (a) privatising the Black Hole Pool to each HEC; and (b) using the thread scheduler directly, by making the blocked thread sleep and retry the evaluation when it reawakens. We have not yet measured these alternatives but, since contention is rare (Figure 12), they will probably only differ in extreme cases.

9.2

Migrating on wakeup

A blocked thread can be woken up for various reasons: if it is blocked on a black hole, it is woken up when some HEC notices that the black hole has now been evaluated (Section 8.1); if it is blocked on an empty MVar, then it can be unblocked when another thread performs a putMVar operation on that MVar. When a thread is woken up, if it was previously running on another HEC, we have a choice: it can be placed on the Run Queue of the current HEC (hence migrating it), or we could arrange to awaken it on the HEC it was previously running on. In fact, we could wake it up on any HEC, but typically these two options are the most profitable. Moving the thread to the current HEC might be advantageous if the thread is involved in a series of communications with another thread on this HEC: context-switching between two threads on the same HEC is particularly cheap. However, locality might also be important: the thread might be referencing data that is in the cache of the other HEC. In GHC we take locality seriously, so our default is not to migrate awoken threads to the current CPU. For parallel programs, it is never worthwhile to change this setting, at least with the current implementation of black holes, since it is essentially random which

74

HEC awakens a blocked thread. If we were to change the implementation of black holes such that a thread can tell when an update should wake a blocked thread (perhaps by using a hash table to map the address of black holes to blocked threads), then there might be some benefit in migrating the blocked thread to the CPU on which the value it was waiting for resides.

10.

Benchmarks and experimental setup

Our test system consists of 2 quad-core Intel Xeon(R) E5320 processors at 1.6GHz. Each pair of cores shares 4MB of L2 cache, and there is 16GB of system memory. The system was running Fedora 9. Although the OS was running in 64-bit mode, we used 32-bit binaries for our measurements (programs compiled for 64-bit tend to place more stress on the memory system and garbage collector resulting in less parallelism). In all cases we ran the programs five times and took the average wall-clock execution time. Our benchmarks consist of a selection of small-to-mediumsized Parallel Haskell programs:

Figure 14. A slow synchronisation

These programs are all small, are mostly easy to parallelise, and are not highly optimised, so the results we report here should be interpreted as suggestive rather than conclusive. Nevertheless, our goal has not been to optimise the programs, but rather to optimise the implementation to make existing programs parallelise better. Furthermore, smaller benchmarks have their uses:

• parfib: the ubiquitous parallel fibonacci function, included here

as a sanity test to ensure that our implementation is able to parallelise micro-benchmarks. The parallelism is divide-andconquer-style, using explicit par and pseq. • sumeuler: the sum of the value of Euler’s function applied to

• Small benchmarks show up in high relief interesting differences

each integer up to a given bound. This is a map/reduce style problem: applications of the Euler function can be performed in parallel, and the results must be summed (Trinder et al. 2002). The parallelism is expressed using parListChunk from the strategies library.

in the behaviour of our runtime and execution model. These differences would be less marked had we used only large programs. • We know that most of these programs should parallelise well, so

any lack of parallelism is more likely to be as a result of choices made in the language implementation than in the program itself. Indeed, the lack of linear speedup in Figure 1 shows that we still have plenty of room for improvement.

• matmult: A naive matrix-multiply algorithm. The matrix is

represented as a [[Int]]. The parallelism is expressed using parListChunk. • ray: A ray-tracer benchmark5 . The parallelism is expressed

10.1

using parBuffer, and is quite fine-grained (each pixel to be rendered is a separate spark).

Profiling

To help understand the behaviour of our benchmark programs we developed a graphical viewer called ThreadScope for event information generated by the runtime system. The viewer is modeled after circuit waveform viewers with a profile drawn with time on the x-axis and HEC number on the y-axis. In each HEC’s timeline, the bar is coloured solid black when the HEC is running Haskell code with the thread ID inside the box, and gray when it is garbage collecting. Events, such as threads blocking or being woken up, are indicated by labels annotating the timeline when the view is zoomed enough. This visualisation of the execution is immensely useful for being able to quickly identify problem areas. For example, when we ran our benchmarks on all 8 cores of our 8-core machine, we experienced inexplicable drops in performance. Figure 14 shows one problem area as seen in the profile viewer. In the middle of the picture there is a long period where one HEC has initiated a GC, and is waiting for the other HECs to stop. The initiating/waiting HEC has a white bar, the HECs that have already stopped and are ready to GC are shown in gray. One HEC is running (black with thread ID 164) during this period, indicating that it is apparently running Haskell code and has not responded to the call for a GC. In fact, the OS thread running this HEC has been descheduled by the OS, so does not respond for a relatively long period. The same pattern repeats many times during the execution, having a significant impact on the overall runtime. This experience does illustrate that our runtime is particularly sensitive to problems such as this due to the relatively high frequency of full synchronisations needed for GC, and that tackling independent GC (Section 12.1) should be a high priority.

• gray: Another ray-tracing benchmark, this time taken from an

entry6 in the ICFP’00 programming contest. Only the rendering part of the program has been parallelised, using a parBuffer as above. According to time profiling, the program only spends about 50% of its time in the renderer, so we expect this to limit the parallelism we can achieve. The parallelism is expressed using a single parBuffer in the renderer. • prsa: A parallel RSA message encoder, encoding a 500KB

message. Parallelism is again expressed using parBuffer. • partree: A parallel map and fold over a tree. The program orig-

inates in the GUM benchmark suite, and in fact appears to be badly written: it is quadratic in the size of the tree. Nevertheless, it does appear to run in parallel, so we used the program unmodified for the purposes of benchmarking. • mandel: this is a mandelbrot-set program originating in the

nofib benchmark suite (Partain 1992). It generates a lazy list of pixel data (for a 1024x1024 scene), in a similar way to the ray tracer, and it was parallelised in the same way with the addition of parBuffer. The difference in this case is that the parallelism is more coarse-grained: each scan-line of the result is a separate spark. 5 This

program has a long history. According to comments in the source code, it was “taken from Paul Kelly’s book, adapted by Greg Michaelson for SML, converted to (parallel) Haskell by Kevin Hammond”. 6 from the Galois team

75

11.

Related work

run-time and we are considering re-running the benchmarks to measure any improvements. In particular we expect benchmarks that perform a lot of garbage collection to benefit from the parallel garbage collector. It is clear from our investigation of the programming model in Section 7 that we should change the GC policy for sparks from ROOT to WEAK, but we must also revisit the Strategies abstraction and develop a new library that works effectively under WEAK.

The design space for language features to support implict parallelism and the underlying run-time system is very large. Here we identify just a few systems that make different design decisions and trade-offs from the GHC run-time system. Like GHC the Manticore (Fluet et al. 2008b,a) system also supports implicit and explicit fine-grained parallelism which in turn has been influenced by previous work on data parallel languages like NESL (Blelloch et al. 1994) and Nepal/DPH (Chakravarty et al. 2001). Unlike NESL or Nepal/DPH, GHC also implements support for explicit concurrency as does Manticore. Many of the underlying implementation choices made for GHC and Manticore are interchangeable e.g. Manticore uses a partially shared heap whereas GHC uses a totally shared heap. Manticore however presents quite a different programming model based on parallel data structures (e.g. tuples and arrays) which provide a fork-join pattern of computation as well as a parallel case expression which can introduce non-determinism. Neither GHC nor Manticore support implicit parallelism without the need for user annotations which has been implemented in other functional languages like Id (Nikhl 1991), pH (Nikhl and Arvind 2001) and Sisal (Gaudiot et al. 1997). S TING (Jagannathan and Philbin 1992) is a system that supports multiple parallel language constructs for a dialect of SCHEME through three layers of process abstraction as well as special support for specifying scheudling policies. We also intend to modify GHC’s infrasturcture to allow different scheduling policies to be composed together in a flexible manner (Li et al. 2007a). GpH (Trinder et al. 1998) extended Haskell98 to introduce the parallel (par) and sequential (seq) coordination primitives and provides strategies for controlling the evaluation order. Unlike semi-implicit parallelism annotations in Haskell which identify opportunities for parallelsim, in Eden (Loogen et al. 2005) one explicitly creates processes which are always executed concurrently. GUM (Trinder et al. 1996) targets distributed systems and is based on message passing. In our work we profiled the execution time of Haskell threads and garbage collection. However, we will also need to perform space profiling and the work on the MLton project on semantic space profiling (Spoonhower et al. 2008) represents an interesting approach for a strict language. Erlang (Armstrong et al. 1996) provides isolated threads which communicate through a mailbox mechanism with pattern matching used to select messages of interest. These design decisions have a subtantial effect on the design of the run-time system. Eden (Loogen et al. 2005) provides special annontations to control parallel evaluation of processes. Cilk (Blumofe et al. 2001) is an imperative programming language based on C which also supports fine grain parallelism in a fork-join manner by spawning off parallel invocations of procedures. Like GHC Cilk also performs work-stealing for load balancing. The spawn feature of Cilk, expressions bound with pcase in Manticore and sparks in GHC can all be considered to be instances of futures.

12.

12.1

Independent GC

Stop-the-world GC will inevitably become more of a bottleneck as the number of cores increases. There are known techniques for doing CPU-independent GC (Doligez and Leroy 1993), and these techniques are used in systems such as Manticore (Fluet et al. 2008a). We fully intend to pursue CPU-independent GC in the future. However this is unlikely to be an easy transition. CPU-independent GC replaces direct sharing by physical separation and explicit communication. This leads to trade-offs; it isn’t a straightforward win. More specifically, CPU-independent GC requires a local-heap invariant, namely that there are no pointers between local heaps, or from the global heap into any local heap. Ensuring and maintaining this invariant introduces new costs and complexities into the runtime execution model. On the other hand, as the number of cores in modern CPUs increases, the illusion of shared memory begins to break down. We are already experiencing severe penalties for losing locality (Section 6.4), and it is likely that these will only get worse in the future. Hence, moving to more explicitly-separate heap regions is a more honest reflection of the underlying memory architecture, and is likely to allow the implementation to make intelligent decisions about data locality.

Acknowledgments We wish to thank Jost Berthold, who helped us identify some of the bottlenecks in the parallel runtime implementation, and built the first implementation of work-stealing queues for spark distribution during an internship at Microsoft Research in the summer of 2008. We also wish to thank Phil Trinder and Tim Harris for their helpful comments on an earlier draft of this paper.

References J. R. Armstrong, R. Virding, C. Wikstr¨om, and M. Williams. Concurrent programming in ERLANG (2nd ed.). Prentice Hall International (UK) Ltd., Hertfordshire, UK, 1996. Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton. Thread scheduling for multiprogrammed multiprocessors. In In Proceedings of the Tenth Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA), Puerto Vallarta, pages 119– 129, 1998. J. Berthold, S. Marlow, A. Al Zain, and K. Hammond. Comparing and optimising parallel Haskell implementations on multicore. In IFL’08: International Symposium on Implementation and Application of Functional Languages (Draft Proceedings), Hatfield, UK, 2008.

Conclusion and future work

Stephen M. Blackburn and Antony L. Hosking. Barriers: friend or foe? In ISMM ’04: Proceedings of the 4th international symposium on Memory management, pages 143–151, New York, NY, USA, 2004. ACM. ISBN 1-58113-945-4. doi: http://doi. acm.org/10.1145/1029873.1029891.

While we have achieved some significant improvements in parallel efficiency, our work clearly has some way to go; several benchmarks do not speed up as much as we might hope. Our focus in the future will therefore continue to be on using profiling tools to identify problem areas, and using those results to direct our attention to appropriate areas of the runtime system and execution model. The work on implicit parallelization described in Harris and Singh (2007) may benefit from the recent changes to the GHC

G. E. Blelloch, S. Chatterjee, J. C. Hardwick, J. Sipelstein, and M Zagha. Implementation of a portable nested data-parallel language. JDPC, 21(1):4–14, 1994.

76

R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, and C. E. Lieserson. Cilk: an efficient multithreaded runtime system. SIGNPLAN Not., 30(8):207–216, 2001.

grams in GPH. Concurrency — Practice and Experience, 11: 701–752, 1999. URL http://www.cee.hw.ac.uk/\~{}dsg/ gph/papers/ps/cpe.ps.gz.

M. T. Chakravarty, G. Keller, R. Leshinskiy, and W. Pfannenstiel. Nepal - Nested Data Parallelism in Haskell. LNCS, 2150, Aug 2001.

H.-W. Loidl, F. Rubio, N. Scaife, K. Hammond, S. Horiguchi, U. Klusik, R. Loogen, G. J. Michaelson, R. Pe na, S. Priebe, ´ J. Reb´on, and P. W. Trinder. Comparing parallel functional A languages: Programming and performance. Higher Order Symbol. Comput., 16(3):203–251, 2003. ISSN 1388-3690. doi: http://dx.doi.org/10.1023/A:1025641323400.

David Chase and Yossi Lev. Dynamic circular work-stealing deque. In SPAA ’05: Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures, pages 21–28, New York, NY, USA, 2005. ACM. ISBN 1-58113-9861. doi: http://doi.acm.org/10.1145/1073970.1073974.

Rita Loogen, Yolanda Ortega-Mall´en, and Ricardo Pena. Parallel Functional Programming in Eden. Journal of Functional Programming, 15(3):431–475, 2005.

Damien Doligez and Xavier Leroy. A concurrent, generational garbage collector for a multithreaded implementation of ml. In POPL ’93: Proceedings of the 20th ACM SIGPLAN-SIGACT symposium on Principles of programming languages, pages 113–123, New York, NY, USA, 1993. ACM. ISBN 0-89791560-7. doi: http://doi.acm.org/10.1145/158511.158611.

Simon Marlow, Simon Peyton Jones, and Wolfgang Thaller. Extending the haskell foreign function interface with concurrency. In Proceedings of the ACM SIGPLAN workshop on Haskell, pages 57–68, Snowbird, Utah, USA, September 2004. URL http://www.haskell.org/~simonmar/ papers/conc-ffi.pdf.

Christine Flood, Dave Detlefs, Nir Shavit, and Catherine Zhang. Parallel garbage collection for shared memory multiprocessors. In Usenix Java Virtual Machine Research and Technology Symposium (JVM ’01), Monterey, CA, 2001. URL citeseer.ist. psu.edu/flood01parallel.html.

Simon Marlow, Tim Harris, Roshan P. James, and Simon Peyton Jones. Parallel generational-copying garbage collection with a block-structured heap. In ISMM ’08: Proceedings of the 7th international symposium on Memory management. ACM, June 2008. URL http://www.haskell.org/~simonmar/ papers/parallel-gc.pdf.

Matthew Fluet, Mike Rainey, and John Reppy. A scheduling framework for general-purpose parallel languages. SIGPLAN Not., 43 (9):241–252, 2008a. ISSN 0362-1340. doi: http://doi.acm.org/ 10.1145/1411203.1411239.

R. S. Nikhl. ID language reference manual. Computer Science, MIT, Jul 1991.

Matthew Fluet, Mike Rainey, John Reppy, and Adam Shaw. Implicitly-threaded Parallelism in Manticore. International Conference on Functional Programming, pages 119–130, 2008b.

Laboratory for

R. S. Nikhl and Arvind. Implicit Parallel Programming in pH. Morgan Kaufmann Publishers, San Francisco, CA, 2001. WD Partain. The nofib benchmark suite of Haskell programs. In Functional Programming, Glasgow 1992, Workshops in Computing, pages 195–202. Springer Verlag, 1992.

J. L. Gaudiot, T. DeBoni, J. Feo, W. Bohm, W. Najjar, and P. Miller. The Sisal model of functional programming and its implementation. In As ’97, pages 112–123, Los Altimos, CA, March 1997. IEEE Computer Society Press.

S. Peyton Jones, A. Gordon, and S. Finne. Concurrent Haskell. In Proc. of POPL’96, pages 295–308. ACM Press, 1996. Simon Peyton Jones, Roman Leshchinskiy, Gabriele Keller, and Manuel M. T. Chakravarty. Harnessing the multicores: Nested data parallelism in Haskell. In FSTTCS 2009: IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science, 2009.

Tim Harris and Satnam Singh. Feedback directed implicit parallelism. In ICFP ’07: Proceedings of the 12th ACM SIGPLAN international conference on Functional programming, pages 251– 264, New York, NY, USA, 2007. ACM. ISBN 978-1-59593815-2. doi: http://doi.acm.org/10.1145/1291151.1291192.

Daniel Spoonhower, Guy E. Blelloch, Robert Harper, and Phillip B. Gibbons. Space profiling for parallel functional programs. In ICFP ’08: Proceedings of the 12th ACM SIGPLAN international conference on Functional programming, pages 253–264, New York, NY, USA, 2008. ACM.

Tim Harris, Simon Marlow, and Simon Peyton Jones. Haskell on a shared-memory multiprocessor. In Haskell ’05: Proceedings of the 2005 ACM SIGPLAN workshop on Haskell, pages 49– 61. ACM Press, September 2005. ISBN 1-59593-071-X. doi: http://doi.acm.org/10.1145/1088348.1088354. URL http:// www.haskell.org/~simonmar/papers/multiproc.pdf.

P. W. Trinder, H.-W. Loidl, and R. F. Pointon. Parallel and distributed Haskells. J. Funct. Program., 12(5):469–510, 2002. ISSN 0956-7968. doi: http://dx.doi.org/10.1017/ S0956796802004343.

S. Jagannathan and J. Philbin. A customizable substrate for concurrent languages. In ACM Conference on Programming Languages Design and Implementation (PLDI’92), pages 55–81. ACM Press, June 1992.

PW Trinder, K Hammond, JS Mattson, AS Partridge, and SL Peyton Jones. GUM: a portable parallel implementation of haskell. In ACM Conference on Programming Languages Design and Implementation (PLDI’96). ACM Press, Philadelphia, May 1996.

P. Li, Simon Marlow, Simon Peyton Jones, and A. Tolmach. Lightweight concurrency primitives for GHC. In Haskell ’07: Proceedings of the 2007 ACM SIGPLAN workshop on Haskell, pages 107–118. ACM Press, September 2007a.

PW Trinder, K Hammond, H-W Loidl, and SL Peyton Jones. Algorithm + strategy = parallelism. Journal of Functional Programming, 8:23–60, January 1998.

Peng Li, Simon Marlow, Simon Peyton Jones, and Andrew Tolmach. Lightweight concurrency primitives for GHC. Haskell ’07: Proceedings of the ACM SIGPLAN workshop on Haskell workshop, June 2007b. URL http://www.haskell.org/ ~simonmar/papers/conc-substrate.pdf. H-W. Loidl, P.W. Trinder, K. Hammond, S.B. Junaidu, R.G. Morgan, and S.L. Peyton Jones. Engineering Parallel Symbolic Pro-

77

Effective Interactive Proofs for Higher-Order Imperative Programs ∗ Adam Chlipala

Gregory Malecha

Greg Morrisett

Avraham Shinnar

Ryan Wisnesky

Harvard University, Cambridge, MA, USA {adamc, gmalecha, greg, shinnar, ryan}@cs.harvard.edu

Abstract

1. Introduction

We present a new approach for constructing and verifying higherorder, imperative programs using the Coq proof assistant. We build on the past work on the Ynot system, which is based on Hoare Type Theory. That original system was a proof of concept, where every program verification was accomplished via laborious manual proofs, with much code devoted to uninteresting low-level details. In this paper, we present a re-implementation of Ynot which makes it possible to implement fully-verified, higher-order imperative programs with reasonable proof burden. At the same time, our new system is implemented entirely in Coq source files, showcasing the versatility of that proof assistant as a platform for research on language design and verification. Both versions of the system have been evaluated with case studies in the verification of imperative data structures, such as hash tables with higher-order iterators. The verification burden in our new system is reduced by at least an order of magnitude compared to the old system, by replacing manual proof with automation. The core of the automation is a simplification procedure for implications in higher-order separation logic, with hooks that allow programmers to add domain-specific simplification rules. We argue for the effectiveness of our infrastructure by verifying a number of data structures and a packrat parser, and we compare to similar efforts within other projects. Compared to competing approaches to data structure verification, our system includes much less code that must be trusted; namely, about a hundred lines of Coq code defining a program logic. All of our theorems and decision procedures have or build machine-checkable correctness proofs from first principles, removing opportunities for tool bugs to create faulty verifications.

A key goal of type systems is to prevent “bad states” from arising in the execution of programs. However, today’s widely-used type systems lack the expressiveness needed to catch language-level errors, such as a null-pointer dereference or an out-of-bounds array index, let alone library- and application-specific errors such as removing an element from an empty queue, failing to maintain the invariants of a balanced tree, or forgetting to release a critical resource such as a database connection. For safety- and securitycritical code, a type system should ideally let the programmer assign types to libraries such that client code cannot suffer from these problems, and, in the limit, the type system should make it possible for programmers to verify that their code is correct. There are many recent attempts to extend the scope of type systems to address a wider range of safety properties. Representative examples include ESC/Java (Flanagan et al. 2002), Epigram (McBride and McKinna 2004), Spec# (Barnett et al. 2004), ATS (Chen and Xi 2005), Concoqtion (Pasalic et al. 2007), Sage (Gronski et al. 2006), Agda (Norell 2007), and Ynot (Nanevski et al. 2008). Each of these systems integrates some form of specification logic into the type system in order to rule out a wider range of truly bad states. However, in the case of ESC/Java, Spec#, and Sage, the program logic is too weak to support full verification because these systems rely completely upon provers to discharge verification conditions automatically. While there have been great advances in the performance of automated provers, in practice, they can only handle relatively shallow fragments of first-order logic. Thus, programmers are frustrated when correct code is rejected by the type-checker. For example, none of these systems is able to prove that an array index is in bounds when the constraints step outside quantifier-free linear arithmetic. In contrast, Agda, ATS, Concoqtion, Epigram, and Ynot use powerful, higher-order logics that support a much wider range of policies including (partial) correctness. Furthermore, in the case of Ynot, programmers can define and use connectives in the style of separation logic (Reynolds 2002) to achieve simple, modular specifications of higher-order imperative programs. For example, a recent paper (Nanevski et al. 2008) coauthored by some of the present authors describes how Ynot was used to construct fullyverified implementations of data structures such as queues, hash tables, and splay trees, including support for higher-order iterators that take effectful functions as arguments. The price paid for these more powerful type systems is that, in general, programmers must provide explicit proofs to convince the type-checker that code is correct. Unfortunately, explicit proofs can be quite large when compared to the code. For example, in the Ynot code implementing dequeue for imperative queues, only 7 lines

Categories and Subject Descriptors F.3.1 [Logics and meanings of programs]: Mechanical verification; D.2.4 [Software Engineering]: Correctness proofs, formal methods, reliability General Terms

Languages, Verification

Keywords functional programming, interactive proof assistants, dependent types, separation logic ∗ This research was supported in part by a gift from Microsoft Research and a National

Science Foundation Graduate Research Fellowship.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICFP’09, August 31–September 2, 2009, Edinburgh, Scotland, UK. c 2009 ACM 978-1-60558-332-7/09/08. . . $5.00 Copyright

79

of program code are required, whereas the proof of correctness is about 70 lines. This paper reports our experience re-designing and re-implementing Ynot to dramatically reduce the burden of writing and maintaining the necessary proofs for full verification. Like the original Ynot, our system is based on the ideas of Hoare Type Theory (Nanevski et al. 2006) and is realized as an axiomatic extension of the Coq proof assistant (Bertot and Cast´eran 2004). This allows us to inherit the full power of Coq’s dependent types for writing code, specifications, and proofs, and it allows us to use Coq’s facility for extraction to executable ML code. However, unlike in the previous version, we have taken advantage of Coq’s tactic language, Ltac, to implement a set of parameterized procedures for automatically discharging, or at least simplifying, the separation logic-style verification conditions. The careful design of these procedures makes it possible for programmers to teach the prover about new domains as they arise. We describe this new implementation of Ynot and report on our experience implementing and verifying various imperative data structures including stacks, queues, hash tables, binomial trees, and binary search trees. When compared with the previous version of Ynot, we observe roughly an order of magnitude reduction in proof size. In most cases, to realize automation, programmers need only prove key lemmas regarding the abstractions used in their interfaces and plug these lemmas into our extensible tactics. Additionally, we show that the tactics used to generate the proofs are robust to small changes in the code or specifications. In the next section, we introduce the new Ynot in tutorial style. Next, we describe the automation tactics that we built, report on further evaluation of our system via case studies, compare with related work, and conclude. 1.1 Coq as an Extensible Automated Theorem Prover Almost everyone familiar with Coq associates it with a particular style of proof development, which might be called the “video game” approach, after a comment by Xavier Leroy. A theorem is proved in many steps of manual interaction, where Coq tells the user which goals remain to be proved, the user enters a short command that simplifies the current goal somewhat, and the process repeats until no goals remain. One of our ancillary aims in this paper is to expose a broad audience to a more effective proof style. Coq provides very good support for fully automatic proving, via its domain-specific programming language Ltac (Delahaye 2000). This support can be mixed-and-matched with more manual proving, and it is usually the case that a well-written development starts out more manual and gradually transforms to a final form where no sequential proof steps are spelled out beyond which induction principle to use. Proof scripts of that kind often adapt without change to alterations in specifications and implementations. We believe that awareness of this style is one of the crucial missing pieces blocking widespread use of proof assistants. We hope that the reader will agree that some of the examples that follow provide evidence that, for programmers with a few years of training using proof assistants, imperative programming with correctness verification need not be much harder than programming in Haskell.

which uses dependency to capture the fact that div can only be called when a proof can be supplied that the second argument is non-zero. One can also write functions such as: Definition avg (x:list nat) : nat := let sum := fold plus 0 x in let len := length x in match eq_nat_dec len 0 with | inl(pf1: len = 0) => 0 | inr(pf2: len 0) => div sum len pf2 end. This function averages the values in a list of natural numbers. It has a normal type like you might find in ML, and its implementation begins in an ML-like way, using a higher-order fold function. The interesting part is the match expression. We match on the result of a call to eq nat dec, a dependently-typed natural number comparison function. This function returns a sum type with an equality proof in one branch and an inequality proof in the other. We bind a name for each proof explicitly in the pattern for each match case. The proof that len is not zero is passed to div to justify the safety of the operation. All Coq functions have to be pure – terminating without side effects. This is necessary to ensure that proofs really are proofs, with no spurious invalid “proofs by infinite loop.” Ynot extends Coq with support for side-effecting computations. Similarly to Haskell, we introduce a monadic type constructor ST T which describes computations that might diverge and that might have side effects, but that, if they do return, return values of type T. The ST type family provides a safe way to keep the effectful computations separate from the pure computations. Unlike Haskell’s IO monad, the ST type family is parameterized by a pre- and post-condition, which can be used to describe the effects of the computation on a mutable store. Alternatively, one can think of the axiomatic base of Ynot as a fairly standard Hoare logic. The main difference of our logic from usual presentations is that it is designed to integrate well with Coq’s functional programming language. Therefore, instead of defining a language of commands, we formalize a language of expressions in the style of the IO monad. A program derivation is of the form {P } e {Q}, where P is a pre-condition predicate over heaps, and Q is a postcondition predicate over an initial heap, the value that results from evaluating e, and a final heap. For instance, where we write sel and upd for the heap selection and update operations used in the ESC tools (Flanagan et al. 2002), we can derive the following facts, where p1 and p2 are pointer variables bound outside of the commands that we are verifying. {λ . ⊤} return(1) {λh, v, h′ . h′ = h ∧ v = 1} and {λh.sel(h, p1 ) = p2 } x ← !p1 ; x := 1 {λh, , h′ . sel(h, p1 ) = p2 ∧ h′ = upd(h, p2 , 1)} Unlike other systems, Ynot does not distinguish between programs and derivations. Rather, the two are combined into one dependent type family, whose indices give the specifications of programs. For instance, the type of the “return” example would be:

2. The Ynot Programming Environment

ST (fun _ => True) (fun h v h’ => h’ = h /\ v = 1)

To a first approximation, Coq can be thought of as a functional programing language like Haskell or ML, but with support for dependent types. For instance, one can have operations with types such as:

Heaps are represented as functions from pointers to dynamicallytyped packages, which are easy to implement in Coq with an inductive type definition. The pointer read rule enforces that the heap value being read has the type that the code expects. The original Ynot paper (Nanevski et al. 2008) contains further details of the base program logic.

div : nat -> forall n : nat, n 0 -> nat

80

{P1 } e1 {Q1 } ′

(∀x, {P2 (x)} e2 {Q2 })

′

{emp} return(v) {λv . [v = v ]}

(∀x, Q1 (x) ⇒ P2 (x))

{P1 } x ← e1 ; e2 {Q2 }

{emp} new(v) {λp. p 7→ v} {∃v, p 7→ v} free(p) {λ . emp} {∃v, p 7→ v ∗ P (v)} !p {λv. p 7→ v ∗ P (v)} {∃v, p 7→ v} p := v ′ {λ . p 7→ v ′ } P ⇒ P′

{P ′ } e {Q′ } Q′ ⇒ Q {P } e {Q}

{P } e {Q} {P ∗ R} e {Q ∗ R}

Figure 1. The main rules of the derived separation logic more standard separation logic. The old Ynot separation logic used binary post-conditions that may refer to both the initial and final heaps. (In both systems, specifications may refer to computation result values, so we avoid counting those in distinguishing between “unary” and “binary” post-conditions.) This is in stark contrast to traditional separation logics, where all assertions are separation formulas over a single heap, and all verification proof obligations are implications between such assertions. The utility of this formalism has been born out in the wealth of tools that have used separation logic for automated verification. In contrast, proofs of the binary post-conditions in the old Ynot tended to involve at least tens of steps of manual proof per line of program code. Today, even penciland-paper proofs about relationships between multiple heaps can draw on no logical formalism that comes close to separation logic in crispness or extent of empirical validation. While binary postconditions are strictly more expressive than unary post-conditions, the separation logic community has developed standard techniques for mitigating the problem. To make up for this lost expressiveness, we need, in effect, to move to a richer base logic. The key addition that lets us use a more standard formulation is the feature of computationallyirrelevant variables, which correspond to specification variables (also known as “ghost variables”) in standard separation logic. Such variables may be mentioned in assertions and proofs only, and an implementation must enforce that they are not used in actual computation. Coq∗ , a system based on the Implicit Calculus of Constructions (Barras and Bernardo 2008), supports this feature natively. From a theoretical standpoint, it would be cleanest to implement Ynot as a Coq∗ library. However, in implementing the original Ynot system, we hesitated to switch to this nonstandard branch of the Coq development tree. In designing the new system, we felt the same trepidation, since we might encounter difficulties using libraries written for the standard Coq system, and the users of our library would need to install an unusual version of Coq. We hope that, in the long term, the new Coq∗ features will become part of the standard Coq distribution. For now, we use an encoding of computationally-irrelevant variables that is effective in standard Coq, modulo some caveats that we discuss below. Our reimplementation employs the trick of representing specification variables in types that are marked as “proofs” instead of “programs,” such that we can take advantage of Coq’s standard restrictions on “information flow” from proofs to programs. Concretely, the Coq standard library has for some time contained a type family called inhabited, defined by:

2.1 A Derived Separation Logic Direct reasoning about heaps leads to very cumbersome proof obligations, with many sub-proofs that pairs of pointers are not equal. Separation logic (Reynolds 2002) is the standard tool for reducing that complexity. The previous Ynot system built a separation logic on top of the axiomatic foundation, and we do the same here. We introduce no new inductive type of separation logic formulas. Instead, we define functions that operate on arbitrary predicates over heaps, with the intention that we will only apply these functions on separation-style formulas. Nonetheless, it can be helpful to think of our assertion language as defined by: P

::=

[φ] | x 7→ y | P ∗ P | ∃x, P

For any pure Coq proposition φ, [φ] is the heap predicate that asserts that φ is true and the heap is empty. We write emp as an abbreviation for [True], which asserts only that the heap is empty. x 7→ y asserts that the heap contains only a mapping from x to y. P1 ∗ P2 asserts that the heap can be broken into two heaps h1 and h2 with disjoint domains, such that h1 satisfies P1 and h2 satisfies P2 . The final clause provides existential quantification. The embedding in Coq provides much more expressive formulas than in most systems based on separation logic. Not only can any pure proposition be injected with [·], but we can also use arbitrary Coq computation to build impure assertions. For instance, we can model deterministic disjunction with pattern-matching on values of algebraic datatypes, and we can include calls to custom recursive functions that return assertions. We need no special support in the assertion language to accommodate this, and Coq’s theorem-proving support for reasoning about pattern-matching recursive functions can be used without modification. If we had defined an inductive type of specifications, we would have needed to encode most of the relevant Coq features explicitly. For instance, to allow pattern matching that produces specifications, our inductive type would need a constructor standing for dependent pattern matching, which is quite a tall order on its own. Perhaps surprisingly, we have met with general success in implementing realistic examples using just these connectives. Standard uses of other connectives can often be replaced by uses of higher-order features, and the connectives that we do use are particularly amenable to automation. In Section 2.2, we try to give a flavor of how to encode disjunction, in the context of a particular example. Fully-automated systems like Smallfoot (Berdine et al. 2005) build in restrictions similar to ours, but it surprised us that we needed little more to do full correctness verification. 2.1.1 The Importance of Computational Irrelevance What we have described so far is the same as in the original Ynot work. The primary departure of our new system is that we use a

Inductive inhabited (A:Type) : Prop := inhabits : A -> inhabited A.

81

Module Type STACK. Parameter t : Set -> Set. Parameter rep (T : Set) : t T -> list T -> hprop.

This code demonstrates Coq’s standard syntax for inductive type definitions, which is quite similar to the syntax for algebraic datatype definitions in ML and Haskell. This type family has one parameter A of type Type, which can be thought of as the type of all types1 . The constructor inhabits lets us inject any value into inhabited. While the original value may have an arbitrary type, the inhabited package has a type in the universe Prop, the universe of logical propositions. Terms whose types are in this universe are considered to be proofs and are erased by program extraction. We will see in the following examples how this encoding necessitates some mildly cumbersome notation around uses of irrelevant variables. Further, to reason effectively about irrelevant variables, we need to assert without proof an axiom stating that the constructor inhabits is injective.

Parameter new T : Cmd emp (fun s : t T => rep s nil). Parameter free T (s : t T) : Cmd (rep s nil) (fun _ : unit => emp). Parameter push T (s : t T) (x : T) (ls : [list T]) : Cmd (ls ~~ rep s ls) (fun _ : unit => ls ~~ rep s (x :: ls)). Parameter pop T (s : t T) (ls : [list T]) : Cmd (ls ~~ rep s ls) (fun xo : option T => ls ~~ match xo with | None => [ls = nil] * rep s ls | Some x => Exists ls’ :@ list T, [ls = x :: ls’] * rep s ls’ end). End STACK.

Axiom pack_injective : forall (T : Set) (x y : T), inhabits x = inhabits y -> x = y. Our library additionally assumes the standard axiom of function extensionality (“functions are equal if they agree at all inputs”) and the very technical “unicity of equality proofs” axiom that is included in Coq’s standard library. This pair of axioms has been proved consistent for Coq’s logic, and we could avoid appealing to extensionality at the cost of more proving work in the library, by formalizing heaps as lists instead of functions. Such a change would be invisible to most users of the library, who only need to use standard theorems proved about the heap model. However, the pack injectivity axiom contradicts the axiom of proof irrelevance (which we do not use in any of our developments, but which is popular among Coq users), and it is an open question in the Coq community whether this axiom is consistent with Coq’s logic even by itself. Past work built a denotational model for Ynot minus this feature (Petersen et al. 2008), and the architects of that model are now considering how to add irrelevance, which would complete the foundational story for the framework that we use in this paper. We hope that the experiences we report here can help to justify the inclusion of irrelevance as a core Coq feature.

Figure 2. The signature of an imperative stack module read to be presented as an argument to the proof rule, but here we want to allow verification of programs where the exact value to read cannot be computed from the pieces of pure data that are in scope. We want to emphasize that the changes we have made in the Ynot separation logic have no effect on the theory behind the systems. In both the old and new systems, a separation logic is defined on top of the base Hoare logic with binary post-conditions, introducing no new axioms. Here, we use the same base logic as in the past work, so the past investigations into its metatheory (Petersen et al. 2008) continue to apply. The sole metatheoretical wrinkle is the one which we discussed above, involving computational irrelevance, which is orthogonal to program logic rules.

2.1.2 The Rules of the Separation Logic Figure 1 presents the main rules of our separation logic. The notable divergence from common formulations is in the use of existential quantifiers in the rules for freeing, reading, and writing. These differences make sense because Ynot is implemented within a constructive logic. Coq’s constructivity is inspired by the CurryHoward isomorphism, where programs and proofs can be encoded in the same syntactic class. A more standard, classical separation logic would probably require that, in the rule for free, the value v pointed to by p be provided as an argument to the proof rule. In constructive logic, such a value can only be produced when it can be computed by an algorithm, just as a functional program may only refer to a value that it has said how to compute. Additionally, we would not be able to use any facts implied by the current heap assertion to build one of these rule witnesses, and perhaps the witness can only be proved to exist using such facts. The explicit existential quantifier frees us to reason inside the assertion language in finding the witness. Because it uses quantification in this way, the “read” rule must also take a kind of explicit framing condition. This condition is parameterized by the value being read from the heap, making it a kind of description of the neighborhood around that value in the heap. More standard separation logics force the exact value being

In the rest of this section, we will introduce the Ynot programming environment more concretely, via several examples of verified data structure implementations. 2.2 Verifying an Implementation of Imperative Stacks Figure 2 shows the signature of a Ynot implementation of the stack ADT. The signature is expressed in Coq’s ML-like module system. Each implementation contains a type family t, where, for any type T, a value of t(T) represents a stack storing elements of T. The rep component of the interface relates an imperative stack s to a functional list ls in a particular state. Thus, rep s ls is a predicate on heaps (hprop) which can be read as “s represents the list ls” in the current state. Just as abstraction over the type family t allows an implementation to choose different data structures to encode the stack, abstraction over the assertion rep allows an implementation to choose different invariants connecting the concrete representation to an idealized model. In Section 2.1, we gave a grammar for our “specification language.” In contrast to most work on separation logic, our real implementation has no such specification language. Rather, we define the type hprop as heap -> Prop, so that specifications and invariants are arbitrary predicates over heaps. In Figure 2, we see notations involving emp, asserting that the heap is empty; [...], for injecting pure propositions; *, for the standard separating conjunction; and Exists, for standard typed existential quantification.

1 To

avoid the standard soundness problems with including a type of all types, actual Coq type-checking infers numerical indices for all occurrences of Type.

82

Not shown in this figure is the binary “points-to” operator -->. The relative parsing precedences of the operators place --> highest, followed by * and Exists. We only need to use funny symbols for syntax like Exists x :@ T, P (meaning “there exists x of type T such that P”) to avoid confusing the LL(1) parser that is at the heart of Coq’s syntax extension facilities. Our library defines hprop-valued functions implementing these usual separation logic connectives, but users can define their own “connectives” just as easily. For example, here is how we define Exists:

The type of pop showcases how we avoid the disjunctive connectives of separation logic. The function returns an optional T value, which will be None when the stack is empty and will be Some x when x is at the top of the stack. We use a Coq match expression to give a different post-condition for each case. We can implement a module satisfying this signature. With the type T as a local variable, we can define the type of nodes of the linked lists that we will use. We use the abstract type ptr of untyped pointers from the Ynot library.

Definition hprop_ex (T : Type) (p : T -> hprop) := fun h : heap => exists v : T, p v h.

Record node : Set := Node { data : T; next : option ptr }.

Here is how we add a syntax extension (or “macro”) that lets us write existential quantification in the way seen in Figure 2: Notation "’Exists’ v :@ T , p" := (hprop_ex T (fun v : T => p)).

To define the representation invariant, we want a recursive function specifying what it means for a possibly-null pointer to represent a functional list. Our code contains a struct annotation that gives a termination argument for the function.

By reading the types of the methods exposed in the STACK signature, we can determine the contract that each method adheres to. The Cmd type family is our parameterized monad of computations with separation logic specifications; the two arguments to Cmd give preconditions and postconditions. Cmd is defined in terms of the more primitive ST parameterized monad, in the same way as in our past work (Nanevski et al. 2008)2 . Our specifications follow the algebraic approach to proofs about data abstraction (as in Liskov and Zilles (1975)), where an abstract notion of state is related to concrete states. Each operation needs a proof that it preserves the relation properly. In Ynot developments, abstract states are manipulated by standard, purely-functional Coq programs, and method specifications include explicit calls to these state transformation functions. Each post-condition requires that the new concrete, imperative state be related to the abstract state obtained by transforming the initial abstract state. The type of the new operation tells us that it expects an empty heap on input, and on output the heap contains just whatever mappings are needed to satisfy the representation invariant between the function return value and the empty list. The free operation takes a stack s as an argument, and it expects the heap to satisfy rep on s and the empty list. The post state shows that all heap values associated with s are freed. The specification for push says that it expects any valid stack as input and modifies the heap so that the same stack that stood for some list l beforehand now stands for the list x :: l, where x is the appropriate function argument. We see an argument ls with type [list T]. The brackets are a notation defined by the Ynot library, standing for computational irrelevance. The syntax [T] expands to inhabited T. To review our discussion from Section 2.1.1, this means that the type-checker should enforce that the value of ls is not needed to execute the function. Rather, such values may only be used in stating specifications and discharging proof obligations. We use Coq’s notation scope mechanism to overload brackets for writing irrelevant types and lifted pure propositions. For an assertion P that mentions the irrelevant variable v, the notation v ~~ P must be used to unpack v explicitly. The type of the unpack operation is such that it may only be applied to assertions and may not be used to allow an irrelevant variable’s value to leak into the computational part of a program. Unpacking has no “logical” meaning; it is only used to satisfy the type-checker in the absence of native support for irrelevance. The notation is defined by this equation, where we write “[v’/v]” informally to denote the substitution of variable v’ for variable v in a Coq term.

Fixpoint listRep (ls : list T) (hd : option ptr) {struct ls} : hprop := match ls with | nil => [hd = None] | h :: t => match hd with | None => [False] | Some hd => Exists p :@ option ptr, hd --> Node h p * listRep t p end end. We can represent stacks as untyped pointers to the heads of linked lists built from Nodes. Definition stack := ptr. We achieve type safety through the representation invariant. Definition rep (s : stack) (ls : list T) : hprop := Exists po :@ option ptr, s --> po * listRep ls po. Before we start implementing the ADT methods, we should set up some proof automation machinery. Systems like Smallfoot (Berdine et al. 2005) have hardcoded support for particular heap predicates like acyclic linked list-ness, cyclic linked list-ness, and so on. These systems perform simplifications on formulas that mention the predicates that they understand. In Ynot, on the other hand, the programmer can define his own new predicates, as we have just done. Not only that, but he can also prove lemmas that correspond to the simplification rules built into automated tools, and he can plug his lemmas into a general separation logic solver. All of this is done with no risk that a mistake by the programmer will lead to a faulty verification; every lemma must be proved from first principles. In a real proof, of course, the human proof architect only learns which automation will be effective in the course of verifying his program. Ynot supports this kind of incremental automation very well, as we hope to demonstrate in the rest of the section, using sample interactive Coq sessions. Due to space constraints, we must skip some steps and go straight to the right answers, but we have tried to include enough iteration to give a flavor for Ynot development. As we progress through the methods, we will be improving a custom tactic, or proof procedure, that we will design specifically for this data structure, and we will call that tactic tac. We begin with a version of tac that delegates all work to the separation logic simplifier sep that is included with Ynot.

v ~~ P = (exists v’, v = inhabits v’ /\ P[v’/v]) 2 The

derived monad is called “STsep” in that past work.

83

tested is rebound with a non-option type. We use ;; instead of ; after imperative commands that do not bind variables, because attempts to do otherwise confuse Coq’s finicky LL(1) parser.

Ltac tac := sep fail auto. We will explain each of the two parameters to sep as we find a use for it. We implement each stack method by stating its type as a proof search goal, using tactics to realize the goal step by step. The first method to implement is new, and we do so using the syntax New for the new(·) operation from Figure 1.

Definition pop (s : stack) (ls : [list T]) : Cmd (ls ~~ rep s ls) (fun xo : option t => ls ~~ match xo with | None => [ls = nil] * rep s ls | Some x => Exists ls’ :@ list T, [ls = x :: ls’] * rep s ls’ end). refine (fun s ls => hd Some hd0 * listRep x (Some hd0) ==> hd0 --> ?1960 * hd0 --> ?1960 We can tell that something has probably gone wrong, since the conclusion of the implication contains an unsatisfiable separation formula that mentions the same pointer twice. Our automated separation simplification is quite aggressive and often simplifies satisfiable formulas to unsatisfiable forms, but the results of this process tend to provide hints about which facts would have been useful. In this case, we see a use of listRep where the pointer is known to be non-null. We can prove a lemma that would help simplify such formulas.

v --> None ==> rep v nil

Theorem listRep_Some : forall (ls : list T) (hd : ptr), listRep ls (Some hd) ==> Exists h :@ T, Exists t :@ list T, Exists p :@ option ptr, [ls = h :: t] * hd --> Node h p * listRep t p. destruct ls; sep fail ltac:(try discriminate). Qed.

The syntax ==> is for implication between heap assertions, and it has lower parsing precedence than any of the other operators that we use. We see that it is important to unfold the definition of the representation predicate, so we modify our tactic definition, and now the proof completes automatically. Ltac tac := unfold rep; sep fail auto.

We prove that a functional list related to a non-null pointer decomposes in the expected way. All it takes is for us to request a case analysis on the variable ls, followed by a call to the separation solver. Here we put to use the second parameter to sep, which gives a tactic to try applying throughout proof search. The discriminate tactic solves goals whose premises include inconsistent equalities over values of datatypes, like nil = x :: ls; and adding try in front prevents discriminate from signaling an error if no such equality exists. We can modify our tac tactic to take listRep Some into account. First, we define another procedure for simplifying an implication.

The definitions of free and push are not much more complicated. We use some new notations, including a Haskell-inspired monadic bind syntax, and all are defined in our library with “Coq macros,” as in the example of double braces above. Definition free (s : stack) : Cmd (rep s nil) (fun _ : unit => emp). refine (fun s => {{Free s}}); tac. Qed. Definition push (s : stack) (x : T) (ls : [list T]) : Cmd (ls ~~ rep s ls) (fun _ : unit => ls ~~ rep s (x :: ls)). refine (fun s x ls => hd sepBind (sepStrengthen (sepRead v)) (fun nd -> sepSeq (sepStrengthen (sepFrame (sepFree v))) (sepSeq (sepStrengthen (sepFrame (sepWrite s (next nd)))) (sepWeaken (sepStrengthen (sepFrame (sepReturn (Some (data nd)))))))) | None -> sepWeaken (sepStrengthen (sepFrame (sepReturn None))))

Notice that all specification variables and proofs are eliminated automatically by the Coq extractor. With the erasure of weakening and related operations, we arrive at exactly the kind of monadic code that is standard fare for Haskell, such that the compilation techniques developed for Haskell can be put to immediate use in creating an efficient compilation pipeline for Ynot. It is also worth pointing out that the sort of tactic construction effort demonstrated here is generally per data structure, not per program. We can verify a wide variety of other list-manipulating programs using the same tac tactic that we developed here. Usually, the tactic work for a new data structure centers on identifying the kind of unfolding lemmas that we proved above. 2.3 Verifying Imperative Queues It is not much harder to implement and verify a queue structure. We define an alternate list representation, parameterized by head and tail pointers. Fixpoint listRep (ls : list T) (hd tl : ptr) {struct ls} : hprop := match ls with | nil => [hd = tl] | h :: t => Exists p :@ ptr, hd --> Node h (Some p) * listRep t p tl end.

Figure 3. Sample OCaml code extracted from the stack example We also suggest to sep that try discriminate may be useful throughout proof search. Ltac tac := unfold rep; sep simp_prem ltac:(try discriminate). When we rerun the definition of pop, we have made progress. Only one goal remains to prove:

Record queue : Set := Queue { front : ptr; back : ptr }.

emp ==> [x = nil] We see that this goal probably has to do with a case where we know that the list being modeled is nil. We were successful at using simpl prem to deal with the case where we know the list is non-nil, and we can continue with that strategy by proving another lemma.

Definition rep’ (ls : list T) (fr ba : option ptr) := match fr, ba with | None, None => [ls = nil] | Some fr, Some ba => Exists ls’ :@ list T, Exists x :@ T, [ls = ls’ ++ x :: nil] * listRep ls’ fr ba * ba --> Node x None | _, _ => [False] end.

Theorem listRep_None : forall ls : list T, listRep ls None ==> [ls = nil]. destruct ls; sep fail idtac. Qed. Now our verification of pop completes, after we modify the definition of simp prem:

Definition rep (q : queue) (ls : list T) := Exists fr :@ option ptr, Exists ba :@ option ptr, front q --> fr * back q --> ba * rep’ ls fr ba.

Ltac simp_prem := simpl_IfNull; simpl_prem ltac:(apply listRep_None || apply listRep_Some).

For this representation, we prove similar unfolding lemmas to those we proved for stacks, with comparable effort. We also need a new lemma for unfolding a queue from the back.

We complete the implementation of the stack ADT with a trivial definition of the type family t, relying on the representation invariant to ensure proper use.

Lemma rep’_back : forall (ls : list T) (fr ba : ptr), rep’ ls (Some fr) ba ==> Exists nd :@ node, fr --> nd * Exists ls’ :@ list T, [ls = data nd :: ls’] * match next nd with | None => [ls’ = nil] | Some fr’ => rep’ ls’ (Some fr’) ba end.

Definition t (_ : Set) := stack. For our modest efforts, we can now extract an executable OCaml version of our module. Figure 3 shows part of the result of running Coq’s automatic extraction command on our Stack module. In the implementation of pop, we see invocations of functions whose names begin with sep. These come from the Ynot library, and we must provide their OCaml implementations. Any Ynot program that returns a type T may be represented in unit -> T in OCaml, regardless of the specification appearing in the original Coq type. This makes it easy to implement the basic functions, in the spirit of how the Haskell IO monad is implemented. We see calls to explicit weakening, strengthening, and framing rules in the extracted code. In OCaml, these can be implemented as no-ops and erased by an optimizer.

The proof of the lemma relies on some lemmas about pure functional lists. With those available, we prove rep’ back in under 20 lines. When we plug this and the two other unfolding lemmas into the sep procedure, we arrive at quite a robust proof procedure for separation assertions about lists that may be modified at either end. Again, in our final queue implementation, every proof obligation is proved by a tac tactic built from sep. We write under 10 lines of new tactic hints to be applied during proof search, and we

85

Module Type MEMO. Parameter T : Set. Parameter t : forall (T’ : T -> Set), hprop -> (forall x, T’ x -> Prop) -> Set. Parameter rep : forall (T’ : T -> Set) (inv : hprop) (fpost : forall x, T’ x -> Prop), t inv fpost -> hprop. Parameter create : forall (T’ : T -> Set) (inv : hprop) (fpost : forall x, T’ x -> Prop), (forall x, Cmd inv (fun y : T’ x => [fpost _ y] * inv)) -> Cmd emp (fun m : t inv fpost => rep m). Parameter funcOf : forall (T’ : T -> Set) (inv : hprop) (fpost : forall x, T’ x -> Prop) (m : t inv fpost), forall (x : T), Cmd (rep m * inv) (fun y : T’ x => rep m * [fpost _ y] * inv). End MEMO.

must prove one key lemma by induction. We discover the importance of this lemma while trying to verify an implementation of enqueueing. Definition enqueue : forall (q : queue) (x : T) (ls : [list T]), Cmd (ls ~~ rep q ls) (fun _ : unit => ls ~~ rep q (ls ++ x :: nil)). refine (fun q x ls => ba ls ~~ [res = ls] * listRep ls hd). refine (Fix2 (fun hd ls => ls ~~ listRep ls hd) (fun hd ls res => ls ~~ [res = ls] * listRep ls hd) (fun self hd ls => IfNull hd Then {{Return nil}} Else fn listRep (ls ++ x :: nil) fr nd. Hint Resolve himp_comm_prem. induction ls; tac. Qed. To get the original verification to go through, we only need to add this lemma to the hint database, using a built-in Coq command. Hint Immediate push_listRep. 2.4 Loops Like with many semi-automated verification systems, we require annotations that are equivalent to loop invariants. Since Coq’s programming language is functional, it is more natural to write loops as recursive functions, and the loop invariants become the pre- and post-conditions of these functions. We support general recursion with a primitive fixpoint operator in the base program logic, and it is easy to build a separation logic version on top of that. We can also build multiple-argument recursive and mutually-recursive function forms on top of the singleargument form, without needing to introduce new primitive combinators. An example is a getElements function, defined in terms of the list invariant that we wrote for the stack example. This operation returns the functional equivalent of an imperative list. The task is not so trivial as it may look at first, because the computational irrelevance of the function’s second argument prohibits its use to influence the return value. This means that we are not allowed to name the irrelevant argument as one that decreases on each recursive call, which prevents us from using Coq’s native recursive function definitions, where every function must be proved to terminate using simple syntactic criteria. Nonetheless, the definition is easy using the general recursion combinators supported by Ynot.

The code demonstrates a use of one of the derived fixpoint combinators, Fix2. Of the three arguments that we pass, the first two give the pre-condition and post-condition in terms of the two “real” arguments (and, for the post-condition, the return value). The third argument is the function body. It takes a recursive self-reference as its first argument, followed by the two “real” arguments. Native Coq recursive function definitions must often include annotations explaining why they terminate, but Ynot deals only with partial correctness, so no such annotations are required for our fixpoint combinators. The notation x ~~~ e is for building a new computationallyirrelevant value out of an old one. The notation e P is explicit invocation of the frame rule. With the current system, one usually wants to invoke that rule at each function call. The framing assertion can be written as an underscore to ask that it be inferred. 2.5 A Dependently-Typed Memoizing Function As far as we have been able to determine, all previous tools for data structure verification lack either aggressive automation or support for higher-order features. The original Ynot supported easy integration of higher-order functions and dependent types, but the very manual proof style became even more onerous for such uses. Our reimplemented Ynot maintains the original’s higher-order features,

Definition getElements (hd : option ptr) (ls : [list A]) :

86

and our proof automation integrates very naturally with them. This is a defining advantage of our new framework over all alternatives. For instance, it is easy to define a module supporting memoization of imperative functions. Figure 4 gives the signature of our implementation, which is actually an ML-style functor that produces implementations of this signature when passed appropriate input modules. The type T is the domain of memoizable functions, and types like t inv fpost stand for memo tables. The argument inv is an assertion giving an invariant that the memoized function maintains, and the pure assertion fpost gives a relation between inputs and outputs of the function. The rep predicate captures the heap invariants associated with a memo table. The create function produces a memo table when passed an imperative function with the proper specification. Finally, the funcOf function maps a memo table to a function that consults the table to avoid recomputation. We can implement a MEMO functor in 50 lines when we use a memo table that only caches the most recent input-output pair. Like in the previous examples, we build a specialized automation procedure with a one-line instantiation of library tactics. We give a 7-line definition of rep, give one one-liner proof of a lemma to use in proof search, and include two lines of annotations within the definition of funcOf. All of the rest of the development is no longer or more complicated than in ML. Compared to ML, we have the great benefit of using types to control the behavior of functions to be memoized. A function could easily thwart an ML memoizer by producing unexpected computational effects.

The effective range of specifications is too large to be solvable by any particular “magic bullet” tactic. Nonetheless, we have found that, in practice, a specific parameterized proof strategy can discharge most obligations. In contrast to the situation with classical verification tools that are backed by automated first-order theoremprovers, when any proof strategy fails in Coq, the user can always program his own new strategy or even move to mostly-manual proof. However, our experience suggests to us that most goals about data structures can be solved by the procedure that we present in this section. That procedure is implemented as the sep tactic that we used in our examples. We do not have space to include the literal Coq code implementing it; we will outline the basic procedure instead. The implementation is in Coq’s Ltac language (Delahaye 2000), a domain-specific, dynamically-typed language for writing proofgenerating proof search procedures. All of the proof scripts we have seen so far are really Ltac programs. The full language includes recursive function definitions, which, along with pattern-matching on proof goals, makes it possible to code a wide variety of proof manipulation procedures. As our examples have illustrated, sep takes two arguments, which we will call unfolder and solver. The task of unfolder is to simplify goals before specification inference, usually by unfolding definitions of recursive predicates, based on known facts about their arguments. The task of solver is to solve all of the goals that remain after generic separation logic reasoning is applied. Coq comes with the standard tactic tauto, for proving propositional tautologies. There is a more general version of tauto called intuition, which will apply a user-supplied tactic to finish off sub-proofs, while taking responsibility for handling propositional structure on its own. The intuition tactic also exhibits the helpful behavior of leaving for the user any subgoals that it could not establish. sep is meant to be an analogue of intuition for separation logic. We also want it to handle easy instantiation of existential quantifiers, since they appear so often in our specifications. We can divide the operation of sep into five main phases. We will sketch the workings of each phase separately.

3. Tactic Support The examples from the last two sections show how much of the gory details of proofs can be hidden from programmers. In actuality, every command triggers the addition of one or more proof obligations that cannot be discharged effectively by any of the built-in Coq automation tactics. Not only is it hard to prove the obligations, but it is also hard to infer the right intermediate specifications. Our separation logic formulas range well outside the propositional fragment that automated tools tend to handle; specification inference and proving must deal with higher-order features. Here is an example of the proof obligations generated for the code we gave earlier for the stack push method. Numbers prefixed with question marks are unification variables, whose values the sep tactic must infer.

3.1 Simple Constraint Solving It is trivial to determine the proper value for any unification variable appearing alone on one side of the implication. For instance, given the goal p --> x * q --> y ==> ?123

ls ~~ rep s ls ==> Exists v :@ option ptr, s --> v * ?200 v

we simply set ?123 to p --> x * q --> y. Given the slightly more complicated goal

forall v : option ptr, s --> v * ?200 v ==> ?192 v

p --> x * q --> y ==> ?123 x

?192 hd ==> ?217 * emp

we abstract over x in the premise to produce fun x’ => p --> x’ * q --> y.

forall v : ptr, ?217 * v --> Node x hd ==> ?206 v

3.2 Intermediate Constraint Solving When the trivial unification rules are not sufficient, we need to do more work. We introduce names for all existential quantifiers and computationally-irrelevant variables in the premise. For instance, starting with

?206 nd ==> ?234 * (Exists v’ :@ ?231, s --> v’) ?234 * s --> Some nd ==> ls ~~ rep s (x :: ls)

m ~~ Exists v :@ T, p --> v * rep m v ==> ?123 * Exists x :@ T, p --> x

We can see that each goal, compared to the previous goals, has at most one new unification variable standing for a specification; of the two new variables appearing in the second last line, one stands for a type, which will be easy to infer by standard unification, once the values of prior variables are known. Also, each new specification variable has its value determined by the value of the new variable from the previous goal. This is no accident; we designed our combinators and notations to have this property.

we introduce names to simplify the premise, leading to this goal: p --> v’ * rep m’ v’ ==> ?123 * Exists x :@ T, p --> x Now we run the user’s unfolder tactic, which might simplify some use of a definition. Let us assume that no such simplification

87

performed by the user’s solver tactic, until no further progress can be made. Finally, sep discharges all goals of the form P ==> P, by reflexivity. Every step of the overall process is implemented in Ltac, so that only a bug in Coq would allow sep to declare an untrue goal as true, no matter which customization the programmer provides. By construction, every step builds an explicit proof term, which can be validated afterward with an independent checker that is relatively simple, compared to the operation of all of the decision procedures that may have contributed to the proof.

occurs for this example. We notice that the points-to fact on the right mentions the same pointer as a fact on the left, so these two facts may be unified, implying x = v’. Canceling this known information, we are left with rep m’ v’ ==> ?123 which is resolvable almost trivially. We cannot give ?123 a value that mentions the variables m’ and v’, since we introduced them with elimination rules within our proof. These variables are not in scope at the point in the original program where the specification must be inserted. Instead, we remember how each local variable was introduced and re-quantify at the end, like this:

4. Evaluation

m ~~ Exists v :@ T, rep m v ==> ?123

We have used our environment to implement and verify several data structures, including the Stack and Queue examples that appeared in Section 2. We also follow the evaluation of our prior Ynot system in implementing a generic signature of imperative finite maps. We built three very different implementations: a trivial implementation based on pointers to heap-allocated functional association lists, an implementation based on binary search trees, and an implementation based on hash tables. Any of the implementations can be used interchangeably via ML-style functors, and their shared signature is phrased in terms of dependently-typed maps, where the type of data associated with a key is calculated from an arbitrary Coq function over that key. Our largest example, a packrat PEG parser (Ford 2004), uses these finite maps to cache intermediate results. We also verified one more exotic data structure: binomial trees, which are tree structures with a non-trivial rule for determining how many pointers are stored at each node. This data structure is often applied in implementing priority queues. Our implementation is interesting in its use of a dependently-typed recursive function to characterize functional models of such trees. Finally, we chose representative examples from two competing data structure verification systems, Smallfoot (Berdine et al. 2005) and Jahob (Zee et al. 2008), and reimplemented those examples in our new Ynot. Figure 5 presents code size statistics for our case studies. “Program” code is code that is preserved by extraction. “Specs” are the pre- and post-conditions of every function defined in the module. The core of a Ynot module consists of heap representation “rep” code (e.g., the definitions named rep in our examples), along with proofs (e.g., push listRep) and tactics (e.g., simp prem) dealing with these representations. The annotations column counts the number of lines of programmer specified annotations (e.g., ). The total overhead column sums proofs, tactics, and annotations. We also present type-checking and proving times (in minutes and seconds), as measured on a 2.8 GHz Pentium D with 1 GB of memory. So far, we have not optimized our tactics for running time; they are executed by direct interpretation of programs in a dynamicallytyped language.

Now the trivial unification is valid. The crucial part of this process was the matching of the two points-two facts. We have specialcase rules for matching conclusion facts under quantifiers, for conclusions that match the pre-conditions of the read, write, and free rules. Beyond that, we apply cancellation of identical terms on the two sides of the implication, when those terms do not fall under the scopes of quantifiers. These simple rules seem to serve well in practice. 3.3 Premise Simplification After specification inference, the next step is to simplify the premise of the implication. Any emp in the premise may be removed, and any lifted pure formula [φ] may be removed from the implication and added instead to the normal proof context. We also remove existential quantifiers and irrelevant variable unpackings in the same way as in the previous phase. 3.4 Conclusion Simplification The main sep loop is focused on dealing with parts of the conclusion. We remove occurrences of emp, and we remove any pure formula [φ] that the user’s solver tactic is able to prove. An existential formula Exists x :@ T, P(x) in the conclusion is replaced by P(?456), for a fresh unification variable ?456. When no more of these rules apply, we look for a pair of unifiable subformulas on the sides of the implication. All such pairs are unified and crossed out. This may determine the value of a variable introduced for an existential quantifier. For instance, say we begin with this goal. [m < 17] * p --> m ==> Exists x :@ nat, p --> x * [x < 42] Premise simplification would move the initial impure fact into the normal proof context, leaving us with this. p --> m ==> Exists x :@ nat, p --> x * [x < 42] Conclusion simplification would introduce a name for the existentially-bound variable.

Our previous version of Ynot placed a significant interactive proof burden on the programmer. The previous Ynot hash table, for instance, required around 320 explicit Coq tactic invocations. Each tactic invocation (indicated by a terminating “.” in Coq) represents a manual intervention by the Ynot programmer. These invocations tended to be low-level steps, like choosing which branch of a disjunction to prove. As such, these proofs are brittle in the face of minor changes. In some previous Ynot developments, the ratio of manual proof to program text is over 10 to 1. For comparison, a large scale compiler certification effort (Leroy 2006) has reported a proof-to-code ratio of roughly 6 to 1. In contrast, our new hash table requires only about 70 explicit tactic invocations. These invocations tend to be high level steps, like performing induction or invoking the sep tactic. We have

p --> m ==> p --> ?789 * [?789 < 42] Next, conclusion simplification would match the two p points-to facts, since their pointers unify trivially. emp ==> [m < 42] This goal can be reduced to emp ==> emp by using the normal proof context to deduce the fact inside the brackets. 3.5 Standard Coq Automation When sep has run out of rules to apply, the remaining subgoal is subjected to standard Coq automation. Propositional structure and calls to recursive functions are simplified where possible. sep ends by running a loop over those simplifications and the simplifications

88

Stack Queue Ref to Functional Finite Map Hash Table BST Finite Map Binomial Tree Association List Linked List Segments Packrat PEG Parser

Program 14 26 8 34 31 19 48 84 277

Specs 8 12 16 21 16 12 34 34 110

Rep 14 22 2 6 6 13 17 19 15

Proofs 7 41 2 70 22 0 41 91 102

Tactics 5 25 2 38 8 9 51 208 55

Annotations 0 0 0 34 4 7 10 7 5

Total Overhead 12 66 4 142 34 16 102 306 162

Time (m:s) 0:12 1:36 0:05 0:45 1:35 2:33 3:10 2:15 1:20

Figure 5. Breakdown of numbers of lines of different kinds of code in the case studies checking, where typing invariants may be specified with booleanvalued program functions and checked at runtime. This approach generally does not enable full static correctness verification. Partly as a way to support imperative programming in type theory, Swierstra and Altenkirch (2007) have studied pure functional semantics for effectful programming language features, with embeddings in Haskell and Agda. Chargu´eraud and Pottier (2008) have demonstrated a translation from a calculus of capabilities to a pure functional language. In each case, the authors stated plans to do traditional interactive verification on the pure functional models that they generate. Since such verification is generally done in logics without general recursion, these translations cannot be used to verify general recursive programs without introducing an extra syntactic layer, in contrast to Ynot. Each other approach also introduces restrictions on the shape of the heap, such as the absence of stored impure functions in the case of Swierstra and Altenkirch’s work. Other computer proof assistants are based around pure functional programming languages, with opportunities for encoding and verifying imperative programs. Nonetheless, we see the elegance of our approach as depending on the confluence of a number of features not found in other mature proof assistants. ACL2 (Kaufmann and Moore 1997) does not support higher-order logic or higherorder functional programming. Bulwahn et al. (2008) describe a system for encoding and verifying impure monadic programs in Isabelle/HOL. Their implementation does not support storing functions in the heap. They suggest several avenues for loosening this restriction, and the approaches that support heap storage of impure functions involve restricting attention to functions that are constructive or continuous (properties that hold of all Coq functions), necessitating some extra proof burden or syntactic encoding. There is closely related work in the field of shape analysis. The TVLA system (Sagiv et al. 2002) models heap shapes with a firstorder logic with a built-in transitive closure operation. With the right choices of predicates that may appear in inferred specifications, TVLA is able to verify automatically many programs that involve both heap shape reasoning and reasoning in particular decidable theories such as arithmetic. The Xisa system (Chang and Rival 2008) uses an approach similar to ours, as Xisa is based on user specification of inductive characterizations of shape invariants. Xisa builds this inductive definition mechanism into its framework, while we inherit a more general mechanism from Coq. Xisa is based on hardcoded algorithms for analyzing inductive definitions and determining when and how they should be unfolded. Such heuristics lack theoretical guarantees about how broadly they apply. In the design of our system, we recognize this barrier and allow users to extend the generic solver with custom rules for dealing with custom inductive predicates. In comparing the new Ynot environment to the above systems and all others that we are aware of, there are a number of common advantages. No other system supports both highly-automated

observed that such tactic-based proofs are significantly easier to maintain. We also made rough comparisons against two verification systems that do not support reasoning about first-class functions. The Jahob (Zee et al. 2008) system allows the specification and verification of recursive, linked data structures in a fragment of Java. We implemented an association list data structure that is included as an example in the Jahob distribution. Code-size-wise, the two implementations are quite similar. For instance, they both require around twenty lines of heap representation code, and they both require about a dozen lines of code for the lookup function’s loop invariant. Our Ynot implementation uses explicit framing conditions in places where Jahob does not, but we speculate that we can probably remove these annotations with additional, custom automation. Our second comparison is against the Smallfoot (Berdine et al. 2005) system, which does completely automated verification of memory safety via separation logic. We implemented Ynot versions of 10 linked list segment functions included with the Smallfoot distribution. In each case, the Ynot and Smallfoot versions differed by no more than a few lines of annotation burden.

5. Related Work Considering the two automated systems that we just mentioned, Smallfoot uses a very limited propositional logic, and Jahob uses an undecidable higher-order logic. Many interesting program specifications cannot be written in Smallfoot’s logic and cannot be proved to hold by Jahob’s automated prover. Neither of these systems supports higher-order programs, and neither supports customprogrammed proof procedures, for cases where standard automation is insufficient. The ESC/Java (Flanagan et al. 2002) and Spec# (Barnett et al. 2004) systems tackle some related problems within the classical verification framework. These systems have strictly less support for modeling data structures than Jahob has, so that it is impractical to use them to perform full verifications of many data structures. A number of systems have been proposed recently to support dependently-typed programming in a setting oriented more towards traditional software development than Coq is. Agda (Norell 2007) and Epigram (McBride and McKinna 2004) are designed to increase the convenience of programming in type theory over what Coq provides, but, out of the box, these systems support neither imperative programming nor custom proof automation. ATS (Chen and Xi 2005) includes novel means for dealing with imperative state, but it includes no proof automation beyond decision procedures for simple base theories like linear arithmetic. This makes it much harder to write verified data structure implementations than in Ynot. Concoqtion (Pasalic et al. 2007) allows the use of Coq for reasoning about segments of general OCaml programs. While those programs may use imperativity, the Coq reasoning is restricted to pure index terms. Sage (Gronski et al. 2006) supports hybrid type-

89

Lukas Bulwahn, Alexander Krauss, Florian Haftmann, Levent Erk¨ok, and John Matthews. Imperative functional programming with Isabelle/HOL. In Proc. TPHOLs, 2008.

proofs based on separation logic (when they work) and highly human-guided proofs (when they are needed), let alone combinations of the two. None of the systems with significant automation support the combination of imperative and higher-order features, like we handle in the example of our higher-order memoizer and iterators. We also find no automated systems that deal with dependent types in programs. The first of these advantages seems critical in the verification of imperative programs that would be difficult to prove correct even if refactored to be purely functional. For instance, it seems plausible that our environment could be used eventually to build a verified compiler that uses imperative data structures for efficient dataflow analysis, unification in type inference, and so on. None of the purely-automated tools that we have surveyed could be applied to that purpose without drastic redesign. We are not aware of any previous toolkit for manual proof about imperative programs in proof assistants that would make the task manageable; the manual reasoning about state would overwhelm “the interesting parts” of compiler verification.

Bor-Yuh Evan Chang and Xavier Rival. Relational inductive shape analysis. In Proc. POPL, 2008. Arthur Chargu´eraud and Franc¸ois Pottier. Functional translation of a calculus of capabilities. In Proc. ICFP, 2008. Chiyan Chen and Hongwei Xi. Combining programming with theorem proving. In Proc. ICFP, 2005. David Delahaye. A tactic language for the system Coq. In Proc. LPAR, 2000. Cormac Flanagan, K. Rustan M. Leino, Mark Lillibridge, Greg Nelson, James B. Saxe, and Raymie Stata. Extended static checking for Java. In Proc. PLDI, 2002. Bryan Ford. Parsing expression grammars: A recognition-based syntactic foundation. In Proc. POPL, 2004. Jessica Gronski, Kenneth Knowles, Aaron Tomb, Stephen N. Freund, and Cormac Flanagan. Sage: Hybrid checking for flexible specifications. In Proc. Scheme Workshop, 2006.

6. Conclusions & Future Work The latest Ynot source distribution, including examples, can be downloaded from the project web site:

Matt Kaufmann and J. S. Moore. An industrial strength theorem prover for a logic based on Common Lisp. IEEE Trans. Softw. Eng., 23(4), 1997.

http://ynot.cs.harvard.edu/ Concurrency is a big area for future work on Ynot. Systems like Smallfoot (Berdine et al. 2005) do automated separation-logic reasoning about memory safety of concurrent programs. We would like to extend that work to full correctness verification, by designing a monadic version of concurrent separation logic that fits well within Coq. The full potential of the Ynot approach also depends on explicit handling of other computational effects, such as exceptions and input-output. Our prior prototype handled the former, and ongoing work considers supporting the latter. As with any project in automated theorem proving, there is always room for improvements to automation and inference. A future version of Ynot could benefit greatly in usability by incorporating abstract interpretation to infer specifications, as several automated separation logic tools already do. Nonetheless, our current system already fills a crucial niche in the space of verification tools. We have presented the first tool that performs well empirically in allowing mixes of manual and highly-automated reasoning about heap-allocated data structures, as well as the first tool to provide aggressive automation in proofs of higher-order, imperative programs. We hope that this will form a significant step towards full functional verification of imperative programs with deep correctness theorems.

Xavier Leroy. Formal certification of a compiler back-end or: programming a compiler with a proof assistant. In Proc. POPL, 2006.

References

John C. Reynolds. Separation logic: A logic for shared mutable data structures. In Proc. LICS, 2002.

Barbara Liskov and Stephen N. Zilles. Specification techniques for data abstractions. IEEE Trans. Software Eng., 1(1):7–19, 1975. Conor McBride and James McKinna. The view from the left. J. Functional Programming, 14(1):69–111, 2004. Aleksandar Nanevski, Greg Morrisett, and Lars Birkedal. Polymorphism and separation in Hoare Type Theory. Proc. ICFP, 2006. Aleksandar Nanevski, Greg Morrisett, Avraham Shinnar, Paul Govereau, and Lars Birkedal. Ynot: Reasoning with the awkward squad. In Proc. ICFP, 2008. Ulf Norell. Towards a practical programming language based on dependent type theory. PhD thesis, Chalmers University of Technology, 2007. Emir Pasalic, Jeremy Siek, Walid Taha, and Seth Fogarty. Concoqtion: Indexed types now! In Proc. PEPM, 2007. Rasmus L. Petersen, Lars Birkedal, Aleksandar Nanevski, and Greg Morrisett. A realizability model for impredicative Hoare Type Theory. In Proc. ESOP, 2008.

Mike Barnett, K. Rustan M. Leino, and Wolfram Schulte. The Spec# programming system: An overview. In Proc. CASSIS, 2004.

Mooly Sagiv, Thomas Reps, and Reinhard Wilhelm. Parametric shape analysis via 3-valued logic. ACM TOPLAS, 24, 2002.

Bruno Barras and Bruno Bernardo. The Implicit Calculus of Constructions as a programming language with dependent types. In Proc. FoSSaCS, 2008.

Wouter Swierstra and Thorsten Altenkirch. Beauty in the beast: A functional semantics for the awkward squad. In Proc. Haskell Workshop, 2007.

Josh Berdine, Cristiano Calcagno, and Peter W. O’Hearn. Smallfoot: Modular automatic assertion checking with separation logic. In Proc. FMCO, 2005.

Karen Zee, Viktor Kuncak, and Martin Rinard. Full functional verification of linked data structures. In Proc. PLDI, 2008.

Yves Bertot and Pierre Cast´eran. Interactive Theorem Proving and Program Development. Coq’Art: The Calculus of Inductive Constructions. Texts in Theoretical Computer Science. Springer Verlag, 2004.

90

Experience Report: seL4 Formally Verifying a High-Performance Microkernel Gerwin Klein

Philip Derrin

Kevin Elphinstone

NICTA and University of NSW [email protected]

NICTA [email protected]

NICTA and University of NSW [email protected]

Abstract

Isabelle/HOL

We report on our experience using Haskell as an executable specification language in the formal verification of the seL4 microkernel. The verification connects an abstract operational specification in the theorem prover Isabelle/HOL to a C implementation of the microkernel. We describe how this project differs from other efforts, and examine the effect of using Haskell in a large-scale formal verification. The kernel comprises 8,700 lines of C code; the verification more than 150,000 lines of proof script.

Abstract Specification

Executable Specification

Haskell Prototype

Automatic Translation

Categories and Subject Descriptors D.2.4 [Software Engineering]: Software/Program Verification; D.1.1 [Programming Techniques]: Functional Programming; D.4.5 [Operating Systems]: Reliability—Verification General Terms Keywords

1.

High-Performance C Implementation Refinement Proof

Figure 1. Specification layers in the L4.verified project.

Verification, Design, Languages

Haskell, seL4, microkernel, Isabelle/HOL

and made formal verification towards abstract and concrete levels substantially easier and faster than they would have been otherwise. The basic structure of the verification project is shown in Fig. 1. The left-hand side follows the classic pattern of a traditional refinement. There is an abstract specification at the top, an executable specification in the middle, and a C implementation on the bottom. Elkaduwe et al. (2008) have also created an even more abstract security model with security proof that would be placed above the abstract specification, but it has not yet been formally connected with the rest of the stack. In the setting of a commercial Common Criteria evaluation, the abstract specification is the high-level design and the executable specification is the low-level design. Cock et al. (2008) present details on the proof between abstract and executable level; the proof between executable specification and implementation will appear elsewhere (Winwood et al. 2009). While neither the main property (functional correctness) nor the main proof methodology (refinement) are unusual, the size and scope of the project are. The verification does not stop at a specification, but descends to the implementation level: 8,700 lines of high-performance, manually tuned C code close to hardware. All proofs in this project are machine-checked in the interactive theorem prover Isabelle/HOL (Nipkow et al. 2002). The project is also unusual in the approach it takes to kernel design and implementation. Two teams were involved: a kernel design team with an operating systems background, and a verification team with a formal methods background. The right-hand side of Fig. 1 indicates that the executable specification of seL4 is produced from a kernel prototype written in Haskell. We have implemented an automatic translator that converts our subset of Haskell into Isabelle/HOL. The Haskell prototype is written and maintained by the design team. It is the principal embodiment of their design decisions. It also became, after automatic translation, the starting point for the verification effort. The traditional model for greenfield projects is to work top-down from a high-level specification,

Introduction

We report on our experience using the functional programming language Haskell in the formal verification of the seL4 microkernel (Elphinstone et al. 2007). The seL4 kernel is an evolution of the high-performance L4 microkernel family (Liedtke 1995) for secure, embedded devices. It provides essential operating system services such as threads, inter-process communication, virtual memory, interrupts, and authorisation via capabilities. In earlier work (Derrin et al. 2006), we reported on our experience with Haskell as a specification language for seL4. In this paper, we concentrate on the effect our choice of Haskell had on the formal verification of the kernel, from abstract operational specification down to high-performance C code. To our knowledge this is the first largescale formal verification project that employs Haskell (or any other functional programming language) in this way.1 We found that working with Haskell decreased our kernel design time, enabled an iterative prototyping process in an area where usually only top-down and bottom-up approaches are advocated, 1 The

ACL2 prover uses LISP as its formal language. Our use of Haskell differs in the sense that our executable kernel prototype in Haskell is an independent program that can stand on its own without theorem prover involvement.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICFP’09, August 31–September 2, 2009, Edinburgh, Scotland, UK. c 2009 ACM 978-1-60558-332-7/09/08. . . $5.00 Copyright

91

a large-scale functional program. Again, this was easily extended. We were able to implement all of the monad and list functions that we used of the Haskell library in Isabelle. Proving termination for every function was less difficult than we anticipated — for primitive recursion and for functions with a simple lexicographic termination measure the proofs are straightforward and in many cases entirely automatic. However, we avoided complex recursion patterns, such as nested mutual recursion, because they would have been more difficult to translate. This was also desirable because the microkernel has to operate on strictly limited and known stack depth, so recursion ultimately had to be implemented by loops in C anyway. We used only a constrained subset of Haskell that could be translated to Isabelle/HOL. Besides forgoing the use of laziness in any essential way, we made limited use of type classes in Haskell (in particular using only three specific instances of M ONAD m), and avoided most GHC extensions. The use of Haskell at this stage had two main effects on the verification component of the project. First, it removed the need to interpret vague and inaccurate natural language design specifications, user manuals, or incomprehensible optimised C code. Second, it constrained the design team to a subset of Haskell that could be handled by the automatic translation, leading them to instinctively favour designs suitable for formal specification. Since extending the subset had an obvious cost in terms of modifications to the translator, there was a natural counter-force to increasing this subset too much. The question How would we write this in Haskell?, and therefore How can it be formalised?, was a topic in design team discussions. In a normal kernel design process, it would not have been.

and for existing implementations to work bottom-up from the implementation level; beginning verification with an executable specification is unusual. The effects of this choice on the verification are discussed below. The abstract specification and the C implementation were developed manually; both were started before the design — and therefore the executable specification — was completely stable. Both activities fed changes back into the Haskell prototype, but they did not supersede it. The Haskell prototype remained the central reference model throughout the duration of the project.

2.

Executable and Abstract Specification

Microkernels in the L4 family share a number of basic design principles, set out by Liedtke (1995). They provide only the abstractions that are essential for performance or security — primarily virtual memory, threads, and inter-process communication. They are designed with an emphasis on IPC performance, which is critical to the overall performance of microkernel-based systems. They have, historically, provided only minimal support for managing kernel resources and for controlling access to communication. The seL4 project set out to design a new microkernel based on the same basic principles, but taking a new approach with capability-based resource management and access control. Since this entailed designing a new API that was significantly different to that of previous L4 kernels, there was a large design space to explore. We decided, at the outset, that implementing a new kernel in C from scratch when the design was still uncertain was a risky proposition; much time would be wasted on rewrites and low-level, hardware-dependent debugging before a final design emerged. On the other hand, designing a kernel formally on paper before implementing it might result in an API with limited application to realworld use. We wanted to be able to execute user-level programs to test our proposed designs. Also, since we intended to formally verify safety properties of the kernel (Tuch et al. 2005), we desired a precise specification with well-defined semantics. Thus, we developed an executable model of the design in Haskell. This model was gradually developed into a complete prototype, exploring various design alternatives in the process (see Derrin et al. 2006). We were able to exercise user-level programs on the binary level by attaching a processor and platform simulation to the Haskell prototype. At the same time, we formalised the Haskell prototype by translating it into Isabelle/HOL; the translation was initially performed by hand, but was later automated. When, after a number of iterations, the kernel design in Haskell had begun to stabilise, we constructed an abstract, operational specification with fewer data structure details, and with features like scheduling underspecified so that different implementation choices could be explored in later versions of the kernel. The abstract specification is meant to specify what the kernel does; the executable specification gives details on how it is done. This initial abstract formalisation process provided immediate feedback on correctness and safety to the design team. The feedback increased when we started the refinement proof between the two layers (see Sect. 4). Isabelle/HOL is based on lambda calculus, and can be seen as a functional programming language extended with logical operators. It is less expressive than Haskell in some ways: every function must terminate, which limits use of laziness; type classes cannot have multiple parameters, and there are no constructor type class, so there is no built-in Monad typeclass, nor was there initially dosyntax for lists of sequential operations. However, Isabelle’s syntax is easily extensible, and we were able to define our own do-syntax for the specific monads used by our abstract and executable models. Unsurprisingly, the Isabelle/HOL standard library is geared towards theorem proving and felt therefore limited for implementing

3.

High Performance C Implementation

The goals of the Haskell prototype were twofold: predictable behaviour to provide an easy path to formalisation, and enough detail to provide an easy path to a C implementation. The second requirement, especially, led to an imperative-style Haskell program with extensive use of the StateT and ErrorT monads, including an explicit model of kernel memory addressed by typed pointers. An explicit hardware interface made it easier to connect the prototype to different simulators (M5, qemu, and our own ARMv6 instruction simulator). This interface also became the machine interface of the C kernel. As performance tuning is essential for microkernels, we did not attempt to generate C code from the model, but implemented the kernel manually, following the structure of the executable specification closely. The direct C implementation work was roughly 2 person months in effort, which is insignificant compared to the 20 person years spent on the complete project. The extremely rapid manual implementation was possible thanks to the precise executable specification. Not many implementation choices had to be made, and the structure of the program was clearly laid out already. D. Wheeler’s SLOCCount estimates that the effort for implementing the kernel directly in C would have been 4 person years. The effort for designing, writing and documenting the Haskell prototype was ca 2 person years. Based on this estimate, the use of Haskell reduced the implementation effort by 50%. In the first implementation pass, we did not pay any attention to performance. The result of this initial pass was therefore unsurprisingly slow (on the order of the Mach microkernel), a factor of 3 slower than comparable operations in existing L4 kernels. After a first round of manual optimisations, seL4 IPC performance is now comparable to OKL4 2.1 (2008) on ARMv6. Another consequence of using a functional language as the design source was the structured use of tagged unions in the C code. Verification of unstructured unions in C is unpleasant. Since

92

unions and structs were used in a principled way, we managed to avoid this additional verification burden entirely. Moreover, we did not trust the compiler to translate C bitfields correctly and with the fine-grained control we required; instead, we generated the C code for these structures and tagged unions automatically from a separate specification language (Cock 2008). We also generated the corresponding Isabelle/HOL proof of code correctness. Our verification framework treats a large, true subset of C99. The main restrictions are: we do not allow the address-of (&) operator on local variables, because the stack is modelled separately from the heap; we do not allow function pointers and goto statements; we make some expressions such as x++ statement forms; and we allow at most one side-effecting sub-expression in any assignment, because execution order is arbitrary otherwise. The function pointer prohibition implies that we did not make heavy use of higher-order functions in Haskell apart from some specific functions that are used to emulate C control structures such as mapM , zipWithM, catchError, and so on. This prohibition could be lifted fairly easily from the C translation, but we found it advantageous to strive for simplicity over features when possible in our Hoare logic framework for C. We translate C types precisely into Isabelle/HOL, including pointers, address arithmetic, finite integers, structs, and padding in structs (Tuch et al. 2007). Our target architecture is ARMv6; the compiler is GCC 4.2.2. Strictly speaking, Fig. 1 is inaccurate in that we do not reason on C directly, but on a translation of C into Isabelle/HOL. In contrast to the Haskell/Isabelle translation, this is a comparatively small translation step, with explicit care taken to map the semantics of C precisely into the theorem prover. Although our experience was in general favourable, efficient kernel code does not always translate well from Haskell. For example, the executable model’s error handling code contains a function that loads a message into a user-level context (setMRs), which is applied to the results of one of several functions that generate messages (as lists of machine words) from various error types. Translating this directly to C leads to implicit allocation of memory to temporarily hold the message, and double copying of the message’s contents; in order to keep the C code efficient, we manually unfolded the definition of setMRs and fused it with the message generation functions before translating. More generally, we found that Haskell at times encouraged coding practices that are inefficient in our subset of C if translated naively: passing large structures as function arguments, throwing and catching exceptions for error handling, and function composition that depends on laziness to be efficient. An interesting observation on the C implementation was that the C program was in parts less verbose than the explicit memory model we used in Haskell, because load-check-modify-store idioms are simply written as pointer accesses and updates in C. The more verbose style for this part in Haskell did not hinder verification. On the contrary, pointers made up a large part of the hidden complexity of the C program that was dealt with explicitly in the executable and abstract models. The additional verbosity was local only. In total, the Haskell prototype comes to 5,700 LOC compared to 8,700 LOC in C (numbers according to SLOCCount). To summarise, we restricted our use of Haskell to a suitable subset and were able to manually implement a high-performance C version of the kernel in very little time. The Haskell and C versions have almost identical data and code structures. We exploited this fact heavily in the verification.

4.

implementation. Our embedding of Haskell into Isabelle is shallow, the embedding of C into Isabelle is deep for statements and shallow for expressions. The main statement we proved in each of the two steps is formal refinement, reduced to forward simulation: if the initial states are in a system-global state relation R, and the concrete level takes one step, then the abstract level must be able to take a corresponding step such that the resulting states again are in the relation R. Cock et al. (2008) extend this classic notion to state monads, integrating the aspects of failure, non-determinism and exceptions needed in the kernel specifications. The notion implies, and we have proved in Isabelle/HOL, that all Hoare triples that are true on the abstract level are also true on the concrete level, modulo the state relation R. 4.1

First Refinement Step

Refinement step one in the verification took ca 8 person years in total and manually produced 117,000 lines of Isabelle/HOL proof script. This step contains the conceptually interesting part of the proof, reasoning about the design aspects of execution safety and correctness. We cannot go into the details of this proof here for space reasons, but the simpler and higher-level data structures of the abstract specification require invariants on their more detailed counterparts on the executable level to show correspondence. Basic preconditions of the correspondence proof are that each operation is well defined, that memory accesses are correctly typed, that assertions do not fail, that objects that are read from do exist, and that partially defined functions (e.g. those with incomplete patterns in Haskell) are used only within their domain. These preconditions for safe execution spawned a number of complex invariants on how the kernel works, how it explicitly re-uses memory, and how it prevents dangling references to deleted objects in any part of the kernel (including all of memory). Reasoning on this level included explicit decoding of binary system call arguments read from user registers and full argument checking to ensure safe operation for any kernel input, be it benign, maliciously crafted, or simply garbage. The effect our use of Haskell had on this proof can be summarised as: the ability to exploit structural similarities, an increased use of library functions, initial increased technical friction in working with generated definitions, and different proof style. We explain each of these in more detail in the following paragraphs. The Haskell prototype existed first, and therefore the abstract specification was inspired by it in structure. We were able to exploit this structural similarity to make the proof easier. Being inspired by the executable level also means that our abstract specification is probably more concrete than it may have been without this input. A higher-level abstract specification would have meant more distance for the refinement proof to the executable level, but possibly less distance to further layers above. For showing specific properties of the kernel that turn out to be too complex for the complete abstract operational specification, we would add a further, more abstract layer to the stack that is specialised to the property under consideration, such as we are currently exploring with the abstract access control model of seL4 (Elkaduwe et al. 2008). Haskell being a fully-featured programming language led the design team to make more extensive use of library functions like mapM and zipWithM than they otherwise may have. At the time, there were no Isabelle versions of these functions. Introducing them saved verification work because we avoided repeating proofs over many similar recursion patterns. On the less positive side, we observed more technical friction in the proofs that were concerned with definitions generated from Haskell than in those that were written in Isabelle directly. This was expected. Programming idioms did not always match up with how rules were phrased in the Isabelle library. The executable

Formal Verification

As mentioned in the introduction, the formal verification of seL4 consisted of two major refinement steps: between abstract and executable specification, and between executable specification and

93

4.2

specification was generated Isabelle code that was not as concise as the Haskell source, and not always nice to read. This turned out to be an initial problem only. The new idioms became manageable once the verification team were used to them, and had built up a library of matching rules. The generated code could often be rewritten trivially with associativity and other simple, general statemonad laws to read more nicely. Due to the monadic, imperative style of the Haskell prototype and therefore the executable specification, the majority of the proof took the form of Hoare triples, weakest precondition reasoning, and correspondence calculus reasoning. Apart from the rewrites mentioned above we used only little of the algebraic reasoning that would usually be associated with verifying functional programs. This proof structure is mainly an artefact of the application area and of having C as a target implementation language. We did use induction where recursion was involved in the Haskell program. In C, these were replaced by loops. This first proof lead to around 200 changes in the Haskell prototype and 300 changes in the abstract specification. Less than half of these were genuine bugs or design defects. Most changes were for proof convenience: reshuffling functions to match up more closely, adding assertions to transport information across levels, and adding local checks or re-arranging code to make properties more obviously true. The majority of the bugs we found during verification were mundane: simple typos and some copy & paste errors. We did also find more subtle problems in the initial design like missing argument checking, potential security violations etc. Of course, it does not matter if the defect is mundane or not: the kernel will happily crash or allow a security attack either way. It is interesting to note that the actual discovery of defects does not necessarily occur when they leap from the screen in the form of a counter example or unprovable lemma (although that did happen). Instead, many defects were found when new invariants became necessary and the verification and design teams discussed what these should be, whether they would hold, and, if so, why. One answer a verification team should be wary of is this is never done anywhere in the kernel. This answer usually means proving a new fact about the whole kernel instead of quick local reasoning. It is not clear if the use of Haskell would have been beneficial in just the proof of the first refinement step in isolation. Nicer, more elegant reasoning might have been possible in a more abstract, still executable setting with definitions written directly in Isabelle/HOL. The detailed executable model, and its use as a prototype running user binaries, injected a sometimes unpleasant dose of realism into the proof — forcing us to consider implementation details that are necessary for an efficient kernel, rather than one that simply functions correctly. For example, one part of the first refinement proof that was particularly challenging was the relation between the abstract and executable models’ versions of the capability derivation tree. This is one of the two main metadata structures in the seL4 implementation; it is conceptually a forest of directed trees, and is represented in the abstract model by simple functions that encode a binary relation between pointers to the nodes. The executable model represents it the way a real kernel implementation would: as a set of doubly linked lists, each corresponding to a pre-order traversal of a tree. The depth information is implicitly encoded in the nodes, and is available only by comparing two nodes; the depth comparison function requires that its arguments are in the stored order. Furthermore, the lists are not represented by Haskell’s standard list type, but by pairs of pointers stored in separate node objects in the modelled physical memory — as they would be in C. Naturally, the invariants that must be maintained by operations on these structures are complex, and therefore so is the refinement proof for those operations. This realism paid off in the second refinement step.

Second Refinement Step

One important observation about the first refinement step is that we spent roughly 80% of the proof effort on showing invariants of the abstract and executable levels and only 20% on the correspondence itself. The invariants were necessary preconditions for the correspondence, but they also carry a large amount of information on how precisely the kernel works and why its internal data structures are safe to use. Because the Haskell and C implementations share almost identical data and code structures, we were able to avoid these 80% for the second step. The important invariants had already been proved on the executable specification level, and no complex semantic reasoning was necessary on the C level. The most complex new relationships we had to show on the C level were the implementation of Haskell or Isabelle lists as doubly linked lists, some of which were encoded in existing data structures. The C verification did lead us to prove new invariants on the level of the executable specification, but far fewer than we needed in the first step. They were mainly due to optimisations in C that made use of conditions known to be true over kernel execution. The main challenges in the second step were dealing with C language semantics and data structure encodings, but without complex data refinement. Having to prove at the same time that the C code maintains complex invariants as we had shown in step one would have made this proof much harder. At the time of writing, the proof on the C level is completed for 474 functions out of 518 and we have so far spent 26 person months on this part of the verification. The speed of verification on this level was 3–4 functions per person per week with 3–5 persons working on this body of proofs concurrently. Even though the kernel implementation had in the meantime been used in a number of small student projects, had been ported to the x86 architecture, and had been run through static analysis tools, we still found 97 defects in C during the verification. We had not attempted to test the implementation in great detail, because formal verification was scheduled anyway. For each of the defects we could have found a test case that demonstrated it, but of course the question is whether we would have thought of these beforehand. Unsurprisingly, the defects were concentrated in parts of the kernel that were less used in the student projects and that were complicated to use. Most of them were simple translation errors and typos in the implementation step from Haskell to C, fewer were defects in new data encodings and optimisation. We also observed compiler specific errors: for instance, some functions that we had annotated with GCC’s pure and const attributes to enable optimisations were not in fact pure or const. The compiler did not check the attributes, and neither did we initially in the verification. This lead to unexpected execution behaviour in otherwise already verified code. We have updated the verification framework in the meantime to include such compiler hints and make them proof obligations. The pure and const attributes are now checked automatically. As in the first refinement step, it was crucial for the verification that we were able to change the C code as well as the Haskell source for proof convenience instead of having to prove complex reordering theorems. For some optimisations, we changed the observable behaviour of both the abstract and the executable specifications. For instance, we changed the order in which data was stored in global data structures, or the order in which arguments were checked (and therefore which error messages would be reported first). In summary, the verification of seL4 proceeded in two main steps. Step one dealt with mostly semantic content in a shallow embedding; step two was more syntactic and dealt with C, its memory model and specific optimisations. We were able to avoid a large part of the proof in the second step, because of the structural similarity between the C and the Haskell implementations.

94

5.

Acknowledgments

Conclusion

We thank Timothy Bourke, Michael Norrish, and Thomas Sewell for reading and discussing drafts of this article.

We have presented our experience in using Haskell in the verification of the seL4 microkernel. The aspects of the verification that are specific to this project are its size, the implementation level it descends to, and its iterative development cycle. As there is not sufficient space to survey related work in this article we refer to the comprehensive overview by Klein (2009). We consider it important for the success of this project that the kernel was designed by a team with an OS background, not by the verification team. The verification team believes it would have designed a much more elegant, but much more useless microkernel. The connection between the two teams was the Haskell prototype. All of the verification and implementation activities fed back into this central reference specification of the project. We could have created the executable specification in Isabelle directly, but that would have left the design team out of the loop. We could also have chosen another functional language, such as ML, rather than Haskell; the primary motivation for our choice was the local availability of experienced Haskell programmers at UNSW, where Haskell is used in several research projects and was used as the introductory undergraduate programming language at the time. In addition, we consider the extensive tool-chain support for Haskell (compiler, foreign function interface, literate Haskell) an important contributor to the success of the Haskell source as a simultaneous binary-compatible prototype, design document, and formal executable specification. The part of the OS team that actively wrote the Haskell code had previous experience with Haskell. The part of the OS team that did not have extensive experience with Haskell was comfortable with the new language after less than one month. The culture shock between the Formal Methods and Operating Systems groups was smaller than expected and greatly alleviated by team members who had gone through advanced courses in both areas. Will this approach work for everything? We believe that for high assurance on the level of kernel and systems code where performance and hardware interaction are important, the approach will work well. On the application level, it might be sufficient to stop at the level of an executable specification, possibly in Haskell or ML, if the compiler, runtime, and translation into Isabelle can be trusted.

NICTA is funded by the Australian Government as represented by the Department of Broadband, Communications and the Digital Economy and the Australian Research Council through the ICT Centre of Excellence program.

References D. Cock. Bitfields and tagged unions in C: Verification through automatic generation. In B. Beckert and G. Klein, editors, Proceedings of the 5th International Verification Workshop (VERIFY’08), volume 372 of CEUR Workshop Proceedings, pages 44–55, Sydney, Australia, Aug 2008. D. Cock, G. Klein, and T. Sewell. Secure microkernels, state monads and scalable refinement. In O. A. Mohamed, C. Mu˜noz, and S. Tahar, editors, 21st TPHOLs, volume 5170 of LNCS, pages 167–182, Montreal, Canada, Aug 2008. Springer. P. Derrin, K. Elphinstone, G. Klein, D. Cock, and M. M. T. Chakravarty. Running the manual: An approach to high-assurance microkernel development. In ACM SIGPLAN Haskell WS, Portland, OR, USA, Sep 2006. D. Elkaduwe, G. Klein, and K. Elphinstone. Verified protection model of the seL4 microkernel. In J. Woodcock and N. Shankar, editors, VSTTE 2008 — Verified Softw.: Theories, Tools & Experiments, volume 5295 of LNCS, pages 99–114, Toronto, Canada, 2008. Springer. K. Elphinstone, G. Klein, P. Derrin, T. Roscoe, and G. Heiser. Towards a practical, verified kernel. In 11th HotOS, pages 117–122, 2007. G. Klein. Operating system verification — an overview. S¯adhan¯a, 34(1): 27–69, Feb 2009. J. Liedtke. On µ-kernel construction. In 15th SOSP, pages 237–250, Copper Mountain, CO, USA, Dec 1995. T. Nipkow, L. Paulson, and M. Wenzel. Isabelle/HOL — A Proof Assistant for Higher-Order Logic, volume 2283 of LNCS. Springer, 2002. Open Kernel Labs. OKL4 v2.1. http://www.ok-labs.com, 2008. H. Tuch, G. Klein, and G. Heiser. OS verification — now! In 10th HotOS, pages 7–12, Santa Fe, NM, USA, Jun 2005. USENIX. H. Tuch, G. Klein, and M. Norrish. Types, bytes, and separation logic. In M. Hofmann and M. Felleisen, editors, 34th POPL, pages 97–108, 2007. S. Winwood, G. Klein, T. Sewell, J. Andronick, D. Cock, and M. Norrish. Mind the gap: A verification framework for low-level C. In S. Berghofer, T. Nipkow, C. Urban, and M. Wenzel, editors, Proc. 22nd International Conference on Theorem Proving in Higher Order Logics (TPHOLs’09), volume 5674 of LNCS. Springer, 2009. To appear.

95

Biorthogonality, Step-Indexing and Compiler Correctness Nick Benton

Chung-Kil Hur

Microsoft Research [email protected]

University of Cambridge [email protected]

Abstract

to a deeper semantic one (‘does the observable behaviour of the code satisfy this desirable property?’). In previous work (Benton 2006; Benton and Zarfaty 2007; Benton and Tabareau 2009), we have looked at establishing type-safety in the latter, more semantic, sense. Our key notion is that a high-level type translates to a lowlevel specification that should be satisfied by any code compiled from a source language phrase of that type. These specifications are inherently relational, in the usual style of PER semantics, capturing the meaning of a type A as a predicate on low-level heaps, values or code fragments together with a notion of A-equality thereon. These relations express what it means for a source-level abstractions (e.g. functions of type A → B) to be respected by low-level code (e.g. ‘taking A-equal arguments to B-equal results’). A crucial property of our low-level specifications is that they are defined in terms of the behaviour of low-level programs; making no reference to any intensional details of the code produced by a particular compiler or the grammar of the source language. Of course, the specifications do involve low-level details of data representations and calling conventions – these are part of the interface to compiled code – but up to that, code from any source that behaves sufficiently like code generated by the compiler should meet the specification, and this should be independently verifiable. Ideally, one might wish to establish the sense in which a compilation scheme is fully abstract, meaning that the compiled versions of two source phrases of some type are in the low-level relation interpreting that type iff the original source phrases are contextually equivalent. If low-level specifications are used for checking linking between the results of compiling open programs and code from elsewhere1 and full abstraction does not hold, then source level abstractions become ‘leaky’: reasoning about equivalence or encapsulation at the source level does not generally translate to the target, which can lead to unsound program transformations in optimizing compilers and to security vulnerabilities (Abadi 1998; Kennedy 2006). Ahmed and Blume (2008) also argue that fully abstract translation should be the goal, and prove full abstraction for (source to source) typed closure conversion for a polymorphic lambda calculus with recursive and existential types. Later on we will say something about why we believe that ‘full’ full abstraction may not, in practice, be quite the right goal, but we certainly do want ‘sufficiently abstract’ compilation, i.e. the preservation of the reasoning principles that we actually use in an optimizing compiler or in proving security properties. The low-level relations of our previous work are not, however, really even sufficiently abstract, having roughly comparable power to a denotational semantics in continuation-passing style (CPS). This is a very strong and useful constraint on the behaviour of machine-code programs, but does not suffice to prove all the equa-

We define logical relations between the denotational semantics of a simply typed functional language with recursion and the operational behaviour of low-level programs in a variant SECD machine. The relations, which are defined using biorthogonality and stepindexing, capture what it means for a piece of low-level code to implement a mathematical, domain-theoretic function and are used to prove correctness of a simple compiler. The results have been formalized in the Coq proof assistant. Categories and Subject Descriptors F.3.1 [Logics and Meanings of Programs]: Specifying and Verifying and Reasoning about Programs—Mechanical verification, Specification techniques; F.3.2 [Logics and Meanings of Programs]: Semantics of Programming Languages—Denotational semantics, Operational semantics; F.3.3 [Logics and Meanings of Programs]: Studies of Program Constructs—Type structure,Functional constructs; D.3.4 [Programming Languages]: Processors—Compilers; D.2.4 [Software Engineering]: Software / Program Verification—Correctness proofs, Formal methods General Terms

Languages, theory, verification

Keywords Compiler verification, denotational semantics, biorthogonality, step-indexing, proof assistants

1.

Introduction

Proofs of compiler correctness have been studied for over forty years (McCarthy and Painter 1967; Dave 2003) and have recently been the subject of renewed attention, firstly because of increased interest in security and certification in a networked world and secondly because of advances in verification technology, both theoretical (e.g. separation logic, step-indexed logical relations) and practical (e.g. developments in model checking and improvements in interactive proof assistants). There are many notions of correctness or safety that one might wish to establish of a compiler. For applying language-based techniques in operating systems design, as in proof-carrying code, one is primarily interesting in broad properties such as type-safety, memory-safety or resource-boundedness. Although these terms are widely used, they are subject to a range of interpretations. For example, type-safety sometimes refers to a simple syntactic notion (‘is the generated code typable using these rules?’) and sometimes

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICFP’09, August 31–September 2, 2009, Edinburgh, Scotland, UK. c 2009 ACM 978-1-60558-332-7/09/08. . . $5.00 Copyright

1 This

is obviously important for foreign function interfaces and multilanguage interoperability, but can be an issue even for separate compilation using the same compiler. It also covers the simpler and ubiquitous case of handcrafted implementations of standard library routines.

97

yield complete systems. The plugging −◦− : P ×C → S might be effected by appending bits of program, substituting terms, applying continuations to arguments or composing processes in parallel. In any such situation, there is contravariant map (·)⊥ : P(P) → P(C) given by

tions we might like between low-level programs: even something as simple as the commutativity of addition does not hold for arbitrary integer computations (just as it doesn’t in a lambda calculus with control), even though it does in our pure source language. To understand how to refine our low-level relations further, it is natural to look at logical relations between low-level code and elements of the domains arising in a standard, direct-style, denotational model of our language, which is what we’ll do here. Given such a typed relation between high-level semantics and low-level programs, a low-level notion of typed equivalence can be generated by considering pairs of low-level programs that are related to some common high-level denotational value. The relations we define will establish a full functional correctness theorem for a simple compiler, not merely a semantic type safety theorem. Just as for type-safety, there are several approaches to formulating such correctness theorems in the literature. A common one, used for example by Leroy (2006), is to define an operational semantics for both high- and low-level languages and then establish a simulation (or bisimulation) result between source programs and their compiled versions, allowing one to conclude that if a closed high-level program terminates with a particular observable result, then its compiled version terminates with the same result, and often the converse too (Hardin et al. 1998; Leroy and Grall 2009). The limitation of these simulation-based theorems is that they are not as compositional (modular) or extensional (behavioural) as we would like, in that they only talk about the behaviour of compiled code in contexts that come from the same compiler, and usually specify a fairly close correspondence between the (non-observable) intermediate states of the source and target. We would rather have maximally permissive specifications that capture the full range of pieces of low-level code that, up to observations, behave like, or realize, a particular piece of high-level program. A slogan here is that ‘the ends justify the means’: we wish to allow low-level programs that extensionally get the right answers, whilst intensionally using any means necessary. The main results here will involve logical relations between a cpo-based denotational semantics of a standard CBV lambda calculus with recursion and programs in an extended SECD-like target, chosen to be sufficiently low-level to be interesting, yet simple enough that the important ideas are not lost in detail. The relations will involve both biorthogonality and step-indexing, and in the next section we briefly discuss these constructions in general terms, before turning to the particular use we make of them. The results in the paper have been formalized and proved in the Coq proof assistant, building on a formalization of domain theory and denotational semantics that we describe elsewhere (Benton et al. 2009). The scripts are available from the authors’ webpages.

2.

Orthogonality and Step-Indexing

2.1

Biorthogonality

P ⊥ = {c ∈ C | ∀p ∈ P, p ◦ c ∈ O} and a homonymous one in the other direction, (·)⊥ : P(C) → P(P) C ⊥ = {p ∈ P | ∀c ∈ C, p ◦ c ∈ O} yielding a contravariant Galois connection, so that, amongst many other things, (·)⊥⊥ is a closure operator (inflationary and idempotent) on P(P), and that any set of the form C ⊥ is (·)⊥⊥ -closed.2 The binary version of this construction starts with a binary relation (e.g. an equivalence relation) on S and proceeds in the obvious way. For compiler correctness, we want the interpretations of sourcelevel types or, indeed, source-level values, to be compositional and extensional: properties of low-level program fragments that we can check independently and that make statements about the observable behavior of the complete configurations that arise when we plug the fragments into ‘appropriate’ contexts. This set of ‘appropriate’ contexts can be thought of as a set of tests: a low-level fragment is in the interpretation of a source type, or correctly represents a source value, just when it passes all these tests. Thus these low-level interpretations will naturally be (·)⊥⊥ -closed sets. For a simply typed source language, we can define these sets by induction on types, either positively, starting with an over-intensional set and then taking its (·)⊥⊥ -closure, or negatively, by first giving an inductive definition of a set of contexts and then taking its orthogonal. See Vouillon and Melli`es (2004) for more on the use of biorthogonality in giving semantics to types. 2.2

Step-Indexing

Logical predicates and relations can be used with many different styles of semantics. When dealing with languages with recursion and working denotationally, however, we often need extra admissibility properties, such as closure under limits of ω-chains. Operational logical relations for languages with recursion generally need to satisfy some analogue of admissibility. One such that has often been used with operational semantics based on lambda terms (Plotkin 1977; Pitts and Stark 1998) considers replacing the recursion construct rec f x = M (or fixpoint combinator) with a family of finite approximations: recn f x = M for n ∈ N, unfolds the recursive function n times in M and thereafter diverges. Appel and McAllester (2001) introduced step-indexed logical relations, which have since been refined and succesfully applied by various authors to operational reasoning problems for both high and low level languages, many of which involve challenging language features (Ahmed 2006; Appel et al. 2007; Benton and Tabareau 2009; Ahmed et al. 2009). Step-indexing works with small-step operational semantics and N-indexed sets of values, with (n, v) ∈ P (or v ∈ Pn ) meaning ‘value v has property P for n steps of reduction’. An interesting feature of step-indexing is that one usually works directly with this family of approximants; the limits that one feels are being approximated (like {v | ∀n, v ∈ Pn }) do not play much of a rˆole.

Biorthogonality is a powerful and rather general idea that has been widely used in diffent kinds of semantics in recent years, beginning with the work of Pitts and Stark (1998) and of Krivine (1994). One way of understanding the basic idea is as a way of ‘contextualizing’ properties of parts of a system, making them compositional and behavioural. In the unary case, we start with some set of systems S (e.g. configurations of some machine, lambda terms, denotations of programs, processes) and some predicate O ⊆ S, which we call an observation, over these systems (e.g. those that execute without error, those that terminate, those that diverge). Then there is some way of combining, or plugging, program fragments p ∈ P in whose properties we are interested (bits of machine code, terms, denotations of terms, processes) with complementary contexts c ∈ C (frame stacks, terms with holes, continuations, other processes) to

2.3

On Using Both

Amongst the useful properties of the operational (·)⊥⊥ -closure operation used by Pitts and Stark (1998) is that it yields admissible 2 Pitts and Stark, and some other non-Gallic authors, tend to write (·)> rather than (·)⊥ .

98

3.

relations (in the recn sense). The same is true of its natural denotational analogue (Abadi 2000). Our earlier work on low-level interpretations of high-level types (Benton and Tabareau 2009) used both step-indexing and orthogonality, but there was some question as to whether the step-indexing was really necessary. Maybe our closed sets are automatically already appropriately ‘admissible’, just by construction, and there is no need to add extra explicit indexing? Slightly to our surprise, it turns out that there are good reasons to use both (·)⊥⊥ -closure and step-indexing, for which we now try to give some intuition. The aim is to carve out interpretations of high-level types and values as ‘well-behaved’ subsets of low-level, untyped programs. The essence of these interpretations generally only depends upon these well-behaved subsets: we’ll (roughly) interpret a function type A → B as the set of programs that when combined with a good argument of type A and a good continuation expecting something of type B, yield good behaviour. So exactly what range of impure, potentially type-abstraction violating, operations are available in the untyped language (the range of ‘any means necessary’ above) does not seem to affect the definitions or key results. Programs that use low-level features in improper ways will simply not be in the interpretations of high-level entities, and nothing more needs to be said. For a simply-typed total language without recursion, this intuition is quite correct: a Krivine-style realizability interpretation is essentially unaffected by adding extra operations to the untyped target. Even though orthogonality introduces quantification over bigger sets of contexts, nothing relies explicitly on properties of the untyped language as a whole. In the presence of recursion, the situation changes. The fact that Pitts and Stark’s (·)⊥⊥ -closed relations are admissible depends on a ‘compactness of evaluation’ result, sometimes called an ‘unwinding theorem’, saying that any complete program p terminates iff there is some n such that for all m ≥ n, p with all the recs replaced by recm s terminates, which is clearly a global property of all untyped programs. In the denotational case, attention is already restricted to operations that can be modelled by continuous functions, i.e. ones that behave well with respect to approximation, in the chosen domains. But realistic targets often support operations that can violate these global properties. Examples of such egregiously non-functional operations include the ‘reflection’ features of some high-level languages (such as Java or C]) or, more interestingly, the ability of machine code programs to switch on code pointers or read executable machine instructions.3 We have found that the presence of such seriously non-functional operations does not just make the proofs harder, but can actually make the ‘natural’ theorems false. Appendix A shows how the addition of equality testing on lambda terms to an untyped lambda calculus breaks a ‘standard’ syntactic interpretation of simple types as sets of untyped terms in the presence of term-level recursion in the source. Fortunately, as we will see, step-indexing sidesteps this problem. In place of appealing to a global property that holds of all untyped programs, we build a notion of approximation, and the requirement to behave well with respect to it, directly into the definition of our logical relations. In fact, we will also do something very similar on the denotational side, closing up explicitly under limits of chains.

Source Language

Our high-level language PCFv is a conventional simply-typed, callby-value functional language with types built from integers and booleans by products and function spaces, with type contexts Γ defined in the usual way: t Γ

:= :=

Int | Bool | t → t0 | t × t0 x1 : t 1 , . . . , x n : t n

We separate syntactic values (canonical forms), ranged over by V , from general expressions, ranged over by M and restrict the syntax to ANF, with explicit sequencing of evaluation by let and explicit inclusion of values into expressions by [·]. The typing rules for values and for expressions are shown in Figure 1. Note that there are really two forms of judgement, but we refrain from distinguishing them syntactically. The symbol ? stands for an arbitrary integer-valued binary operation on integers, whilst > is a representative boolean-valued one. PCFv has the obvious CBV operational semantics, which we elide here, and a conventional, and computationally adequate, denotational semantics in the category of ω-cpos and continuous maps, which we now briefly summarize to fix notation. Types and environments are interpreted as cpos: def

=

N

JIntK

def

JBoolK

=

B

Jt → t K

def

=

JtK ⇒ Jt0 K⊥

Jx1 : t1 , . . . , xn : tn K

def

Jt1 K × · · · × Jtn K

0

=

where ⇒ is the cpo of continuous functions and × is the Cartesian product cpo. Typing judgements for values and expressions are then interpreted as continuous maps: JΓ ` V : tK JΓ ` M : tK

: :

JΓK → JtK JΓK → JtK⊥

defined by induction. So, for example JΓ ` Fix f x = M : A → BK ρ = µdf .λdx ∈ JAK.JΓ, f : A → B, x : A ` M : BK (ρ, df , dx ) We write [·] : D → D⊥ for the unit of the lift monad. We elide the full details of the denotational semantics as they are essentially the same as those found in any textbook, such as that of Winskel (1993). The details can also be found, along with discussion of the Coq formalization of the semantics, in Benton et al. (2009).

4.

Target Language and Compilation

4.1

An SECD Machine

The low-level target is a variant of an SECD virtual machine (Landin 1964). We have chosen such a machine rather than a lowerlevel assembly language, such as that of our previous work, so as to keep the formal development less cluttered with detail. But we are emphatically not interested in the SECD machine as something that is inherently “for” executing programs in a language like our source. We have included an equality test instruction, Eq, that works on arbitrary values, including closures, so a counterexample to a naive semantics of types like that in Appendix A can be constructed. Furthermore, the logical relations we present have been carefully constructed with applicability to lower-level machines in mind.

3 Real

implementations, even of functional languages, can make non-trivial use of such features. For example, interpreting machine instructions that would normally be executed in order to advance to a safe-point for interruption, building various runtime maps keyed on return addresses, doing emulation, JIT-compilation or SFI.

99

Values: [T V AR]

Γ, x : t ` x : t [T F IX]

[T BOOL]

Γ ` b : Bool

Γ, f : t → t0 , x : t ` M : t0 Γ ` Fix f x = M : t → t0

(b ∈ B)

[T P ]

[T IN T ]

Γ ` n : Int

(n ∈ N)

Γ ` Vi : ti (i = 1, 2) Γ ` hV1 , V2 i : t1 × t2

Expressions: [T V AL]

[T AP P ]

Γ`V :t Γ ` [V ] : t

[T LET ]

Γ ` V1 : t → t0 Γ ` V2 : t Γ ` V1 V2 : t0

[T OP ]

Γ ` M : t Γ, x : t ` N : t0 Γ ` let x = M in N : t0

[T IF ]

Γ ` V1 : Int Γ ` V2 : Int Γ ` V1 ? V2 : Int [T F ST, T SN D]

Γ ` V : Bool Γ ` M1 : t Γ ` M2 : t Γ ` if V then M1 else M2 : t

[T GT ]

Γ ` V1 : Int Γ ` V2 : Int Γ ` V1 > V2 : Bool

Γ ` V : t1 × t2 Γ ` πi (V ) : ti (i = 1, 2)

Figure 1. Typing rules for PCFv The inductive type Instruction, ranged over by i, is defined by i

:=

4.2

Swap | Dup | PushV n | Op ? | PushC c | PushRC c | App | Ret | Sel (c1 , c2 ) | Join | MkPair | Fst | Snd | Eq

where c ranges over Code, the set of lists of instructions, n ranges over integers, and ? over binary operations on integers. The set Value of runtime values, ranged over by v is defined by v

5.

where e ranges over Env, defined to be list Value. So a Value is either an integer literal, a closure containing an environment and some code, a recursive closure, or a pair values. We also define Stack Dump CESD

= = =

list Value list (Code × Env × Stack) Code × Env × Stack × Dump

CESD is the set of configurations of our virtual machine. A configuration hc, e, s, di comprises the code to be executed, c, the current environment, e, an evaluation stack s and a call stack, or dump, d. The deterministic one-step transition relation 7→ between configurations is defined in Figure 2. There are many configurations with no successor, such as those in which the next instruction is Swap but the the stack depth is less than two; we say such configurations are stuck or terminated. (So there is no a priori distinction between normal and abnormal termination.) We write cesd 7→k to mean that the configuration cesd takes at least k steps without getting stuck, and say it diverges, written cesd 7→ω if it can always take a step:

(c, s) o n hc0 , e0 , s0 , d0 i = hc ++ c0 , e0 , s ++ s0 , d0 i We also have an operation · ^ · that appends an element of Env onto the environment component of a configuration:

def

cesd 7→ω ⇐⇒ (∀k, cesd 7→k ) Conversely, we say cesd terminates, and write cesd 7→∗ does not diverge: cesd 7→∗

Logical Relations

In this section we define logical relations between components of the low-level SECD machine and elements of semantic domains, with the intention of capturing just when a piece of low-level code realizes a semantic object. In fact, there will be two relations, one defining when a low-level component approximates a domain element, and one saying when a domain element approximates a lowlevel component. These roughly correspond to the soundness and adequacy theorems that one normally proves to show a correspondence between an operational and denotational semantics, but are rather more complex. Following the general pattern of biorthogonality sketched above, we work with (predicates on) substructures of complete configurations. On the SECD machine side, complete configurations are elements of CESD, whilst the substructures will be elements of Value and of Comp, which is defined to be Code × Stack. If v : Value then we define vb : Comp to be ([], [v]), the computation comprising an empty instruction sequence and a singleton stack with v on. Similarly, if c : Code then b c : Comp is (c, []). The basic plugging operation on the low-level side is · o n ·, taking an element of Comp, the computation under test, and an element of CESD, thought of as a context, and combining them to yield a configuration in CESD:

n | CL (e, c) | RCL (e, c) | PR (v1 , v2 )

:=

Compiling PCFv to SECD

The compiler comprises two mutually-inductive functions mapping (typed) PCFv values and expressions into Code. We overload L· M for both of these functions, whose definitions are shown in Figure 3.

e ^ hc0 , e0 , s0 , d0 i = hc0 , e ++ e0 , s0 , d0 i 5.1

, if it

Approximating Denotational By Operational

The logical relation expressing what it means for low-level computations to approximate denotational values works with step-indexed entities. We write iValue for N × Value, iComp for N × Comp and

def

⇐⇒ ¬(cesd 7→ω ).

100

hSwap :: c, e, v1 :: v2 :: s, di hDup :: c, e, v :: s, di hPushV n :: c, [v1 , . . . , vk ], s, di hPushN n :: c, e, s, di hPushC bod :: c, e, s, di hPushRC bod :: c, e, s, di hApp :: c, e, v :: CL (e0 , bod ) :: s, di hApp :: c, e, v :: RCL (e0 , bod ) :: s, di hOp ? :: c, e, n2 :: n1 :: s, di hRet :: c, e, v :: s, (c0 , e0 , s0 ) :: di hSel (c1 , c2 ) :: c, e, v :: s, di hSel (c1 , c2 ) :: c, e, 0 :: s, di hJoin :: c, e, s, (c0 , e0 , s0 ) :: di hMkPair :: c, e, v1 :: v2 :: s, di hFst :: c, e, PR (v1 , v2 ) :: s, di hSnd :: c, e, PR (v1 , v2 ) :: s, di hEq :: c, e, v1 :: v2 :: s, di hEq :: c, e, v1 :: v2 :: s, di

7→ 7 → 7→ 7 → 7 → 7 → 7→ 7→ 7→ 7→ 7→ 7→ 7→ 7→ 7 → 7 → 7→ 7→

hc, e, v2 :: v1 :: s, di hc, e, v :: v :: s, di hc, [v1 , . . . , vk ], vn :: s, di hc, e, n :: s, di hc, e, CL (e, bod ) :: s, di hc, e, RCL (e, bod ) :: s, di hbod , v :: e0 , [], (c, e, s) :: di hbod , v :: RCL (e0 , bod ) :: e0 , [], (c, e, s) :: di hc, e, n1 ? n2 :: s, di hc0 , e0 , v :: s0 , di hc1 , e, s, (c, [], []) :: di (if v 6= 0) hc2 , e, s, (c, [], []) :: di hc0 , e, s, di hc, e, PR (v2 , v1 ) :: s, di hc, e, v1 :: s, di hc, e, v2 :: s, di hc, e, 1 :: s, di (if v1 = v2 ) hc, e, 0 :: s, di (if v1 6= v2 )

Figure 2. Operational Semantics of SECD Machine Values: Lx1 : t1 , . . . , xn : tn ` xi : ti M LΓ ` true : BoolM LΓ ` false : BoolM LΓ ` n : IntM LΓ ` hV1 , V2 i : t1 × t2 M Expressions:

= = = = =

LΓ ` Fix f x = M : t → t0 M LΓ ` [V ] : tM

=

0

=

0

= = = =

LΓ ` let x = M in N : t M

LΓ ` V1 V2 : t M LΓ ` if V then M1 else M2 : tM LΓ ` V1 ? V2 : IntM LΓ ` V1 > V2 : BoolM

=

[PushV i] [PushN 1] [PushN 0] [PushN n] LΓ ` V1 : t1 M ++LΓ ` V2 : t2 M ++[MkPair]

[PushRC (LΓ, f : t → t0 , x : t ` M : t0 M ++[Ret])]

LΓ ` V : tM

[PushC (LΓ, x : t ` N : t0 M ++[Ret])] ++LΓ ` M : tM ++[App]

LΓ ` V1 : t → t0 M ++LΓ ` V2 : tM ++[App] LΓ ` V : BoolM ++[Sel ((LΓ ` M1 : tM ++[Join]), (LΓ ` M2 : tM ++[Join])) LΓ ` V1 : IntM ++LΓ ` V2 : IntM ++[Op ?] LΓ ` V1 : IntM ++LΓ ` V2 : IntM ++[Op (λ(n1 , n2 ).n1 > n2 ⊃ 1 | 0)] Figure 3. Compiler for PCFv

iCESD for N × CESD and define an Env-parameterized observation O e over pairs of indexed computations from iComp and indexed contexts from iCESD:

rations. And each observation gives rise to two, closely related, Galois connections: one between predicates on (indexed) values and predicates on (indexed) contexts, and the other between those on (indexed) computations and (indexed) contexts. So there are four contravariant maps associated with each e : Env. Our definitions use these two of them:

def

O e (i, comp) (j, cesd ) ⇐⇒ (comp o n (e ^ cesd )) 7→min(i,j) So, given an environment e, O e holds of an indexed computation and an indexed context just when the configuration that results from appending the environment e and the code and stack from the computation onto the corresponding components of the context steps for at least the minimum of the indices of the iComp and the iCESD. We also define an observation on pairs of indexed values and indexed contexts by lifting values to computations:

↓e (·) ↓e (P ) ⇑e (·) ⇑e (Q)

: = : =

P(iValue) → P(iCESD) {jcesd | ∀iv ∈ P, O e iv jcesd } P(iCESD) → P(iComp) {icomp | ∀jcesd ∈ Q, O e icomp jcesd }

To explain the notation: down arrows translate positive predicates (over values and computations) into negative ones (over contexts), whilst up arrows go the other way. We use single arrows for the operations relating to values and double arrows for those relating to computations.

def

O e (i, v) (j, cesd ) ⇐⇒ O e (i, vb) (j, cesd ) Now we follow the general pattern of orthogonality, but with some small twists. We actually have a collection of observations, indexed by environments, made over step-indexed components of configu-

101

Now we can start relating the low-level machine to the highlevel semantics. If D is a cpo and RD i ⊆ Value×D is a N-indexed relation between machine values and elements of D then define the indexed relation [RD]n ⊆ Value × D⊥ by

For computations in context, we define EΓ,t ⊆ Comp × (JΓK ⇒ i JtK⊥ ) using the monadic lifting again

[RD]n = {(v, dv) | ∃d ∈ D, [d] = dv ∧ (v, d) ∈ RD n }

These relations are antimonotonic in the step indices and monotone in the domain-theoretic order (we also switch to infix notation for relations at this point):

C

= (EΓ → (Et )⊥ )i EΓ,t i

If, furthermore, S is a cpo and RS i ⊆ Env × S an indexed relation between machine environments and elements of S, then we define an indexed relation

Lemma 1. 1. If v Eti d, d v d0 and j ≤ i then v Etj d0 . 0 Γ 0 2. If e EΓ i ρ, ρ v ρ and j ≤ i then e Ej ρ . Γ,t 3. If comp Ei f , f v f 0 and j ≤ i then comp EΓ,t f 0. j

C

(RS → RD ⊥ )i ⊆ Comp × (S ⇒ D⊥ ) = {(comp, df ) | ∀k ≤ i, ∀(e, de) ∈ RS k , (k, comp) ∈⇑e (↓e ({(j, v) | (v, df de) ∈ [RD]j }))} which one should see as the relational action of the lift monad, relative to a relation on environments. The definition looks rather complex, but the broad shape of the definition is ‘logical’: relating machine computations to denotational continuous maps just when RS -related environments yield [RD]-related results. Then there is a little extra complication caused by threading the step indices around, but this is also of a standard form: computations are in the relation at i when they take k-related arguments to k-related results for all k ≤ i. Finally, we use biorthogonality to close up on the right hand side of the arrow; rather than making an intensional direct-style definition that the computation ‘yields’ a value v that is related to the denotational application df de , we take the set of all such related results, flip it across to an orthogonal set of contexts with ↓e (·) and then take that back to a set of computations with ⇑e (·). A special case of indexed relations between machine environments and denotational values is that for the empty environment. We define Ii ⊆ Env × 1, where 1 is the one-point cpo, by Ii = {([], ∗)}. We can now define the ‘real’ indexed logical relation of approximation between machine values and domain elements

The non-indexed versions of the approximation relations are then given by universally quantifying over the indices.

=

{(n, n) | n ∈ N}

=

{(0, false)} ∪ {(n + 1, true) | n ∈ N}

0 Eit×t

=

{(PR (v1 , v2 ), (dv1 , dv2 )) | (v1 , dv1 ) ∈ Eti ∧

def

∀i, v Eti d

e Γ ρ

⇐⇒

def

∀i, e EΓ i ρ

comp Γ,t df

⇐⇒

def

df ∀i, comp EΓ,t i

Lemma 2. If v t d then vb [],t (λ∗ : 1.[d]). 5.2

Approximating Operational By Denotational

Our second logical relation captures what it means for a denotational value to be ‘less than or equal to’ a machine computation. This way around we will again use biorthogonality, but this time with respect to the observation of termination. This is intuively reasonable, as showing that the operational behaviour of a program is at least as defined as some domain element will generally involve showing that reductions terminate. We will not use operational step-indexing to define the relation this way around, but an explicit admissible closure operation will play a similar rˆole. For e ∈ Env, comp ∈ Comp and cesd ∈ CESD our termination observation is defined by

where t is a type and i is a natural number index like this: EBool i

⇐⇒

Note that the relation on computations extends that on values:

Eti ⊆ Value × JtK

EInt i

v t d

def

n (e ^ cesd )) 7→∗ T e comp cesd ⇐⇒ (comp o which we again lift to values v ∈ Value: def

T e v cesd ⇐⇒ T e vb cesd

0

(v2 , dv2 ) ∈ Eti } 0

Eit→t

=

and again the observations generate two e-parameterized Galois connections, one between predicates on values and predicates on contexts, and the other between predicates on computations and predicates on contexts. Once more we use two of the four maps:

{(f, df ) | ∀k ≤ i, ∀(v, dv) ∈ Etk , C

0

(([App], [v, f ]), λ∗ : 1.df dv) ∈ (I → (Et )⊥ )k } This says that machine integers approximate the corresponding denotational ones, the machine zero approximates the denotational ‘false’ value, and all non-zero machine integers approximate the denotational ‘true’ value, reflecting the way in which the lowlevel conditional branch instruction works and the way in which we compile source-level booleans. Pair values on the machine approximate denotational pairs pointwise. As usual, the interesting case is that for functions. The definition says that a machine value f and a semantic function df are related at type t → t0 if whenever v is related to dv at type t, the computation whose code part is a single application instruction and whose stack part is the list [v, f ] is related to the constantly (df dv) function, of type 1 ⇒ Jt0 K⊥ , by the monadic lifting of the approximation relation at type t0 . Having defined the relation for values, we lift it to environments in the usual pointwise fashion. If Γ is x1 : t1 , . . . , xn : tn then EΓ i ⊆ Env × JΓK is given by

↓e (·) e

↓ (P )

:

P(Value) → P(CESD)

=

{cesd | ∀v ∈ P, T e v cesd }

⇑e (·)

:

P(CESD) → P(Comp)

⇑e (Q)

=

{comp | ∀cesd ∈ Q, T e comp cesd }

to define a relational action for the lift monad. If S and D are cpos, RS ⊆ Env × S and RD ⊆ Value × D, then define B

(RS → RD ⊥ ) ⊆ Comp × (S ⇒ D⊥ ) = {(comp, df ) | ∀(e, de) ∈ RS , ∀d, [d] = (df de) =⇒ comp ∈ ⇑e (↓e ({v | (v, d) ∈ RD}))} which follows a similar pattern to our earlier definition, in using biorthogonality on the right hand side of the arrow: starting with all values related to d, flipping that over to the set of contexts that terminate when plugged into any of those values and then coming back to the set of all computations that terminate when plugged into any of those contexts.

t

l EΓ i = {([v1 , . . . , vn ], (d1 , . . . , dn )) | ∀l, (vl , dl ) ∈ Ei }

102

apply a transitive closure operation to get a low-level notion of equivalence ∼: def Γ ` comp ∼ comp : t ⇐⇒ Γ ` comp ∼+ comp : t

Now we can define the second logical relation t

D ⊆ Value × JtK

by induction on the type t:

1

Int

=

{(n, n) | n ∈ N}

DBool

=

{(0, false)} ∪ {(n + 1, true) | n ∈ N}

=

{(PR (v1 , v2 ), (dv1 , dv2 )) | (v1 , dv1 ) ∈ Dt ∧

D D

t×t0

0

(v2 , dv2 ) ∈ Dt } Dt→t

0

{(f, df ) | ∀(v, dv) ∈ Dt ,

=

B

0

(([App], [v, f ]), λ∗ : 1.df dv) ∈ (I → (Dt )⊥ )} This relation also lifts pointwise to environments. For Γ = x1 : t1 , . . . , xn : tn define DΓ ⊆ Env × JΓK by DΓ = {([v1 , . . . , vn ], (dv1 , . . . , dvn )) | ∀l, (vl , dvl ) ∈ Dtl }

and then for computations in context, DΓ,t ⊆ (JΓK ⇒ JtK⊥ ) is given by B

DΓ,t = (DΓ → (Dt )⊥ ) The D relations are all down-closed on the denotational side: Lemma 3. 1. If v Dt d and d0 v d then v Dt d0 2. If e DΓ ρ and ρ0 v ρ then e DΓ ρ0 3. If comp DΓ,t f and f 0 v f then comp DΓ,t f 0 However, for a fixed v, it turns out that {d | v Dt d} is not always closed under taking limits of chains (see Appendix B for a more detailed explanation), which would prevent our compiler correctness proof going through in the case of recursion. We solve this problem by taking the Scott-closure. The closed subsets of a cpo D are those that are both down-closed and closed under limits of ω-chains. The closure Clos(P ) of a subset P ⊆ D is the smallest closed subset of D containing P . So we define def

t

v dv

⇐⇒

Γ

e de

def

⇐⇒

de ∈ Clos({d | e D d})

Γ,t

def

df ∈ Clos({d | comp DΓ,t d})

comp

df

⇐⇒

Lemma 5 (Adequacy for bottom). For any c ∈ Code and type t, if c |=[],t (λ∗ : 1.⊥JtK ) then c diverges unconditionally.

t

dv ∈ Clos({d | v D d})

For ground type observations, we say a computation comp converges to a particular integer value n if plugging it into an arbitrary context equiterminates with plugging n into that context:

Γ

n cesd 7→∗ =⇒ comp o n cesd 7→∗ ∀cesd , n bo ω ∧n bo n cesd 7→ =⇒ comp o n cesd 7→ω . And we can then show that if a piece of code realizes a non-bottom element [n] of JIntK⊥ in the empty environment, then it converges to n:

Realizability and Equivalence

Lemma 6 (Ground termination adequacy). For any c ∈ Code, if

Having defined the two logical relations, we can clearly put them together to define relations expressing that a machine value, environment or computation realizes a domain element: v t d ∧ v t d

def

e Γ de ∧ e Γ de

def

comp Γ,t df ∧ comp Γ,t df.

e |=Γ de

⇐⇒

comp |=Γ,t df

⇐⇒

c |=[],t (λ∗ : 1.[n]) then b c converges to n.

def

⇐⇒

Adequacy also holds for observation at the boolean type, with a definition of convergence to a value b that quantifies over those test contexts cesd that terminate or diverge uniformly for all machine values representing b. We finally show the compositionality of our realizability semantics.

and these relations naturally induce typed relations between machine components. We just give the case for computations

def

⇐⇒

Lemma 7 (Compositionality for application). For any cf, cx ∈ Code and df ∈ JΓK ⇒ (JtK ⇒ Jt0 K⊥ )⊥ , dx ∈ JΓK ⇒ JtK⊥ , if

Γ ` comp 1 ∼ comp 2 : t ∃df ∈ (JΓK ⇒ JtK⊥ ), comp 1 |=Γ,t df ∧ comp 2 |=Γ,t df

0

cf |=Γ,t→t df ∧ cx |=Γ,t dx

and we overload this notation to apply to simple pieces of code too:

def

2

The following says that if a piece of code realizes the denotational bottom value at any type, then it diverges unconditionally:

Lemma 4. If v t d then vb [],t (λ∗ : 1.[d]).

v |=t d

1

∀cesd , (b co n cesd ) 7→ω

Again, the relation for computations extends that for values:

5.3

2

So now we have a notion of what it means for a piece of SECD machine code to be in the semantic interpretation of a source language type, and what it means for two pieces of code to be equal when considered at that type. Clearly, we are going to use this to say something about the correctness of our compilation scheme, but note that the details of just what code the compiler produces have not really shown up at all in the definitions of our logical relations: it is only the interfaces to compiled code – the way in which integers, booleans and pairs are encoded and the way in which function values are tested via application – that are mentioned in the logical relation. In fact, equivalence classes of the ∼ relations can be seen as defining a perfectly good compositional ‘denotational’ semantics for the source language in their own right. A feature of this semantics is that there are no statements about what (non-observable) intermediate configurations should look like. For example, we never say that when a function is entered with a call stack that looks like ‘x’, then eventually one reaches a configuration that looks like ‘y’. At the end of the day, all we ever talk about is termination and divergence of complete configurations, which we need to connect with the intended behaviour of closed programs of ground type; this we do by considering a range of possible external test contexts, playing the role of top-level continuations. If c ∈ Code then we say c diverges unconditionally if

then

Γ ` c1 ∼ c2 : t ⇐⇒ Γ ` cb1 ∼ cb2 : t

cf ++ cx ++[App]

There seems no general reason to expect that the ∼ relations are transitive (though we have no concrete counterexample), so we

|=Γ,t

0

λ de : JΓK. (df de) g (dx de)

where g denotes the lifted (Kleisli) application.

103

6.

Applications

6.2.1

In this section, we illustrate the kind of results one can establish using the logical relations of the previous section. 6.1

Example: Commutativity of addition

Define the following source term plussrc = (Fix f x = [Fix g y = x + y])

Compiler Correctness

and its compiled code

Our motivating application was establishing the functional correctness of the compiler that we presented earlier.

pluscode(Γ) = LΓ ` plussrc : Int → Int → IntM.

Now we can show the following:

Theorem 1.

Lemma 8. For any Γ, for any c1 , c2 ∈ Code, and dc1 , dc2 ∈ JΓK ⇒ JIntK⊥ , if

1. For all Γ,V ,t, if Γ ` V : t then LΓ ` V : tM Γ,t [ JΓ ` V : tK ]

c1 |=Γ,Int dc1

2. For all Γ,M ,t, if Γ ` M : t then

and

c2 |=Γ,Int dc2

then

LΓ ` M : tM Γ,t JΓ ` M : tK

Γ ` APP(APP(pluscode Γ, c1 ), c2 ) ∼ APP(APP(pluscode Γ, c2 ), c1 ) : Int

The two parts are proved simultaneously by induction on typing derivations, as in most logical relations proofs. In the case for recursive functions, there is a nested induction over the step indices.

In other words, for any code fragments c1 , c2 that are in the interpretation of the source language type Int, manually composing those fragments with the code produced by the compiler for the curried addition function in either order yields equivalent behaviour of type Int.

Theorem 2. 1. For all Γ,V ,t, if Γ ` V : t then LΓ ` V : tM Γ,t [ JΓ ` V : tK ]

6.2.2

2. For all Γ,M ,t, if Γ ` M : t then

Example: First projection

Define the source term

LΓ ` M : tM Γ,t JΓ ` M : tK

projfstsrc = (Fix f x = [Fix g y = x])

This is another simultaneous induction on typing derivations. This time, the proof for recursive functions involves showing that each of the domain elements in the chain whose limit is the denotation is in the relation and then concluding that the fixpoint is in the relation by admissibility.

and the compiled code

then

Corollary 1.

projfstcode(Γ , t, t 0 ) = LΓ ` projfstsrc : t → t0 → tM

Lemma 9. For any Γ, t,t, for any c1 , c2 ∈ Code and dc1 ∈ JΓK ⇒ JtK⊥ and dc2 ∈ JΓK ⇒ Jt0 K⊥ , if

1. For all Γ,V ,t, if Γ ` V : t then LΓ ` V : tM |=Γ,t [ JΓ ` V : tK ]

c1 |=Γ,t dc1

2. For all Γ,M ,t, if Γ ` M : t then

and

0

c2 |=Γ,t dc2

and furthermore

LΓ ` M : tM |=Γ,t JΓ ` M : tK

∀de ∈ JΓK, ∃dv ∈ Jt0 K (dc2 de) = [dv]

So the compiled code of a well-typed term always realizes the denotational semantics of that term. A consequence is that compiled code for whole programs has the correct operational behaviour according to the denotational semantics:

which says that the code c2 realizes some total denotational computation of type t0 in context Γ, then

Corollary 2. For any M with [] ` M : Int,

This says that the compiled version of projfstsrc behaves like the first projection, provided that the second argument does not diverge.

Γ ` APP(APP(projfstcode(Γ, t, t0 ), c1 ), c2 ) ∼ c1 : t.

• If J[] ` M : IntK = ⊥ then L[] ` M : IntM diverges uncondi-

tionally.

• If J[] ` M : IntK = [n] then L[] ` M : IntM converges to n.

6.2.3

which follows by the adequacy lemmas above. And of course, by composing with the result that the denotational semantics is adequate with respect to the operational semantics of the source language, one obtains another corollary, that the operational semantics of complete source programs agrees with that of their compiled versions. It is this last corollary that is normally thought of as compiler correctness, but it is Corollary 1 that is really interesting, as it is that which allows us to reason about the combination of compiled code with code from elsewhere. 6.2

Our last example is slightly more involved, and makes interesting use of the non-functional equality test in the target language. We start by compiling the identity function on integers idsrc idcode(Γ)

Fix id x = x LΓ ` idsrc : Int → IntM

appnsrc = fun f => letrec apf n = fun v => if n > 0 then f (apf (n-1) v) else v

In this section we give some simple examples of typed equivalences one can prove on low-level code. We first define some macros for composing SECD programs. If c, cf, cx ∈ Code then define = =

= =

and then define a higher-order function appnsrc that takes a function f from integers to integers, an iteration count n and an integer v, and returns f applied n times to v. We present the definition in ML-like syntax rather than our ANF language to aid readability:

Low-level Equational Reasoning

LAMBDA(c) APP(cf, cx)

Example: Optimizing iteration

[PushC (c ++[Ret])] cf ++ cx ++[App]

104

abstraction with respect to the source language, but we might well wish to allow it.4 The case against full abstraction becomes stronger when one considers that one of our aims is to facilitate semantically type safe linking of code produced from different programming languages. There is a fair degree of ‘wiggle room’ in deciding just how strong the semantic assumptions and guarantees should be across these interfaces. They should be strong enough to support sound and useful reasoning from the point of view of one language, but not insist on what one might almost think of as ‘accidental’ properties of one language that might be hard to interpret or ensure from the point of view of others. The Coq formalization of these results was pleasantly straightforward. Formalizing the SECD machine, compiler, logical relations and examples, including the compiler correctness theorem, took a little over 4000 lines, not including the library for domain theory and the semantics of the source language. The extra burden of mechanized proof seems fairly reasonable in this case, and the results, both about the general theory and the examples, are sufficiently delicate that our confidence in purely paper proofs would be less than complete. There are many obvious avenues for future work, including the treatment of richer source languages and type systems, and of lower-level target languages. We intend particularly to look at source languages with references and polymorphism and at a target machine like the idealized assembly language of our previous work. We would also like to give low-level specifications that are more independent of the source language - the current work doesn’t mention source language terms, but does still talk about particular cpos. We would like to express essentially the same constraints in a more machine-oriented relational Hoare logic which might be more language neutral and better suited for independent verification. It would also be interesting to look at type systems that are more explicitly aimed at ensuring secure information flow.

and we let appncode(Γ) = LΓ ` appnsrc : (Int → Int) → Int → Int → IntM Now we define a handcrafted optimized version, appnoptcode(Γ) in the SECD language, which would, if one could write it, correspond to ML-like source code looking something like this: fun f => fun n => fun v => if f =α idcode(Γ) then v else appnsrc f n v The optimized code checks to see if it has been passed the literal closure corresponding to the identity function, and if so simply returns v without doing any iteration. We are then able to show that for any Γ, Γ ` appnoptcode(Γ) ∼ appncode(Γ) : (Int → Int) → Int → Int showing that the optimized version, which could not be written in the source language, is equivalent to the unoptimized original.

7.

Discussion

We have given a realizability relation between the domains used to model a simply-typed functional language with recursion and the low level code of an SECD machine with non-functional features. This relation was used to establish a semantic compiler correctness result and to justify typed equational reasoning on handcrafted low-level code. The relations make novel use of biorthogonality and step-indexing, and the work sheds interesting new light on the interaction of these two useful constructions. As we said in the introduction, there are many other compiler correctness proofs in the literature, but they tend not to be so compositional or semantic in character as the present one. The classic work on the VLISP verified Scheme compiler by Guttman et al. (1995) is very close in character, being based on relating a denotational semantics to, ultimately, real machine code. The untyped PreScheme language treated in that work is significantly more realistic than the toy language of the present work, though the proofs were not mechanized. The denotational semantics used there was in CPS and the main emphasis was on the behaviour of complete programs. Chlipala (2007) has also used Coq to formalize a correctness relation between a high-level functional language and low-level code, though in that case the source language is total and so can be given an elementary semantics in Sets. For ML-like languages (pure or impure), contextual equivalence at higher types is highly complex, depending subtly on exactly what primitives one allows. This is why, as we said in the introduction, we feel that fully abstract compilation might not be quite the right thing to aim for. For example, Longley (1999) shows that there are extensionally pure (and even useful) functionals that are only defineable in the presence of impure features, such as references. Adding such functionals is not conservative – they refine contextual equivalence at order four and above in pure CBV languages – yet it seems clear that they will be implementable in many low-level machines. Complicating one’s specifications to rule out these exotic programs in pursuit of full abstraction is not obviously worthwhile: it seems implausible that any compiler transformations or information flow policies will be sensitive to the difference. As another example, the presence of strong reflective facilities at the lowlevel, such as being able to read machine code instructions, might well make parallel-or defineable; this would obviously break full

Acknowledgments Thanks to Andrew Kennedy, Neel Krishnaswami, Nicolas Tabareau and Carsten Varming for many useful discussions. Extra thanks to Andrew and Carsten for their work on the Coq formalization of the source language and its denotational semantics.

References M. Abadi. >>-closed relations and admissibility. Mathematical Structures in Computer Science, 10(3), 2000. M. Abadi. Protection in programming-language translations. In 25th International Colloquium on Automata, Languages and Programming (ICALP), volume 1443 of Lecture Notes in Computer Science, 1998. A. Ahmed. Step-indexed syntactic logical relations for recursive and quantified types. In 15th European Symposium on Programming (ESOP), volume 3924 of Lecture Notes in Computer Science, 2006. A. Ahmed and M. Blume. Typed closure conversion preserves observational equivalence. In 13th ACM SIGPLAN International Conference on Functional Programming (ICFP), 2008. A. Ahmed, D. Dreyer, and A. Rossberg. State-dependent representation independence. In 36th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), 2009. A. Appel and D. McAllester. An indexed model of recursive types for foundational proof-carrying code. ACM Transactions on Programming Languages and Systems (TOPLAS), 23(5), 2001. A.W. Appel, P.-A. Melli`es, C.D. Richards, and J. Vouillon. A Very Modal Model of a Modern, Major, General Type System. 34th ACM SIGPLAN4 Discussing

the sense in which these two candidate full abstractionbreaking operations are incompatible would take us too far afield, however.

105

guments and returns true if the two (possibly functional) terms are syntactically equal and otherwise returns false.

SIGACT Symposium on Principles of Programming Languages (POPL), 2007. N. Benton. Abstracting allocation: The new new thing. In 20th International Workshop on Computer Science Logic (CSL), volume 4207 of LNCS, 2006.

Definition of VULE. As an example of a target language, we consider the call-by-value untyped lambda calculus with a syntactic equality test (VULE). We give the formal definition and operational semantics of VULE below. We fix a countable set V of variables. The set of values Val and the set of terms Term with variables in V are mutually inductively defined by the following rule:

N. Benton and N. Tabareau. Compiling functional types to relational specifications for low level imperative code. In 4th ACM SIGPLAN Workshop on Types in Language Design and Implementation (TLDI), 2009. N. Benton and U. Zarfaty. Formalizing and verifying semantic type soundness of a simple compiler. In 9th ACM SIGPLAN International Symposium on Principles and Practice of Declarative Programming (PPDP), 2007.

Val

:= |

x λx. t

Term

:= | | |

v ts u ≡α v ERROR

N. Benton, A. Kennedy, and C. Varming. Some domain theory and denotational semantics in Coq. In 22nd International Conference on Theorem Proving in Higher Order Logics (TPHOLs), volume 5674 of Lecture Notes in Computer Science, 2009.

where x ∈ V, u, v ∈ Val, t, s ∈ Term. As usual, we assume that the appplication is left-associative. FVar(t) for any term t denotes the set of free variables in t, and t [x 7→ s] the capture-avoiding substitution of the term s for the variable x in the term t. We also define some syntactic sugar for boolean operation and recursion (implemented via a CBV fixed point combinator).

A. Chlipala. A certified type-preserving compiler from lambda calculus to assembly language. In ACM SIGPLAN 2007 Conference on Programming Language Design and Implementation (PLDI), 2007. M. Dave. Compiler verification: a bibliography. ACM SIGSOFT Software Engineering Notes, 28(6), 2003. J. Guttman, J. Ramsdell, and M. Wand. VLISP: A verified implementation of scheme. Lisp and Symbolic Computation, 8(1/2), 1995.

TRUE , λx. λy. x y

T. Hardin, L. Maranget, and B. Pagano. Functional runtime systems within the lambda-sigma calculus. Journal of Functional Programming, 8, 1998.

FALSE , λx. λy. y x if t then s1 else s2 , t (λx. s1 ) (λx. s2 ) x 6∈ FVar(s1 ) ∪ FVar(s2 )

A. Kennedy. Securing the .NET programming model. Theoretical Computer Science, 364(3), 2006.

rec t , λx. (λy. t (λz. y y z)) (λy. t (λz. y y z)) x x, y 6∈ FVar(t), z 6= y

J. L. Krivine. Classical logic, storage operators and second-order lambda calculus. Annals of Pure and Applied Logic, 1994. P. Landin. The mechanical evaluation of expressions. Journal, 6(4), 1964.

The Computer

The small-step call by value operational semantics of VULE is given as follows:

X. Leroy. Formal certification of a compiler back-end, or: programming a compiler with a proof assistant. In 33rd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), 2006.

t0 s0

t s

X. Leroy and H. Grall. Coinductive big-step operational semantics. Information and Computation, 207(2), 2009.

u ≈α v u 6≈α v

J. Longley. When is a functional program not a functional program? In 4th ACM SIGPLAN International Conference on Functional Programming (ICFP), 1999. J. McCarthy and J. Painter. Correctness of a Compiler for Arithmetic Expressions. Proceedings Symposium in Applied Mathematics, 19:33– 41, 1967.

where x ∈ V, u, v ∈ Val, t, t0 , s, s0 ∈ Term and where u ≈α v means that u is alpha-equivalent to v. For convenience, we define + + the multi-step relation by setting t t0 iff the term t reaches 0 t in one or more steps. Although we have used lambda-encodings of booleans, conditionals and recursion, so as to be both concrete and minimal, the argument we shall make does not depend on these details. In fact we will only rely on some elementary properties of our encodings. The following hold for any v ∈ Val and t, s1 , s2 ∈ Term:

A. M. Pitts and I. D. B. Stark. Operational reasoning for functions with local state. In Higher Order Operational Techniques in Semantics. CUP, 1998. G. D. Plotkin. LCF considered as a programming language. Theoretical Computer Science, 5, 1977. J. Vouillon and P.-A. Melli`es. Semantic types: A fresh look at the ideal model for types. In 31st ACM Symposium on Principles of Programming Languages (POPL), 2004.

(Prop 1) TRUE, FALSE and rec t are values;

G. Winskel. The Formal Semantics of Programming Languages. MIT Press, 1993.

(Prop 2) if TRUE then s1 else s2 (Prop 3) if FALSE then s1 else s2

A.

t0 s v s0 t [x 7→ v] TRUE FALSE ERROR

=⇒ t s =⇒ v s (λx. t) v =⇒ u ≡α v =⇒ u ≡α v ERROR v

The Problem With Realizing Recursion

(Prop 4) (rec t) v

This appendix gives a concrete example of how defining semantics for functional types as sets of untyped programs can run into problems in the case that the source language includes recursion and the untyped target includes non-functional operations. By nonfunctional operations, we particularly mean operations that examine the actual syntax or code of functional terms. The example we take here is a syntactic equality test: an operation that takes two ar-

+

+ +

s1 ; s2 ;

t (rec t) v.

The problem. We now define the problem. Let Type be the set of simple types defined by the rule T ∈ Type := Bool | T → T . As usual, the type constructor → is right-associative. Now we consider what happens if we try to give a semantics to these types as

106

B.

sets of closed VULE terms that admit recursively-defined functions. Whatever the details, we would expect the following, very minimal, conditions to be satisfied by the interpretation of types J−K : Type → P(Term):

On the non-closure of Dt

The reason {d | v Dt d} is not always closed under taking limits of chains is essentially that (·)⊥⊥ does not preserve meets. In particular \ >> [ > > \ >> Pi = (Pi ) ⊃ Pi

(Asm 1) TRUE, FALSE ∈ JBoolK.

(Asm 2) Given a value u ∈ JA1 → . . . → An → BK and values v1 ∈ JA1 K . . . vn ∈ JAn K, the application of u to the vi s does + not go wrong: u v1 . . . vn / ERROR.

i

i

i

v for all values v ∈ JAK, then

with the last inclusion following from contravariance and the strict inclusion [ > \ > Pi ⊂ Pi .

We now show that, rather distressingly, for any semantics J−K satisfying the above conditions, the fixed point combinator Y defined by

Now let hdfi i be a chain of elements of Jt1 → t2 K with ti dfi = df , such that for all i and for all dv ∈ Jt1 K, dfi dv A ⊥. Then we claim that in general, there is a strict inclusion

+

(Asm 3) For a value u, if u v u ∈ JA → AK.

Y

i

{f | ∀i, f Dt1 →t2 dfi } ⊃ {f | f Dt1 →t2 df }

λf. rec f

,

is not in the set

Expanding the definitions, f being in the right-hand set above means exactly that for all v, dv such that v Dt1 dv,

J((Bool → Bool) → Bool → Bool) → Bool → BoolK.

B

(([App], [v, f ]), λ∗. df dv) ∈ (I → Dt⊥2 )

Let the value F be defined by F

,

which expands to

λg. λf. if f ≡α (rec g) then ERROR else f

∀d, d = [df dv] ⇒ ([App], [v, f ]) ∈ ⇑(↓({w | w Dt2 d}))

and observe the following facts about the behaviour of F :

or, equivalently (because of the non-⊥ assumption),

(Obs 1): For any value v ∈ Val, (rec F ) v

+

([App], [v, f ]) ∈ ⇑(↓({w | ∀d, d = [df dv] ⇒ w Dt2 d}))

F (rec F ) v ERROR if v ≈α rec rec F v otherwise

+

(Obs 2): (rec rec F ) TRUE ERROR TRUE ERROR (Obs 3): Y (rec F ) TRUE

+

i

(rec F ) (rec rec F ) TRUE (rec rec F ) TRUE

+

which is to say, ([App], [v, f ]) is in ⇑(⇓({w b | ∀d, d = [df dv] ⇒ w Dt2 d})).

Using similar reasoning, one can deduce that f being in the lefthand side of our inclusion is equivalent to saying that for all v, dv such that v Dt1 dv, the computation ([App], [v, f ]) is in \ ⇑(⇓({w b | ∀d, d = [dfi dv] ⇒ w Dt2 d})). (2)

+

ERROR

From these observations, we can conclude that

i

But there is in general a strict inclusion between the set in (1) and that in (2). By down-closure,

(Con 1): rec rec F 6∈ JBool → BoolK because TRUE ∈ JBoolK + (Asm 1), (rec rec F ) TRUE ERROR (Obs 2), and welltyped applications don’t go wrong (Asm 2).

⊆

(Con 2): rec F ∈ J(Bool → Bool) → Bool → BoolK because rec F behaves as the identity on all values except rec rec F (Obs 1), and since rec rec F is not in JBool → BoolK (Con 1), rec F must behave as the identity on all values that are in JBool → BoolK, we get the conclusion by (Asm 3).

⊆ ⊂

⇑(⇓({w b | ∀d, d = [df dv] ⇒ w Dt2 d})) T b | ∀d, d = [dfi dv] ⇒ w Dt2 d})) ⇑(⇓( i {w T ⇑(⇓({ w b | ∀d, d = [dfi dv] ⇒ w Dt2 d})) i

by the general property of biorthogonals above. To address this issue, we can either add in ‘just enough’ limits or close up under all limits. The first version of our definitions looked like G def v t dv ⇐⇒ ∃hdi i, dv v di ∧ ∀i, v Dt di

Y 6∈ J((Bool → Bool) → Bool → Bool) → Bool → BoolK

because

+

{w b | ∀d, d = [df dv] ⇒ w Dt2 d} T b | ∀d, d = [dfi dv] ⇒ w Dt2 d} i {w

so

(Con 3): Now we see that

Y (rec F ) TRUE ERROR rec F ∈ J(Bool → Bool) → Bool → BoolK TRUE ∈ JBoolK

(1)

(Obs 3) (Con 2) (Asm 1)

i

which adds limits of arbitrary chains to (and down-closes) the righthand side of the relation. The resulting relation may still not be closed under limits of chains, but actually works perfectly well for our purposes, as it does contain limits of all chains arising in the semantics of functions in our language. (And we even proved all the theorems in Coq using this definition. . . ) Nevertheless, it seems mathematically more natural to work with Scott-closed sets, so we have now adopted the definition using Clos(·) shown in the main text.

and well-typed applications don’t go wrong (Asm 2). This is not at all what we wanted! One would expect (Con 2) to be false, since F is clearly highly suspicious, and then the blameless Y could have the expected type. Our analysis of the problem is that (Asm 3) is the only one of our assumptions that could be modified in order to get the expected result. In short, just testing with ‘good’ arguments is actually insufficient grounds for concluding that a function is good: we need some extra tests on ‘partially good’ values, which is just what step-indexing will supply.

107

Scribble: Closing the Book on Ad Hoc Documentation Tools Matthew Flatt

Eli Barzilay

Robert Bruce Findler

University of Utah and PLT [email protected]

Northeastern University and PLT [email protected]

Northwestern University and PLT [email protected]

Abstract Scribble is a system for writing library documentation, user guides, and tutorials. It builds on PLT Scheme’s technology for language extension, and at its heart is a new approach to connecting prose references with library bindings. Besides the base system, we have built Scribble libraries for JavaDoc-style API documentation, literate programming, and conference papers. We have used Scribble to produce thousands of pages of documentation for PLT Scheme; the new documentation is more complete, more accessible, and better organized, thanks in large part to Scribble’s flexibility and the ease with which we cross-reference information across levels. This paper reports on the use of Scribble and on its design as both an extension and an extensible part of PLT Scheme. Categories and Subject Descriptors I.7.2 [Document and Text Processing]: Document Preparation—Languages and systems General Terms

1.

Design, Documentation, Languages

Documentation as Code

Most existing documentation tools fall into one of three categories: LATEX-like tools that know nothing about source code; JavaDoclike tools that extract documentation from annotations in source code; and WEB-like literate-programming tools where source code is organized around a prose presentation. Scribble is a new documentation infrastructure for PLT Scheme that can support and integrate all three kinds of tools. Like the LATEX category, Scribble is suitable for producing stand-alone documents. Like the other two categories, Scribble creates a connection between documentation and the program that it describes—but without restricting the form of the documentation like JavaDoc-style tools, and with a well-defined connection to the language’s scoping that is lacking in WEB-like tools. Specifically, Scribble leverages lexical scoping as supplied by the underlying programming language, instead of ad hoc textual manipulation, to connect documentation and code. This connection supports abstractions across the prose and code layers, and it enables a precise and consistent association (e.g., via hyperlinks) of references in code fragments to specifications elsewhere in the documentation. For example, @scheme[circle] in a document source generates the output text circle. If the source form appears within a lexical context that imports the slideshow library, then the rendered circle is hyperlinked to the documentation for the

Figure 1: DrScheme with binding arrows and documentation links on Scribble code slideshow library—and not to the documentation of, say, the htdp/image library, which exports a circle binding for a different GUI library. Moreover, the hyperlink is correct even if @scheme[circle] resides in a function that is used to generate documentation, and even if the lexical context of the call does not otherwise mention slideshow. Such lexically scoped fragments of documentation are built on the same technology as Scheme’s lexically scoped macros, and they provide the same benefits for documentation abstraction and composition as for ordinary programs. To support documentation in the style of JavaDoc, a Scribble program can “include” a source library and extract its documentation. Bindings in the source are reflected naturally as crossreferences in the documentation. Similarly, a source program can use module-level imports to introduce and compose literateprogramming forms; in other words, the module system acts as the language that Ramsey (1994) has suggested to organize the composition of noweb extensions. Scribble’s capacity to span documentation-tool categories is a consequence of PLT Scheme’s extensibility. Extensibility is an obstacle for JavaDoc-style tools, which parse a program’s source text and would have to be extended in parallel to the language. Scribble, in contrast, plugs into the main language’s extensibility machinery, so it both understands language extensions and is itself extensible. Similarly, Scheme macros accommodate a WEB-like

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICFP’09, August 31–September 2, 2009, Edinburgh, Scotland, UK. c 2009 ACM 978-1-60558-332-7/09/08. . . $5.00 Copyright

109

organization of a library’s implementation, and the same macros can simultaneously organize the associated documentation. Indeed, Scribble documents are themselves Scheme programs, which means that PLT Scheme tools can work on Scribble sources. Figure 1 shows this paper’s source opened in DrScheme. After clicking Check Syntax, then a right-click on a use of emph directly accesses the documentation of the emph function, even though the surface syntax of the document source does not look like Scheme. Such documentation links are based on the same lexical information and program-expansion process that the compiler uses, so the links point precisely to the right documentation. We developed Scribble primarily for stand-alone documentation, but we have also developed a library for JavaDoc-style extraction of API documentation, and we have created a WEB-style tool for literate programming. In all forms, Scribble’s connection between documentation and source plays a crucial role in crossreferencing, in writing examples within the documentation, and in searching the documentation from within the programming environment. These capabilities point the way toward even more sophisticated extensions, and they illustrate the advantages of treating documentation as code.

2.

The initial #lang scribble/doc line declares that the module uses Scribble’s documentation syntax, as opposed to using #lang scheme for S-expression syntax. At the same time, the #lang line also imports all of the usual PLT Scheme functions and syntax. The @(require scribble/manual) form imports additional functions and syntactic forms specific to typesetting a user manual. The remainder of the module represents the document content. The semantics of the document body is essentially that of Scheme, where most of the text is represented as Scheme strings. Although we build Scribble on Scheme, a LATEX-style syntax works better than nested S-expressions, because it more closely resembles the resulting textual layout. First, although all of the text belongs in a section, it is implicitly grouped by the section title, instead of explicitly grouped into something like a section function call. Second, the default parsing mode is “text” instead of “expression,” so that commas, periods, quotes, paragraphs, and sections behave in the usual way for prose, while the @ notation provides a uniform way to escape to a Scheme function call with text-mode arguments. Third, various automatic rules convert ASCII to more sophisticated typeset forms, such as the conversion of --to an em-dash and ‘‘...’’ to curly quotes. Although LATEX and Scribble use a similar syntax, the semantics are completely different. For example, itemize is a function that accepts document fragments created by the item function, instead of a text-parsing macro like LATEX’s itemize environment. The square brackets after itemize in the document source reflect that it accepts item values, whereas item and many other functions are followed by curly braces that indicate text arguments. The @notation is simply another way of writing S-expressions, as we describe in detail in Section 4.

Scribbling Prose

The beginning of the PLT Scheme overview documentation demonstrates several common typesetting forms: 1

Welcome to PLT Scheme

Depending on how you look at it, PLT Scheme is • a programming language — a descendant of Scheme,

3.

which is a dialect of Lisp;

Scribbling Code

The PLT Scheme tutorial “Quick: An Introduction to PLT Scheme with Pictures” starts with a few paragraphs of prose and then shows the following example interaction:

• a family of programming languages — variants of Scheme,

and more; or • a set of tools for using a family of programming lan-

guages. > 5 5 > "art gallery" "art gallery"

Where there is no room for confusion, we use simply “Scheme” to refer to any of these facets of PLT Scheme. The Scribble syntax for generating this document fragment is reminiscent of LATEX, using @ (like texinfo) instead of \:

The > represents the Scheme prompt. The first 5 is a constant value in an expression, so it is colored green in the tutorial, while the second 5 is a result, so it is colored blue. In Scheme, the syntax for output does not always match the syntax for expressions, so the different colors are useful hints to readers—but only if they have the consistency of an automatic annotation. The source code for the first example is simply

#lang scribble/doc @(require scribble/manual) @section{Welcome to PLT Scheme} Depending on how you look at it, @bold{PLT Scheme} is

@interaction[5 "art gallery"]

@itemize[ @item{a @emph{programming language} --- a descendant of Scheme, which is a dialect of Lisp;}

where interaction is provided by a Scribble library to both evaluate examples and typeset the expressions and results with syntax coloring. Since example expressions are evaluated when the document is built, examples never contain bugs where evaluation does not match the predicted output. The second example in the “Quick” tutorial shows a more interesting evaluation:

@item{a @emph{family} of programming languages --- variants of Scheme, and more; or} @item{a set of @emph{tools} for using a family of programming languages.} ]

> (circle 10)

Where there is no room for confusion, we use simply ‘‘Scheme’’ to refer to any of these facets of PLT Scheme.

110

Here, again, the expression is colored in the normal way for PLT Scheme code. More importantly, the circle identifier is hyperlinked to the definition of the circle function in the slideshow library, so an interested reader can follow the link to learn more about circle. Meanwhile, the result is shown as a circle image, just as it would be shown when evaluating the expression in DrScheme. The source code for the second example is equally simple:

Alternatively, instead of writing the documentation for circle in a stand-alone document—where there is a possibility that the documented contract does not match the contract in the implementation—the documentation could be written with the implementation of circle. In that case, the documentation would look slightly different, since it would be part of the module’s export declarations: (provide/doc [circle ([diameter real?] . -> . pict?) @{Creates an unfilled ellipse.}])

@mr-interaction[(circle 10)]

The author of the tutorial had to implement the mr-interaction syntactic form, because interaction does not currently support picture results. The syntax coloring, hyperlinking, and evaluation of (circle 10), however, is implemented by expanding to interaction. In particular, circle is correctly hyperlinked because the module containing the above source also includes

With provide/doc, the single contract specification for circle is used in two ways: at run time to check arguments and results for circle, and when building the documentation to show the expected arguments and results of circle. Although defproc and provide/doc are provided with Scribble, they are not built into the core typesetting engine. They are written in separate libraries, and Scribble users could have implemented these forms. We describe this approach to extending Scribble further in Section 8.

@(require (for-label slideshow))

which causes the circle binding to be imported from the slideshow module for the purposes of hyperlinking. Based on this import and a database mapping bindings to definition sites, Scribble can automatically insert the hyperlink. A module that is imported only with for-label is not run when the documentation is built, because the time at which a document is built may not be a suitable time to actually run a module. As an extreme example, an author might want to document a module whose job is to erase all files on the disk. More practically, executing a GUI library might require a graphics terminal, while the documentation for the graphics library can be built using only a text terminal. Pervasive and precise hyperlinking of identifiers greatly improves the quality of documentation, and it relieves a document author from much tedious cross-referencing work, much like automatic hyperlinking in wikis. The author need not specify where circle is documented, but instead merely import for-label a module that supplies circle, and the documentation system is responsible for correlating the use and the definition. Furthermore, since hyperlinks are used in examples everywhere, an author can expect readers to follow them, instead of explicitly writing “for more information on the circle procedure used above, see ...” These benefits are crucial when a system’s documentation runs to thousands of pages. Indeed, PLT Scheme’s documentation has 57,997 links between manuals, which is roughly 15 links per printed page (and which does not count the additional 105,344 intra-manual links). Clicking the circle hyperlink leads to its documentation in a standard format:

4. @s and []s and {}s, Oh My! Users of a text-markup language experience first and foremost the language’s concrete syntax. The same is true of any language, but in the case of text, authors with different backgrounds have arrived at a remarkably consistent view of the appropriate syntax: it should use blank lines to indicate paragraph breaks, double-quote characters should not be special, and so on. At the same time, a programmable mark-up language needs a natural escape to the programming layer and back. From the perspective of a programming language, conventional notations for string literals are terrible for writing text. The quoting rules tend to be complex, and they usually omit an escape for arbitrarily nested expressions. “Here strings” and string interpolation can alleviate some of the quoting and escape problems, but they are insufficient for writing large amounts of text with frequent nested escapes to the programming language. More importantly, building text in terms of string escapes and operations like string-append distracts from the business of writing prose, which is about text and markup rather than strings and function calls. Indeed, many documentation systems, like JavaDoc, avoid the limitations of string literals in the language by defining a completely new syntax that is embedded within comments. Of course, this approach sacrifices any connection between the text and the programming language. For Scribble, our solution is the @-notation, which is a textfriendly alternative to traditional S-expression syntax. More precisely, the @-notation is another way to write down arbitrary Sexpressions, but it is tuned for writing blocks of free-form text. The @-expression notation is a strict extension of PLT Scheme’s S-expression syntax; the @ character has no special meaning in Scheme strings, in comments, or in the middle of Scheme identifiers. Furthermore, since it builds on the existing S-expression parser, it inherits all of the existing source-location support (e.g., for error messages).

(circle diameter) → pict? diameter : real? Creates an unfilled ellipse. In this definition, real? and pict? are contracts for the function argument and result. Naturally, they are in turn hyperlinked to their definitions, because suitable libraries are imported for-label in the documentation source. The above documentation of circle is implemented using defproc:

4.1

@-expressions as S-expressions

The grammar of an @-expression is roughly as follows (where @, [, ], {, and } are literal, and x? means that x is optional): hat-expri hopi hS-expri htexti

@defproc[(circle [diameter real?]) pict?]{ Creates an unfilled ellipse. }

111

::= ::= ::= ::=

@hopi? [hS-expri*]? {htexti}? hS-expri that does not start with [ or { any PLT Scheme S-expression text with balanced {...} and with @-exprs

An @-expression maps to an S-expression as follows:

Another way to describe the @-expression syntax is simply @hopi[...]{...} where each of the three parts is optional. When hopi is included but both kinds of arguments are missing, then hopi can produce a value to use directly instead of a function to call. The hopi in an @-expression is not constrained to be an identifier; it can be any S-expression that does not start with { or [. For example, an argumentless @(require scribble/manual) is equivalent to the S-expression (require scribble/manual). The spectrum of @-expression forms enables a document author to use whichever variant is most convenient. For a given operation, however, one particular variant is typically used. In general, @hopi{...} or @hopi[...] is used to imply a typesetting operation, whereas @hopi more directly implies an escape to Scheme. Hence, the form @emph{Yes!} is preferred to the equivalent @(emph "Yes!"), while @(require scribble/manual) is preferred to the equivalent @require[scribble/manual]. A combination of S-expression and text-mode arguments is often useful to “customize” an operation that consumes text. The @title[#:style ’toc]{Contracts} example illustrates this combination, where the optional ’toc style customizes the typeset result of the title function. In other cases, an operation that specifically leverages S-expression notation may also have a text component. For example,

• An @hopi{...} sequence combines hopi with text-mode argu-

ments. For example, @emph{Yes!}

is equivalent to the S-expression (emph "Yes!")

Also, since @ keeps its meaning inside text-mode arguments, @section{Country @emph{and} Western}

is equivalent to the S-expression (section "Country " (emph "and") " Western") • An @hopi[...] sequence combines hopi with S-expression ar-

guments. For example, @itemize[(item "a") (item "b")]

@defproc[(circle [diameter real?]) pict?]{ Creates an unfilled ellipse. }

is equivalent to the S-expression (itemize (item "a") (item "b"))

is equivalent to

• An @hopi[...]{...} sequence combines S-expression argu-

(defproc (circle [diameter real?]) pict? "Creates an unfilled ellipse.")

ments and text-mode arguments. For example, @title[#:style ’toc]{Contracts}

but as the description of the procedure becomes more involved, using text mode for the description becomes much more convenient. An @ works both as an escape from text mode and as a form constructor in S-expression contexts. As a result, @-forms keep their meaning whether they are used in a Scheme expression or in a Scribble text part. This equivalence significantly reduces the need for explicit quoting and unquoting operations, and it helps avoid bugs due to incorrect quoting levels. For example, instead of @itemize[(item "a") (item "b")], an itemization is normally written @itemize[@item{a} @item{b}], since items for an itemization are better written in text mode than as conventional strings; in this case, @item{a} can be used directly without first switching back to text mode. Overall, @-expressions are crucial to Scribble’s flexibility in the same way that S-expressions are crucial to Scheme’s flexibility— and, in the same way, the benefit is difficult to quantify. Furthermore, just as S-expressions can be used for more than writing Scheme programs, the @ notation can be used for purposes other than documentation, and the @-notation parser is available for use in PLT Scheme separate from the rest of the Scribble infrastructure. We use it as an alternative to HTML for building the plt-scheme.org web pages, more generally in a template system supported by the PLT Scheme web server, and also as a text preprocessor language similar in spirit to m4 for generating plaintext files.

is equivalent to the S-expression (title #:style ’toc "Contracts")

where #:style uses PLT Scheme notation for a keyword. • An @hopi sequence without an immediately following { or [ is

equivalent to just hopi in Scheme mode. For example, @username

is equivalent to the S-expression username

so that @emph{committed by @username}

is equivalent to (emph "committed by " username) • An hopi can be omitted in any of the above forms. For example, @{Country @emph{and} Western}

is equivalent to the S-expression

4.2

Documentation-Specific Decoding

The @ notation supports local text transformations and mark-up, but it does not directly address some other problems specific to organizing a document’s source:

("Country " (emph "and") " Western")

which is useful in some quoted or macro contexts.

112

• Section content should be grouped implicitly via section,

a doc binding whose value is an instance of the Scribble part structure type. For example,

subsection, etc. declarations, instead of explicitly nesting section constructions.

#lang scheme (require scribble/decode) (define doc (decode ’("Hello, world!"))) (provide doc)

• Paragraph breaks should be determined by empty lines in the

source text, instead of explicitly constructing paragraph values. • A handful of ASCII character sequences should be converted

automatically to more sophisticated typesetting elements, such as converting ‘‘ and ’’ to curly quotes or --- to an em-dash.

implements in Scheme notation a Scribble document that contains only the text “Hello, world!” Larger documents are typically split across modules/files along section boundaries. Subsections are incorporated into a larger section using the include-section form, which expands to a require to import the sub-section module and an expression that produces the doc part exported by the module. Since document inclusion corresponds to module importing, all of the usual PLT Scheme tools for building and executing modules apply to Scribble documents. When a large document source is split into multiple modules, most of the modules need the same basic typesetting functions as well as the same “standard” bindings for examples. In Scribble, both sets of bindings can be packaged together; since for-label declarations build on the module system’s import mechanisms, they work with the module system’s re-exporting mechanisms. For example, the documentation for a library that builds on the scheme/base library might use this "common.ss" library:

These transformations are specific to typesetting, and they are not appropriate for other contexts where the @ notation is useful. Therefore, the @ parser in Scribble faithfully preserves the original text in Scheme strings, and a separate decode layer in Scribble provides additional transformations. Functions like bold and emph apply decode-content to their arguments to perform ASCII transformations, and item calls decode-flow to transform ASCII sequences and form paragraphs between empty lines. In contrast, tt and verbatim do not call the decode layer, and they instead typeset text exactly as it is given. For example, the source document #lang scribble/doc @(require scribble/manual) @title{Tubers}

#lang scheme/base (require scribble/manual (for-label lang/htdp-beginner)) (provide (all-from-out scribble/manual) (for-label (all-from-out lang/htdp-beginner)))

@section{Problem} You say ‘‘potato.’’ I say ‘‘potato.’’

Then, each part of the document can be implemented as

@section{Solution}

#lang scribble/doc @(require "common.ss") ....

Call the whole thing off.

invokes the decode layer, producing a module that is roughly equivalent to the following (where a part is a generic section):

instead of separately requiring scribble/manual and (forlabel lang/htdp-beginner) in every file.

#lang scheme/base (require scribble/struct) (provide doc)

6.

(define doc (make-part (list "Tubers") (list (make-part (list "Problem") (list (make-paragraph (list "You say \u201Cpotato.\u201D")) (make-paragraph (list "I say \u201Cpotato.\u201D")))) (make-part (list "Solution") (list (make-paragraph (list "Call the whole thing off.")))))))

5.

Modules and Bindings

As an embedded domain-specific language, Scribble follows a long tradition of using Lisp- and Scheme-style macros to implement little languages. In particular, Scribble relies heavily on the Scheme notion of syntax objects (Sperber 2007), which are fragments of code that have lexical-binding information attached. Besides using syntax objects in the usual way to implement macros, Scribble uses syntax objects to carry lexical information all the way through document rendering. For example, @scheme[lambda] expands to roughly (typeset-id #’lambda), where #’lambda is similar to ’lambda but produces a syntax object (with its lexical information intact) instead of a symbol. At the same time, many details of Scribble’s implementation rely on PLT Scheme extensions to Scheme macros. Continuing the above example, the typeset-id function applies PLT Scheme’s identifier-label-binding function to the given syntax object to determine the source module of its binding. The typeset-id function can then construct a cross-reference key based on the identifier and the source module; the documentation for the binding pairs the same identifier and source module to define the target of the cross-reference. A deeper dependence of Scribble on PLT Scheme relates to #lang parsing. The #lang notation organizes reader extensions of Scheme (i.e., changes to the way that raw text is converted to S-

Document Modules

Like all PLT Scheme programs, Scribble documents are organized into modules, each in its own file. A #lang line starts a module, and most PLT Scheme modules start with #lang scheme or #lang scheme/base. A Scribble document normally starts with #lang scribble/doc to use a prose-oriented notation with @ syntax, but a Scribble document can be written in any notation and using any helper functions and syntax, as long as it exports

113

Aside from (1) the ability to force the expansion of nested forms and (2) the ability of macros to expand into new imports, macro expansion of a module body is essentially the same as for libraries in the current Scheme standard (Sperber 2007). Where the standard allows choice in the separation of phases, we have chosen maximal separation in PLT Scheme, so that compilation and expansion as consistent as possible (Flatt 2002). That is, bindings and module instantiations needed during the compilation of a module are kept separate from the bindings and instantiations needed when executing a module for rendering. Furthermore, to support the connection between documentation and library bindings, PLT Scheme introduces a new phase that is orthogonal to compile time or run time: the label phase level. As noted in Section 3, a for-label import introduces bindings for documentation without triggering the execution of the imported module. In PLT Scheme, the same identifier can have different bindings in different phases. For example, when documenting the Intermediate Scheme pedagogical language, a document author would like uses of lambda to link to the lambda specification for Intermediate Scheme, while procedures used to implement the document itself will more likely use the full PLT Scheme language, which is a different lambda. The two different uses of lambda are kept straight naturally and automatically by separate bindings in separate phases.

expressions) to allow new forms of surface syntax. The identifier after #lang in the original source act as the “language” of a module. To parse a #lang line, the identifier after #lang is used as the name of a library collection that contains a "lang/reader.ss" module. The collection’s "lang/reader.ss" module must export a read-syntax function, which takes an input stream and produces a syntax object. The "lang/reader.ss" module for scribble/doc parses the given input stream in @-notation text mode, and then wraps the result in a module form. For example, #lang scribble/doc @(require scribble/manual) It was a @bold{dark} and @italic{stormy} night.

in a file named "hello.scrbl" reads as (module hello scribble/doclang doc () "\n" (require scribble/manual) "\n" "It was a " (bold "dark") " and " (italic "stormy") "night." "\n")

where doc is inserted by the scribble/doc reader as the identifier to export from the module, and the () is a convenience explained below. The module form is PLT Scheme’s core module form, and it generalizes the standard library form (Sperber 2007) to give macros more control in transforming the body of a module. Within a module, the first identifier is the relative name of the module, and the second identifier indicates a module to supply initial bindings for the module body. In particular, the initial import of a module is responsible for supplying a #%module-begin macro that is implicitly applied to the entire content of the module. In the case of scribble/doclang, the #%module-begin macro lifts out all import and definitions forms in the body, passes all remaining content to the decode function, and binds the result to an exported doc identifier. Thus, macro expansion converts the hello module to the following:

7.

Core Scribble Datatypes

The doc binding that a Scribble module exports is a description of a document. Various tools, such as the scribble command-line program, can take this description of a document and render it to a specific format, such as LATEX or HTML. In particular, Scribble defers detailed typesetting work to LATEX or to HTML browsers, and Scribble’s plug-in architecture accommodates new rendering back-ends. Scribble’s documentation abstraction reflects a least-common denominator among such document formats. For example, Scribble has a baked-in notion of itemization, since LATEX, HTML, and other document formats provide specific support to typeset itemizations. For many other layout tasks, such as formatting Scheme code, Scribble documents fall back to a generic “table” abstraction. Similarly, Scribble itself resolves most forms of cross-references and document dependencies, since different formats provide different levels of automatic support; tables of contents and indexes are mostly built within Scribble, instead of the back-end. A Scribble document is a program that generates an instance of a part structure type. A part can represent a section or a book, and it can have sub-parts that represent sub-sections or chapters. This paper, for example, is generated by a Scribble document whose resulting part represents the whole paper, and it contains sub-parts for individual sections. The part produced by a Scheme document for a reference manual is rendered as a book, where the immediate sub-parts are chapters. Figure 2 summarizes the structure of a document under part in a UML-like diagram. When a field contains a list, the diagram shows a double arrow, and when a field contains a lists of lists, the diagram shows a triple arrow. The dashed arrows call attention to delayed fields, which are explained below. Each part has a flow that is typeset before its sub-parts (if any), and that represents the main content of a section. A flow is a list of blocks, where each block is one of the following:

(module hello scheme/base (require scribble/doclang scribble/manual) (provide doc) (define doc (decode "\n" "\n" "It was a " (bold "dark") " and " (italic "stormy") "night." "\n")))

A subtlety in the process of lifting out import and definition forms is that they might not appear directly, but instead appear in the process of macro expansion. For example, includesection expands to a require of the included document plus a reference to the document. The #%module-begin macro of scribble/doclang therefore relies on a PLT Scheme facility for forcing the expansion of sub-forms. Specifically, #%modulebegin uses local-expand to expand each sub-form just far enough to determine whether it is an import form, definition form, or expression. If the sub-form is an import or definition, then #%module-begin suspends further work and lifts out the import or definition immediately; the import or definition can then supply bindings for further expansion of the module body. The need to suspend and continue lifting explains the () inserted in the body of a module by the scribble/doc reader; #%module-begin uses that position to track the sub-forms that have been expanded already to expressions.

• a paragraph, which contains a list of elements that are type-

set inline with automatic line breaks; • a table, which contains a list of rows, where each row is a list

of flows, one per cell in the table;

114

the elements to elsewhere in the document (which is rendered in HTML as a hyperlink from the elements);

part title flow subparts

• a delayed-element eventually expands to a list of ele-

ments. Like a delayed-block, it typically generates the elements using information gathered from elsewhere in the document. A delayed-element often generates a linkelement after a suitable target for cross-referencing is located.

flow blocks

block

paragraph style elements

• A collect-element is the complement of delayed-

element: it includes an immediate list of elements, but also a procedure to record information that might be used elsewhere in the document. A collect-element often includes a target-element, in which case its procedure might register the target’s cross-reference tag for discovery by delayedelement instances.

delayed-block block itemization style items

• A few other element types support more specialized tasks, such

as communicating between phases and specifying tooltips. blockquote style flow

table style cells

A document as represented by a part instance is an immutable value. This value is transformed in several passes to eliminate delayed-block instances, delayed-element instances, and collect-element instances. The result is a simplified part instance and associated cross-reference information. Once the cross-reference information has been computed, it is saved for use when building other documents that have cross-references to this one. Finally, the part instance is consumed by a rendering back-end to produce the final document. In the current implementation of Scribble, all documents are transformed in only two passes: a collect pass that collects information about the document (e.g., through collect-elements), and a resolve pass that turns delayed blocks and elements into normal elements. We could easily generalize to multiple passes, but so far, two passes have been sufficient within a single document. When multiple documents that refer to each other are built separately, these passes are iterated as explained in Section 9. In some cases, the output of Scribble needs customization that is specific to a back-end. Users of Scribble provide the customization information by supplying a mapping from the contents of the style field in the various structures for the style’s back-end rendering. For HTML output, a CSS fragment can extend or override the default Scribble style sheet. For LATEX output, a ".tex" file can extend or redefine the default Scribble LATEX commands.

element style

collect-element elements

string

target-element tag elements

delayed-element elements

link-element tag elements

Figure 2: Scribble’s core document representation • an itemization, which contains a list of flows, one per item; • a blockquote, which contains a single flow that is typically

typeset with more indentation than its surrounding flow; or • a delayed-block, which eventually expands to another

block, using information gathered elsewhere in the document. Accordingly, the block field of a delayed-block is not just a block, but a function that computes a block when given that other information. For example, a delayed-block is used to implement a table of contents.

8.

Scribble’s Extensibility

Scribble’s foundation on PLT Scheme empowers programmers to implement a number of features as libraries that ordinarily must be built into a documentation tool. More importantly, users can experiment with new and more interesting ways to write documentation without having to modify Scribble’s implementation. In this section, we describe several libraries that we have already built atop Scribble: for stand-alone API documentation, for automatically running examples when building documentation, for combining code with documentation in the style of JavaDoc, and for literate programing.

A Scribble document can construct other kinds of blocks that are implemented in terms of the above built-in kinds. For example, a defproc block that describes a procedure is implemented in terms of a table. An element within a paragraph can be one of the following: • a plain string;

8.1

• an instance of the element structure type, which wraps a list

API Specification

Targets for code hyperlinks are defined by defproc (for functions), defform (for syntactic forms), defstruct (for structure types), defclass (for classes in the object system), and other such forms—one for each form of binding. When a library defines a new form of binding, an associated documentation library can define a new form for documenting the bindings. As we demonstrated in Section 3, the defproc form documents a function given its name, information about its arguments,

of elements with a typesetting style, such as ’bold, whose detailed interpretation depends on the back-end format; • a target-element, which associates a cross-reference tag

with a list of elements, and where the typeset elements are the target for cross-references using the tag; • a link-element, which associates a cross-reference tag to a

list of elements, where the tag designates a cross-reference from

115

8.2

and a contract expression for its result. Information for each argument includes a contract expression, the keyword (if any) for the argument, and the default value (if any). For example, a louder function that consumes and produces a string might be documented as follows:

Examples and Tests

In the documentation for a function or syntactic form, concrete examples help a reader understand how a function works, but only if the examples are reliable. Ensuring that examples are correct is a significant burden in a conventional approach to documentation, because the example expressions must be carefully checked against the implementation (often by manual cut and paste), and a small edit can easily introduce a bug. The examples form of the scribble/eval library typesets an example along with its result using the style of a read-evalprint loop. For example,

@defproc[(louder [str string?]) string?]{ Adds ‘‘!’’ to the end of @scheme[str]. }

The description of the function refers back to the formal argument str using scheme. In the typeset result, the reference to str is typeset in a slanted font both in the function prototype and description.

@examples[(/ 1 2) (/ 1 2.0) (/ 1 +inf.0)]

produces the output Examples: > (/ 1 2) 1/2 > (/ 1 2.0) 0.5 > (/ 1 +inf.0) 0.0

(louder str) → string? str : string? Adds “!” to the end of str.

As usual, lexical scope provides the connection between the formal-argument str and the reference. The defproc form expands to a combination of Scribble functions to construct a table representing the documentation and Scheme local-macro bindings to control the expansion and typesetting of the procedure description. For the above defproc, the for-label binding of louder partly determines the library binding that is documented by this defproc form. A single binding, however, can be re-exported by many modules. On the reference side, the scheme and schemeblock forms follow re-export chains to discover the first exporting module for which a binding is documented; on the definition side, defproc needs a declaration of the module that is being documented. The module declaration is no extra burden on the document author, because the reader of the document needs some indication of which module is being documented. The defmodule form both generates the user-readable explanation of the module being documented and declares that all definitions within the enclosing section (and sub-sections, unless overridden) correspond to exports from the declared module. Thus, if louder is exported by the comics/string library, it is documented as follows:

Since building the documentation runs the examples every time, the typeset results are reliable. When an author makes a mistake, or when an implementation changes so that the documentation is out of sync, the example remains correct—though it may not reflect what the author intended. For example, if we misspell +inf.0 in the example, then the output is still accurate, though unhelpful in describing the behavior of division: Example: > (/ 1 +infinity.0) reference to undefined identifier: +infinity.0 To guard against such mistakes, an example expression can be wrapped with eval:check to combine it with an expected result: @examples[(eval:check (/ 1 +infinity.0) 0.0)]

Instead of typesetting an error message, this checked example will raise an exception when the document is built, because the expression does not produce the expected result 0.0. In this way, documentation source can serve partly as a test suite. Evaluation of example code mingles two phases that we have otherwise worked to keep separate: the time at which a library is executed, and the time at which its documentation is produced. For simple functional expressions, such as (/ 1 2), the separation does not matter, and examples could simply duplicate its argument in both an expression position and a typeset position. More generally, however, examples involve temporary definitions and side-effects. To prevent examples from interfering with each other while building a large document, examples uses a sandboxed environment, for which PLT Scheme provides extensive support (Flatt et al. 1999; Flatt and PLT Scheme 2009, §13).

#lang scribble/doc @(require scribble/manual (for-label scheme/base comics/string)) @title{String Manipulations} @defmodule[comics/string] @defproc[(louder [str string?]) string?]{ Adds ‘‘!’’ to the end of @scheme[str]. }

The defproc form is implemented by a scribble/manual layer of Scribble, which provides many functions and forms for typesetting PLT Scheme documentation. The scribble/manual layer is separate from the core Scribble engine, however, and other libraries can build up defproc-like abstractions on top of the core typesetting and cross-referencing capabilities described in Section 7.

8.3

In-Code Documentation

For some libraries, the programmer may want to write documentation with the source instead of in a separate document. To support such documentation, we have created a Scheme/Scribble extension that is used to document some libraries in the PLT Scheme distribution.

116

Using this extension, the comics/string module could be implemented as follows:

ble/lp and use chunk to introduce a piece of the implementation. A use of chunk consists of a name followed by definitions and/or expressions:

#lang at-exp scheme/base (require scheme/contract scribble/srcdoc) (require/doc scheme/base scribble/manual)

@chunk[ ... definitions ... ... expressions ...]

The definitions and expressions in a chunk can refer to other chunks by their names. Unlike a normal Scribble program, running a scribble/lp program ignores the prose exposition and instead evaluates the program in the chunks. In literate programming terminology, this process is called tangling the program. Thus, to a client module, a literate program behaves just like its illiterate variant. The compiled form of a literate program does not contain any of the documentation, nor does it depend on the runtime support for Scribble, just as if an extra-linguistic tangler had been used. Consequently, the literate implementation suffers no overhead due to the prose. To recover the prose, the

(define (louder s) (string-append s "!")) (provide/doc [louder ([str string?] . -> . string?) @{Adds ‘‘!’’ to the end of @scheme[str].}])

The #lang at-exp scheme/base line declares that the module uses scheme/base language extended with @-notation. The imported scribble/srcdoc library binds require/doc and provide/doc. The require/doc form imports bindings into a “documentation” phase, such as the scheme form that is used in the description of louder. The provide/doc form exports louder, annotates it with a contract for run-time checking, and records the contract and description for inclusion in documentation. The description is an expression in the documentation phase; it is dropped by normal compilation of the module, but combined with the require/doc imports and inferred (require (for-label ...)) imports to generate the module’s documentation. The documentation part of this module is extracted using include-extracted, which is provided by the scribble/extract module in cooperation with scribble/srcdoc. The extracted documentation might provide the entire text of the document directly, or it might be incorporated into a larger document:

@lp-include[filename]

form extracts a literate view of the program from filename. In literate programming terminology, this process is called weaving the program. The right-hand side of Figure 3 shows the woven version of the code in the screenshot. Both weaving and tangling with scribble/lp work at the level of syntactic extensions, and not in terms of manipulating source text. As a result, the language for writing prose is extensible, because Scribble libraries such as scribble/manual can be imported into the document. The language for implementing the program is also obviously extensible, because a chunk can include imports from other PLT Scheme libraries. Finally, even the bridge between the prose and the implementation is extensible, because the document author can create new syntactic forms that expand to a mixture of prose, implementation, and uses of chunk. Tangling via syntactic extension also enables many tools for Scheme programs to automatically apply to literate Scheme programs. The arrows in Figure 3’s screenshot demonstrate how DrScheme can draw arrows from chunk bindings to chunk references, and from the binding occurrence of an identifier to its bound occurrences, even across chunks. These latter arrows are particularly helpful with literate programs, where lexical scope is sometimes obscured by the way that textually disparate fragments of a program are eventually tangled into the same scope. DrScheme’s interactive REPL, test-case coverage support, module browser, executable generation, and other tools also work on literate programs. To gain some experience with non-trivial literate programming in Scribble, we have written a 34-page literate program that describes our implementation of the Chat Noir game, which is distributed with PLT Scheme. The source is included in the distribution as "chat-noir-literate.ss", and the rendered output is in the help system and online at http://docs. plt-scheme.org/games/chat-noir.html.

#lang scribble/doc @(require scribble/manual scribble/extract (for-label comics/string)) @title{Strings} @defmodule[comics/string] The @schememodname[comics/string] library provides functions for creating punchlines. @include-extracted[comics/string]

An advantage of using scribble/srcdoc and scribble/extract is that the description of the function is with the implementation, and the function contract need not be duplicated in the source and documentation. Similarly, the fact that string? in the contract gets its binding from scheme/base is specified once in the code and inferred for the documentation. At the same time, a phase separation prevents document-generating expressions from polluting the library’s run-time execution, and vice versa. 8.4

9.

Literate Programming

Building and Installing Documentation

PLT Scheme documentation resides with the source code. The setup process that builds bytecode from Scheme source also renders HTML documentation from Scribble source. The HTML output is accompanied by cross-reference information that is used both for building more documentation when new libraries are installed and for online help in the programming environment. Although many existing PLT Scheme tools help in building documents, the process of generating HTML is significantly different from compilation tasks. The main difference is that cyclic depen-

The techniques used for in-source documentation extend to the creation of WEB-like literate programming tools. Figure 3 shows an example use of our literate-programming library; the left-hand side shows a screenshot of DrScheme editing the source code for a short, literate discussion of the Collatz conjecture, while the righthand side shows the rendered output. Literate programs written with our library look like ordinary Scribble documents, except that they start with #lang scrib-

117

Consider a function that, starting from (collatz n), recurs with ::= (collatz (/ n 2)) if n is even and recurs with ::= (collatz (+ (* 3 n) 1)) if n is odd. We can package that up into the collatz function: ::= (define (collatz n) (unless (= n 1) (if (even? n) ))) The Collatz conjecture is true if this function terminates for every input. Thanks to the flexibility of literate programming, we can package up the code to compute orbits of Collatz numbers too: ::= (define (collatz n) (cond [(= n 1) ’(1)] [(even? n) (cons n )] [(odd? n) (cons n )])) Finally, we put the whole thing together, after establishing different scopes for the two functions. ::= (require scheme/local) (local [] (collatz 18)) (local [] (collatz 18))

Figure 3: Literate programming example

118

The Scribble-based documentation system is accessible to all PLT Scheme users, who write their own documentation using Scribble and often supply patches to the main document sources. Scribble produces output that is more consistent and easier to navigate than the old documentation, and the resulting documentation works better with online help. More importantly, the smooth path from API documentation to stand-alone documents has let us produce much more tutorial and overview documentation, helping users find their way through the volumes of available information.

dencies are common in documentation, whereas library dependencies are strictly layered. For example, the core language reference contains many pointers into the overview and a few pointers to the GUI library and other extensions; all documents, meanwhile, refer back to the core reference. Resolving mutual dependencies directly would require loading all documents into memory at once, which is impractical for the scale of the PLT Scheme documentation. The setup processes therefore builds documents one at a time, reading and writing serialized cross-reference information until it arrives at a fixed point for all documents. A fixed point usually requires two iterations, so that all documents see the information collected from all other documents. A handful of documents require a third pass, because they contain section titles from other documents, where each section title is based on still other documents (e.g., by using an identifier whose typesetting depends on whether it is documented as a procedure or syntactic form). Another challenge in building a unified set of documentation is that individual users might augment the main installation with userspecific libraries. The main installation includes a table of contents that is the default starting point for reading documentation, and this table is updated when a new package is placed into the main installation. When a user-specific library is installed, in contrast, its document is built so that hyperlink references go into the main installation’s documentation, and a user-specific table of contents is created. When a user opens the documentation via DrScheme’s Help menu, a user-specific table of contents is opened (if it exists). Instead of explicitly installing a library, a user can implicitly install a package from the PLaneT repository (Matthews 2006) by using a library reference of the form (planet ....). When a library is installed in this way, its documentation is installed as the library is compiled. PLaneT supports library versioning, and multiple versions of a package can be installed at a time. In that case, multiple versions of the documentation are installed; document references work with versions just as reliably as library references, since they use the same underlying module-import mechanisms to precisely identify the origin of a binding.

10.

11.

Related Work

As noted in the introduction, most existing documentation tools fall into one of three categories: LATEX-like tools, JavaDoc-like tools, and WEB-like tools. The LATEX category includes general word-processing tools like Microsoft Word, but LATEX offers the crucial advantage of programmability, where macros enable automatic formatting of API details. Systems like Skribe (Gallesio and Serrano 2005) improve LATEX by offering a sane programming language. Even in a programmable documentation language, however, a lack of connection to source code means that information is duplicated in documentation and source, and binding and evaluation rules inherent to the source language are not automatically reflected in documentation and in examples related to those bindings. The JavaDoc category includes perldoc for Perl, RDoc for Ruby, Haddock (Marlow 2002) for Haskell, OCamlDoc (Leroy 2007), Doxygen (van Heesch 2007) for various languages (including Java, C++, C#, and Fortran), and many others. Such tools improve on the LATEX category, in that they provide a closer connection to the programs that they document. In particular, they are specifically designed for library API documentation, where they shine in automatic extraction of API details from the source code. These tools are not suitable for other kinds of stand-alone documents, such as overview documents, tutorials, and papers (like this one), where prose and document structuring are more central than API details. Literate programming tools such as WEB (Knuth 1984) and noweb (Ramsey 1994) are designed for documenting the implementation of a library as much as the API that a library exports. In a sense, these tools are an extreme version of the JavaDoc category, where the information communicated to a reader is drawn from both the prose and the executable source. In doing so, unfortunately, the tools typically revert to a textual slice-and-dice of the program and prose sources, instead of a programmable layer that spans the two halves. Simonis and Weiss (2003) provide a more complete overview of existing systems and add ProgDoc, which is similar to noweb in the way that it uses a pipeline of tools. Scribble builds on many ideas from these predecessors, but fits them into an extensible framework backed by an expressive programming language. Skribe (categorized above in the LATEX group) is by far the system most closely related to Scribble. Like Scribble, Skribe builds on Scheme to construct representations of documents using Scheme functions and macros, and it uses an extension of Scheme syntax to make it more suitable for working with literal text. (Skribe uses square brackets to quote strings, and within square brackets, a comma followed by an open parenthesis escapes back into Scheme.) Skribe’s format-independent document structure and its use of passes to render a document influenced the design of Scribble. Skribe, however, lacks an integration with lexical binding and the module system that is the heart of Scribble. For example, a scheme form that typesets and links and identifier in a lexically sensitive way is not possible to implement in Skribe without building a PLT Scheme-style module and macro layer on top of Skribe. Scribble builds on a long line of work in Lisp-style language extensibility, including traditional Lisp macros, lexically scoped

Experience

Scribble is part of the PLT Scheme distribution as of version 4.0, which was released in June 2008, and all PLT Scheme documentation is created with Scribble. Developing Scribble, porting old PLT Scheme documentation, and writing new documentation took about a year, but the @ notation evolved earlier through years of experimentation. The documentation at http://docs.plt-scheme.org/ is built nightly by Scribble from a snapshot of the PLT Scheme source repository. The same documentation is available in PDF form at http://pre.plt-scheme.org/docs/pdf/. At the time of this writing, the 70 PDF files of current documentation total 3778 pages in a relatively sparse format, which we estimate would fit in around 1000 pages if compressed into a conferencestyle, two-column layout. This total includes documentation only for libraries that are bundled with PLT Scheme; additional libraries for download via PLaneT are also documented using Scribble. PLT Scheme documentation was previously written in LATEX and converted to HTML via tex2page (Sitaram 2007). Although tex2page was a dramatic improvement over our original use of latex2html, the build process relied on layers of fragile LATEX macros, HTML hacks, and pre- and post-processing scripts, which made the documentation all but impossible to build except by its authors. Consequently, most library documentation used a plaintext format that was easier to write but inconsistent in style and difficult to index. The documentation index distinguished identifier names from general terms, but it did not attach a source module to each identifier name, so online help relied on textual search.

119

of no programming system besides PLT Scheme that is distributed with tutorials, programmer guides, and detailed API documentation, all extensively and precisely cross-referenced. We also know of no other system that makes it so easy for third parties to add new documentation of all kinds with the same level of integration, to say nothing of being able to extend the documentation system itself.

macros in Scheme (Dybvig et al. 1993), and readtable-based syntactic extension as in Common Lisp. Phase-sensitive binding through for-label is specific to PLT Scheme, as is the disciplined approach to reader extension embodied by #lang. The SLATEX (Sitaram 2007) system provides automatic formatting of Scheme code within a LATEX document. To identify syntactic forms and constants, SLATEX relies on defkeyword and defconstant declarations. In this mode, the author of a work in progress must constantly add another “standard” binding to SLATEX’s list; SLATEX’s built-in table of syntactic forms is small compared to the number of syntactic forms available in PLT Scheme. More generally, the problem is the usual one for “standards”: there are many to choose from. Scribble solves this problem with for-label imports and by directly using the namespacemanagement functionality of PLT Scheme modules. Many systems follow the Lisp tradition of docstrings, in which documentation is associated to run-time values and used for online help. Python supports docstrings, and its doctest module even extracts and executes examples as tests, analogous to Scribble’s examples form. Scribble supports a docstring-like connection between run-time bindings and documentation, but using lexicalbinding information instead of the value associated with a binding. For example, (help cons) in PLT Scheme’s read-eval-print loop opens documentation for cons based on its binding as imported from scheme/base, and not based on the procedure obtained by evaluating cons. Smalltalk programming environments (Kay 1993) have always encouraged programmers to use the source (with its comments) as documentation, and environments like Eclipse and Visual Studio now make code navigation similarly convenient for other languages. Such tools do not supplant the need for external documentation, however, such as guides and tutorials. In terms of surface syntax, many documentation systems build on either S-expression notation (or its cousin XML) as a way to encode both document structure and program structure. Such representations are especially appropriate for an intermediate representation of documentation, as in DocBook (Walsh and Muellner 2008). S-expression encodings of documentation are especially common in Lisp projects, where data and code are mingled easily.

12.

Trying Scribble To install an HTML version of this paper where Scheme and Scribble identifiers are hyperlinked to their documentation, first install PLT Scheme 4.1.5 or later from http://plt-scheme.org/. Then, start DrScheme, enter the program #lang scheme (require (planet mflatt/scribble-paper))

and click Run. Running the program installs the paper and directs your default browser to the starting page. To view the document source, click Check Syntax and then right-click on mflatt/scribble-paper to open its source. Acknowledgements: We would like to thank Matthias Felleisen and the anonymous reviewers for helpful feedback on this paper. This work is supported in part by the NSF.

Bibliography R. Kent Dybvig, Robert Hieb, and Carl Bruggeman. Syntactic Abstraction in Scheme. Lisp and Symbolic Computation 5(4), pp. 295–326, 1993. Matthew Flatt. Compilable and Composable Macros, You Want it When? In Proc. ACM Intl. Conf. Functional Programming, pp. 72–83, 2002. Matthew Flatt, Robert Bruce Findler, Shriram Krishnamurthi, and Matthias Felleisen. Programming Languages as Operating Systems (or Revenge of the Son of the Lisp Machine). In Proc. ACM Intl. Conf. Functional Programming, pp. 138–147, 1999. Matthew Flatt, and PLT Scheme. Reference: PLT Scheme. PLT Scheme Inc., PLT-TR2009-reference-v4.2, 2009. Erick Gallesio, and Manuel Serrano. Skribe: a Functional Authoring Language. J. Functional Programming 15(5), pp. 751–770, 2005. Alan C. Kay. The early history of Smalltalk. ACM SIGPLAN Notices 28(3), 1993. Donald E. Knuth. Literate Programming. Computer Journal 27(2), pp. 97– 111, 1984.

Conclusion

A documentation language should be designed not by piling escape conventions on top of a comment syntax, but by removing the weaknesses and restrictions of the programming language that make a separate documentation language appear necessary. Scribble demonstrates that a small number of rules for forming documentation, with no restrictions on how they are composed, suffice to form a practical and efficient documentation language that is flexible enough to support the major documentation paradigms in use today.

Xavier Leroy. The Objective Caml System, release 3.10. 2007. Simon Marlow. Haddock, a Haskell Documentation Tool. In Proc. ACM Wksp. Haskell, pp. 78–89, 2002. Jacob Matthews. Component Deployment with PLaneT: You Want it Where? In Proc. Wksp. Scheme and Functional Programming, 2006. Norman Ramsey. Literate Programming Simplified. IEEE Software 11(5), pp. 97–105, 1994.

— Clinger’s introduction to the Rn RS standards, adapted for Scribble

Volker Simonis, and Roland Weiss. ProgDOC — A New Program Documentation System. In Proc. Perspectives of System Informatics, Lecture Notes in Computer Science volume 2890, pp. 438–449, 2003.

Our design for Scribble relies on a thread of language-extension work that starts in Lisp macros, runs through Scheme’s introduction of lexically scoped syntax, and continues with PLT Scheme innovations on modules, phases, and an open syntax expander. Meanwhile, LATEX and Skribe demonstrate the advantages of building a document system on top of a programming language, and tools like JavaDoc demonstrate the power of leveraging the information available in a program’s source to automate and link documentation about the program. Scribble combines all of these threads for the first time, producing a tool (or library, or language, depending on how you look at it) that spans and integrates document categories. We are aware

Dorai Sitaram. TeX2page. 2007. http://www.ccs.neu.edu/home/ dorai/tex2page/tex2page-doc.html Dorai Sitaram. How to Use SLaTeX. 2007. http://www.ccs.neu. edu/home/dorai/slatex/ Michael Sperber (Ed.). The Revised 6 Report on the Algorithmic Language Scheme. 2007. Norman Walsh, and Leonard Muellner. DocBook: The Definitive Guide. O’Reilly & Associates, Inc., 2008. Dimitri van Heesch. Doxygen Source Code Documentation Generator Tool. 2007. http://www.stack.nl/˜dimitri/doxygen/

120

Lambda, the Ultimate TA Using a Proof Assistant to Teach Programming Language Foundations Benjamin C. Pierce University of Pennsylvania http://www.cis.upenn.edu/∼bcpierce

Abstract Categories and Subject Descriptors mation Science Education]

Ambitious experiments using proof assistants for programming language research and teaching are all the rage. In this talk, I’ll report on one now underway at the University of Pennsylvania and several other sites: a one-semester graduate course in the theory of programming languages presented entirely—every lecture, every homework assignment—in Coq. I’ll try to give a sense of what the course is like for both instructors and students, describe some of the most interesting challenges in developing it, and explain why I now believe such machine-assisted courses are the way of the future.

General Terms

K.3.2 [Computer and Infor-

Languages, Theory

Keywords Programming languages, pedagogy, proof assistants

Copyright is held by the author/owner(s). ICFP’09, August 31–September 2, 2009, Edinburgh, Scotland, UK. ACM 978-1-60558-332-7/09/08.

121

A Universe of Binding and Computation Daniel R. Licata∗

Robert Harper ∗

Carnegie Mellon University {drl,rwh}@cs.cmu.edu

Abstract

functional programming, have an extensional character—a function from A to B is a black box that, when given an A, delivers a B. A function-as-data determines a function-as-computation by substitution (plugging a value in for a variable), but not every functionas-computation determines a function-as-data, because the syntax appropriate for a particular problem may not allow the expression of every black box. In previous work (Licata et al., 2008), we began to study a programming language that provides support for both functionsas-data and functions-as-computation as two different types. Our framework provides one type constructor ⇒ for functions-as-data, used to represent variable binding, and another type constructor ⊃ for functions-as-computation, used for functional programming. This permits representations that mix the two function spaces. As a simple example of such integration, consider a syntax for arithmetic expressions constructed out of (1) variables, (2) numeric constants, (3) let binding, and (4) arbitrary binary primitive operations, represented by functions-as-computation of type nat ⊃ nat ⊃ nat. In SML, we would represent this syntax with the following datatype:

We construct a logical framework supporting datatypes that mix binding and computation, implemented as a universe in the dependently typed programming language Agda 2. We represent binding pronominally, using well-scoped de Bruijn indices, so that types can be used to reason about the scoping of variables. We equip our universe with datatype-generic implementations of weakening, substitution, exchange, contraction, and subordination-based strengthening, so that programmers need not reimplement these operations for each individual language they define. In our mixed, pronominal setting, weakening and substitution hold only under some conditions on types, but we show that these conditions can be discharged automatically in many cases. Finally, we program a variety of standard difficult test cases from the literature, such as normalization-by-evaluation for the untyped λ-calculus, demonstrating that we can express detailed invariants about variable usage in a program’s type while still writing clean and clear code. Categories and Subject Descriptors F.3.3 [Logics and Meanings Of Programs]: Studies of Program Constructs—Type structure General Terms Languages, Verification

1.

datatype | | |

Introduction

There has been a great deal of research on programming languages for computing with binding and scope (bound variables, α-equivalence, capture-avoiding substitution). These languages are useful for a variety of tasks, such as implementing domain-specific languages and formalizing the metatheory of programming languages. Functional programming with binding and scope involves two different notions of function: functions-as-data and functionsas-computation. Functions-as-data, used to represent abstract syntax with variable binding, have an intensional, syntactic, character, in the sense that they can be inspected in ways other than function application. For example, many algorithms that process abstract syntax recur under binders, treating variables symbolically. On the other hand, functions-as-computation, the usual functions of

arith = Num Letbind Binop

Var of var of nat of arith * (var * arith) of arith * (nat -> nat -> nat) * arith

We use ML functions(-as-computation) to represent the primops. However, because SML provides no support for functions-as-data, we must represent variable binding explicitly (with a type var), and code notions such as α-equivalence and substitution ourselves. In contrast, our framework naturally supports mixed datatypes such as this one. We specify it by the following constructors: num letbind binop

: : :

arith ⇐ nat arith ⇐ arith ⊗ (arith ⇒ arith) arith ⇐ arith ⊗ (nat ⊃ nat ⊃ nat) ⊗ arith

The symbol ⇐ is used for datatype constructors, which have the form D ⇐ A, for a datatype name D and a type A. We use ⇒ (functions-as-data) to represent the body of the letbind, and ⊃ (functions-as-computation) to represent the primops. Our framework takes a pronominal approach to the variables introduced by functions-as-data: variables are thought of as pronouns that refer to a designated binding site, and thus are intrinsically scoped. This is in contrast to the nominal approach taken by languages such as FreshML (Pitts and Gabbay, 2000; Pottier, 2007; Shinwell et al., 2003), where variables are thought of as nouns— they are pieces of data that exist independently of any scope. The pronominal approach inherently requires some notion of context to be present in the language’s type system, so that variables have something to refer to; we write hΨi A as the classifier of a program of type A with variables Ψ. The practical advantage of these contextual types is that they permit programmers to express useful invariants about variable-manipulating code using the type system, such as the fact that a λ-calculus evaluator maps closed terms to closed terms.

∗ This

research was sponsored in part by the National Science Foundation under grant number CCF-0702381 and by the Pradeep Sindhu Computer Science Fellowship. The views and conclusions contained in this document are those of the author and should not be interpreted as representing the official policies, either expressed or implied, of any sponsoring institution, the U.S. government or any other entity.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICFP’09, August 31–September 2, 2009, Edinburgh, Scotland, UK. c 2009 ACM 978-1-60558-332-7/09/08. . . $5.00 Copyright

123

exist. As a rough rule of thumb, one can weaken with types that do not appear to the left of a computational arrow in the type being weakened, and similarly for substitution. Our framework implements the structural properties generically but conditionally, providing programmers with the structural properties “for free” in many cases. This preserves one of the key benefits of working in LF, where weakening and substitution are always defined. In our previous work (Licata et al., 2008), we investigated the logical foundations of a pronominal approach to mixing binding and computation. In the present paper, we give an implementation of (a slight variant of) our framework, and we demonstrate the viability of our approach by programming some standard difficult test cases from the literature. For example, we implement normalization-by-evaluation (Berger and Schwichtenberg, 1991; Martin-Löf, 1975) for the untyped λ-calculus, an example considered in FreshML by Shinwell et al. (2003). Our version of this algorithm makes essential use of a datatype mixing binding and computation, and our type system verifies that evaluation maps closed terms to closed terms. Rather than implementing a new language from scratch, we construct our type theory as a universe in Agda 2 (Norell, 2007), a dependently typed functional programming language that provides good support for programming with inductive families, in the style of Epigram (McBride and McKinna, 2004). This means that we (a) give a syntax for the types of our type theory and (b) give a function mapping the types of our language to certain Agda types; the programs of our language are then the Agda programs of those types. This implementation strategy allows us to reuse the considerable implementation effort that has gone into Agda, and to exploit generic programming within dependently typed programming (Altenkirch and McBride, 2003) to implement the structural properties; additionally, it permits programs written using our framework to interact with existing Agda code. Also, our development provides a successful example of prototyping a new language with an interesting type system using a dependently typed programming language. In our Agda implementation, we have chosen to represent variable binding using well-scoped de Bruijn indices. In summary, we make the following technical contributions: (1) We show that our previous type theory for integrating binding and computation can be implemented as a universe in Agda. The types of the universe permit concise, “point-free” descriptions of contextual types: a type in the universe acts as a function from contexts to Agda types. (2) We implement a variety of structural properties for the universe, including weakening, substitution, exchange, contraction, and subordination-based strengthening (Virga, 1999), all using a single generic map function for datatypes that mix binding and computation. (3) We define the structural properties’ preconditions computationally, so that our framework can discharge these conditions automatically in many cases. This gives the programmer free access to weakening, substitution, etc. (when they hold). (4) We program a variety of examples, and demonstrate that we can express detailed invariants about variable usage in a program’s type while still writing clean and clear code. In this paper, we consider only a simply-typed universe, for writing ML-like programs that manipulate binding in a well-scoped manner; we leave dependent types to future work. Also, the companion code for this paper (see http://www.cs.cmu.edu/~drl/) is written in “Agda minus termination checking,” as many of our examples require non-termination; we discuss which parts of our code pass the termination checker below. The remainder of this paper is organized as follows: In Section 2, we introduce our language and its semantics in Agda. In Section 3, we present examples. In Section 4, we discuss the structural properties. In Sections 5 and 6, we discuss related work and conclude. Appendix A contains a brief introduction to Agda.

In a pronominal setting, the interaction of functions-as-data and functions-as-computation has interesting consequences for the structural properties of variables, such as weakening (introducing a new variable that does not occur) and substitution (plugging a value in for a variable). For example, one might expect that it would be possible to weaken a value of type A to a function-asdata of type D ⇒ A. However, this is not necessarily possible when A itself is a computational function type: Contextual computational functions of type hΨi A ⊃ B are essentially interpreted as functions from hΨi A to hΨi B , and hΨi D ⇒ A classifies values of type A in an extended context Ψ, x : D. Now, suppose we are given a function f of type hΨi A ⊃ B ; we try to construct a function of type hΨi D ⇒ (A ⊃ B ). This requires a function from hΨ, x : Di A to hΨ, x : Di B . Since f is a black box, we can only hope to achieve this by pre- and post-composing with appropriate functions. The post-composition must take hΨi B to hΨ, x : Di B , which would be a recursive application of weakening. However, the pre-composition has a contravariant flip: we require a strengthening function from hΨ, x : Di A to hΨi A in order to call f—and such a strengthening function does not in general exist, because the value of type A might be that very variable x. Similarly, substitution of terms for variables is not necessarily possible, because substitution requires weakening. Put differently, computational functions permit the expression of side conditions that inspect the context, which causes the structural properties to fail. As a concrete example, consider computational functions of type h·i (arith ⊃ arith), which are defined by case-analysis over closed arithmetic expressions, giving cases for constants and binops and let-binding—but not for variables, because there are no variables in the empty context. Weakening such a function to type arith ⇒ (arith ⊃ arith) enlarges its domain, asking it to handle cases that it does not cover. What are we to do about this interaction of binding and computation? One option is to work in a less general setting, where it does not come up. For example, in nominal languages such as FreshML (Shinwell et al., 2003), the type of names is kept openended (it is considered to have infinitely many inhabitants). Thus, any computational function on syntax with binding must account for arbitrarily many names, and is therefore weakenable. However, many functions on syntax are only defined for certain classes of contexts (e.g., only closed arithmetic expressions can be evaluated to a numeral), and the nominal approach does not allow these invariants to be expressed in a program’s type (though they can be reasoned about externally using a specification logic (Pottier, 2007)). Alternatively, in languages based on the LF logical framework, such as Twelf (Pfenning and Schürmann, 1999), Delphin (Poswolsky and Schürmann, 2008), and Beluga (Pientka, 2008), the structural properties always hold, because computational functions cannot be used in LF representations of logical systems. In our framework, we take a more general approach, which requires admitting that weakening and substitution may not always be defined. Thus, we should be more careful with terminology, and say that the type D ⇒ A classifies values of type A with a free variable of type D. In some cases, D ⇒ A determines a function given by substitution, but in some cases it does not. In this sense, our approach is similar to representations of binding using wellscoped de Bruijn indices (Altenkirch and Reus, 1999; Bellegarde and Hook, 1994; Bird and Paterson, 1999), which are pronominal, because variables are represented by pointers into a context, but make no commitment to weakening and substitution. However, our framework improves upon such representations by observing that weakening and substitution are in fact definable generically, not for every type D ⇒ A, but under certain conditions on the types D and A. For example, returning to our failed attempt to weaken A ⊃ B above, if variables of type D could never appear in terms of type A, then the required strengthening operation would

124

2.

Language Definition

The only subtlety in this definition is that we represent the bodies of ∀c and ∃c by computational functions in Agda. This choice 2.1 Types has some trade-offs: on the one hand, it means that the bodies of The grammar for the types of our language is as follows: quantifiers can be specified by any Agda computation (e.g. by recursion over the domain). On the other hand, it makes it difficult to analyze the syntax of Types, because there is no way to inspect Defined atoms D ::= . . . the body of the quantifier. Indeed, this caused problems for our Var. Types C ::= (a subset of D) implementation of the structural properties, which we solved by Contexts Ψ ::= [] | (Ψ, C) Types A ::= 0+ | 1+ | A ⊗ B | A ⊕ B | listA | A ⊃ B adding certain instances of the quantifiers (∀⇒ and ∃⇒, discussed below), which would otherwise be derived forms, as separate Type D+ D | C# | Ψ ⇒∗ A | 2A constructors. In future work, we may pursue a more syntactic treat∀c ψ.A | ∃c ψ.A | ∀≤C A | ∃≤C A ment of the quantifiers (which would of course be easier if we had The language is parametrized by a class of defined atoms D, good support for variable binding. . . ). which are the names of datatypes. A subset of these names are A rule, which is the type of a datatype constructor, pairs variable types, which are allowed to appear in contexts. This disthe defined atom being constructed with a single premise type tinguishes certain types C which may be populated by variables (no/multiple premises can be encoded using 1+ and ⊗): from other types D which may not. This definition of VarType permits only variables of base type, rather than the full language data Rule : Set where of higher-order rules that we considered in previous work (Licata _⇐_ : DefAtom -> Type -> Rule et al., 2008). Contexts are lists of variable types, written with ’cons’ We will make use of a few derived forms: on the right. The types on the first line have their usual meaning. The type • We write (∀⇒ A) for (∀c \Ψ -> Ψ ⇒* A), and similarly D+ D is the datatype named by D. Following Delphin (Poswolsky for ∃⇒ (note that \ x -> e introduces an anonymous funcand Schürmann, 2008), we include a type C# classifying only the tion). This type quantifies over a context Ψ and immediately variables of type C. The type Ψ ⇒∗ A classifies inhabitants of binds it around A. Similarly, we write [ Ψ ]* A for 2 (Ψ ⇒* A in the current context extended with Ψ. The type 2A classifies A) closed inhabitants of A. The types ∀c and ∃c classify universal • We write (C ⇒ Ψ) for ⇒* with a single premise. and existential context quantification; ∀≤C A and ∃≤C A provide bounded quantification over contexts containing only the type C. • We write (C + ) for (D+ C) when C is a variable type. 2.1.1

• We write bool for 1+ ⊕ 1+ and A option for A ⊕ 1+ .

Agda implementation

We now represent these types in Agda. Those readers who are not fluent in dependent programming can find a review of Agda syntax, well-scoped de Bruijn indices, and universes in Appendix A. We represent defined atoms, variable types and contexts as follows:

2.2

Semantics

A universe is specified by a inductive datatype of codes for types, along with a function mapping each code to a Set. In this case, the Types above are the codes, and the semantics is specified in Figure 1 by a function < Ψ > A, mapping a context and a Type to an Agda Set. The first six cases interpret the basic types of the simply-typed λ-calculus as their Agda counterparts, pushing the context inside to the recursive calls. The next two cases interpret datatypes. We define an auxiliary datatype called Data which represents all of the data types defined in the universe. Data is indexed by a context and a defined atom, with the idea that the Agda set Data Ψ D represents the values of datatype D in context Ψ. For example, the values of Data Ψ arith will represent the arithmetic expressions defined by the signature given in the introduction. There are two ways to construct a datatype: (1) apply a datatype constructor to an argument and (2) choose a variable from Ψ. Constants are declared in a signature, represented with a predicate on rules InΣ : Rule -> Set, where InΣ R is inhabited when the rule R is in the signature. The first constructor, written as infix ·, pairs a constant with the interpretation of the constant’s premise. The second constructor, ., injects a variable from Ψ into Data.1 See the appendix for the definition of the type ∈, which represents well-scoped de Bruijn indices (Altenkirch and Reus, 1999; Bellegarde and Hook, 1994; Bird and Paterson, 1999). A DefAtom D is in the context if there exist credentials c for which the VarType formed by (. D {c}) is in the list Ψ. Finally, we provide a collection of types that deal with the context: Ψ ⇒* A extends the context (we write + for append); 2 A clears the context. The quantifiers ∀c and ∃c are interpreted as the corresponding Agda dependent function and pair types. Finally,

DefAtom = DefinedAtoms.Atom data VarType : Set where . : (D : DefAtom) {_ : Check(DefinedAtoms.world D)} -> VarType Vars = List VarType

DefinedAtoms.Atom is a parameter that we will instantiate later. DefinedAtoms.world returns true when D is allowed to appear in the context; Check turns this boolean into a proposition (Check True is the unit type; Check False is the empty type; see Appendix A for an introduction). A VarType is thus a pair of an atom along with the credentials allowing it to appear in contexts. We represent the syntax of types in Agda as follows: data Type : Set where –– types that have their usual meaning 1+ : Type _⊗_ : Type -> Type -> Type 0+ : Type _⊕_ : Type -> Type -> Type list_ : Type -> Type _⊃_ : Type -> Type -> Type –– datatypes and context manipulation D+ : DefAtom -> Type _# : VarType -> Type _⇒*_ : Vars -> Type -> Type 2 : Type -> Type ∀c : (Vars -> Type) -> Type ∃c : (Vars -> Type) -> Type ∀≤ : VarType -> Type -> Type ∃≤ : VarType -> Type -> Type

1 Agda allows overloading of datatype constructors between different types,

and we tend to use . for injections from one type to another, as with VarType above.

125

AllEq : Vars -> VarType -> Set AllEq Ψ D = Check (List.all (eqVarType D) Ψ)

subst : (A : Type) {D : VarType} {_ : Check(canSubst (un. Cut) A)} -> (∀⇒ (D ⇒ A) ⊃ (D + ) ⊃ A)

mutual data Data (Ψ : Vars) (D : DefAtom) : Set where _·_ : {A : Type} -> InΣ (D ⇐ A) -> < Ψ > A -> Data Ψ D . : {c : _} -> (. D {c}) ∈ Ψ -> Data Ψ D

weaken : (A : Type) {D : VarType} {_ : Check (canWeaken (un. D) A)} -> (∀⇒ A ⊃ (D ⇒ A)) strengthen : (A : Type) {D : VarType} {_ : Check (canStrengthen (un. D) A)} -> ∀⇒ (D ⇒ A) ⊃ A

_ : Vars -> Type -> Set –– basic types < Ψ > 1+ = Unit < Ψ > 0+ = Void < Ψ > (A ⊗ B) = (< Ψ > A) × (< Ψ > B) < Ψ > (A ⊕ B) = Either (< Ψ > A) (< Ψ > B) < Ψ > (list A) = List (< Ψ > A) < Ψ > (A ⊃ B) = (< Ψ > A) -> (< Ψ > B) –– data types < Ψ > (D+ D) = Data Ψ D < Ψ > (D #) = D ∈ Ψ –– context manipulation < Ψ > (Ψnew ⇒* A) = < Ψ + Ψnew > A < _ > (2 A) = < [] > A < Ψ > (∃c τ ) = Σ \ Ψ’ -> < Ψ > (τ Ψ’) < Ψ > (∀c τ ) = (Ψ’ : Vars) -> < Ψ > (τ Ψ’) < Ψ > (∀≤ D A) = (Ψ’ : Vars) -> AllEq Ψ’ D -> < Ψ + Ψ’ > A < Ψ > (∃≤ D A) = Σ \ (Ψ’ : Vars) -> AllEq Ψ’ D × < Ψ + Ψ’ > A

exchange2 : (A : Type) {D1 D2 : VarType} -> (∀⇒ (D2 ⇒ D1 ⇒ A) ⊃ (D1 ⇒ D2 ⇒ A)) contract2 : (A : Type) {D : VarType} -> (∀⇒ (D ⇒ D ⇒ A) ⊃ (D ⇒ A)) weaken*/bounded : (A : Type) (Ψ : Vars) {D : VarType} -> (AllEq Ψ D) -> {canw : Check (canWeaken (un. D) A)} -> (∀⇒ A ⊃ (Ψ ⇒* A))

Figure 2. Type signatures of structural properties 2.3

Structural Properties

In Figure 2, we present the type signatures for the structural properties; this is the interface that users of our framework see. For example, the type of substitution should be read as follows: for any A and D, if the conditions for substitution hold, then there is a function of type (∀⇒ (D ⇒ A) ⊃ (D + ) ⊃ A) (for any context, given a term of type A with a free variable, and something of type D + to plug in, there is a term of type A without the free variable). Weakening coerces a term of type A to a term with an extra free variable; strengthening does the reverse; exchange swaps two variables; contraction substitutes a variable for a variable. We also include an n-ary version of weakening for use with the bounded quantifier: if A can be weakened with D, then A can be weakened with a whole context comprised entirely of occurrences of D. We discuss the meaning of the conditions (canSubst, etc.) below; in all of our examples, they will be discharged automatically by our implementation.

Figure 1. Semantics the types ∀≤ D A and ∃≤ D A quantify over contexts Ψ’ for which AllEq Ψ’ D holds. The type AllEq says that every variable type in Ψ is equal to the given type D (List.all is true when its argument is true on all elements of the list; eqVarType is a boolean-valued equality function for variable types). (We could internalize AllEq Ψ’ D as a type alleq D—given meaning by < Ψ > (alleq D) = AllEq Ψ D—in which case the bounded quantifier could expressed as a derived form, but we have not needed alleq D in a positive position in the examples we have coded so far.) An Agda datatype is strictly positive if it does not appear to the left of any Agda function types (->) in its own definition; this positivity condition ensures that the user does not define general recursive types (e.g. µD.D → D), which can be used to inhabit any type and to write non-terminating code. The above type Data does not pass the positivity checker: it is defined mutually with _, and _ occurs to the left of an Agda function type in the meaning of ⊃. In this paper, we wish to program with general recursive types, so we will ignore this failure of positivity checking. An interesting direction for future work would be to consider a total variant of our framework, which admits only strictly positive types. This would require a more refined explanation of the construction of the defined atoms in the universe, e.g. using containers (Abbott et al., 2005), because the positivity of a defined atom D depends on the rules for D in the signature InΣ. We also define versions of 2 and ∀⇒ that construct Agda Sets, so that we do not need to write < [] > 2 A and so on as the Agda type of a term. (We intentionally use a very similar notation for these; to a first approximation, one can read our examples without keeping this distinction in mind.)

3.

Examples

In this section, we illustrate programming in our framework, adapting a number of examples that have been considered in the literature (Pientka, 2008; Poswolsky and Schürmann, 2008; Shinwell et al., 2003). Throughout this section, we compare the examples coded in our framework with how they are/might be represented in Twelf, Delphin, Beluga, and FreshML. We endeavor to keep these comparisons objective, focusing on what invariants of the code are expressed, and what auxiliary functions the programmer needs to define. Aside from Twelf, we are not expert users of these other systems, and we welcome feedback from those who are. Several additional examples are available in the companion Agda code, including a translation from λ-terms to combinators, a type checker for simply-typed λ-calculus terms, an evaluator for λ-calculus with mutable references (using variables to represent locations), and an alternate version of normalization-by-evaluation, which has simpler types at the expense of slightly more-complicated code. To use our framework, we give a type DefAtom representing the necessary datatypes names, along with a datatype

: Type -> Set A = < [] > A

data InΣ : Rule -> Set where

∀⇒ _ : Type -> Set ∀⇒ _ A = (Ψ : Vars) -> < Ψ > A

defining the datatype constructors.

126

We use the following naming convention: Defined atoms are given names that end in A; e.g., for the signature for arithmetic expressions given in the introduction, we will define natA and arithA. For types of variables, we define C to be A injected into VarType:

checker, which recognizes certain substitution instances as smaller. We have not yet investigated how to explain this induction principle to Agda. 3.2

arithC = . arithA

We define to be the Type constructed by D+ A; e.g.: nat = D+ natA arith = D+ arithA

3.1

lam app

Evaluating Arithmetic Expressions

: InΣ (expA ⇐ (expC ⇒ exp)) : InΣ (expA ⇐ exp ⊗ exp)

closure : InΣ (closA ⇐ (∃⇒ (expC ⇒ exp) ⊗ (expC # ⊃ 2 clos)))

We define a signature for the arithmetic example mentioned above: zero : InΣ (natA ⇐ 1+ ) succ : InΣ (natA ⇐ nat) num : InΣ (arithA ⇐ nat) letbind : InΣ (arithA ⇐ arith ⊗ (arithC ⇒ arith)) binop : InΣ (arithA ⇐ arith ⊗ (nat ⊃ nat ⊃ nat) ⊗ arith)

Natural numbers are specified by zero and successor. Arithmetic expressions are given as a mixed datatype, with ⇒ used to represent the body of the letbind and ⊃ used to represent primops. Next, we define an evaluation function that reduces an expression to a number: eval eval eval eval eval

Closure-based Evaluator

Next, we implement a closure-based evaluator for the untyped λcalculus. λ-terms and closures are represented by types exp and clos as follows:

: (arith ⊃ nat) (num · n) = n (letbind · (e1 , e2)) = eval (subst arith _ e2 e1) (binop · (e1 , f , e2)) = f (eval e1) (eval e2) (. ())

Evaluation maps closed arithmetic expressions to natural numbers (the type expression (arith ⊃ nat) reduces to the Agda function type Data [] arithA → Data [] natA). Constants evaluate to themselves; binops are evaluated by applying their code to the values of the arguments; let-binding is evaluated by substituting the expression e1 into the letbind’s body e22 and then evaluating the result. A simple variation would be to evaluate e1 first and then substitute its value into e2. The final clause covers the case for variables with a refutation pattern: there are no variables in the empty context.

Expressions are defined by the usual signature, as in LF. The type of closures, clos, is a recursive type with one constructor closure. The premise of closure should be read as follows: a closure is constructed from a triple (Ψ , e , σ), where (1) Ψ is an existentially quantified context; (2) e is an expression in Ψ with an extra free variable, which represents the body of a λ-abstraction; and (3) σ is a substitution of of closed closures for all the variables in Ψ. We represent a substitution as a function that maps each expression variable in the context (classified by the type expC #) to a closure. The type of the premise provides a succinct description of all of this: ∃⇒ introduces the variables in the existentially quantified context into scope without explicitly naming the context; ⇒ extends the context with an additional variable; (expC #) ranges over all of the variables in scope. For comparison, in Ψ this type reduces to the Agda type Σ \(Ψ’ : Vars) -> (Data (Ψ + Ψ’ „ expC) expA) × (expC ∈ (Ψ + Ψ’) -> Data [] closA)

(where we write „ for cons on the right). In this example, unlike the above evaluator for closed arithmetic expressions, we recur over open expressions, so eval is quantified over an unknown context Ψ using ∀⇒. Evaluation takes two further arguments: (1) an expression with free variables in Ψ, and (2) an environment, represented by a function that yields a closed closure for each expression variable in Ψ; eval returns a closed closure. env : Type env = expC #

Comparison. This example provides a nice illustration of the benefits of our approach: Substitution is provided “for free” by the framework, which infers that it is permissible to substitute for arithC variables in arith. The type system enforces the invariant that evaluation produces a closed natural number. It is not possible to define the type arith in Twelf/Delphin/Beluga, as LF representations cannot use computational functions. One could program this example in FreshML, but it would be necessary to implement substitution directly for arith, as FreshML does not provide a generic substitution operation. Agda checks that eval’s pattern matching is exhaustive. However, Agda is not able to verify the termination of this function, as it recurs on a substitution-instance of one of the inputs. Setting aside the computational functions in binop, it would be possible to get the call-by-value version of this code to pass Twelf’s termination 2 The arith argument to subst is the type A in the D ⇒ A argument to substitution; Agda’s type reconstruction procedure requires this annotation. The underscore is the context argument instantiating the ∀⇒ in the type of subst; this could be eliminated by adding an implicit context quantifier (whose meaning is { Ψ : Vars } -> ...) to the universe. The credentials for performing substitution are marked as an implicit argument, so there is no evidence of it visible in the call to subst.

⊃

2 clos

eval : (∀⇒ exp ⊃ env ⊃ 2 clos) eval Ψ (. x) σ = σ x eval Ψ (lam · e) σ = closure · (Ψ , e , σ) eval Ψ (app · (e1 , e2)) σ with eval Ψ e1 σ ... | closure · (Ψ’ , e’ , σ’) = eval (Ψ’ „ expC) e’ (extend{(2 clos)} _ σ’ (eval Ψ e2 σ)) ... | . x = impossible x

A variable is evaluated by applying the substitution. A lam evaluates to the obvious closure. To evaluate an application, we first evaluate the function position. To a first approximation, the reader may think of Agda’s with syntax as a case statement in the body of the clause, with each branch marked by ... |. Case-analyzing the evaluation of e1 gives two cases: (1) the value is constructed by the constructor closure; (2) the value is a variable. In the first case, we evaluate the body of the closure in an extended environment. The call to the function extend extends the environment σ’ so that the last variable is mapped to the value of e2. The definition of extend is as follows: extend : {A : Type} {D : VarType} -> (∀⇒ (D # ⊃ A) ⊃ A ⊃ (D ⇒ D #) ⊃ A) extend Ψ σ new i0 = new extend Ψ σ new (iS i) = σ i

127

At the call site of extend, we must explicitly supply the type A (in this case 2 clos) to help out type reconstruction. The underscore stands for the instantiation of the ∀⇒ , which is marked as an explicit argument, but can in this case be inferred. The second case is contradicted using the function impossible, which refutes the existence of a variable at a non-VarType—which clos is, because we never wish to have clos variables. The context argument Ψ to eval does not play an interesting role in the code, but Agda’s type reconstruction requires us to supply it explicitly at each recursive call. In future work, we may consider whether this argument can be inferred. Agda is unable to verify the termination of this evaluator for the untyped λ-calculus, as one would hope. When writing this code, one mistake a programmer might make is to evaluate the body of the closure in σ instead of σ’, which would give dynamic scope. If we make this mistake, Agda highlights the occurrence of σ and helpfully reports the type error that Ψ’ != Ψ, indicating that the context of the expression does not match the context of the substitution.

The type of size expresses that it returns a closed natural number. For comparison, we implement a second version that does not make this invariant explicit: : (∀≤ expC (exp ⊃ nat)) Ψ bound (. x) = succ · (zero · _) Ψ bound (app · (e1 , e2)) = · (plus’ Ψ bound (size’ Ψ bound e1) (size’ Ψ bound e2)) where plus’ : (∀≤ expC (nat ⊃ nat ⊃ nat)) plus’ Ψ b = weaken*/bounded (nat ⊃ nat ⊃ nat) Ψ b [] plus size’ Ψ bound (lam · e) = strengthen nat _ (size’ (Ψ „ expC) bound e)

size’ size’ size’ succ

Comparison. In Twelf, one cannot represent substitutions σ using computational functions, because these are not available for use in LF encodings. However, because the domain of the substitution is finite, a first-order representation of substitutions could be used. Additionally, Twelf does not provide the 2 and ∃⇒ connectives that we use here to describe the contexts of closures. While it should be possible for the programmer to express the necessary context invariants using explicit contexts (Crary, 2008), this is a fairly heavy encoding technique. Because of these two limitations, the resulting Twelf code would be more complicated than the above. One would hope for better Delphin and Beluga implementations than a port of the Twelf code, but Delphin lacks existential context quantification and 2, and Beluga lacks the parameter type exp #, so our definition of clos cannot be straightforwardly ported to either of these languages.3 One could implement this example in FreshML (Shinwell et al., 2003), but the type system would not enforce the invariant that closures are in fact closed. To our knowledge, a proof of this property for this example has not been attempted in Pure FreshML (Pottier, 2007), though we know of no reason why it would not be possible.

Without the 2, size must return a number in context Ψ: in the application case, we must weaken plus into Ψ, and in the lam case we must strengthen the extra expC variable out of the recursive call. Strengthening expression variables from natural numbers is permitted by our implementation of the structural properties because natural numbers cannot mention expressions; we use a subordinationlike analysis to determine this (Virga, 1999). To ensure that these weakenings and strengthenings are permitted, we type size’ with a bounded quantifier over exp. Comparison. The first version is similar to what one writes in FreshML, except in that setting there is no need to pass around a context Ψ. In the second version, the strengthening of the recursive result in the lam case is analogous to the need, in FreshML 2000 (Pitts and Gabbay, 2000), to observe that nat is pure (always has empty support); FreshML (Shinwell et al., 2003) does not require this. In Beluga, one can express either the first or second versions. In Twelf and Delphin, one can only express the second variation, as these languages do not provide 2. However, the Twelf/Delphin/Beluga syntax for weakening and strengthening is terser than what we have been able to construct in Agda: weakening is handled by world subsumption and is not marked in the proof term; strengthening is marked by pattern-matching the result of the recursive call and marking those variables that do occur, which in this case does not include the expression variable. For example, the lam case of size in Twelf looks like this:

3.3

- : size (lam ([x] E x)) (succ N) exp) in Twelf, we plan to consider a derived induction principle that covers this case in future work.

Agda successfully termination-checks these functions. 3 Beluga

provides a built-in type of substitutions, written [Ψ’]Ψ, so one might hope to represent closures as ∃ψ.([ψ, x : exp]exp) × [.]ψ; however, the second component of this pair associates an expression with each expression variable in ψ, whereas, in this example, we need to associate a closure with each expression variable in ψ.

128

Comparison. Pattern-matching on variables is represented using higher-order metavariables in Twelf/Delphin/Beluga and using equality tests on names in FreshML. The exchange needed in the lam case is written as a substitution in the Twelf/Delphin/Beluga version of this clause. In Twelf one would write:

The metavariable F:exp is bound outside the scope of x, and thus stands only for terms that do not mention x. (To allow it to mention x, we would bind F:exp -> exp and write (F x) in place of F.) Unfortunately, Agda does not provide this sort of pattern matching for our encoding—pattern variables are always in the scope of all enclosing local binders—so we must explicitly call a strengthening function that checks whether the variable occurs:

- : cnt ([x] lam ([y] E x y)) N (∀⇒ (D ⇒ list remove Ψ [] = remove Ψ (i0 :: ns) = remove Ψ ((iS i) :: ns) =

fvs fvs fvs fvs

contract-η : ∀⇒ exp ⊃ exp option contract-η Ψ (lam · (app · (f , . i0))) = strengthen? Ψ f contract-η Ψ _ = Inr

We conjecture that strengthen? could be implemented datatypegenerically for all purely positive types (no ⊃ or ∀c or ∀≤)—it is not possible to decide whether a variable occurs in the values of these computational types (cf. FreshML, where it is not possible to decide whether a name is in the support of a function). This strengthening function is not an instance of the generic map that we define below, as it changes the type of the term (exp to exp option); in future work, we plan to consider a more general traversal that admits this operation.

(D #)) ⊃ list (D #)) [] (remove Ψ ns) i :: (remove Ψ ns)

: (∀⇒ exp ⊃ list (expC #)) Ψ (. x) = [ x ] Ψ (lam · e) = remove Ψ (fvs (Ψ „ expC) e) Ψ (app · (e1 , e2)) = (fvs Ψ e1) ++ (fvs Ψ e2)

3.4

In the lam case, we use the helper function remove to remove the lam-bound variable from the recursive result. The function remove takes a list of variables, itself with a distinguished free variable, and produces a list of variables without the distinguished variable. If the programmer were to make a mistake in the second clause by accidentally including i0 in the result, he would get a type error. Agda successfully termination-checks this example. Comparison. For comparison with FreshML (Shinwell et al., 2003), the type given to remove here is analogous to their Figure 6:

([[A arrow B]] in Ψ) = ∀Ψ0 . ([[A]] in Ψ,Ψ0 ) ⊃ ([[B]] in Ψ,Ψ0 ) That is, the meaning of A arrow B in Ψ is a function that, for any future extension of the context, maps the meaning of A in that extension to the meaning of B in that extension. In our type theory, we represent (a simply-typed version of) this logical relation as a datatype sem. The datatype constructor corresponding to the above clause would have the following type:

remove : ( (name list)) -> name list

where τ is a nominal abstractor. The authors comment that they prefer the version of remove in their Figure 5: remove : name -> (name list) -> name list

where the name to removed is specified by the first argument, rather than using a binder. Using dependent types, we can type this second version of remove as follows:

sem ⇐ (∀⇒ sem ⊃ sem)

However, for the argument to go through, we must ensure that the context extension Ψ’ consists only of variables of a specific type neu, so we use a bounded context quantifier below. We represent the semantics by the datatypes neu and sem in Figure 3. The type neu (neutral terms) consists of variables or neutral terms applied to semantic arguments (napp); these are the standard neutral proofs in natural deduction. A sem (semantic term) is either a neutral term or a semantic function. A semantic function of type (∀≤ neuC (sem ⊃ sem)) is a computational function that works in any extension of the context consisting entirely of neu variables. We define reification first, via two mutually recursive functions, reifyn (for neutral terms) and reify (for semantic terms). It is typical in logical relations arguments to use two independent contexts, one for the syntax and one for the semantics. Thus, we

remove : (Ψ : Vars) (i : exp ∈ Ψ) -> List (exp ∈ Ψ) -> List (exp ∈ (Ψ - i))

where Ψ - i removes the indicated element element from the list. This type is of course expressible in Agda, but we have not yet integrated dependent types into our universe. 3.3.4

Normalization by Evaluation

In Figure 3, we present a serious example mixing binding and computation, β-normalization-by-evaluation for the untyped λcalculus. NBE works by giving the syntax a semantics in terms of computational functions (evaluation) and then reading back a normal form (reification). The NBE algorithm is similar to a Kripke logical relations argument, where one defines a type- and context-indexed family of relations [[A]] in Ψ. The key clause of this definition is:

η-Contraction

In Twelf/Delphin/Beluga, one can recognize η-redices by writing a meta-variable that is not applied to all enclosing locally bound variables. E.g. in Twelf one would write - : contract (lam [x] app F x) F.

129

napp : InΣ (neuA ⇐ neu ⊗ sem) neut : InΣ (semA ⇐ neu) slam : InΣ (semA ⇐ (∀≤ neuC (sem ⊃ sem)))

mantic context is extended with a new neu variable. We instantiate the semantic function ϕ, which anticipates extensions of the context, with this one-variable extension ([ x ] constructs a singleton list), and apply it to the variable. The library function extendv2v makes the "parallel" extension of a var2var in the obvious way, mapping the one new variable to the other:

reifyn : ∀⇒ ∀c \ Ψs -> (var2var neuC Ψs expC) ⊃ [ Ψs ]* neu ⊃ exp reifyn Ψe Ψs σ (. x) = . (σ x) reifyn Ψe Ψs σ (napp · (n , s)) = app · (reifyn Ψe Ψs σ n , reify Ψe Ψs σ s)

extendv2v : {D1 D2 : VarType} -> (Ψs : Vars) -> ∀⇒ (var2var D1 Ψs D2) ⊃ D2 ⇒ (var2var D1 (D1 :: Ψs) D2) extendv2v Ψs Ψe σ (i0) = i0 extendv2v Ψs Ψe σ (iS i) = iS (σ i)

reify : ∀⇒ ∀c \ Ψs -> (var2var neuC Ψs expC) ⊃ [ Ψs ]* sem ⊃ exp reify Ψe Ψs σ (slam · ϕ) = lam · reify (Ψe „ expC) (Ψs „ neuC) (extendv2v Ψs Ψe σ) (ϕ [ neuC ] _ (neut · (. i0))) reify Ψe Ψs σ (neut · n) = reifyn Ψe Ψs σ n reify Ψe Ψs σ (. x) = impossible x appsem appsem appsem appsem

: _ _ _

∀⇒ sem ⊃ sem ⊃ sem (slam · ϕ) s2 = ϕ [] _ s2 (neut · n) s2 = neut · (napp · (n , s2)) (. x) _ = impossible x

evalenv : Vars -> Type evalenv Ψs = (expC #) ⊃ ([ Ψs ]* sem) eval : ∀⇒ ∀c \Ψs -> evalenv Ψs ⊃ exp ⊃ ([ Ψs ]* sem) eval Ψe Ψs σ (. x) = σ x eval Ψe Ψs σ (app · (e1 , e2)) = appsem Ψs (eval Ψe Ψs σ e1) (eval Ψe Ψs σ e2) eval Ψe Ψs σ (lam · e) = slam · ϕ where ϕ : < Ψs > (∀≤ neuC (sem ⊃ sem)) ϕ Ψ’ ctxinv s’ = eval (Ψe „ expC) (Ψs + Ψ’) σ’ e where σ’ : < Ψe > (expC ⇒ (evalenv (Ψs + Ψ’))) σ’ i0 = s’ σ’ (iS i) = weaken*/bounded sem Ψ’ ctxinv Ψs (σ i)

Figure 3. Normalization by evaluation parametrize these functions by two contexts, one consisting for neu variables for the semantics, and the other consisting of exp variables for the syntax. We will write Ψs for the former and Ψe for the latter. In the type of reify, we must name one of these contexts, because each context scopes over two disconnected parts of the type. We choose to name the semantic context and let the expression context be the ambient one. The outer ∀⇒ thus binds the expression context, whereas we use the full binding form ∀c for the semantic context. The type of reify then says that, under some condition expressed by the type var2var, reify maps semantics in the semantic context (recall that [ Ψ ]* A stands for 2 (Ψ ⇒* A); lexically, [ Ψ ]* A binds more tightly than ⊃) to expressions (in the ambient expression context). The type var2var C1 Ψ1 C2 means that every variable of type C1 in Ψ1 maps to a variable of type C2 in the ambient context. It is defined in a library as follows:

The neutral-to-semantic coercion is reified recursively, and we disallow sem variables from the context. To define evaluation, we first define an auxiliary function appsem that applies one semantic term to another. This requires a case-analysis of the function term: when it is an slam (i.e. the application is a β-redex), we apply the embedded computational function, choosing the nil context extension, and letting the argument be s2. When the function term is neutral, we make a longer neutral term. The type of eval is symmetric to reify, except the environment that we carry along in the induction maps expression variables to semantic terms rather than just variables. The type evalenv Ψs means that every expression variable in the ambient context is mapped to a semantic value in Ψs. A variable is evaluated by looking it up; an application is evaluated by combining the recursive results with semantic application. A lam is evaluated to an slam whose body ϕ has the type indicated in the figure. When given a context extension Ψ’ and an argument s’ in that extension, ϕ evaluates the original body e in an extended substitution. The new substitution σ’ maps the λ-bound variable i0 to the provided semantic value, and defers to σ on all other variables. However, σ provides values in Ψs, which must be weakened into the extension Ψ’. Fortunately, the bounded quantifier provides sufficient evidence to show that weakening can be performed in this case, because sem’s can be weakened with neu variables. Normalization is defined by composing evaluation and reification. We define a normalizer for closed λ-terms as follows: emptyv2v : emptyevalenv :

(var2var neuC [] expC) (evalenv [])

norm : (exp ⊃ exp) norm e = reify [] [] emptyv2v (eval [] [] emptyevalenv e)

Our type system has verified the scope-correctness of this code, proving that it maps closed terms to closed terms. Amusingly, Agda accepts the termination of this evaluator for the untyped λcalculus, provided that we have told it to ignore its issues with our universe itself—a nice illustration of the need for the positivity check on datatypes. Our companion code includes an alternate version of NBE, which has simpler types (it does not maintain separate contexts Ψe for expressions and Ψs for semantics) at the expense of more-complicated code (various appeals to weakening and strengthening are necessary). Comparison. The type sem is a truly mixed datatype: the premise (∀≤ neuC (sem ⊃ sem)) uses both ⇒ and ⊃ (recall that there is a ⇒ buried in the definition of ∀≤). Because it uses ⊃ in a recursive datatype, it is not representable in LF. Because it uses ⇒, it would not even be representable in Delphin/Beluga extended with standard recursive types (that did not interact with the LF part of the language). Despite the fact that our implementation enforces strong invariants about the scope of variables, the code is essentially as simple as the FreshML version described by Shinwell et al. (2003), aside from the need to pass the contexts Ψe and Ψs along. Invariants about variable scoping can be proved in Pure FreshML (Pot-

var2var : VarType -> Vars -> VarType -> Type var2var C1 Ψ1 C2 = ([ Ψ1 ]* (C1 #)) ⊃ (C2 #)

Even though reify is given a precise type describing the scoping of variables, its code is as simple as one could want. To reify neutral terms: The reification of a variable is the variable given in the substitution. The reification of an application is the application of the reifications. To reify semantic terms: The reification of a function (slam · ϕ) is the λ-abstraction of the reification of an instance of ϕ. In the recursive call, the expression context is extended with a new exp variable (which is bound by the lam) and the se-

130

map : (A : Type) {Ψ Ψ’ : Vars} -> (Co A Ψ Ψ’) -> < Ψ > A -> < Ψ’ > A map (D+ Dat) co (. x) = (snd (compat co) x) map (Dat #) co x = ((compat co) x) map (A ⊃ B) co e = \ y -> (map B (snd (compat co)) (e (map’ A (fst (compat co)) y))) map (Ψ0 ⇒* A) co e = map A (compat co) e map (list A) co [] = [] map (list A) co (x :: xs) = map A (compat co) x :: map (list A) co xs map (D+ Dat) co (_·_ {A} c e) = c · map A (fst (compat co) c) e map (2 A) co e = e –– ... more cases

tier, 2007), but we would like to enforce these invariants within a type system, not using an external specification logic. Relative to a direct implementation in Agda, our framework provides the weakening function needed in the final case of eval for free.

4.

Structural Properties

The structural properties are implemented by instantiating a generic traversal for < Ψ > A. The generic traversal has the following type: map : (A : Type) {Ψ Ψ’ : Vars} -> (Co A Ψ Ψ’) -> < Ψ > A

->

< Ψ’ > A

This should be read as follows: for every A Ψ Ψ’, under the condition Co A Ψ Ψ’, there is a map from terms of type A in Ψ to terms of type A in Ψ’. Co : Type -> Vars -> Vars -> Set is a variable relation, a type-indexed family of relations between two contexts. Co is in fact a (module-level) parameter to the generic map; it must provide (1) a variable or term in Ψ’ for each variable in Ψ that the traversal runs into; and (2) enough information to keep the traversal going inductively. We will instantiate Co with a specific relation for each traversal; e.g., for weakening with a variable of type D, Co will relate Ψ to (Ψ „ D) under appropriate conditions on D and A. For expository purposes, we present a slightly simplified version of the traversal first; the generalization is described with weakening below. 4.1

Figure 4. Map (excerpt) types ensure that the traversal will only find certain variables, and thus that only those variables need realizations. Compatibility ensures that Co provides enough information for Contra to process the contravariant positions to the left of a computational arrow. Additionally, it permits conditional traversals: below, we will instantiate Co so that it is uninhabited for certain A. 4.2

Map

Suppose that Co and Contra are compatible, and assume a function map’ : (A : Type) {Ψ Ψ’ : Vars} -> (Contra A Ψ Ψ’) -> < Ψ > A -> < Ψ’ > A

Compatibility

We ensure that Co provides the two pieces of information mentioned above using the notion of compatibility. Suppose that Co and Contra are variable relations. We say that Co and Contra are compatible iff there is a term

that is the equivalent of map for the Contravariant positions. Then we implement map in Figure 4. In the first and second cases, the compatibility of Co induces the map on variables that we need. In the third case, we pre-compose the function with map’ and post-compose with map. In all other cases, map simply commutes with constructors, or stops early if it hits a boxed term.

compat : ({A : Type} {Ψ Ψ’ : Vars} -> Co A Ψ Ψ’ -> Compat A Ψ Ψ’)

4.3

where Compat is defined as follows:

Exchange/Contraction

Exchange and contraction are implemented by one instantiation of map. In this case, we take

Compat : Type -> Vars -> Vars -> Set Compat (D #) Ψ Ψ’ = (D ∈ Ψ) -> (D ∈ Ψ’) Compat (D+ D) Ψ Ψ’ = ({A : Type} -> (c : InΣ (D ⇐ A)) -> Co A Ψ Ψ’) × ({ch : _} -> (. D {ch}) ∈ Ψ -> < Ψ’ > D+ D) Compat (A ⊃ B) Ψ Ψ’ = Contra A Ψ’ Ψ × Co B Ψ Ψ’ Compat (Ψ0 ⇒* A) Ψ Ψ’ = Co A (Ψ + Ψ0) (Ψ’ + Ψ0) Compat (list A) Ψ Ψ’ = Co A Ψ Ψ’ Compat (2 A) Ψ Ψ’ = Unit –– ...

Co A Ψ Ψ’ = Contra A Ψ Ψ’ = (Ψ ⊆ Ψ’ × Ψ’ ⊆ Ψ)

where ⊆ means every variable in one context is in the other. It is simple to show that these relations are compatible, because Co (a) provides the required action on variables directly and (b) ignores its type argument, so the compatibility cases for the type constructors are easy. Exchange is defined by instantiating the generic map with Co, where map’ is taken be map itself, which works because Co = Contra.

Compat imposes certain conditions on Co and Contra. For example, for variable types D #, it says that Co (D #) Ψ Ψ’ induces a map from variables of type D in Ψ to variables in Ψ’. For defined atoms D+ D, Compat says that Co (D+ D) Ψ Ψ’ induces a map from variables in Ψ to terms in Ψ’, and that Co A Ψ Ψ’ holds for every premise A of every constant inhabiting D. In all other cases, Compat provides enough information to keep the induction going in map below. This amounts to insisting that Co (or Contra) holds on the subexpressions of a type in all appropriate contexts. For example, the condition for Ψ0 ⇒* A is that Co holds for A in the contexts extended with Ψ0. In the usual monadic traversals of syntax (Altenkirch and Reus, 1999), Co _ Ψ Ψ’ is taken to be (D : VarType) -> D ∈ Ψ -> < Ψ’ > D—i.e. a realization of every variable in Ψ as a term in Ψ’. In our setting, this does not suffice to define a traversal, because (1) it does not provide for the contravariant flip necessary to process the domains of computational functions and (2) it does not allow us to express a conditional traversal, where conditions on the

4.4

Strengthening

Next, we define a traversal that strengthens away variables that, based on type information, cannot possibly occur. The invariant for strengthening is the following:4 Co : Type -> Vars -> Vars -> Set Co A Ψ Ψ’ = Σ \(D : VarType) -> Σ \(i : D ∈ Ψ) -> Check(irrel (un. D) A) × Id Ψ’ (Ψ - i)

Here i, a pointer into the initial context Ψ is the variable to be strengthened away; the propositional equality constraint represented by the Identity says that the final context Ψ’ is the initial context with i removed. The type Check(irrel (un. D) A) 4 For

concision, we suppress some details arising from the implementation of irrel, which takes a visited list as an extra argument; see the companion code for details.

131

types bound by A, so that the term being plugged in for the variable can be weakened as substitution goes under binders. Substitution takes about 220 lines to define and prove compatible.

computes to Unit when strengthening is possible, and Void when it is not. Here un. simply peels off the injection of a defined atom into a VarType. The crucial property of irrel is that Check(irrel (un. D) (D+ D)) computes to Void. This forbids strengthening a variable of type D out of a term of type D. This is necessary because we cannot satisfy the usual compatibility condition for (D+ D), which would require mapping all variables—including the variable-to-bestrengthened i—to a term of type D that does not mention i. More generally, Check(irrel (un. D) A) means that variables of type D can never be used to construct terms of type A, which ensures that strengthening never runs into variables of the type being strengthened. The function irrel D A is defined by traversing the graph structure of types (i.e., it unrolls the definitions of defined atoms) and checks not (DefinedAtoms.eq D Dat) for each defined atom Dat it finds. To account for contravariance, we must define strengthening simultaneously with weakening by irrelevant assumptions, which is similar. About 250 lines of Agda code shows that these two relations together are compatible. Their traversals are then defined by instantiating map twice, mutually recursively—each is passed to the other as map’ for the contravariant recursive calls. 4.5

5.

We have provided comparisons with several other systems throughout the paper: Relative to LF-based systems such as Twelf (Pfenning and Schürmann, 1999), Delphin (Poswolsky and Schürmann, 2008), and Beluga (Pientka, 2008), our framework permits definitions that mix binding and computation; this is essential for defining the datatype sem in the NBE example. Relative to FreshML (Pottier, 2007; Shinwell et al., 2003), our framework enforces invariants about variable scoping in the type system. Such invariants can be proved in Pure FreshML (Pottier, 2007), but we would like to enforce these invariants within a type system, not using an external specification logic. Aydemir et al. (2008) provide a nice overview of various techniques that are used to implement variable binding, including named, de Bruijn, {locally / globally} {named /nameless}, and weak higher-order abstract syntax (Bucalo et al., 2006; Despeyroux et al., 1995). More recently, Chlipala (2008) has advocated the use of parametric higher-order abstract syntax. We have chosen well-scoped de Bruijn indices (Altenkirch and Reus, 1999; Bellegarde and Hook, 1994; Bird and Paterson, 1999) for our Agda implementation, a simple representation that makes the pronoun structure of variables explicit. It would be interesting to investigate whether any benefits can be obtained by implementing our universe with a different representation. Relative to these techniques for representing binding, the advantage of our framework is that it provides datatype-generic implementations of the structural properties, including substitution. Both the Hybrid frameworks (Ambler et al., 2002; Capretta and Felty, 2007; Momigliano et al., 2007), Hickey et al. (2006)’s work, and Lambda Tamer (Chlipala, 2007) describe languages or tools for specifying data with binding, providing generic implementations of the structural properties. However, to the best of our knowledge, these logical frameworks do not make the computational functions of the meta-language available for use in the framework (except inasmuch as they are, in some cases, used to represent binding itself). In contrast, our universe includes both ⇒ and ⊃. In this work, we have created a universe of contextual types in Agda. Contextual types appear in Miller and Tiu’s work (Miller and Tiu, 2003), as well as in contextual modal type theory (Nanevski et al., 2007). Miller and Tiu’s self-dual ∇ connective is closely related to ⇒, also capturing the notion of a scoped constant. However, the ∇ proof theory adopts a logic-programming-based distinction between propositions and types, and ∇ binds a scoped term constant in a proposition. In our setting, ⇒ allows the meaning of certain propositions (defined atoms) to vary. Fiore et al. (1999) and Hofmann (1999) give semantic accounts of variable binding. In a sense, the present paper gives a semantics for our type theory, where binding is represented by an indexed inductive definition. However, this semantics does not shed any new light on the datatype-generic definition of the structural properties; it would be interesting to explore a semantic characterization of the conditions under which weakening and substitution are definable.

Weakening

In addition to weakening by irrelevant types (e.g. weakening a nat with an exp), we can weaken by types that do not appear to the left of a computational arrow (e.g., weakening an exp with an exp). For a simple version of weakening, the variable relation is similar to strengthening, but uses a different computed condition, and flips the role of Ψ and Ψ’ (now Ψ’ is bigger): Co : Type -> Vars -> Vars -> Set Co A Ψ Ψ’ = Σ \(D : VarType) -> Σ \(i : D ∈ Ψ’) -> Check(canWeaken (un. D) A) × Id Ψ (Ψ’ - i)

The function canWeaken is a different graph traversal than before: this time, we check irrel (un. D) A for the left-hand side of each computational arrow A ⊃ B. Weakening can then be defined using strengthening in contravariant positions, as irrel is exactly the condition that strengthening requires. This suffices for a simple version of weakening. However, we can be more clever, and observe that types of the form ∀⇒ A are always weakenable, because their proofs are explicitly parametrized over arbitrary extensions of the context. Similarly, ∀≤ C A is weakenable with any context composed entirely of C’s. Capitalizing on this observation requires a slight generalization of the traversal described above: computationally, weakening ∀⇒ A does not recursively traverse the proof of A, like map usually does, but stops the traversal and instantiates the context quantifier appropriately. Thus, our actual implementation of map is parametrized so that, for each type A, either it is given sufficient information to transform A directly (a function < Ψ > A -> < Ψ’ > A), or it has enough information to continue recursively, as in the compatibility conditions described above. We use the former only for weakening the quantifiers (map < Ψ - i > (∀⇒ A) to < Ψ > (∀⇒ A)). We refer the reader to our Agda code for details. All told, weakening takes about 210 lines of Agda code to define and prove compatible. 4.6

Related Work

6.

Substitution

Conclusion

In this paper, we have constructed a logical framework supporting datatypes that mix binding and computation: Our framework is implemented as a universe in the dependently typed programming language Agda. Binding is represented in a pronominal manner, so the type system can be used to reason about the scoping of variables. Our implementation provides datatype-generic implementations of

Substitution is similar to weakening and strengthening. Its invariant has the same form, using a condition canSubst (un. D) A. This condition ensures two things: (1) that D is irrelevant to the lefthand-sides of any computational arrow, so that substitution can be defined using weakening-with-irrelevant-assumptions in the contravariant position, and (2) that D is weakenable with all variable

132

e.g., we do not explicitly apply append to the Set argument A. Agda attempts to infer implicit function arguments and reports an error if they cannot be reconstructed. Indexed datatypes are defined using a notation similar to GADTs in GHC. For example, we define a datatype ∈ representing indices into a list:

the structural properties (weakening, subordination-based strengthening, exchange, contraction, and substitution). We have used the framework to program a number of examples, including a scopecorrect version of the normalization-by-evaluation challenge problem discussed by Shinwell et al. (2003). We believe that these examples demonstrate the viability of our approach for simply-typed programming. We hope also to have clarified the gap between LF-based systems for programming with binding, such as Twelf, Delphin, and Beluga, and a generic dependently typed programming language like Agda. For simply-typed programming, the benefits of the LFbased systems that we were unable to mimic include: (1) the ability to write pronominal variables with a named syntax; and (2) a convenient syntax for applying the structural properties. For example, the syntax of weakening and strengthening is relatively heavy in our setting. In Twelf, weakening is silent, and strengthening (including strengthen? used in the η-contraction example) is marked by saying which variables do occur, using a non-linear higher-order pattern. In our Agda implementation, weakening must be marked explicitly, and strengthening requires one to enumerate those variables that do not occur instead. However, the more convenient syntax seems within reach for a standalone implementation of our framework; e.g., weakening could be implemented using a form of coercive subtyping. Of course, one way in which all of the LF-based systems outpace ours is that they support dependent types, which are crucial for representing logics and for mechanizing metatheory. Our most pressing areas of future work are to investigate a dependently typed extension of our universe, and to address the termination issues that we have deferred here. One key issue for the dependently typed version will be the equational behavior of the structural properties, which we have not yet investigated. We would hope that they have the right behavior up to propositional equality (otherwise there is a bug in the code presented here), but it remains to be seen whether we can get Agda’s definitional equality to mimic the equations proved automatically by, e.g., Twelf. That said, the fact that the map function defined in Section 4 commutes with all term constructors definitionally in Agda gives us some hope in this regard.

data _∈_ {A : Set} : A -> List A -> Set where i0 : {x : A} {xs : List A} -> x ∈ (x :: xs) iS : {x y : A} {xs : List A} -> y ∈ xs -> y ∈ (x :: xs)

For any Set A, and terms x and xs of type A and List A, there is a type x ∈ xs. The first constructor, i0, creates a proof of x ∈ (x :: xs)—i.e. x is the first element of the list. The second constructor iS, creates a proof of x ∈ (y :: xs) from a proof that x is in the tail. As a simple example of dependent pattern matching, we define an n-ary version of iS: skip : {A : Set} (xs : List A) {ys : List A} {y : A} -> y ∈ ys -> y ∈ (append xs ys) skip [] i = i skip (x :: xs) i = iS (skip xs i)

We use an implicit-quantifier for all arguments but the list xs; explicit-quantifiers are written with parentheses instead of curlybraces. The fact that this code type-checks depends on the computational behavior of append; e.g., in the first case, the expression append [] ys reduces to ys, so we can return the index i unchanged. Well-scoped syntax for the untyped λ-calculus is defined as follows: data Term (Γ . : ∈ Lam : Term App : Term

: List Γ -> ( :: Γ ->

Unit) : Set where Term Γ Γ) -> Term Γ Term Γ -> Term Γ

In this section, we review Agda’s syntax, we show a simple example of well-scoped de Bruijn indices, and we give a simple example of a universe. We refer the reader to the Agda Wiki (http://wiki.portal.chalmers.se/agda/) for more introductory materials.

The type Unit is defined to be the record type with no fields, with inhabitant written . We represent variables as indices into a list Γ containing elements of the one-element type Unit. (Such lists are isomorphic to natural numbers, but this illustrates the pattern for variables of more than one type.) The constructor . makes a term from an index into Γ, which represents a variable. The body of Lam can refer to all of the variables in Γ, as well as a new bound variable represented by extending Γ to ( :: Γ). The K combinator λx.λy.x is represented as follows: Lam (Lam (. (iS i0))). The values of Term Γ correspond exactly to the λ-terms with free variables in Γ.

A.1

A.2

A.

Agda Overview

Well-scoped de Bruijn indices in Agda

Universes

A universe is specified by a inductive datatype of codes for types, along with a function mapping each code to a Set. For example, a simple universe with an empty type, a unit type, and binary products is specified as follows:

We review the representation of well-scoped de Bruijn indices as an indexed inductive definition (Altenkirch and Reus, 1999; Bellegarde and Hook, 1994; Bird and Paterson, 1999). Agda data types are introduced as follows:

data Type : Set where 0+ : Type 1+ : Type _⊗_ : Type -> Type -> Type

data List (A : Set) : Set where [] : List A _::_ : A -> List A -> List A

Set classifies Agda classifiers, like the kind type in ML or Haskell. Mixfix constructors are declared by using _ in an identifier; e.g., :: can now be used infix as in Zero :: (Zero :: []). Functions are defined by pattern-matching:

Element Element Element Element

append : {A : Set} -> List A -> List A -> List A append [] ys = ys append (x :: xs) ys = x :: (append xs ys)

: Type -> Set 0+ = Void 1+ = Unit (τ 1 ⊗ τ 2) = (Element τ 1) × (Element τ 2)

In the right-hand side of Element, we write A × B for the Agda pair type, etc. Datatype-generic programs are implemented by recursion over the codes; e.g, every element of the universe can be converted to a string:

The curly-braces mark an implicit dependent function space. Applications to implicit arguments are not marked in the program;

133

show show show show "<

A. Chlipala. A certified type-preserving compiler from λ-calculus to assembly language. In ACM SIGPLAN Conference on Programming Language Design and Implementation, 2007.

: (τ : Type) -> Element τ -> String 0+ () 1+ = "" (τ 1 ⊗ τ 2) (e1 , e2) = " ^ (show τ 1 e1) ^ " , " ^ (show τ 2 e2) ^ " >"

A. Chlipala. Parametric higher-order abstract syntax for mechanized semantics. In ACM SIGPLAN International Conference on Functional Programming. ACM, 2008.

In the first clause, the empty parentheses are a refutation pattern, telling Agda to check that the type in question (in this case Element 0+ ) is uninhabited, and allowing the programmer to elide the right-hand side. As another example, we will often view booleans as a twoelement universe, with only True inhabited:

K. Crary. Explicit contexts in LF. In International Workshop on Logical Frameworks and Meta-Languages: Theory and Practice, 2008. J. Despeyroux, A. Felty, and A. Hirschowitz. Higher-order abstract syntax in Coq. In M. Dezani-Ciancaglini and G. Plotkin, editors, International Conference on Typed Lambda Calculi and Applications, volume 902 of Lecture Notes in Computer Science, pages 124–138, Edinburgh, Scotland, 1995. Springer-Verlag.

data Bool : Set where True : Bool False : Bool

M. Fiore, G. Plotkin, and D. Turi. Abstract syntax and variable binding. In IEEE Symposium on Logic in Computer Science, 1999.

Check : Bool -> Set Check True = Unit Check False = Void

J. Hickey, A. Nogin, X. Yu, and A. Kopylov. Mechanized meta-reasoning using a hybrid HOAS/de Bruijn representation and reflection. In ACM SIGPLAN International Conference on Functional Programming, pages 172–183, New York, NY, USA, 2006. ACM.

Because Agda implements extensionality for Unit (there is only one record with no fields), terms of type Check True can be left implicit and inferred.

M. Hofmann. Semantical analysis of higher-order abstract syntax. In IEEE Symposium on Logic in Computer Science, 1999. D. R. Licata, N. Zeilberger, and R. Harper. Focusing on binding and computation. In IEEE Symposium on Logic in Computer Science, 2008.

Acknowledgements We thank Noam Zeilberger for discussions about this work, and we thank the anonymous reviewers for their helpful feedback on an earlier version of this article.

P. Martin-Löf. An intuitionistic theory of types: Predicative part. In H. Rose and J. Shepherdson, editors, Logic Colloquium. Elsevier, 1975. C. McBride and J. McKinna. The view from the left. Journal of Functional Programming, 15(1), 2004.

References

D. Miller and A. F. Tiu. A proof theory for generic judgments: An extended abstract. In IEEE Symposium on Logic in Computer Science, pages 118– 127, 2003.

M. Abbott, T. Altenkirch, and N. Ghani. Containers: constructing strictly positive types. Theoretic Computer Science, 342(1):3–27, 2005.

A. Momigliano, A. Martin, and A. Felty. Two-level hybrid: A system for reasoning using higher-order abstract syntax. In International Workshop on Logical Frameworks and Meta-Languages: Theory and Practice, 2007.

T. Altenkirch and C. McBride. Generic programming within dependently typed programming. In IFIP TC2 Working Conference on Generic Programming, Schloss Dagstuhl, 2003. T. Altenkirch and B. Reus. Monadic presentations of lambda terms using generalized inductive types. In CSL 1999: Computer Science Logic. LNCS, Springer-Verlag, 1999.

A. Nanevski, F. Pfenning, and B. Pientka. Contextual modal type theory. Transactions on Computational Logic, 2007. To appear. U. Norell. Towards a practical programming language based on dependent type theory. PhD thesis, Chalmers University of Technology, 2007.

S. Ambler, R. L. Crole, and A. Momigliano. Combining higher order abstract syntax with tactical theorem proving and (co)induction. In International Conference on Theorem Proving in Higher-Order Logics, pages 13–30, London, UK, 2002. Springer-Verlag.

F. Pfenning and C. Schürmann. System description: Twelf - a meta-logical framework for deductive systems. In H. Ganzinger, editor, International Conference on Automated Deduction, pages 202–206, 1999.

B. Aydemir, A. Charguéraud, B. C. Pierce, R. Pollack, and S. Weirich. Engineering formal metatheory. In ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pages 3–15, 2008.

B. Pientka. A type-theoretic foundation for programming with higher-order abstract syntax and first-class substitutions. In ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pages 371–382, 2008.

F. Bellegarde and J. Hook. Substitution: A formal methods case study using monads and transformations. Science of Computer Programming, 23(2– 3):287–311, 1994.

A. M. Pitts and M. J. Gabbay. A metalanguage for programming with bound names modulo renaming. In R. Backhouse and J. N. Oliveira, editors, Mathematics of Program Construction, volume 1837 of Lecture Notes in Computer Science, pages 230–255. Springer-Verlag, Heidelberg, 2000.

U. Berger and H. Schwichtenberg. An inverse of the evaluation functional for typed λ-calculus. In IEEE Symposium on Logic in Computer Science, 1991.

A. Poswolsky and C. Schürmann. Practical programming with higherorder encodings and dependent types. In European Symposium on Programming, 2008.

R. S. Bird and R. Paterson. De Bruijn notation as a nested datatype. Journal of Functional Programming, 9(1):77–91, 1999. A. Bucalo, M. Hofmann, F. Honsell, M. Miculan, and I. Scagnetto. Consistency of the theory of contexts. Journal of Functional Programming, 16 (3):327–395, May 2006.

F. Pottier. Static name control for FreshML. In IEEE Symposium on Logic in Computer Science, 2007. M. R. Shinwell, A. M. Pitts, and M. J. Gabbay. FreshML: Programming with binders made simple. In ACM SIGPLAN International Conference on Functional Programming, pages 263–274, August 2003.

V. Capretta and A. Felty. Combining de Bruijn indices and higher-order abstract syntax in Coq. In Proceedings of TYPES 2006, volume 4502 of Lecture Notes in Computer Science, pages 63–77. Springer-Verlag, 2007.

R. Virga. Higher-Order Rewriting with Dependent Types. PhD thesis, Carnegie Mellon University, 1999.

134

Non-Parametric Parametricity Georg Neis

Derek Dreyer

Andreas Rossberg

MPI-SWS [email protected]

MPI-SWS [email protected]

MPI-SWS [email protected]

Abstract

type system, thus enabling the so-called type-erasure interpretation of polymorphism by which type abstractions and instantiations are erased during compilation. However, some modern programming languages include a useful feature that appears to be in direct conflict with parametric polymorphism, namely the ability to perform intensional type analysis [12]. Probably the simplest and most common instance of intensional type analysis is found in the implementation of languages supporting a type Dynamic [1]. In such languages, any value v may be cast to type Dynamic, but the cast from type Dynamic to any type τ requires a runtime check to ensure that v’s actual type equals τ . Other languages such as Acute [25] and Alice ML [23], which are designed to support dynamic loading of modules, require the ability to check dynamically whether a module implements an expected interface, which in turn involves runtime inspection of the module’s type components. There have also been a number of more experimental proposals for languages that employ a typecase construct to facilitate polytypic programming (e.g., [32, 29]). There is a fundamental tension between type analysis and type abstraction. If one can inspect the identity of an unknown type at run time, then the type is not really abstract, so any invariants concerning values of that type may be broken [32]. Consequently, languages with a type Dynamic often distinguish between castable and non-castable types—with types that mention user-defined abstract types belonging to the latter category—and prohibit values with non-castable types from being cast to type Dynamic. This is, however, an unnecessarily severe restriction, which effectively penalizes programmers for using type abstraction. Given a user-defined abstract type t—implemented internally, say, as int—it is perfectly reasonable to cast a value of type t → t to Dynamic, so long as we can ensure that it will subsequently be cast back only to t → t (not to, say, int → int or int → t), i.e., so long as the cast is abstraction-safe. Moreover, such casts are useful when marshalling (or “pickling”) a modular component whose interface refers to abstract types defined in other components [23]. That said, in order to ensure that casts are abstraction-safe, it is necessary to have some way of distinguishing (dynamically, when a cast occurs) between an abstract type and its implementation. Thus, several researchers have proposed that languages with type analysis facilities should also support dynamic type generation [24, 21, 29, 22]. The idea is simple: when one defines an abstract type, one should also be able to generate at run time a “fresh” type name, which may be used as a dynamic representative of the abstract type for purposes of type analysis.1 (We will see a concrete example of this in Section 2.) Intuitively, the freshness of type name generation ensures that user-defined abstract types are viewed dynamically in the same way that they are viewed statically—i.e., as distinct from all other types.

Type abstraction and intensional type analysis are features seemingly at odds—type abstraction is intended to guarantee parametricity and representation independence, while type analysis is inherently non-parametric. Recently, however, several researchers have proposed and implemented “dynamic type generation” as a way to reconcile these features. The idea is that, when one defines an abstract type, one should also be able to generate at run time a fresh type name, which may be used as a dynamic representative of the abstract type for purposes of type analysis. The question remains: in a language with non-parametric polymorphism, does dynamic type generation provide us with the same kinds of abstraction guarantees that we get from parametric polymorphism? Our goal is to provide a rigorous answer to this question. We define a step-indexed Kripke logical relation for a language with both non-parametric polymorphism (in the form of type-safe cast) and dynamic type generation. Our logical relation enables us to establish parametricity and representation independence results, even in a non-parametric setting, by attaching arbitrary relational interpretations to dynamically-generated type names. In addition, we explore how programs that are provably equivalent in a more traditional parametric logical relation may be “wrapped” systematically to produce terms that are related by our non-parametric relation, and vice versa. This leads us to a novel “polarized” form of our logical relation, which enables us to distinguish formally between positive and negative notions of parametricity. Categories and Subject Descriptors D.3.3 [Programming Languages]: Language Constructs and Features—Abstract data types; F.3.1 [Logics and Meanings of Programs]: Specifying and Verifying and Reasoning about Programs General Terms

Languages, Theory, Verification

Keywords Parametricity, intensional type analysis, representation independence, step-indexed logical relations, type-safe cast

1. Introduction When we say that a language supports parametric polymorphism, we mean that “abstract” types in that language are really abstract— that is, no client of an abstract type can guess or depend on its underlying implementation [20]. Traditionally, the parametric nature of polymorphism is guaranteed statically by the language’s

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICFP’09, August 31–September 2, 2009, Edinburgh, Scotland, UK. c 2009 ACM 978-1-60558-332-7/09/08. . . $5.00 Copyright

1 In

languages with simple module mechanisms, such as Haskell, it is possible to generate unique type names statically. However, this is not sufficient in the presence of functors and local or first-class modules.

135

The question remains: how do we know that dynamic type generation works? In a language with intensional type analysis— i.e., non-parametric polymorphism—can the systematic use of dynamic type generation provably ensure abstraction safety and provide us with the same kinds of abstraction guarantees that we get from traditional parametric polymorphism? Our goal is to provide a rigorous answer to this question. We study an extension of System F, supporting (1) a type-safe cast mechanism, which is essentially a variant of Girard’s J operator [9], and (2) a facility for dynamic generation of fresh type names. For brevity, we will call this language G. As a practical language mechanism, the cast operator is somewhat crude in comparison to the more expressive typecase-style constructs proposed in the literature,2 but it nonetheless renders polymorphism non-parametric. Our main technical result is that, in a language with non-parametric polymorphism, parametricity may be provably regained via judicious use of dynamic type generation. The rest of the paper is structured as follows. In Section 2, we present our language under consideration, G, and also give an example to illustrate how dynamic type generation is useful. In Section 3, we explain informally the approach that we have developed for reasoning about G. Our approach employs a stepindexed Kripke logical relation, with an unusual form of possible world that is a close relative of Sumii and Pierce’s [26]. This section is intended to be broadly accessible to readers who are generally familiar with the basic idea of relational parametricity but not with the details of (advanced) logical relations techniques. In Section 4, we formalize our logical relation for G and show how it may be used to reason about parametricity and representation independence. A particularly appealing feature of our formalization is that the non-parametricity of G is encapsulated in the notion of what it means for two types to be logically related to each other when viewed as data. The definition of this type-level logical relation is a one-liner, which can easily be replaced with an alternative “parametric” version. In Sections 5–8, we explore how terms related by the parametric version of our logical relation may be “wrapped” systematically to produce terms related by the non-parametric version (and vice versa), thus clarifying how dynamic type generation facilitates parametric reasoning. This leads us to a novel “polarized” form of our logical relation, which enables us to distinguish formally between positive and negative notions of parametricity. In Section 9, we extend G with iso-recursive types to form Gμ and adapt the previous development accordingly. Then, in Section 10, we discuss how the abovementioned “wrapping” function can be seen as an embedding of System F (+ recursive types) into Gμ , which we conjecture to be fully abstract. Finally, in Section 11, we discuss related work, including recent work on the relevant concepts of dynamic sealing [27] and multilanguage interoperation [13], and in Section 12, we conclude and suggest directions for future work.

τ ::= α | b | τ → τ | τ × τ | ∀α.τ | ∃α.τ v ::= x | c | λx:τ.e | v1 , v2 | λα.e | pack τ, v as τ e ::= v | e e | e1 , e2 | e.1 | e.2 | e τ | pack τ, e as τ | unpack α, x=e in e | cast τ τ | new α≈τ in e Stores σ ::= | σ, α≈τ Config’s ζ ::= σ; e

Types Values Terms

Type Contexts Δ ::= | Δ, α | Δ, α≈τ Value Contexts Γ ::= | Γ, x:τ Δ; Γ e : τ

···

(E CAST) (E NEW)

Δ τ2 Δ τ1 Δ; Γ cast τ1 τ2 : τ1 → τ2 → τ2

Δτ

(E CONV)

Δ, α≈τ ; Γ e : τ Δ τ Δ; Γ new α≈τ in e : τ Δ; Γ e : τ Δ τ ≈ τ Δ; Γ e : τ

Δτ α≈τ ∈ Δ Δα

···

α≈τ ∈ Δ Δα≈τ

···

(T NAME) Δτ ≈τ (C NAME) ζ :τ (CONF)

σ

σ; e : τ σ; e : τ

σ; (λx:τ.e) v σ; v1 , v2 .i σ; (λα.e) τ σ; unpack α, x=(pack τ, v) in e (α ∈ / dom(σ)) σ; new α≈τ in e σ; cast τ1 τ2 (τ1 = τ2 ) (τ1 = τ2 ) σ; cast τ1 τ2

→ → → → → → →

τ

σ; e[v/x] σ; vi σ; e[τ /α] σ; e[τ /α][v/x] σ, α≈τ ; e σ; λx1 :τ1 .λx2 :τ2 .x1 σ; λx1 :τ1 .λx2 :τ2 .x2

(. . . plus standard “search” rules . . . ) Figure 1. Syntax and Semantics of G (excerpt) • cast τ1 τ2 v1 v2 converts v1 from type τ1 to τ2 . It checks that

those two types are the same at the time of evaluation. If so, the operator succeeds and returns v1 . Otherwise, it fails and defaults to v2 , which acts as an else clause of the target type τ2 . • new α≈τ in e generates a fresh abstract type name α. Values

2. The Language G

of type α can be formed using its representation type τ . Both types are deemed compatible, but not equivalent. That is, they are considered equal as classifiers, but not as data. In particular, cast α τ v v will not succeed (i.e., it will return v ).

Figure 1 defines our non-parametric language G. For the most part, G is a standard call-by-value λ-calculus, consisting of the usual types and terms from System F [9], including pairs and existential types.3 We also assume an unspecified set of base types b, along with suitable constants c of those types. Two additional, non-standard constructs isolate the essential aspects of the class of languages we are interested in:

Our cast operator is essentially the same as Harper and Mitchell’s TypeCond operator [11], which was itself a variant of the nonparametric J operator that Girard studied in his thesis [9]. Our new construct is similar to previously proposed constructs for dynamic type generation [21, 29, 22]. However, we do not require explicit term-level type coercions to witness the isomorphism between an abstract type name α and its representation τ . Instead, our type system is simple enough that we perform this conversion implicitly.

2 That

said, the implementation of dynamic modules in Alice ML, for instance, employs a very similar construct [23]. 3 We could use a Church encoding of existentials through universals, but distinguishing them gives us more leeway later (cf. Section 5).

136

For convenience, we will occasionally use expressions of the form let x=e1 in e2 , which abbreviate the term (λx:τ1 .e2 ) e1 (with τ1 being an appropriate type for e1 ). We omit the type annotation for existential packages where clear from context. Moreover, we take the liberty to generalize binary tuples to n-ary ones where necessary and to use pattern matching notation to decompose tuples in the obvious manner.

Type preservation can be expressed using the typing rule CONF for configurations. We formulate this rule by treating the type store as a type context, which is possible because type stores are a syntactic subclass of type contexts. (In a similar manner, we can write σ for well-formedness of store σ, by viewing it as a type context.) It is worth noting that the representation types in the store are actually never inspected by the dynamic semantics. They are only needed for specifying well-formedness of configurations and proving type soundness.

2.1 Typing Rules The typing rules for the System F fragment of G are completely standard and thus omitted from Figure 1. We focus on the nonstandard rules related to cast and new. Full formal details of the type system appear in the expanded version of this paper [16]. Typing of casts is straightforward (Rule E CAST): cast τ1 τ2 is simply treated as a function of type τ1 → τ2 → τ2 . Its first argument is the value to be converted, and its second argument is the default value returned in the case of failure. The rule merely requires that the two types be well-formed. For an expression new α≈τ in e, which binds α in e, Rule E NEW checks that the body e is well-typed under the assumption that α is implemented by the representation type τ . For that purpose, we enrich type contexts Δ with entries of the form α≈τ that keep track of the representation types tied to abstract type names. Note that τ may not mention α. Syntactically, type names are just type variables. When viewed as data, (i.e., when inspected by the cast operator), types are considered equivalent iff they are syntactically equal. In contrast, when viewed as classifiers for terms, knowledge about the representation of type names may be taken into account. Rule E CONV says that if a term e has a type τ , it may be assigned any other type that is compatible with τ . Type compatibility, in turn, is defined by the judgment Δ τ1 ≈ τ2 . We only show the rule C NAME, which discharges a compatibility assumption α≈τ from the context; the other rules implement the congruence closure of this axiom. The important point here is that equivalent types are compatible, but compatible types are not necessarily equivalent. Finally, Rule E NEW also requires that the type τ of the body e does not contain α (i.e., τ must be well formed in Δ alone). A type of this form can always be derived by applying E CONV to convert τ to τ [τ /α].

2.3 Motivating Example Consider the following attempt to write a simple functional “binary semaphore” ADT [17] in G. Following Mitchell and Plotkin [15], we use an existential type, as we would in System F: τsem := ∃α.α × (α → α) × (α → bool) esem := pack int, 1, λx: int .(1 − x), λx: int .(x = 0) as τsem A semaphore essentially is a flag that can be in two states: either locked or unlocked. The state can be toggled using the first function of the ADT, and it can be polled using the second. Our little module uses an integer value for representing the state, taking 1 for locked and 0 for unlocked. It is an invariant of the implementation that the integer never takes any other value—otherwise, the toggle function would no longer operate correctly. In System F, the implementation invariant would be protected by the fact that existential types are parametric: there is no way to inspect the witness of α after opening the package, and hence no client could produce values of type α other than those returned by the module (nor could she apply integer operations to them). Not so in G. The following program uses cast to forge a value s of the abstract semaphore type α: eclient := unpack α, s0 , toggle, poll = esem in let s = cast int α 666 s0 in poll s, poll (toggle s) Because reduction of unpack simply substitutes the representation type int for α, the consecutive cast succeeds, and the whole expression evaluates to true, true—although the second component should have toggled s and thus be different from the first. The way to prevent this in G is to create a fresh type name as witness of the abstract type:

2.2 Dynamic Semantics

esem1 := new α ≈ int in pack α , 1, λx: int .(1 − x), λx: int .(x = 0) as τsem

The operational semantics has to deal with generation of fresh type names. To that end, we introduce a type store σ to record generated type names. Hence, reduction is defined on configurations (σ; e) instead of plain terms. Figure 1 shows the main reduction rules. We omit the standard “search” rules for descending into subterms according to call-by-value, left-to-right evaluation order. The reduction rules for the F fragment are as usual and do not actually touch the store. However, types occurring in F constructs can contain type names bound in the store. Reducing the expression new α≈τ in e creates a new entry for α in the type store. We rely on the usual hygiene convention for bound variables to ensure that α is fresh with respect to the current store (which can always be achieved by α-renaming).4 The two remaining rules are for casts. A cast takes two types and checks that they are equivalent (i.e., syntactically equal). In either case, the expression reduces to a function that will return the appropriate one of the additional value arguments, i.e., the value to be converted in case of success, and the default value otherwise. In the former case, type preservation is ensured because source and target types are known to be equivalent.

After replacing the initial semaphore implementation with this one, eclient will evaluate to true, false as desired—the cast expression will no longer succeed, because α will be substituted by the dynamic type name α , and α = int. (Moreover, since α is only visible statically in the scope of the new expression, the client has no access to α , and thus cannot convert from int to α either.) Now, while it is clear that new ensures proper type abstraction in the client program eclient , we want to prove that it does so for any client program. A standard way of doing so is by showing a more general property, namely representation independence [20]: we show that the module esem1 is contextually equivalent to another module of the same type, meaning that no G program can observe any difference between the two modules. By choosing that other module to be a suitable reference implementation of the ADT in question, we can conclude that the “real” one behaves properly under all circumstances. The obvious candidate for a reference implementation of the semaphore ADT is the following: esem2 := new α ≈ bool in pack α , true, λx: bool .¬x, λx: bool .x as τsem

4A

well-known alternative approach would omit the type store in favor of using scope extrusion rules for new binders, as in Rossberg [21].

137

say that two functions are logically related at the type τ1 → τ2 if, when passed arguments that are logically related at τ1 , either they both diverge or they both converge to values that are logically related at τ2 . The fundamental theorem of logical relations states that the logical relation is a congruence with respect to the constructs of the language. Together with what Pitts [17] calls adequacy—i.e., the fact that logically related terms have equivalent termination behavior—the fundamental theorem implies that logically related terms are contextually equivalent, since contextual equivalence is defined precisely to be the largest adequate congruence. Traditionally, the parametric nature of polymorphism is made clear by the definition of the logical relation for universal and existential types. Intuitively, two type abstractions, λα.e1 and λα.e2 , are logically related at type ∀α.τ if they map related type arguments to related results. But what does it mean for two type arguments to be related? Moreover, once we settle on two related type arguments τ1 and τ2 , at what type do we relate the results e1 [τ1 /α] and e2 [τ2 /α]? One approach would be to restrict “related type arguments” to be the same type τ . Thus, λα.e1 and λα.e2 would be logically related at ∀α.τ iff, for any (closed) type τ , it is the case that e1 [τ /α] and e2 [τ /α] are logically related at the type τ [τ /α]. A key problem with this definition, however, is that, due to the quantification over any argument type τ , the type τ [τ /α] may in fact be larger than the type ∀α.τ , and thus the definition of the logical relation is no longer inductive in the structure of the type. Another problem is that this definition does not tell us anything about the parametric nature of polymorphism. Reynolds’ alternative approach is a generalization of Girard’s “candidates” method for proving strong normalization for System F [9]. The idea is simple: instead of defining two type arguments to be related only if they are the same, allow any two different type arguments to be related by an (almost) arbitrary relational interpretation (subject to certain admissibility constraints). That is, we parameterize the logical relation at type τ by an interpretation function ρ, which maps each free type variable of τ to a pair of types τ1 , τ2 together with some (admissible) relation between values of those types. Then, we say that λα.e1 and λα.e2 are logically related at type ∀α.τ under interpretation ρ iff, for any closed types τ1 and τ2 and any relation R between values of those types, it is the case that e1 [τ1 /α] and e2 [τ2 /α] are logically related at type τ under interpretation ρ, α → (τ1 , τ2 , R). The miracle of Reynolds/Girard’s method is that it simultaneously (1) renders the logical relation inductively well-defined in the structure of the type, and (2) demonstrates the parametricity of polymorphism: logically related type abstractions must behave the same even when passed completely different type arguments, so their behavior may not analyze the type argument and behave in different ways for different arguments. Dually, we can show that two ADTs pack τ1 , v1 as ∃α.τ and pack τ2 , v2 as ∃α.τ are logically related (and thus contextually equivalent) by exhibiting some relational interpretation R for the abstract type α, even if the underlying type representations τ1 and τ2 are different. This is the essence of what is meant by “representation independence”. Unfortunately, in the setting of G, Reynolds/Girard’s method is not directly applicable, precisely because polymorphism in G is not parametric! This essentially forces us back to the first approach suggested above, namely to only consider type arguments to be logically related if they are equal. Moreover, it makes sense: the cast operator views types as data, so types may only be logically related if they are indistinguishable as data. The natural questions, then, are: (1) what metric do we use to define the logical relation inductively, since the structure of the type no longer suffices, and (2) how do we establish that dynamic

Here, the semaphore state is represented directly by a Boolean flag and does not rely on any additional invariant. If we can show that esem1 is contextually equivalent to esem2 , then we can conclude that esem1 ’s type representation is truly being held abstract. 2.4 Contextual Equivalence In order to be able to reason about representation independence, we need to make precise the notion of contextual equivalence. A context C is an expression with a single hole [ ], defined in the usual manner. Typing of contexts is defined by a judgment form C : (Δ; Γ; τ ) (Δ ; Γ ; τ ), where the triple (Δ; Γ; τ ) indicates the type of the hole. The judgment implies that for any expression e with Δ; Γ e : τ we have Δ ; Γ C[e] : τ . The rules are straightforward, the key rule being the one for holes: Γ ⊆ Γ Δ ⊆ Δ [ ] : (Δ; Γ; τ ) (Δ ; Γ ; τ ) We can now define contextual approximation and contextual equivalence as follows (with σ; e ↓ asserting that σ; e terminates): Definition 2.1 (Contextual Approximation and Equivalence) Let Δ; Γ e1 : τ and Δ; Γ e2 : τ . Δ; Γ e1 e2 : τ ⇔ ∀C, τ , σ. σ ∧ C : (Δ; Γ; τ ) (σ; ; τ ) ∧ σ; C[e1 ] ↓ =⇒ σ; C[e2 ] ↓ def Δ; Γ e1 e2 : τ ⇔ Δ; Γ e1 e2 : τ ∧ Δ; Γ e2 e1 : τ def

That is, contextual approximation Δ; Γ e1 e2 : τ means that for any well-typed program context C with a hole of appropriate type, the termination of C[e1 ] implies the termination of C[e2 ]. Contextual equivalence Δ; Γ e1 e2 : τ is just approximation in both directions. Considering that G does not explicitly contain any recursive or looping constructs, the reader may wonder why termination is used as the notion of “distinguishing observation” in our definition of contextual equivalence. The reason is that the cast operator, together with impredicative polymorphism, makes it possible to write well-typed non-terminating programs [11]. (This was Girard’s reason for studying the J operator in the first place [9].) Moreover, using cast, one can encode arbitrary recursive function definitions. Other forms of observation may then be encoded in terms of (non-)termination. See the expanded version of this paper for details [16].

3. A Logical Relation for G: Main Ideas Following Reynolds [20] and Mitchell [14], our general approach to reasoning about parametricity and representation independence is to define a logical relation. Essentially, logical relations give us a tractable way of proving that two terms are contextually equivalent, which in turn gives us a way of proving that abstract types are really abstract. Of course, since polymorphism in G is non-parametric, the definition of our logical relation in the cases of universal and existential types is somewhat unusual. To place our approach in context, we first review the traditional approach to defining logical relations for languages with parametric polymorphism, such as System F. 3.1 Logical Relations for Parametric Polymorphism Although the technical meaning of “logical relation” is rather woolly, the basic idea is to define an equivalence (or approximation) relation on programs inductively, following the structure of their types. To take the canonical example of arrow types, we would

138

type generation regains a form of parametricity? We address these questions in the next two sections, respectively.

We now must show that the values pack α , 1, λx: int .(1 − x), λx: int .(x = 0) as τsem

3.2 Step-Indexed Logical Relations for Non-Parametricity

and pack α , true, λx: bool .¬x, λx: bool .x as τsem

First, in order to provide a metric for inductively defining the logical relation, we employ step-indexing. Step-indexed logical relations were proposed originally by Appel and McAllester [7] as a way of giving a simple operational-semantics-based model for general recursive types in the context of foundational proofcarrying code. In subsequent work by Ahmed and others [3, 6], the method has been adapted to support relational reasoning in a variety of settings, including untyped and imperative languages. The key idea of step-indexed logical relations is to index the definition of the logical relation not only by the type of the programs being related, but also by a natural number n representing (intuitively) “the number of steps left in the computation”. That is, if two terms e1 and e2 are logically related at type τ for n steps, then if we place them in any program context C and run the resulting programs for n steps of computation, we should not be able to produce observably different results (e.g., C[e1 ] evaluating to 5 and C[e2 ] evaluating to 7). To show that e1 and e2 are contextually equivalent, then, it suffices to show that they are logically related for n steps, for any n. To see how step-indexing helps us, consider how we might define a step-indexed logical relation for G in the case of universal types: two type abstractions λα.e1 and λα.e2 are logically related at ∀α.τ for n steps iff, for any type argument τ , it is the case that e1 [τ /α] and e2 [τ /α] are logically related at τ [τ /α] for n − 1 steps. This reasoning is sound because the only way a program context can distinguish between λα.e1 and λα.e2 in n steps is by first applying them to a type argument τ —which incurs a step of computation for the β-reduction (λα.ei ) τ → ei [τ /α]—and then distinguishing between e1 [τ /α] and e2 [τ /α] within the next n − 1 steps. Moreover, although the type τ [τ /α] may be larger than ∀α.τ , the step index n − 1 is smaller, so the logical relation is inductively well-defined.

are logically related in the world w. Since G’s logical relation for existential types is non-parametric, the two packages must have the same type representation, but of course the whole point of using new was to ensure that they do (namely, it is α ). The remainder of the proof is showing that the value components of the packages are related at the type α × (α → α ) × (α → bool) under the interpretation ρ = α → (int, bool, R) derived from the world w. This last part is completely analogous to what one would show in a standard representation independence proof. In short, the possible worlds in our Kripke logical relations bring back the ability to assign arbitrary relational interpretations R to abstract types, an ability that was seemingly lost when we moved to a non-parametric logical relation. The only catch is that we can only assign arbitrary interpretations to dynamic type names, not to static, universally/existentially quantified type variables. There is one minor technical matter that we glossed over in the above proof sketch but is worth mentioning. Due to nondeterminism of type name allocation, the evaluation of esem1 and esem2 may result in α being replaced by α1 in the former and α2 in the latter (for some fresh α1 = α2 ). Moreover, we are also interested in proving equivalence of programs that do not necessarily allocate exactly the same number of type names in the same order. Consequently, we also include in our possible worlds a partial bijection η between the type names of the first program and the type names of the second program, which specifies how each dynamically generated abstract type is concretely represented in the stores of the two programs. We require them to be in 1-1 correspondence because the cast construct permits the program context to observe equality on type names, as follows: def

equal? : ∀α.∀β. bool = Λα.Λβ. cast ((α → α) → bool) ((β → β) → bool) (λx:(α → α). true)(λx:(β → β). false)(λx:β.x)

3.3 Kripke Logical Relations for Dynamic Parametricity Second, in order to establish the parametricity properties of dynamic type generation, we employ Kripke logical relations, i.e., logical relations that are indexed by possible worlds.5 Kripke logical relations are appropriate when reasoning about properties that are true only under certain conditions, such as equivalence of modules with local mutable state. For instance, an imperative ADT might only behave according to its specification if its local data structures obey certain invariants. Possible worlds allow one to codify such local invariants on the machine store [18]. In our setting, the local invariant we want to establish is what a dynamically generated type name means. That is, we will use possible worlds to assign relational interpretations to dynamically generated type names. For example, consider the programs esem1 and esem2 from Section 2. We want to show they are logically related at ∃α. α × (α → α) × (α → bool) in an empty initial world w0 (i.e., under empty type stores). The proof proceeds roughly as follows. First, we evaluate the two programs. This will have the effect of generating a fresh type name α , with α ≈ int extending the type store of the first program and α ≈ bool extending the type store of the second program. At this point, we correspondingly extend the initial world w0 with a mapping from α to the relation R = {(1, true), (0, false)}, thus forming a new world w that specifies the semantic meaning of α .

We then consider types to be logically related if they are the same up to this bijection. For instance, in our running example, when extending w0 to w, we would not only extend its relational interpretation with α → (int, bool, R) but also extend its η with α → (α1 , α2 ). Thus, the type representations of the two existential packages, α1 and α2 , though syntactically distinct, would still be logically related under w.

4. A Logical Relation for G: Formal Details Figure 2 displays our step-indexed Kripke logical relation for G in full gory detail. It is easiest to understand this definition by making two passes over it. First, as the step indices have a way of infecting the whole definition in a superficially complex—but really very straightforward—way, we will first walk through the whole definition ignoring all occurrences of n’s and k’s (as well as auxiliary functions like the ·n operator). Second, we will pinpoint the few places where step indices actually play an important role in ensuring that the logical relation is inductively well-founded. 4.1 Highlights of the Logical Relation The first section of Figure 2 defines the kinds of semantic objects that are used in the construction of the logical relation. Relations R are sets of atoms, which are pairs of terms, e1 and e2 , indexed by a possible world w. The definition of Atom[τ1 , τ2 ] requires that e1 and e2 have the types τ1 and τ2 under the type stores w.σ1 and w.σ2 , respectively. (We use the dot notation w.σi to denote the i-th

5 In fact, step-indexed logical relations may already be understood as a special case of Kripke logical relations, in which the step index serves as the notion of possible world, and where n is a future world of m iff n ≤ m.

139

Atomn [τ1 , τ2 ] Reln [τ1 , τ2 ] SomeReln Interpn Conc Worldn (σ1 , σ2 , η, ρ)n ρn (τ1 , τ2 , R)n Rn R

Vn [[α]]ρ Vn [[b]]ρ Vn [[τ × τ ]]ρ

def

Vn [[τ → τ ]]ρ

def

= = def = def

=

def

Vn [[∀α.τ ]]ρ

=

def

Vn [[∃α.τ ]]ρ

=

def

En [[τ ]]ρ

=

def

= = def = def = def = def = def

def

{(k, w, e1 , e2 ) | k < n ∧ w ∈ Worldk ∧ w.σ1 ; e1 : τ1 ∧ w.σ2 ; e2 : τ2 } {R ⊆ Atomval n [τ1 , τ2 ] | ∀(k, w, v1 , v2 ) ∈ R. ∀(k , w ) (k, w). (k , w , v1 , v2 ) ∈ R} {r = (τ1 , τ2 , R) | fv(τ1 , τ2 ) = ∅ ∧ R ∈ Reln [τ1 , τ2 ]} fin {ρ ∈ TVar → SomeReln } fin {η ∈ TVar → TVar × TVar | ∀α, α ∈ dom(η). α = α ⇒ η 1 (α) = η 1 (α ) ∧ η 2 (α) = η 2 (α )} {w = (σ1 , σ2 , η, ρ) | σ1 ∧ σ2 ∧ η ∈ Conc ∧ ρ ∈ Interpn ∧ dom(η) = dom(ρ) ∧ ∀α ∈ dom(ρ). σ1 ρ1 (α) ≈ η 1 (α) ∧ σ2 ρ2 (α) ≈ η 2 (α)}

= = def = def =

(σ1 , σ2 , η, ρn ) {α→rn | ρ(α) = r} (τ1 , τ2 , Rn ) {(k, w, e1 , e2 ) ∈ R | k < n}

def

{(k, w, e1 , e2 ) | k = 0 ∨ (k − 1, wk−1 , e1 , e2 ) ∈ R}

def

=

(k , w ) (k, w)

⇔

η η ρ ρ

⇔ def ⇔

def

def

k ≤ k ∧ w ∈ Worldk ∧ w .η w.η ∧ w .ρ w.ρk ∧ ∀i ∈ {1, 2}. w .σi ⊇ w.σi ∧ rng(w .η i ) − rng(w.η i ) ⊆ dom(w .σi ) − dom(w.σi ) ∀α ∈ dom(η). η (α) = η(α) ∀α ∈ dom(ρ). ρ (α) = ρ(α)

ρ(α).Rn {(k, w, c, c) ∈ Atomn [b, b]} {(k, w, v1 , v1 , v2 , v2 ) ∈ Atomn [ρ1 (τ × τ ), ρ2 (τ × τ )] | (k, w, v1 , v2 ) ∈ Vn [[τ ]]ρ ∧ (k, w, v1 , v2 ) ∈ Vn [[τ ]]ρ} {(k, w, λx:τ1 .e1 , λx:τ2 .e2 ) ∈ Atomn [ρ1 (τ → τ ), ρ2 (τ → τ )] | ∀(k , w , v1 , v2 ) ∈ Vn [[τ ]]ρ. (k , w ) (k, w) ⇒ (k , w , e1 [v1 /x], e2 [v2 /x]) ∈ En [[τ ]]ρ} {(k, w, λα.e1 , λα.e2 ) ∈ Atomn [ρ1 (∀α.τ ), ρ2 (∀α.τ )] | ∀(k , w ) (k, w). ∀(τ1 , τ2 , r) ∈ Tk [[Ω]]w . (k , w , e1 [τ1 /α], e2 [τ2 /α]) ∈ En [[τ ]]ρ, α→r} {(k, w, pack τ1 , v1 , pack τ2 , v2 ) ∈ Atomn [ρ1 (∃α.τ ), ρ2 (∃α.τ )] | ∃r. (τ1 , τ2 , r) ∈ Tk [[Ω]]w ∧ (k, w, v1 , v2 ) ∈ Vn [[τ ]]ρ, α→r} {(k, w, e1 , e2 ) ∈ Atomn [ρ1 (τ ), ρ2 (τ )] | ∀j < k. ∀σ1 , v1 . (w.σ1 ; e1 →j σ1 ; v1 ) ⇒ ∃w , v2 . (k − j, w ) (k, w) ∧ w .σ1 = σ1 ∧ (w.σ2 ; e2 →∗ w .σ2 ; v2 ) ∧ (k − j, w , v1 , v2 ) ∈ Vn [[τ ]]ρ}

def

{(w.η 1 (τ ), w.η 2 (τ ), (w.ρ1 (τ ), w.ρ2 (τ ), Vn [[τ ]]w.ρ)) | fv(τ ) ⊆ dom(w.ρ)}

Gn [[]]ρ Gn [[Γ, x:τ ]]ρ

def

Dn [[]]w Dn [[Δ, α]]w

def

Dn [[Δ, α≈τ ]]w

def

{(k, w, ∅, ∅) | k < n ∧ w ∈ Worldk } {(k, w, (γ1 , x→v1 ), (γ2 , x→v2 )) | (k, w, γ1 , γ2 ) ∈ Gn [[Γ]]ρ ∧ (k, w, v1 , v2 ) ∈ Vn [[τ ]]ρ} {(∅, ∅, ∅)} {((δ1 , α→τ1 ), (δ2 , α→τ2 ), (ρ, α→r)) | (δ1 , δ2 , ρ) ∈ Dn [[Δ]]w ∧ (τ1 , τ2 , r) ∈ Tn [[Ω]]w} {((δ1 , α→β1 ), (δ2 , α→β2 ), (ρ, α→r)) | (δ1 , δ2 , ρ) ∈ Dn [[Δ]]w ∧ ∃α . w.ρ(α ) = r ∧ w.η(α ) = (β1 , β2 ) ∧ w.σ1 (β1 ) = δ1 (τ ) ∧ w.σ2 (β2 ) = δ2 (τ ) ∧ r.R = Vn [[τ ]]ρ}

Tn [[Ω]]w

= = =

def

= =

def

=

def

Δ; Γ e1 e2 : τ ⇔

Δ; Γ e1 : τ ∧ Δ; Γ e2 : τ ∧ ∀n ≥ 0. ∀w0 ∈ Worldn . ∀(δ1 , δ2 , ρ) ∈ Dn [[Δ]]w0 . ∀(k, w, γ1 , γ2 ) ∈ Gn [[Γ]]ρ. (k, w) (n, w0 ) ⇒ (k, w, δ1 γ1 (e1 ), δ2 γ2 (e2 )) ∈ En [[τ ]]ρ Figure 2. Logical Relation for G

type store component of w, and analogous notation for projecting out the other components of worlds.) Rel[τ1 , τ2 ] defines the set of admissible relations, which are permitted to be used as the semantic interpretations of abstract types. For our purposes, admissibility is simply monotonicity—i.e., closure under world extension. That is, if a relation in Rel relates

two values v1 and v2 under a world w, then the relation must relate those values in any future world of w. (We discuss the definition of world extension below.) Monotonicity is needed in order to ensure that we can extend worlds with interpretations of new dynamic type names, without interfering somehow with the interpretations of the old ones.

140

Worlds w are 4-tuples (σ1 , σ2 , η, ρ), which describe a set of assumptions under which pairs of terms are related. Here, σ1 and σ2 are the type stores under which the terms are typechecked and evaluated. The finite mappings η and ρ share a common domain, which can be understood as the set of abstract type names that have been generated dynamically. These “semantic” type names do not exist in either store σ1 or σ2 .6 Rather, they provide a way of referring to an abstract type that is represented by some type name α1 in σ1 and some type name α2 in σ2 . Thus, for each name α ∈ dom(η) = dom(ρ), the concretization η maps the “semantic” name α to a pair of “concrete” names from the stores σ1 and σ2 , respectively. (See the end of Section 3.3 for an example of such an η.) As the definition of Conc makes clear, distinct semantic type names must have distinct concretizations; consequently, η represents a partial bijection between σ1 and σ2 . The last component of the world w is ρ, which assigns relational interpretations to the aforementioned semantic type names. Formally, ρ maps each α to a triple r = (τ1 , τ2 , R), where R is a monotone relation between values of types τ1 and τ2 . (Again, see the end of Section 3.3 for an example of such a ρ.) The final condition in the definition of World stipulates that the closed syntactic types in the range of ρ and the concrete type names in the range of η are compatible. As a matter of notation, we will write η i and ρi to denote the type substitutions {α → αi | η(α) = (α1 , α2 )} and {α → τi | ρ(α) = (τ1 , τ2 , R)}, respectively. The second section of Figure 2 displays the definition of world extension. In order for w to extend w (written w w), it must be the case that (1) w specifies semantic interpretations for a superset of the type names that w interprets, (2) for the names that w interprets, w must interpret them in the same way, and (3) any new semantic type names that w interprets may only correspond to new concrete type names that did not exist in the stores of w. Although the third condition is not strictly necessary, we have found it to be useful when proving certain examples (e.g., the “order independence” example in Section 4.4). The last section of Figure 2 defines the logical relation itself. V [[τ ]]ρ is the logical relation for values, E[[τ ]]ρ is the one for terms, and T [[Ω]]w is the one for types as data, as described in Section 3 (here, Ω represents the kind of types). V [[τ ]]ρ relates values at the type τ , where the free type variables of τ are given relational interpretations by ρ. Ignoring the step indices, V [[τ ]]ρ is mostly very standard. For instance, at certain points (namely, in the → and ∀ cases), when we quantify over logically related (value or type) arguments, we must allow them to come from an arbitrary future world w in order to ensure monotonicity. This kind of quantification over future worlds is commonplace in Kripke logical relations. The only really interesting bit in the definition of V [[τ ]]ρ is the use of T [[Ω]]w to characterize when the two type arguments (resp. components) of a universal (resp. existential) are logically related. As explained in Section 3.3, we consider two types to be logically related in world w iff they are the same up to the partial bijection w.η. Formally, we define T [[Ω]]w as a relation on triples (τ1 , τ2 , r), where τ1 and τ2 are the two logically related types and r is a relation telling us how to relate values of those types. To be logically related means that τ1 and τ2 are the concretizations (according to w.η) of some “semantic” type τ . Correspondingly, r is the logical relation V [[τ ]]w.ρ at that semantic type. Thus, when we write E[[τ ]]ρ, α → r in the definition of V [[∀α.τ ]]ρ, this is roughly equivalent to writing E[[τ [τ /α]]]ρ (which our discussion in Section 3.2 might have led the reader to expect to see here instead). The reason for our present formulation is that E[[τ [τ /α]]]ρ is not quite right:

the free variables of τ are interpreted by ρ, but the free variables of τ are dynamic type names whose interpretations are given by w.ρ. It is possible to merge ρ and w.ρ into a unified interpretation ρ , but we feel our present approach is cleaner. Another point of note: since r is uniquely determined from τ1 and τ2 , it is not really necessary to include it in the T [[Ω]]w relation. However, as we shall see in Section 6, formulating the logical relation in this way has the benefit of isolating all of the nonparametricity of our logical relation in the definition of T [[Ω]]w. The term relation E[[τ ]]ρ is very similar to that in previous stepindexed Kripke logical relations [6]. Briefly, it says that two terms are related in an initial world w if whenever the first evaluates to a value under w.σ1 , the second evaluates to a value under w.σ2 , and the resulting stores and values are related in some future world w . The remainder of the definitions in Figure 2 serve to formalize a logical relation for open terms. G[[Γ]]ρ is the logical relation on value substitutions γ, which asserts that related γ’s must map variables in dom(Γ) to related values. D[[Δ]]w is the logical relation on type substitutions. It asserts that related δ’s must map variables in dom(Δ) to types that are related in w. For type variables α bound as α ≈ τ , the δ’s must map α to a type name whose semantic interpretation in w is precisely the logical relation at τ . Analogously to T [[Ω]]w, the relation D[[Δ]]w also includes a relational interpretation ρ, which may be uniquely determined from the δ’s. Finally, the open logical relation Δ; Γ e1 e2 : τ is defined in a fairly standard way. It says that for any starting world w0 , and any type substitutions δ1 and δ2 related in that world, if we are given related value substitutions γ1 and γ2 in any future world w, then δ1 γ1 e1 and δ2 γ2 e2 are related in w as well. 4.2 Why and Where the Steps Matter As we explained in Section 3.2, step indices play a critical role in making the logical relation well-founded. Essentially, whenever we run into an apparent circularity, we “go down a step” by defining an n-level property in terms of an (n−1)-level one. Of course, this trick only works if, at all such “stepping points”, the only way that an adversarial program context could possibly tell whether the nlevel property holds or not is by taking one step of computation and then checking whether the underlying (n−1)-level property holds. Fortunately, this is the case. Since worlds contain relations, and relations contain sets of tuples that include worlds, a na¨ıve construction of these objects would have an inconsistent cardinality. We thus stratify both worlds and relations by a step index: n-level worlds w ∈ Worldn contain n-level interpretations ρ ∈ Interpn , which map type variables to n-level relations; n-level relations R ∈ Reln [τ1 , τ2 ] only contain atoms indexed by a step level k < n and a world w ∈ Worldk . Although our possible worlds have a different structure than in previous work, the technique of mutual world and relation stratification is similar to that used in Ahmed’s thesis [2], as well as recent work by Ahmed, Dreyer and Rossberg [6]. Intuitively, the reason this works in our setting is as follows. Viewed as a judgment, our logical relation asserts that two terms e1 and e2 are logically related for k steps in a world w at a type τ under an interpretation ρ (whose domain contains the free type variables of τ ). Clearly, in order to handle the case where τ is just a type variable α, the relations r in the range of ρ must include atoms at step index k (i.e., the r’s must be in SomeRelk+1 ). But what about the relations in the range of w.ρ? Those relations only come into play in the universal and existential cases of the logical relation for values. Consider the existential case (the universal one is analogous). There, w.ρ pops up in the definition of the relation r that comes from Tk [[Ω]]w. However, that r is only needed in defining the relatedness of the values v1 and v2 at step level k−1 (note the definition of R in the second section of Figure 2). Con-

fact, technically speaking, we consider dom(η) = dom(ρ) to be bound variables of the world w.

6 In

141

sequently, we only need r to include atoms at step k−1 and lower (i.e., r must be in SomeRelk ), so the world w from which r is derived need only be in Worldk . As this discussion suggests, it is imperative that we “go down a step” in the universal and existential cases of the logical relation. For the other cases, it is not necessary to go down a step, although we have the option of doing so. For example, we could define k-level relatedness at pair type τ1 × τ2 in terms of (k−1)-level relatedness at τ1 and τ2 . But since the type gets smaller, there is no need to. For clarity, we have only gone down a step in the logical relation at the points where it is absolutely necessary, and we have used the notation to underscore those points.

former uses int, the latter bool. To show that they are contextually equivalent, it suffices by Soundness to show that each logically approximates the other. We prove only one direction, namely esem1 esem2 : τsem ; the other is proven analogously. Expanding the definitions, we need to show (k, w, esem1 , esem2 ) ∈ En [[τsem ]]∅. Note how each term generates a fresh type name αi in one step, resulting in a package value. Hence all we need to do is come up with a world w satisfying

4.3 Key Properties

where vi is the term component of esemi ’s implementation. We construct w by extending w with mappings that establish the relation between the new type names:

• (k − 1, w ) (k, w), • w .σ1 = w.σ1 , α1 ≈int and w .σ2 = w.σ2 , α2 ≈bool, • (k − 1, w , packα1 , v1 , packα2 , v2 ) ∈ Vn [[τsem ]]∅.

The main results concerning our logical relation are as follows: Theorem 4.1 (Fundamental Property for ) If Δ; Γ e : τ , then Δ; Γ e e : τ .

R := {(k , w , vint , vbool ) ∈ Atomval k−1 [int, bool] | (vint , vbool ) = (1, true) ∨ (vint , vbool ) = (0, false)} r := (int, bool, R)

Theorem 4.2 (Soundness of wrt. Contextual Approximation) If Δ; Γ e1 e2 : τ , then Δ; Γ e1 e2 : τ .

w := wk−1 (α1 ≈int, α2 ≈bool, α→(α1 , α2 ), α→r)

These theorems establish that our logical relation provides a sound technique for proving contextual equivalence of G programs. The proofs of these theorems rely on many technical lemmas, most of which are standard and straightforward to prove. We highlight a few of them here, and refer the reader to the expanded version of this paper for full details of the proofs [16]. One key lemma we have mentioned already is the monotonicity lemma, which states that the logical relation for values is closed under world extension, and therefore belongs to the Rel class of relations. Another key lemma is transitivity of world extension. There are also a group of lemmas—Pitts terms them compatibility lemmas [17]—which show that the logical relation is a precongruence with respect to the constructs of the G language. Of particular note among these are the ones for cast and new. For cast, we must show that cast τ1 τ2 is logically related to itself under a type context Δ assuming that τ1 and τ2 are wellformed in Δ. This boils down to showing that, for logically related type substitutions δ1 and δ2 , it is the case that δ1 τ1 = δ1 τ2 if and only if δ2 τ1 = δ2 τ2 . This follows easily from the fact that δ1 and δ2 , by virtue of being logically related, map the variables in dom(Δ) to types that are syntactically identical up to some bijection on type names. For new, we must show that, if Δ, α≈τ ; Γ e1 e2 : τ , then Δ; Γ new α≈τ in e1 new α≈τ in e2 : τ (assuming Δ Γ and Δ τ ). The proof involves extending the η and ρ components of some given initial world w0 with bindings for the fresh dynamically-generated type name α. The η is extended with α → (α1 , α2 ), where α1 and α2 are the concrete fresh names that are chosen when evaluating the left and right new expressions. The ρ is extended so that the relational interpretation of α is simply the logical relation at type τ . The proof of this lemma is highly reminiscent of the proof of compatibility for ref (reference allocation) in a language with mutable references [6]. Finally, another important compatibility property is type compatibility, i.e., that if Δ τ1 ≈ τ2 and (δ1 , δ2 , ρ) ∈ Dn [[Δ]]w, then Vn [[τ1 ]]ρ = Vn [[τ2 ]]ρ and En [[τ1 ]]ρ = En [[τ2 ]]ρ. The interesting case is when τ1 is a variable α bound in Δ as α ≈ τ2 , and the result in this case follows easily from the definition of D[[Δ, α ≈ τ ]]w.

The first two conditions above are satisfied by construction. To show that the packages are related we need to show the existence of an r with (α1 , α2 , r ) ∈ Tk−1 [[Ω]]w such that (k − ]]ρ, α→r , where τsem = α × 2, w k−2 , v1 , v2 ) ∈ Vn [[τsem i (α → α) × (α → bool). Since αi = w .η (α), r must be (int, bool, Vk−1 [[α]]w .ρ) by definition of T [[Ω]]. Of course, we defined w the way we did so that this r is exactly r. ]]ρ, α→r deThe proof of (k − 2, w k−2 , v1 , v2 ) ∈ Vn [[τsem : composes into three parts, following the structure of τsem

4.4 Examples

The only difference between v1 and v2 is whether the argument x is applied once or twice. Intuitively, either x () diverges, in which case both programs diverge, or else the first application of x terminates, in which case so should the second.

1. (k − 2, w k−2 , 1, true) ∈ Vn [[α]]ρ, α→r This holds because Vn [[α]]ρ, α→r = R. 2. (k − 2, w k−2 , λx:int.(1 − x), λx:bool.¬x) ∈ Vn [[α → α]]ρ, α→r • Suppose we are given related arguments in a future world:

(k , w , v1 , v2 ) ∈ Vn [[α]]ρ, α→r = R.

• Hence either (v1 , v2 ) = (1, true) or (v1 , v2 ) = (0, false). • Consequently, 1 − v1 and ¬v2 will evaluate in one step,

without effects, to values again related by R. • In other words, (k , w , 1 − v1 , ¬v2 ) ∈ En [[α]]ρ, α→r.

3. (k − 2, w k−2 , λx.(x = 0), λx.x) ∈ Vn [[α → bool]]ρ, α→r Like in the previous part, the arguments v1 and v2 will be related by R in some future (k , w ). Therefore v1 = 0 will reduce in one step without effects to v2 , which already is a value. Because of the definition of the logical relation at type bool, this implies (k , w , v1 = 0, v2 ) ∈ En [[bool]]ρ, α→r. Partly Benign Effects. When side effects are introduced into a pure language, they often falsify various equational laws concerning repeatability and order independence of computations. In this section, we offer some evidence that the effect of dynamic type generation is partly benign in that it does not invalidate some of these equational laws. First, consider the following functions: v1 := λx:(unit → τ ). let x = x () in x () v2 := λx:(unit → τ ). x ()

Semaphore. We now return to our semaphore example from Section 2 and show how to prove representation independence for the two different implementations esem1 and esem2 . Recall that the

142

Second, consider the following functions: v1 v2

Wr± τ (e)

:= λx:(unit → τ ).λy:(unit → τ ). let y = y () in x (), y

= let x=e in Wr± τ (x)

def

(if e not a value)

= v Wr± α (v) def ± Wrb (v) = v def ± Wr± (v) = Wr± τ1 (v.1), Wrτ2 (v.2) τ1 ×τ2 def

:= λx:(unit → τ ).λy:(unit → τ ).x (), y ()

The only difference between v1 and v2 is the order in which they call their argument callbacks x and y. Those calls may both result in the generation of fresh type names, but the order in which the names are generated should not matter. Using our logical relation, we can prove that v1 and v2 are contextually equivalent, and so are v1 and v2 . (Due to space considerations, we refer the interested reader to the expanded version of this paper for full proof details [16].) However, as we shall see in the example of e1 and e2 in the next section, our G language does not enjoy referential transparency. This is to be expected, of course, since new is an effectful operation and (in-)equality of type names is observable in the language.

± ∓ Wr± τ1 →τ2 (v) = λx1 :τ1 . Wrτ2 (v Wrτ1 (x1 )) def Wr± = λα. new∓ α in Wr± τ (v α) ∀α.τ (v) def ± Wr∃α.τ (v) = unpack α, x=v in new± α in pack α, Wr± τ (x) as ∃α.τ def + new α in e = new α ≈α in e[α /α] def new− α in e = e def

Figure 3. Wrapping Figure 3 defines a pair of wrapping operators that correspond to these two dual requirements: Wr+ protects an expression e : τe from being used in a non-parametric way, by inserting fresh names for each existential quantifier. Dually, Wr− forces e to behave parametrically by creating a fresh name for each polymorphic instantiation. The definitions extend to other types in the usual functorial manner. Both definitions are interdependent, because roles switch for function arguments. These operators are similar to the typedirected translation that Sumii and Pierce suggest for establishing type abstraction in an untyped language [27] (they propose the descriptive terms “firewall” for Wr+ , and “sandbox” for Wr− ). However, their use of dynamic sealing instead of type generation results in the insertion of runtime coercions to seal/unseal each individual value of abstract type, while our wrapping leaves such values alone. Given these operators, we can go back to our semaphore example: esem1 can now be obtained as Wr+ τsem (esem ) (modulo some harmless η-expansions). This generalises to any ADT: wrapping its implementation positively will guarantee abstraction by making it parametric. We prove that in the next section. Positive wrapping is reminiscent of module sealing (or opaque signature ascription) in ML-style module languages. If we view e as a module and its type τe as a signature, then Wr+ τe (e) corresponds to the sealing operation e :> τe . While module sealing typically only performs static abstraction, wrapping describes the dynamic equivalent [22]. In fact, positive wrapping is precisely how sealing is implemented in Alice ML [23], where the module language is non-parametric otherwise. The correspondence to module sealing motivates our treatment of existential types. Notice that Wr+ causes a fresh type name to be created only once for each existentially quantified type—that is, corresponding to each existential introduction. Another option would be to generate type names with each existential elimination. In fact, such a semantics would arise naturally were we to use a Church encoding of existentials in conjunction with our wrapping for universals. However, in such a semantics, unpacking an existential value twice would have the effect of producing two distinct abstract types. While this corresponds intuitively to the “generativity” of unpack in System F, it is undesirable in the context of dynamic, first-class modules. In particular, in order for an abstract type t defined by some dynamic module M to have some permanent identity (so that it can be referenced by other dynamic modules), it is important that each unpacking of M yields a handle to the same name for t. Moreover, as we show in the next section, our approach to defining wrapping is sufficient to ensure abstraction safety.

5. Wrapping We have seen that parametricity can be re-established in G by introducing name generation in the right place. But what is the “right place” in general? That is, given an arbritrary expression e with polymorphic type τe , how can we systematically transform it into an expression e of the same type τe that is parametric? One obvious—but unfortunately bogus—idea is the following: transform e such that every existential introduction and every universal elimination creates a fresh name for the respective witness or instance type. Formally, apply the following rewrite rules to e: pack τ, e as τ new α≈τ in pack α, e as τ eτ new α≈τ in e α Obviously, this would make every quantified type abstract, so that any cast that tries to inspect it would fail. Or would it? Perhaps surprisingly, the answer is no. To see why, consider the following expressions of type (∃α.τ ) × (∃α.τ ): e1 := let x = pack τ, v in x, x e2 := pack τ, v, pack τ, v They are clearly equivalent in a parametric language (and in fact they are even equivalent in G). Yet rewriting yields: e1 := let x = (new α≈τ in pack α, v) in x, x e2 := new α≈τ in pack α, v, new α≈τ in pack α, v The resulting expressions are not equivalent anymore, because they perform different effects. Here is one distinguishing context: let p = [ ] in unpack α1 , x1 = p.1 in unpack α2 , x2 = p.2 in equal? α1 α2 Although the representation type τ is not disclosed as such, sharing between the two abstract types in e1 is. In a parametric language, that would not be possible. In order to introduce effects uniformly, and to hide internal sharing, the transformation we are looking for needs to be defined on the structure of types, not terms. Roughly, for each quantifier occurring in τe we need to generate one fresh type name. That is, instead of transforming e itself, we simply wrap it with some expression that introduces the necessary names at the boundary, by induction on the type τe . In fact, we can refine the problem further. When looking at a G expression e, what do we actually mean by “making it parametric”? We can mean two different things: either ensuring that e behaves parametrically, or dually, that any context treats e parametrically. In the former case, we are protecting the context against e, in the latter we protect e against malicious contexts. The latter is what is sometimes referred to as abstraction safety.

6. Parametric Reasoning The logical relation developed in Section 4 enables us to do nonparametric reasoning about equivalence of G programs. It also

143

Tn◦ [[Ω]]w

def

=

{(τ1 , τ2 , (τ1 , τ2 , R)) | τi ∧ w.σi τi ≈ τi ∧ R ∈ Reln [τ1 , τ2 ]} (everything else as in Figure 2) Figure 4. Parametric Logical Relation

enables us to do parametric reasoning, but only indirectly: we have to explicitly deal with the effects of new and to define worlds containing relations between type names. It would be preferable if we were able to do parametric reasoning directly. For example, given two expressions e1 , e2 that do not use casts, and assuming that the context does not do so either, we should be able to reason about equivalence of e1 and e2 in a manner similar to what we do when reasoning about System F.

6.2 Examples Semaphore. Consider our running example of the semaphore module again. Using the parametric relation, we can prove that the two implementations are related without actually reasoning about type generation. That aspect is covered once and for all by the Wrapping Theorem. Recall the two implementations, here given in unwrapped form: esem1 := pack int, 1, λx: int .(1 − x), λx: int .(x = 0) as τsem esem2 := pack bool, true, λx: bool .¬x, λx: bool .x as τsem

6.1 A Parametric Logical Relation

We can prove esem1 ◦ esem2 : τsem using conventional parametric reasoning about polymorphic terms. Now define esem1 = + Wr+ τsem (esem1 ) and esem2 = Wrτsem (esem2 ), which are semantically equivalent to the original definitions in Section 2.3. The Wrapping Theorem then immediately tells us that esem1 esem2 : τsem .

Thanks to the modular formulation of our logical relation in Figure 2, it is easy to modify it so that it becomes parametric. All we need to do is swap out the definition of T [[Ω]]w, which relates types as data. Figure 4 gives an alternative definition that allows choosing an arbitrary relation between arbitrary types. Everything else stays exactly the same. We decorate the set of parametric logical relations thus obtained with ◦ (i.e., V ◦ , E ◦ , etc.) to distinguish them from the original ones. Likewise, we write ◦ for the notion of parametric logical approximation defined as in Figure 2 but in terms of the parametric relations. For clarity, we will refer to the original definition as the non-parametric logical relation. This modification gives us a seemingly parametric definition of logical approximation for G terms. But what does that actually mean? What is the relation between parametric and non-parametric logical approximation and, ultimately, contextual approximation? Since the language is not parametric, clearly, parametrically equivalent terms generally are not contextually equivalent. The answer is given by the wrapping functions we defined in the previous section. The following theorem connects the two notions of logical relation and approximation that we have introduced:

A Free Theorem. We can use the parametric relation for proving free theorems [30] in G. For example, for any g : ∀α.α → α in G it holds that Wr− (g) either diverges for all possible arguments τ and v : τ , or it returns v in all cases. We first apply the Fundamental Property for to relate g to itself in E, then transfer this to E ◦ for Wr− (g) using the Wrapping Theorem. From there the proof proceeds in the usual way.

7. Syntactic vs. Semantic Parametricity The primary motivation for our parametric relation in the previous section was to enable more direct parametric reasoning about the result of (positively) wrapping System F terms. However, it is also possible to use our parametric relation to reason about terms that are syntactically, or intensionally, non-parametric (i.e., that use cast’s), so long as they are semantically, or extensionally, parametric (i.e., the use of cast is not externally observable). For example, consider the following two polymorphic functions of type ∀α.τα (here, let b2i = λx:bool. if x then 1 else 0):

Theorem 6.1 (Wrapping for ◦ ) + 1. If e1 ◦ e2 : τ , then Wr+ τ (e1 ) Wrτ (e2 ) : τ . − ◦ 2. If e1 e2 : τ , then Wrτ (e1 ) Wr− τ (e2 ) : τ .

τα := ∃β. (α × α → β) × (β → α) × (β → α) g1 := λα. pack α × α, λp.p, λx.(x.1), λx.(x.2) as τα g2 := λα. cast τbool τα (pack int, λp:(bool × bool). b2i(p.1) + 2×b2i (p.2), λx:int. x mod 2 = 0, λx:int. x div 2 = 0 as τbool ) (g1 α)

This theorem justifies the definition of the parametric logical relation. At the same time it can be read as a correctness result for the wrapping operators: it says that whenever we can relate two terms using parametric reasoning, then the positive wrappings of the first term contextually approximates the positive wrapping of the second. Dually, once any properly related terms are wrapped negatively, they can safely be passed to any term that depends on its context behaving parametrically. What can we say about the content of the parametric relation? Obviously, it cannot contain arbitrary non-parametric G terms— e.g., cast τ1 τ2 is not even related to itself in E ◦ . However, we still obtain the following restricted form of the fundamental property:

These two functions take a type argument α and return a simple generic ADT for pairs over α. But g2 is more clever about it and specializes the representation for α = bool. In that case, it packs both components into the two least significant bits of a single integer. For all other types, g2 falls back to the generic implementation from g1 . Using the parametric relation, we will be able to show that Wr+ (g1 ) Wr+ (g2 ) : ∀α.τα . One might find this surprising, since g2 is syntactically non-parametric, returning different implementations for different instantiations of its type argument. However, since the two possible implementations g2 returns are extensionally equivalent to each other, g2 is semantically indistinguishable from the syntactically parametric g1 . Formally: Assume that τ1 , τ2 are the types and Rα ∈ Rel[τ1 , τ2 ] is the relation the context picks, parametrically, for α. If τ2 = bool, the rest of the proof is straightforward. Otherwise, we do not know

Theorem 6.2 (Fundamental Property for ◦ ) If Δ; Γ e : τ and e is cast-free, then Δ; Γ e ◦ e : τ . In particular, this implies that any well-typed System F term is parametrically related to itself. The relation will also contain terms with cast, but only if the use of cast does not violate parametricity. (We discuss this further in Section 7.) Along the same lines, we can show that our parametric logical relation is sound w.r.t. contextual approximation, if the definition of the latter is limited to quantifying only over cast-free contexts.

144

Vn± [[α]]ρ Vn± [[b]]ρ Vn± [[τ × τ ]]ρ

def

Vn± [[τ → τ ]]ρ

def

Vn± [[∀α.τ ]]ρ

def

Vn± [[∃α.τ ]]ρ

def

En± [[τ ]]ρ

def

Tn+ [[Ω]]w Tn− [[Ω]]w

def

= = def = def

=

=

= =

= =

def

ρ(α).Rn {(k, w, c, c) ∈ Atomn [b, b]} {(k, w, v1 , v1 , v2 , v2 ) ∈ Atomn [ρ1 (τ × τ ), ρ2 (τ × τ )] | (k, w, v1 , v2 ) ∈ Vn± [[τ ]]ρ ∧ (k, w, v1 , v2 ) ∈ Vn± [[τ ]]ρ} {(k, w, λx:τ1 .e1 , λx:τ2 .e2 ) ∈ Atomn [ρ1 (τ → τ ), ρ2 (τ → τ )] | ∀(k , w , v1 , v2 ) ∈ Vn∓ [[τ ]]ρ. (k , w ) (k, w) ⇒ (k , w , e1 [v1 /x], e2 [v2 /x]) ∈ En± [[τ ]]ρ} {(k, w, λα.e1 , λα.e2 ) ∈ Atomn [ρ1 (∀α.τ ), ρ2 (∀α.τ )] | ∀(k , w ) (k, w). ∀(τ1 , τ2 , r) ∈ Tk∓ [[Ω]]w . (k , w , e1 [τ1 /α], e2 [τ2 /α]) ∈ En± [[τ ]]ρ, α→r} {(k, w, pack τ1 , v1 , pack τ2 , v2 ) ∈ Atomn [ρ1 (∃α.τ ), ρ2 (∃α.τ )] | ∃r. (τ1 , τ2 , r) ∈ Tk± [[Ω]]w ∧ (k, w, v1 , v2 ) ∈ Vn± [[τ ]]ρ, α→r} {(k, w, e1 , e2 ) ∈ Atomn [ρ1 (τ ), ρ2 (τ )] | ∀j < k. ∀σ1 , v1 . (w.σ1 ; e1 →j σ1 ; v1 ) ⇒ ∃w , v2 . (k − j, w ) (k, w) ∧ w .σ1 = σ1 ∧ (w.σ2 ; e2 →∗ w .σ2 ; v2 ) ∧ (k − j, w , v1 , v2 ) ∈ Vn± [[τ ]]ρ} Tn◦ [[Ω]]w Tn [[Ω]]w

Dn+ [[Δ]]w Dn− [[Δ]]w

Δ; Γ e1 ± e2 : τ ⇔ def

def

= =

def

Dn◦ [[Δ]]w Dn [[Δ]]w

Δ; Γ e1 : τ ∧ Δ; Γ e2 : τ ∧ ∀n ≥ 0, ∀w0 ∈ Worldn . ∀(δ1 , δ2 , ρ) ∈ Dn∓ [[Δ]]w0 . ∀(k, w, γ1 , γ2 ) ∈ G∓ n [[Γ]]ρ. (k, w) (n, w0 ) ⇒ (k, w, δ1 γ1 (e1 ), δ2 γ2 (e2 )) ∈ En± [[τ ]]ρ Figure 5. Polarized Logical Relations

anything about τ1 and Rα , because τ1 and τ2 are related in T ◦ . Nevertheless, we can construct a suitable relational interpretation Rβ ∈ Rel[τ1 × τ1 , int] for the type β:

Here is a somewhat contrived example to illustrate the point. Consider the following two polymorphic functions of type ∀α.τα : τα := ∃β. (α → β) × (β → α) f1 := λα. cast τint τα (pack int, λx:int.x+1, λx:int.x as τint ) (pack α, λx:α.x, λx:α.x as τα ) f2 := λα. cast τint τα (pack int, λx:int.x, λx:int.x+1 as τint ) (pack α, λx:α.x, λx:α.x as τα )

Rβ := {(k, w, v, v , 0) | (k, w, v, false), (k, w, v , false) ∈ Rα } ∪ {(k, w, v, v , 1) | (k, w, v, true), (k, w, v , false) ∈ Rα } ∪ {(k, w, v, v , 2) | (k, w, v, false), (k, w, v , true) ∈ Rα } ∪ {(k, w, v, v , 3) | (k, w, v, true), (k, w, v , true) ∈ Rα } As it turns out, we do not need to know much about the structure of Rα to define Rβ . What we are relying on here is only the knowledge that all values in Rα are well-typed, which is built into our definition of Rel. From that we know that there can never be any other value than true or false on the right side of the relation Rα . Hence we can still enumerate all possible cases to define Rβ , and do a respective case distinction when proving equivalence of the projection operations. Interestingly, it seems that our proof relies critically on the fact that our logical relations are restricted to syntactically well-typed terms. Were we to lift this restriction, we would be forced (it seems) to extend the definition of Rβ with a “junk” case, but the calls to b2i in g2 would get stuck if applied to non-boolean values. We leave further investigation of this observation to future work.

These functions take a type argument α and return a simple ADT β. Values of type α can be injected into β, and projected out again. However, both functions specialize the behavior of this ADT for type int—for integers, injecting n and projecting again will give back not n, but rather n + 1. This is true for both functions, but they implement it in a different way. We want to prove that both implementations are equivalent under wrapping using a form of parametric reasoning. However, we cannot do that using the parametric relation from the previous section—since the functions do not behave parametrically (i.e., they return observably different packages for different instantiations of their type argument), they will not be related in E ◦ . To support that kind of reasoning, we need a more refined treatment of parametricity in the logical relation. The idea is to separate the two aforementioned aspects of parametricity. Consequently, we are going to have a pair of separate relations, E + and E − . The former enforces parametric usage, the latter parametric behavior. Figure 5 gives the definition of these relations. We call them polarized, because they are mutually dependent and the polarity (+ or −) switches for contravariant positions, i.e., for function arguments and for universal quantifiers. Intuitively, in these places, term and context switch roles. Except for the consistent addition of polarities, the definition of the polarized relations again only represents a minor modification of the original one.7 We merely refine the definition of the type re-

8. Polarized Logical Relations The parametric relation is useful for proving parametricity properties about (the positive wrappings of) G terms. However, it is all-ornothing: it can only be used to prove parametricity for terms that expect to be treated parametrically and also behave parametrically— cf. the two dual aspects of parametricity described in Section 5. We might also be interested in proving representation independence for terms that do not behave parametrically themselves (in either the syntactic or semantic sense considered in the previous section). One situation where this might show up is if we want to show representation independence for generic ADTs that (like the ones in Section 7) return different results for different instantiations of their type arguments, but where (unlike in Section 7) the difference is not only syntactic but also semantic.

7 In

fact, all four relations can easily be formulated in a single unified definition indexed by ι ::= | ◦ | + | −. We refrained from doing so here for the sake of clarity; see the expanded version of this paper for details [16].

145

If τ1 = int, then we know from the definition of T − that τ2 = int, too. We hence know that both sides will evaluate to the specialized version of the ADT. Since we are in E + , we get to pick some (τ1 , τ2 , r ) ∈ T + [[Ω]]w as the interpretation of β, where the choice of r is up to us. The natural choice is to use τ1 = τ2 = int with the relation r = (int, int, {(k, w, n + 1, n) | n ∈ Z}). The rest of the proof is then straightforward. If τ1 = int we similarly know that τ2 = int from the definition of T − . Hence, both sides use the default implementations, which are trivially related in E + , thanks to Corollary 8.3. Finally, applying the Wrapping Theorem 8.1, we can conclude that Wr+ (f1 ) Wr+ (f2 ) : ∀α.τα , and hence by Soundness, Wr+ (f1 ) Wr+ (f2 ) : ∀α.τα . Note how we relied on the knowledge that τ1 and τ2 can only be int at the same time. This holds for types related in T − but not in T + or T ◦ . If we had tried to do this proof in E ◦ , the types τ1 and τ2 would have been related by T ◦ only, which would give us too little information to proceed with the necessary case distinction.

eG ∈

Wr+

E+

eG ∈ E Wr−

Wr− E ◦ eF

E−

Wr+

Figure 6. Relating the Relations lation T [[Ω]]w to distinguish polarity: in the positive case it behaves parametrically (i.e., allowing an arbitrary relation) and in the negative case non-parametrically (i.e., demanding r be the logical relation at some type). Thus, existential types behave parametrically in E + but non-parametrically in E − , and vice versa for universals.

9. Recursive Types

8.1 Key Properties

We now add iso-recursive types to G and call the result Gμ :

The way in which polarities switch in the polarized relations mirrors what is going on in the definition of wrapping. That of course is no accident, and we can show the following theorem that relates the polarized relations with the non-parametric and parametric ones through uses of wrapping:

Types Values Terms

If e1 If e1 If e1 If e1

::= ::= ::=

. . . | μα.τ . . . | roll v as τ . . . | roll e as τ | unroll e

The extensions to the semantics are standard and therefore omitted— they do not affect the type store. Also, the definition of contextual equivalence does not change (except there are more contexts).

Theorem 8.1 (Wrapping for ± ) 1. 2. 3. 4.

τ v e

+ + e2 : τ , then Wr+ τ (e1 ) Wrτ (e2 ) : τ . − − e2 : τ , then Wrτ (e1 ) Wr− τ (e2 ) : τ . ◦ − + e2 : τ , then Wr− τ (e1 ) Wrτ (e2 ) : τ . ◦ + − e2 : τ , then Wrτ (e1 ) Wr+ τ (e2 ) : τ .

9.1 Extending the Logical Relations The step-indexing that we used in defining our logical relations makes it very easy to adapt them to Gμ . There are two natural ways in which we could define the value relation at a recursive type:

Moreover, we can show that the inverse directions of these implications require no wrapping at all:

1. Vnι [[μα.τ ]]ρ = {(k, w, roll v1 , roll v2 ) ∈ Atomn [. . . ] | (k, w, v1 , v2 ) ∈ Vkι [[τ ]]ρ, α→Vkι [[μα.τ ]]ρ} def 2. Vnι [[μα.τ ]]ρ = {(k, w, roll v1 , roll v2 ) ∈ Atomn [. . . ] | (k, w, v1 , v2 ) ∈ Vkι [[τ [μα.τ /α]]]ρ} def

Theorem 8.2 (Inclusion for ± ) 1. If e1 e2 : τ or e1 ◦ e2 : τ , then e1 + e2 : τ . 2. If e1 − e2 : τ , then e1 e2 : τ and e1 ◦ e2 : τ .

For ι ∈ {, ◦}—i.e., for the non-parametric and parametric forms of the logical relation—the above two formulations are equivalent due to the validity of a standard substitution property. Unfortunately, though, we do not have such a property for the polarized relation. In fact, for ι ∈ {+, −}, the first definition wrongly records a fixed polarity for α. It is thus crucial that we choose the second one; only then do all key properties continue to hold in Gμ .

This theorem can equivalently be stated as: E − ⊆ E ⊆ E + and E− ⊆ E◦ ⊆ E+. Note that Theorem 6.1 follows directly from Theorems 8.1 and 8.2. Similarly, the following property follows from Theorem 8.2 together with Theorem 4.1: Corollary 8.3 (Fundamental Property for + ) If Δ; Γ e : τ , then Δ; Γ e + e : τ .

9.2 Extending the Wrapping

Interestingly, compatibility does not hold for ± (consider the polarities in the rule for application), which has the consequence that we cannot show Corollary 8.3 directly. For a similar reason, we cannot show any such property for − at all. Figure 6 depicts all of the above properties in a single diagram. Unlabeled arrows denote inclusion, while labeled arrows denote the wrapping that maps one relation to the other. The ∈-operators show the fundamental properties for the respective relations, i.e., which class of terms are included (G terms or F terms).

How can we upgrade the wrapping to account for recursive types? Given an argument of type μα.τ , the basic idea is to first unfold it to type τ [μα.τ /α], then wrap it at that type, and finally fold the result back to type μα.τ . Of course, since τ [μα.τ /α] may be larger than μα.τ , a direct implementation of this idea will not result in a well-founded definition. The solution is to use a fixed-point (definable in terms of recursive types, of course), which gives us a handle on the wrapping function we are in the middle of defining. Figure 7 shows the new definition. We first index the wrapping by an environment ϕ that maps recursive type variables α to wrappings for those variables. Roughly, the wrapping at type μα.τ under environment ϕ is a recursive function F , defined in terms of the wrapping at type τ under environment ϕ, α → F . Since the bound variable of a recursive type may occur in positions of different polarity, we actually need two mutually recursive functions and then select the right one depending on the polarity. The cognoscenti will recognize this as a

8.2 Example Getting back to our motivating example from the beginning of the section, it is essentially straightforward to prove that f1 + f2 : ∀α.τα . The proof proceeds as usual, except that we have to make a case distinction when we want to show that the function bodies are related in E + . At that point, we are given a triple (τ1 , τ2 , r) ∈ T − [[Ω]]w.

146

relies on parametricity (especially for existential types) to hide an ADT’s representation from clients [15]. The latter approach is typically employed in untyped languages, which do not have the ability to place static restrictions on clients. Consequently, data hiding has to be enforced on the level of individual values. For that, languages provide means for generating unique names and using them as keys for dynamically sealing values. A value sealed by a given key can only be inspected by principals that have access to the key [27]. Dynamic type generation as we employ it [21, 29, 22] can be seen as a middle ground, because it bears resemblance to both approaches. As in the dynamic approach, we cannot rely on parametricity and instead generate dynamic names to protect abstractions. However, these are type-level names, not term-level names, and they only “seal” type information. In particular, individual values of abstract type are still directly represented by the underlying representation type, so that crossing abstraction boundaries has no runtime cost. In that sense, we are closer to the static approach. Another approach to reconciling type abstraction and type analysis has been proposed by Washburn and Weirich [31]. They introduce a type system that tracks information flow for terms and types-as-data. By distinguishing security levels, the type system can statically prevent unauthorized inspection of types by clients.

Wr± = v (if α ∈ / dom(ϕ)) α;ϕ (v) def Wr± = ϕ± (α) v (if α ∈ dom(ϕ)) α;ϕ (v) def + + Wr± μα.τ ;ϕ (v) = letrec f = λx. roll (Wrτ ;ϕ (unroll x)[μα.τ /α]) − (unroll x)[μα.τ /α]) and f = λx. roll (Wr− τ ;ϕ (where ϕ = ϕ, α→(f + , f − )) in f ± v def

(other cases as before except for the consistent addition of ϕ) Figure 7. Wrapping for Gμ polarized variant of the so-called syntactic projection function associated with a recursive type [8]. Note that the environment only plays a role for recursive types, and that for any τ that does not involve recursive types, Wr± τ ;∅ is ± the same as our old wrapping Wr± τ from Section 5. Taking Wrτ to be shorthand for Wr± , all our old wrapping theorems for G τ ;∅ continue to hold for Gμ . Full proofs of these theorems are given in the expanded version of this paper [16].

10. Towards Full Abstraction

Multi-Language Interoperation. The closest work to ours is that of Matthews and Ahmed [13]. They describe a pair of mutually recursive logical relations that deal with the interoperation between a typed language (“ML”) and an untyped language (“Scheme”). Unlike in G, parametric behavior is hard-wired into their ML side: polymorphic instantiation unconditionally performs a form of dynamic sealing to protect against the non-parametric Scheme side. (In contrast, we treat new as its own language construct, orthogonal to universal types.) Dynamic sealing can then be defined in terms of the primitive coercion operators that bridge between the ML and Scheme sides. These coercions are similar to our (metalevel) wrapping operators, but ours perform type-level sealing, not term-level sealing. The logical relations in Matthews and Ahmed’s formalism are somewhat reminiscent of E ◦ and E, although theirs are distinct logical relations for two languages, while ours are for a single language and differ only in the definition of T [[Ω]]w. In order to prove the fundamental property for their relations, they prove a “bridge lemma” transferring relatedness in one language to the other via coercions. This is analogous to our Wrapping Theorem for ◦ , but the latter is an independent theorem, not a lemma. Also, they do not propose anything like our polarized logical relations. A key technical difference is that their formulation of the logical relations does not use possible worlds to capture the type store (the latter is left implicit in their operational semantics). Unfortunately, this resulted in a significant flaw in their paper [4]. They have since reportedly fixed the problem—independently of our work—using a technique similar to ours, but they have yet to write up the details.

◦

The definition of the parametric relation E (including the extension for recursive types) is largely very similar to that of a typical step-indexed logical relation EFµ for Fμ , i.e., System F extended with pairs, existentials and iso-recursive types [3]. The main difference is the presence of worlds, but they are not actually used in a particularly interesting way in E ◦ . Therefore, one might expect that any two Fμ terms related by the hypothetical EFµ would also be related by E ◦ and vice versa. However, this is not obvious: Gμ is more expressive than Fμ , i.e., terms in the parametric relation can contain non-trivial uses of casts (e.g., the generic ADT for pairs from Section 7), and there is no evident way to back-translate these terms into Fμ , as would be needed for function arguments. That invalidates a proof approach like the one taken by Ahmed and Blume [5]. Ultimately, the property we would like to be able to show is that the embedding of Fμ into Gμ by positive wrapping is fully abstract: + e1 Fµ e2 : τ ⇐⇒ Wr+ τ (e1 ) Wrτ (e2 ) : τ

This equivalence is even stronger than the one about logical relatedness in EFµ and E ◦ , because is only sound w.r.t. contextual approximation, not complete. Since Fμ is a fragment of Gμ , and Fμ contexts cannot observe any difference between an Fμ term and its wrapping, the direction from right to left, called equivalence reflection, is not hard to show. Theorem 10.1 (Equivalence Reflection) If Δ; Γ Fµ e1 : τ and Δ; Γ Fµ e2 : τ + and Δ; Γ Wr+ τ (e1 ) Wrτ (e2 ) : τ , then Δ; Γ e1 Fµ e2 : τ .

Proof Methods. Logical relations in various forms are routinely used to reason about program equivalence and type abstraction [20, 14, 17, 3]. In particular, Ahmed, Dreyer and Rossberg recently applied step-indexed logical relations with possible worlds to reason about type abstraction for a language with higher-order state [6]. State in G is comparatively benign, but still requires a circular definition of worlds that we stratify using steps. Pitts and Stark used logical relations to reason about program equivalence in a language with (term-level) name generation [18] and subsequently generalized their technique to handle mutable references [19]. Sumii and Pierce use them for proving secrecy results for a language with dynamic sealing [26], where generated names are used as keys. Their logical relation uses a form of possible world very similar to ours, but tying relational interpretations to

Unfortunately, it is not known to us whether the other direction, equivalence preservation, holds as well. We conjecture that it does, but are not aware of any suitable technique to prove it. Note that while equivalence reflection also holds for F and G— i.e., in the absence of recursive types—equivalence preservation does not, because non-termination is encodable in G but not in F.

11. Related Work Type Generation vs. Other Forms of Data Abstraction. Traditionally, authors have distinguished between two complementary forms of data abstraction, sometimes dubbed the static and the dynamic approach [13]. The former is tied to the type system and

147

term-level private keys instead of to type names. Their worlds come into play in the interpretation of the type bits of encrypted data, whereas in our setup the worlds are important in the interpretation of universal and existential types. In another line of work, Sumii and Pierce have used bisimulations to establish abstraction results for both untyped and polymorphic languages [27, 28]. However, none of the languages they investigate mixes the two paradigms. Grossman, Morrisett and Zdancewic have proposed the use of abstraction brackets for syntactically tracing abstraction boundaries [10] during program execution. However, this is a comparatively weak method that does not seem to help in proving parametricity or representation independence results.

[6] Amal Ahmed, Derek Dreyer, and Andreas Rossberg. State-dependent representation independence. In POPL, 2009. [7] Andrew W. Appel and David McAllester. An indexed model of recursive types for foundational proof-carrying code. TOPLAS, 23(5):657–683, 2001. [8] Karl Crary and Robert Harper. Syntactic logical relations for polymorphic and recursive types. In Computation, Meaning and Logic: Articles dedicated to Gordon Plotkin. 2007. [9] Jean-Yves Girard. Interpr´etation fonctionelle et e´ limination des coupures de l’arithm´etique d’ordre sup´erieur. PhD thesis, Universit´e Paris VII, 1972. [10] Dan Grossman, Greg Morrisett, and Steve Zdancewic. Syntactic type abstraction. TOPLAS, 22(6):1037–1080, 2000.

12. Conclusion and Future Work

[11] Robert Harper and John C. Mitchell. Parametricity and variants of Girard’s J operator. Information Processing Letters, 1999.

In traditional static languages, type abstraction is established by parametric polymorphism. This approach no longer works when dynamic typing features like casts, typecase, or reflection are added to the mix. Dynamic type generation addresses this problem. In this paper, we have shown that dynamic type generation succeeds in recovering type abstraction. More specifically: (1) we presented a step-indexed logical relation for reasoning about program equivalence in a non-parametric language with cast and type generation; (2) we showed that parametricity can be re-established systematically using a simple type-directed wrapping, which then can be reasoned about using a parametric variant of the logical relation; (3) we showed that parametricity can be refined into parametric behavior and parametric usage and gave a polarized logical relation that distinguishes these dual notions, thereby handling more subtle examples. The concept of a polarized logical relation seems novel, and it remains to be seen what else it might be useful for. Interestingly, all our logical relations can be defined as a single family differing only in the interpretation T of types-as-data. An open question is whether the wrapping, when seen as an embedding of Fμ into Gμ , is fully abstract. We conjecture that it is, but we were only able to show equivalence reflection, not equivalence preservation. Proving full abstraction remains an interesting challenge for future work. On the practical side, we would like to scale our logical relation to handle a more realistic language like ML. Unfortunately, wrapping cannot easily be extended to a type of mutable references. However, we believe that our approach still scales to a large class of languages, so long as we instrument it with a distinction between module and core levels. Specifically, note that wrapping only does something “interesting” for universal and existential types, and is the identity (modulo η-expansion) otherwise. Thus, for a language like Standard ML, which does not support firstclass polymorphism—or Alice ML, which supports modules-asfirst-class-values, but not existentials—wrapping could be confined to the module level (as part of the implementation of opaque signature ascription). For core-level types it could just be the identity. This is a real advantage of type generation over dynamic sealing since, for the latter, the need to seal/unseal individual values of abstract type precludes any attempt to confine wrapping to modules.

[12] Robert Harper and Greg Morrisett. Compiling polymorphism using intensional type analysis. In POPL, 1995. [13] Jacob Matthews and Amal Ahmed. Parametric polymorphism through run-time sealing, or, theorems for low, low prices! In ESOP, 2008. [14] John C. Mitchell. Representation independence and data abstraction. In POPL, 1986. [15] John C. Mitchell and Gordon D. Plotkin. Abstract types have existential type. TOPLAS, 10(3):470–502, 1988. [16] Georg Neis. Non-parametric parametricity. Universit¨at des Saarlandes, 2009.

Master’s thesis,

[17] Andrew Pitts. Typed operational reasoning. In Benjamin C. Pierce, editor, Advanced Topics in Types and Programming Languages, chapter 7. MIT Press, 2005. [18] Andrew Pitts and Ian Stark. Observable properties of higher order functions that dynamically create local names, or: What’s new? In MFCS, volume 711 of LNCS, 1993. [19] Andrew Pitts and Ian Stark. Operational reasoning for functions with local state. In HOOTS, 1998. [20] John C. Reynolds. Types, abstraction and parametric polymorphism. In Information Processing, 1983. [21] Andreas Rossberg. Generativity and dynamic opacity for abstract types. In PPDP, 2003. [22] Andreas Rossberg. Dynamic translucency with abstraction kinds and higher-order coercions. In MFPS, 2008. [23] Andreas Rossberg, Didier Le Botlan, Guido Tack, Thorsten Brunklaus, and Gert Smolka. Alice ML through the looking glass. In TFP, volume 5, 2004. [24] Peter Sewell. Modules, abstract types, and distributed versioning. In POPL, 2001. [25] Peter Sewell, James Leifer, Keith Wansbrough, Francesco Zappa Nardelli, Mair Allen-Williams, Pierre Habouzit, and Viktor Vafeiadis. Acute: High-level programming language design for distributed computation. JFP, 17(4&5):547–612, 2007. [26] Eijiro Sumii and Benjamin C. Pierce. Logical relations for encryption. JCS, 11(4):521–554, 2003. [27] Eijiro Sumii and Benjamin C. Pierce. A bisimulation for dynamic sealing. TCS, 375(1–3):161–192, 2007.

References [1] Mart´ın Abadi, Luca Cardelli, Benjamin Pierce, and Didier R´emy. Dynamic typing in polymorphic languages. JFP, 5(1):111–130, 1995.

[28] Eijiro Sumii and Benjamin C. Pierce. A bisimulation for type abstraction and recursion. JACM, 54(5):1–43, 2007.

[2] Amal Ahmed. Semantics of Types for Mutable State. PhD thesis, Princeton University, 2004.

[29] Dimitrios Vytiniotis, Geoffrey Washburn, and Stephanie Weirich. An open and shut typecase. In TLDI, 2005.

[3] Amal Ahmed. Step-indexed syntactic logical relations for recursive and quantified types. In ESOP, 2006.

[30] Philip Wadler. Theorems for free! In FPCA, 1989. [31] Geoffrey Washburn and Stephanie Weirich. Generalizing parametricity using information flow. In LICS, 2005.

[4] Amal Ahmed. Personal communication, 2009.

[32] Stephanie Weirich. Type-safe cast. JFP, 14(6):681–695, 2004.

[5] Amal Ahmed and Matthias Blume. Typed closure conversion preserves observational equivalence. In ICFP, 2008.

148

Finding Race Conditions in Erlang with QuickCheck and PULSE Koen Claessen Michał Pałka Nicholas Smallbone

John Hughes Hans Svensson Thomas Arts

Chalmers University of Technology, Gothenburg, Sweden [email protected] [email protected] [email protected]

Chalmers University of Technology and Quviq AB [email protected] [email protected] [email protected]

Abstract

General Terms Keywords

1.

Erlang Training and Consulting [email protected]

a multi-core processor. In that case, one would really benefit from a hierarchical approach to testing legacy code in order to simplify debugging of faults encountered. The Erlang programming language (Armstrong 2007) is designed to simplify concurrent programming. Erlang processes do not share memory, and Erlang data structures are immutable, so the kind of data races which plague imperative programs, in which concurrent processes race to read and write the same memory location, simply cannot occur. However, this does not mean that Erlang programs are immune to race conditions. For example, the order in which messages are delivered to a process may be nondeterministic, and an unexpected order may lead to failure. Likewise, Erlang processes can share data, even if they do not share memory—the file store is one good example of shared mutable data, but there are also shared data-structures managed by the Erlang virtual machine, which processes can race to read and write. Industrial experience is that the late discovery of race conditions is a real problem for Erlang developers too (Cronqvist 2004). Moreover, these race conditions are often caused by design errors, which are particularly expensive to repair. If these race conditions could be found during unit testing instead, then this would definitely reduce the cost of software development. In this paper, we describe tools we have developed for finding race conditions in Erlang code during unit testing. Our approach is based on property-based testing using QuickCheck (Claessen and Hughes 2000), in a commercial version for Erlang developed by Quviq AB (Hughes 2007; Arts et al. 2006). Its salient features are described in section 3. We develop a suitable property for testing parallel code, and a method for generating parallel test cases, in section 4. To test a wide variety of schedules, we developed a randomizing scheduler for Erlang called PULSE, which we explain in section 5. PULSE records a trace during each test, but interpreting the traces is difficult, so we developed a trace visualizer which is described in section 6. We evaluate our tools by applying them to an industrial case study, which is introduced in section 2, then used as a running example throughout the paper. This code was already known to contain bugs (thanks to earlier experiments with QuickCheck in 2005), but we were previously unable to diagnose the problems. Using the tools described here, we were able to find and fix two race conditions, and identify a fundamental flaw in the API.

We address the problem of testing and debugging concurrent, distributed Erlang applications. In concurrent programs, race conditions are a common class of bugs and are very hard to find in practice. Traditional unit testing is normally unable to help finding all race conditions, because their occurrence depends so much on timing. Therefore, race conditions are often found during system testing, where due to the vast amount of code under test, it is often hard to diagnose the error resulting from race conditions. We present three tools (QuickCheck, PULSE, and a visualizer) that in combination can be used to test and debug concurrent programs in unit testing with a much better possibility of detecting race conditions. We evaluate our method on an industrial concurrent case study and illustrate how we find and analyze the race conditions. Categories and Subject Descriptors ging]: Distributed debugging

Ulf Wiger

D.2.5 [Testing and Debug-

Verification

QuickCheck, Race Conditions, Erlang

Introduction

Concurrent programming is notoriously difficult, because the nondeterministic interleaving of events in concurrent processes can lead software to work most of the time, but fail in rare and hardto-reproduce circumstances when an unfortunate order of events occurs. Such failures are called race conditions. In particular, concurrent software may work perfectly well during unit testing, when individual modules (or “software units”) are tested in isolation, but fail later on during system testing. Even if unit tests cover all aspects of the units, we still can detect concurrency errors when all components of a software system are tested together. Timing delays caused by other components lead to new, previously untested, schedules of actions performed by the individual units. In the worst case, bugs may not appear until the system is put under heavy load in production. Errors discovered in these late stages are far more expensive to diagnose and correct, than errors found during unit testing. Another cause of concurrency errors showing up at a late stage is when well-tested software is ported from a single-core to

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICFP’09, August 31–September 2, 2009, Edinburgh, Scotland, UK. c 2009 ACM 978-1-60558-332-7/09/08. . . $5.00 Copyright

2.

Introducing our case study: the process registry

We begin by introducing the industrial case that we apply our tools and techniques to. In Erlang, each process has a unique, dynamically-assigned identifier (“pid”), and to send a message to

149

a process, one must know its pid. To enable processes to discover the pids of central services, such as error logging, Erlang provides a process registry—a kind of local name server—which associates static names with pids. The Erlang VM provides operations to register a pid with a name, to look up the pid associated with a name, and to unregister a name, removing any association with that name from the registry. The registry holds only live processes; when registered processes crash, then they are automatically unregistered. The registry is heavily used to provide access to system services: a newly started Erlang node already contains 13 registered processes. However, the built-in process registry imposes several, sometimes unwelcome, limitations: registered names are restricted to be atoms, the same process cannot be registered with multiple names, and there is no efficient way to search the registry (other than by name lookup). This motivated Ulf Wiger (who was working for Ericsson at the time) to develop an extended process registry in Erlang, which could be modified and extended much more easily than the one in the virtual machine. Wiger’s process registry software has been in use in Ericsson products for several years (Wiger 2007). In our case study we consider an earlier prototype of this software, called proc_reg, incorporating an optimization that proved not to work. The API supported is just: reg(Name,Pid) to register a pid, where(Name) to look up a pid, unreg(Name) to remove a registration, and send(Name,Msg) to send a message to a registered process. Like the production code, proc_reg stores the association between names and pids in Erlang Term Storage (“ETS tables”)—hash tables, managed by the virtual machine, that hold a set of tuples and support tuple-lookup using the first component as a key (cf. Armstrong 2007, chap 15). It also creates a monitor for each registered process, whose effect is to send proc_reg a “DOWN” message if the registered process crashes, so it can be removed from the registry. Two ETS table entries are created for each registration: a “forward” entry that maps names to pids, and a “reverse” entry that maps registered pids to the monitor reference. The monitor reference is needed to turn off monitoring again, if the process should later be unregistered. Also like the production code, proc_reg is implemented as a server process using Erlang’s generic server library (cf. Armstrong 2007, chap 16). This library provides a robust way to build clientserver systems, in which clients make “synchronous calls” to the server by sending a call message, and awaiting a matching reply1 . Each operation—reg, where, unreg and send—is supported by a different call message. The operations are actually executed by the server, one at a time, and so no race conditions can arise. At least, this is the theory. In practice there is a small cost to the generic server approach: each request sends two messages and requires two context switches, and although these are cheap in Erlang, they are not free, and turn out to be a bottleneck in system start-up times, for example. The prototype proc_reg attempts to optimize this, by moving the creation of the first “forward” ETS table entry into the clients. If this succeeds (because there is no previous entry with that name), then clients just make an “asynchronous” call to the server (a so-called cast message, with no reply) to inform it that it should complete the registration later. This avoids a context switch, and reduces two messages to one. If there is already a registered process with the same name, then the reg operation fails (with an exception)—unless, of course, the process is dead. In this case, the process will soon be removed from the registry by the server; clients ask the server to “audit” the dead process to hurry this along, then complete their registration as before.

This prototype was one of the first pieces of software to be tested using QuickCheck at Ericsson. At the time, in late 2005, it was believed to work, and indeed was accompanied by quite an extensive suite of unit tests—including cases designed specifically to test for race conditions. We used QuickCheck to generate and run random sequences of API calls in two concurrent processes, and instrumented the proc_reg code with calls to yield() (which cedes control to the scheduler) to cause fine-grain interleaving of concurrent operations. By so doing, we could show that proc_reg was incorrect, since our tests failed. But the failing test cases we found were large, complex, and very hard to understand, and we were unable to use them to diagnose the problem. As a result, this version of proc_reg was abandoned, and development of the production version continued without the optimization. While we were pleased that QuickCheck was able to reveal bugs in proc_reg, we were unsatisfied that it could not help us to find them. Moreover, the QuickCheck property we used to test it was hard-to-define and ad hoc—and not easily reusable to test any other software. This paper is the story of how we addressed these problems—and returned to apply our new methods successfully to the example that defeated us before.

3.

An Overview of Quviq QuickCheck

QuickCheck (Claessen and Hughes 2000) is a tool that tests universally quantified properties, instead of single test cases. QuickCheck generates random test cases from each property, tests whether the property is true in that case, and reports cases for which the property fails. Recent versions also “shrink” failing test cases automatically, by searching for similar, but smaller test cases that also fail. The result of shrinking is a “minimal”2 failing case, which often makes the root cause of the problem very easy to find. Quviq QuickCheck is a commercial version that includes support for model-based testing using a state machine model (Hughes 2007). This means that it has standard support for generating sequences of API calls using this state machine model. It has been used to test a wide variety of industrial software, such as Ericsson’s Media Proxy (Arts et al. 2006) among others. State machine models are tested using an additional library, eqc_statem, which invokes call-backs supplied by the user to generate and test random, wellformed sequences of calls to the software under test. We illustrate eqc_statem by giving fragments of a (sequential) specification of proc_reg. Let us begin with an example of a generated test case (a sequence of API calls). [{set,{var,1},{call,proc_reg_eqc,spawn,[]}}, {set,{var,2},{call,proc_reg,where,[c]}}, {set,{var,3},{call,proc_reg_eqc,spawn,[]}}, {set,{var,4},{call,proc_reg_eqc,kill,[{var,1}]}}, {set,{var,5},{call,proc_reg,where,[d]}}, {set,{var,6},{call,proc_reg_eqc,reg,[a,{var,1}]}}, {set,{var,7},{call,proc_reg_eqc,spawn,[]}}]

eqc_statem test cases are lists of symbolic commands represented by Erlang terms, each of which binds a symbolic variable (such as {var,1}) to the result of a function call, where {call,M,F,Args} represents a call of function F in module M with arguments Args3 . Note that previously bound variables can be used in later calls. Test cases for proc_reg in particular randomly spawn processes (to use as test data), kill them (to simulate crashes at random times), or pass them to proc_reg operations. Here proc_reg_eqc is the module containing the specification of proc_reg, in which we define local 2 In

the sense that it cannot shrink to a failing test with the shrinking algorithm used. 3 In Erlang, variables start with an uppercase character, whereas atoms (constants) start with a lowercase character.

1 Unique

identifiers are generated for each call, and returned in the reply, so that no message confusion can occur.

150

versions of reg and unreg which just call proc_reg and catch any exceptions. This allows us to write properties that test whether an exception is raised correctly or not. (An uncaught exception in a test is interpreted as a failure of the entire test). We model the state of a test case as a list of processes spawned, processes killed, and the {Name,Pid} pairs currently in the registry. We normally encapsulate the state in a record:

postcondition(S,{call,_,reg,[Name,Pid]},Res) -> case Res of true -> register_ok(S,Name,Pid); {’EXIT’,_} -> not register_ok(S,Name,Pid) end; postcondition(S,{call,_,unreg,[Name]},Res) -> case Res of true -> unregister_ok(S,Name); {’EXIT’,_} -> not unregister_ok(S,Name) end; postcondition(S,{call,_,where,[Name]},Res) -> lists:member({Name,Res},S#state.regs); postcondition(_S,{call,_,_,_},_Res) -> true.

-record(state,{pids=[],regs=[],killed=[]}).

eqc_statem generates random calls using the call-back function command that we supply as part of the state machine model, with the test case state as its argument: command(S) -> oneof( [{call,?MODULE,spawn,[]}] ++ [{call,?MODULE,kill, [elements(S#state.pids)]} || S#state.pids/=[]] ++ [{call,?MODULE,reg,[name(),elements(S#state.pids)]} || S#state.pids/=[]] ++ [{call,?MODULE,unreg,[name()]}] ++ [{call,proc_reg,where,[name()]}]).

unregister_ok(S,Name) -> lists:keymember(Name,1,S#state.regs).

Note that reg(Name,Pid) and unreg(Name) are required to return exceptions if Name is already used/not used respectively, but that reg always returns true if Pid is dead, even though no registration is performed! This may perhaps seem a surprising design decision, but it is consistent. As a comparison, the built-in process registry sometimes returns true and sometimes raises an exception when registering dead processes. This is due to the fact that a context switch is required to clean up. State machine models can also specify a precondition for each call, which restricts test cases to those in which all preconditions hold. In this example, we could have used preconditions to exclude test cases that we expect to raise exceptions—but we prefer to allow any test case, and check that exceptions are raised correctly, so we define all preconditions to be true. With these four call-backs, plus another call-back specifying the initial state, our specification is almost complete. It only remains to define the top-level property which generates and runs tests:

name() -> elements([a,b,c,d]).

The function oneof is a QuickCheck generator that randomly uses one element from a list of generators; in this case, the list of candidates to choose from depends on the test case state. ([X||P] is a degenerate list comprehension, that evaluates to the empty list if P is false, and [X] if P is true—so reg and kill can be generated only if there are pids available to pass to them.) We decided not to include send in test cases, because its implementation is quite trivial. The macro ?MODULE expands to the name of the module that it appears in, proc_reg_eqc in this case. The next_state function specifies how each call is supposed to change the state: next_state(S,V,{call,_,spawn,_}) -> S#state{pids=[V|S#state.pids]}; next_state(S,V,{call,_,kill,[Pid]}) -> S#state{killed=[Pid|S#state.killed], regs=[{Name,P} || {Name,P} case register_ok(S,Name,Pid) andalso not lists:member(Pid,S#state.killed) of true -> S#state{regs=[{Name,Pid}|S#state.regs]}; false -> S end; next_state(S,_V,{call,_,unreg,[Name]}) -> S#state{regs=lists:keydelete(Name,1,S#state.regs)}; next_state(S,_V,{call,_,where,[_]}) -> S.

prop_proc_reg() -> ?FORALL(Cmds,commands(?MODULE), begin {ok,ETSTabs} = proc_reg_tabs:start_link(), {ok,Server} = proc_reg:start_link(), {H,S,Res} = run_commands(?MODULE,Cmds), cleanup(ETSTabs,Server), Res == ok end).

Here ?FORALL binds Cmds to a random list of commands generated by commands, then we initialize the registry, run the commands, clean up, and check that the result of the run (Res) was a success. Here commands and run_commands are provided by eqc_statem, and take the current module name as an argument in order to find the right call-backs. The other components of run_commands’ result, H and S, record information about the test run, and are of interest primarily when a test fails. This is not the case here: sequential testing of proc_reg does not fail.

register_ok(S,Name,Pid) -> not lists:keymember(Name,1,S#state.regs).

Note that the new state can depend on the result of the call (the second argument V), as in the first clause above. Note also that killing a process removes it from the registry (in the model), and that registering a dead process, or a name that is already registered (see register_ok), should not change the registry state. We do allow the same pid to be registered with several names, however. When running tests, eqc_statem checks the postcondition of each call, specified via another call-back that is given the state before the call, and the actual result returned, as arguments. Since we catch exceptions in each call, which converts them into values of the form {’EXIT’,Reason}, our proc_reg postconditions can test that exceptions are raised under precisely the right circumstances:

4.

Parallel Testing with QuickCheck

4.1

A Parallel Correctness Criterion

In order to test for race conditions, we need to generate test cases that are executed in parallel, and we also need a specification of the correct parallel behavior. We have chosen, in this paper, to use a specification that just says that the API operations we are testing should behave atomically. How can we tell from test results whether or not each operation “behaved atomically”? Following Lamport (1979) and Herlihy and

151

in the worst case). Finally, we run tests by first running the prefix, then spawning two processes to run the two command-lists in parallel, and collecting their results, which will be non-deterministic depending on the actual parallel scheduling of events. We decide whether a test has passed, by attempting to construct a sequentialization of the test case which explains the results observed. We begin with the sequential prefix of the test case, and use the next_state function of the eqc_statem model to compute the test case state after this prefix is completed. Then we try to extend the sequential prefix, one command at a time, by choosing the first command from one of the parallel branches, and moving it into the prefix. This is allowed only if the postcondition specified in the eqc_statem model accepts the actual result returned by the command when we ran the test. If so, we use the next_state function to compute the state after this command, and continue. If the first commands of both branches fulfilled their postconditions, then we cannot yet determine which command took effect first, and we must explore both possibilities further. If we succeed in moving all commands from the parallel branches into the sequential prefix, such that all postconditions are satisfied, then we have found a possible sequentialization of the test case explaining the results we observed. If our search fails, then there is no such sequence, and the test failed. This is a greedy algorithm: as soon as a postcondition fails, then we can discard all potential sequentializations with the failing command as the next one in the sequence. This happens often enough to make the search reasonably fast in practice. As a further optimization, we memoize the search function on the remaining parallel branches and the test case state. This is useful, for example, when searching for a sequentialization of [A, B] and [C, D], if both [A, C] and [C, A] are possible prefixes, and they lead to the same test state—for then we need only try to sequentialize [B] and [D] once. We memoize the non-interference test in a similar way, and these optimizations give an appreciable, though not dramatic, speed-up in our experiments—of about 20%. With these optimizations, generating and running parallel tests is acceptably fast.

Wing (1987), we consider a test to have passed if the observed results are the same as some possible sequential execution of the operations in the test—that is, a possible interleaving of the parallel processes in the test. Of course, testing for atomic behavior is just a special case, and in general we may need to test other properties of concurrent code too—but we believe that this is a very important special case. Indeed, Herlihy and Wing claim that their notion of linearizability “focuses exclusively on a subset of concurrent computations that we believe to be the most interesting and useful”; we agree. In particular, atomicity is of great interest for the process registry. One great advantage of this approach is that we can reuse the same specification of the sequential behavior of an API, to test its behavior when invocations take place in parallel. We need only find the right linearization of the API calls in the test, and then use the sequential specification to determine whether or not the test has passed. We have implemented this idea in a new QuickCheck module, eqc_par_statem, which takes the same state-machine specifications as eqc_statem, but tests the API in parallel instead. While state machine specifications require some investment to produce in real situations, this means that we can test for race conditions with no further investment in developing a parallel specification. It also means that, as the code under test evolves, we can switch freely toand-fro between sequential testing to ensure the basic behavior still works, and race condition testing using eqc_par_statem. The difficulty with this approach is that, when we run a test, then there is no way to observe the sequence in which the API operations take effect. (For example, a server is under no obligation to service requests in the order in which they are made, so observing this order would tell us nothing.) In general, the only way to tell whether there is a possible sequentialization of a test case which can explain the observed test results, is to enumerate all possible sequentializations. This is prohibitively expensive unless care is taken when test cases are generated. 4.2

Generating Parallel Test Cases

Our first approach to parallel test case generation was to use the standard Quviq QuickCheck library eqc_statem to generate sequential test cases, then execute all the calls in the test case in parallel, constrained only by the data dependencies between them (which arise from symbolic variables, bound in one command, being used in a later one). This generates a great deal of parallelism, but sadly also an enormous number of possible serializations—in the worst case in which there are no data dependencies, a sequence of n commands generates n! possible serializations. It is not practically feasible to implement a test oracle for parallel tests of this sort. Instead, we decided to generate parallel test cases of a more restricted form. They consist of an initial sequential prefix, to put the system under test into a random state, followed by exactly two sequences of calls which are performed in parallel. Thus the possible serializations consist of the initial prefix, followed by an interleaving of the two parallel sequences. (Lu et al. (2008) gives clear evidence that it is possible to discover a large fraction of the concurrency related bugs by using only two parallel threads/processes.) We generate parallel test cases by parallelizing a suffix of an eqc_statem test case, separating it into two lists of commands of roughly equal length, with no mutual data dependencies, which are non-interfering according to the sequential specification. By non-interference, we mean that all command preconditions are satisfied in any interleaving of the two lists, which is necessary to prevent tests from failing because a precondition was unsatisfied—not an interesting failure. We avoid parallelizing too long a suffix (longer than 16 commands), to keep the number of possible interleavings feasible to enumerate (about 10,000

4.3

Shrinking Parallel Test Cases

When a test fails, QuickCheck attempts to shrink the failing test by searching for a similar, but smaller test case which also fails. QuickCheck can often report minimal failing examples, which is a great help in fault diagnosis. eqc_statem already has built-in shrinking methods, of which the most important tries to delete unnecessary commands from the test case, and eqc_par_statem inherits these methods. But we also implement an additional shrinking method for parallel test cases: if it is possible to move a command from one of the parallel suffixes into the sequential prefix, then we do so. Thus the minimal test cases we find are “minimally parallel”—we know that the parallel branches in the failing tests reported really do race, because everything that can be made sequential, is sequential. This also assists fault diagnosis. 4.4 Testing proc reg for Race Conditions To test the process registry using eqc_par_statem, it is only necessary to modify the property in Section 2 to use eqc_par_statem rather than eqc_statem to generate and run test cases. prop_proc_reg_parallel() -> ?FORALL(Cmds,eqc_par_statem:commands(?MODULE), begin {ok,ETSTabs} = proc_reg_tabs:start_link(), {ok,Server} = proc_reg:start_link(), {H,{A,B},Res} = eqc_par_statem:run_commands(?MODULE,Cmds), cleanup(ETSTabs,Server), Res == ok end).

152

5.

The type returned by run_commands is slightly different (A and B are lists of the calls made in each parallel branch, paired with the results returned), but otherwise no change to the property is needed. When this property is tested on a single-core processor, all tests pass. However, as soon as it is tested on a dual-core, tests begin to fail. Interestingly, just running on two cores gives us enough finegrain interleaving of concurrent processes to demonstrate the presence of race conditions, something we had to achieve by instrumenting the code with calls to yield() to control the scheduler when we first tested this code in 2005. However, just as in 2005, the reported failing test cases are large, and do not shrink to small examples. This makes the race condition very hard indeed to diagnose. The problem is that the test outcome is not determined solely by the test case: depending on the actual interleaving of memory operations on the dual core, the same test may sometimes pass and sometimes fail. This is devastating for QuickCheck’s shrinking, which works by repeatedly replacing the failed test case by a smaller one which still fails. If the smaller test happens to succeed—by sheer chance, as a result of non-deterministic execution—then the shrinking process stops. This leads QuickCheck to report failed tests which are far from minimal. Our solution to this is almost embarrassing in its simplicity: instead of running each test only once, we run it many times, and consider a test case to have passed only if it passes every time we run it. We express this concisely using a new form of QuickCheck property, ?ALWAYS(N,Prop), which passes if Prop passes N times in a row4 . Now, provided the race condition we are looking for is reasonably likely to be provoked by test cases in which it is present, then ?ALWAYS(10,...) is very likely to provoke it—and so tests are unlikely to succeed “by chance” during the shrinking process. This dramatically improves the effectiveness of shrinking, even for quite small values of N. While we do not always obtain minimal failing tests with this approach, we find we can usually obtain a minimal example by running QuickCheck a few times. When testing the proc_reg property above, we find the following simple counterexample:

A User-level Scheduler

At this point, we have found a simple test case that fails, but we do not know why it failed—we need to debug it. A natural next step would be to turn on Erlang’s tracing features and rerun the test. But when the bug is caused by a race condition, then turning on tracing is likely to change the timing properties of the code, and thus interfere with the test failure! Even simply repeating the test may lead to a different result, because of the non-determinism inherent in running on a multi-core. This is devastating for debugging. What we need is to be able to repeat the test as many times as we like, with deterministic results, and to observe what happens during the test, so that we can analyze how the race condition was provoked. With this in mind, we have implemented a new Erlang module that can control the execution of designated Erlang processes and records a trace of all relevant events. Our module can be thought of as a user-level scheduler, sitting on top of the normal Erlang scheduler. Its aim is to take control over all sources of non-determinism in Erlang programs, and instead take those scheduling decisions randomly. This means that we can repeat a test using exactly the same schedule by supplying the same random number seed: this makes tests repeatable. We have named the module PULSE, short for ProTest User-Level Scheduler for Erlang. The Erlang virtual machine (VM) runs processes for relatively long time-slices, in order to minimize the time spent on context switching—but as a result, it is very unlikely to provoke race conditions in small tests. It is possible to tune the VM to perform more context switches, but even then the scheduling decisions are entierly deterministic. This is one reason why tricky concurrency bugs are rarely found during unit testing; it is not until later stages of a project, when many components are tested together, that the standard scheduler begins to preempt processes and trigger race conditions. In the worst case, bugs don’t appear until the system is put under heavy load in production! In these later stages, such errors are expensive to debug. One other advantage (apart from repeatability) of PULSE is that it generates much more fine-grain interleaving than the built-in scheduler in the Erlang virtual machine (VM), because it randomly chooses the next process to run at each point. Therefore, we can provoke race conditions even in very small tests. Erlang’s scheduler is built into its virtual machine—and we did not want to modify the virtual machine itself. Not only would this be difficult—it is a low-level, fairly large and complex C program—but we would need to repeat the modifications every time a new version of the virtual machine was released. We decided, therefore, to implement PULSE in Erlang, as a user-level scheduler, and to instrument the code of the processes that it controls so that they cooperate with it. As a consequence, PULSE can even be used in conjunction with legacy or customized versions of the Erlang VM (which are used in some products). The user level scheduler also allows us to restrict our debugging effort to a few processes, whereas we are guaranteed that the rest of the processes are executed normally.

{[{set,{var,5},{call,proc_reg_eqc,spawn,[]}}, {set,{var,9},{call,proc_reg_eqc,kill,[{var,5}]}}, {set,{var,15},{call,proc_reg_eqc,reg,[a,{var,5}]}}], {[{set,{var,19},{call,proc_reg_eqc,reg,[a,{var,5}]}}], [{set,{var,18},{call,proc_reg_eqc,reg,[a,{var,5}]}}]}}

This test case first creates and kills a process, then tries to register it (which should have no effect, because it is already dead), and finally tries to register it again twice, in parallel. Printing the diagnostic output from run_commands, we see: Sequential: [{{state,[],[],[]},}, {{state,[],[],[]},ok}, {{state,[],[],[]},true}] Parallel: {[{{call,proc_reg_eqc,reg,[a,]}, {’EXIT’,{badarg,[{proc_reg,reg,2},...]}}}], [{{call,proc_reg_eqc,reg,[a,]},true}]} Res: no_possible_interleaving

5.1

Overall Design

The central idea behind developing PULSE was to provide absolute control over the order of relevant events. The first natural question that arises is: What are the relevant events? We define a side-effect to be any interaction of a process with its environment. Of particular interest in Erlang is the way processes interact by message passing, which is asynchronous. Message channels, containing messages that have been sent but not yet delivered, are thus part of the environment and explicitly modelled as such in PULSE. It makes sense to separate side-effects into two kinds: outward side-effects, that influence only the environment (such as sending a message over a channel, which does not block and cannot fail, or printing a message), and inward side-effects, that allow the environment to

(where the ellipses replace an uninteresting stack trace). The values displayed under “Parallel:” are the results A and B from the two parallel branches—they reveal that one of the parallel calls to reg raised an exception, even though trying to register a dead process should always just return true! How this happened, though, is still quite mysterious—but will be explained in the following sections. 4 In

PULSE :

fact we need only repeat tests during shrinking.

153

PULSE needs to be able to control the order in which those interactions take place. Since we are not interested in controlling the order in which pure functions are called we allow the programmer to specify which external functions have side-effects. Each call of a side-effecting function is then instrumented with code that yields before performing the real call and PULSE is free to run another process at that point. Side-effecting functions are treated as atomic which is also an important feature that aids in testing systems built of multiple components. Once we establish that a component contains no race conditions we can remove the instrumentation from it and mark its operations as atomic side-effects. We will then be able to test other components that use it and each operation marked as sideeffecting will show up as a single event in a trace. Therefore, it is possible to test a component for race conditions independently of the components that it relies on.

influence the behavior of the process (such as receiving a message from a channel, or asking for the system time). We do not want to take control over purely functional code, or side-effecting code that only influences processes locally. PULSE takes control over some basic features of the Erlang RTS (such as spawning processes, message sending, linking, etc.), but it knows very little about standard library functions – it would be too much work to deal with each of these separately! Therefore, the user of PULSE can specify which library functions should be dealt with as (inward) side-effecting functions, and PULSE has a generic way of dealing with these (see subsection 5.3). A process is only under the control of PULSE if its code has been properly instrumented. All other processes run as normal. In instrumentation, occurrences of side-effecting actions are replaced by indirections that communicate with PULSE instead. In particular, outward side-effects (such as sending a message to another process) are replaced by simply sending a message to PULSE with the details of the side-effect, and inward side-effects (such as receiving a message) are replaced by sending a request to PULSE for performing that side-effect, and subsequently waiting for permission. To ease the instrumentation process, we provide an automatic instrumenter, described in subsection 5.4. 5.2

5.4

Inner Workings

The PULSE scheduler controls its processes by allowing only one of them to run at a time. It employs a cooperative scheduling method: At each decision point, PULSE randomly picks one of its waiting processes to proceed, and wakes it up. The process may now perform a number of outward side-effects, which are all recorded and taken care of by PULSE, until the process wants to perform an inward side-effect. At this point, the process is put back into the set of waiting processes, and a new decision point is reached. The (multi-node) Erlang semantics (Svensson and Fredlund 2007) provides only one guarantee for message delivery order: that messages between a pair of processes arrive in the same order as they were sent. So as to adhere to this, PULSE’s state also maintains a message queue between each pair of processes. When process P performs an outward side-effect by sending a message M to the process Q, then M is added to the queue hP, Qi. When PULSE wants to wake up a waiting process Q, it does so by randomly picking a non-empty queue hP 0 , Qi with Q as its destination, and delivering the first message in that queue to Q. Special care needs to be taken for the Erlang construct receive . . . after n -> . . . end, which allows a receiving process to only wait for an incoming message for n milliseconds before continuing, but the details of this are beyond the scope of this paper. As an additional benefit, this design allows PULSE to detect deadlocks when it sees that all processes are blocked, and there exist no message queues with the blocked processes as destination. As a clarification, the message queues maintained by PULSE for each pair of processes should not be confused with the internal mailbox that each process in Erlang has. In our model, sending a message M from P to Q goes in four steps: (1) P asynchronously sends off M , (2) M is on its way to Q, (3) M is delivered to Q’s mailbox, (4) Q performs a receive statement and M is selected and removed from the mailbox. The only two events in this process that we consider side-effects are (1) P sending of M , and (3) delivering M to Q’s mailbox. In what order a process decides to process the messages in its mailbox is not considered a side-effect, because no interaction with the environment takes place. 5.3

Instrumentation

The program under test has to cooperate with PULSE, and the relevant processes should use PULSE’s API to send and receive messages, spawn processes, etc., instead of Erlang’s built-in functionality. Manually altering an Erlang program so that it does this is tedious and error-prone, so we developed an instrumenting compiler that does this automatically. The instrumenter is used in exactly the same way as the normal compiler, which makes it easy to switch between PULSE and the normal Erlang scheduler. It’s possible to instrument and load a module at runtime by typing in a single command at the Erlang shell. Let us show the instrumentation of the four most important constructs: sending a message, yielding, spawning a process, and receiving a message. 5.4.1 Sending If a process wants to send a message, the instrumenter will redirect this as a request to the PULSE scheduler. Thus, Pid ! Msg is replaced by scheduler ! {send, Pid, Msg}, Msg

The result value of sending a message is always the message that was sent. Since we want the instrumented send to yield the same result value as the original one, we add the second line. 5.4.2

Yielding

A process yields when it wants to give up control to the scheduler. Yields are also introduced just before each user-specified sideeffecting external function. After instrumentation, a yielding process will instead give up control to PULSE. This is done by telling it that the process yields, and waiting for permission to continue. Thus, yield() is replaced by scheduler ! yield, receive {scheduler, go} -> ok end

In other words, the process notifies PULSE and then waits for the message go from the scheduler before it continues. All control messages sent by PULSE to the controlled processes are tagged with {scheduler, _} in order to avoid mixing them up with ”real” messages. 5.4.3 Spawning A process P spawning a process Q is considered an outward sideeffect for P , and thus P does not have to block. However, PULSE must be informed of the existence of the new process Q, and Q

External Side-effects

In addition to sending and receiving messages between themselves, the processes under test can also interact with uninstrumented code.

154

needs to be brought under its control. The spawned process Q must therefore wait for PULSE to allow it to run. Thus, spawn(Fun) is replaced by

SRes = scheduler:start([{seed,Seed}], fun() -> {ok,ETSTabs} = proc_reg_tabs:start_link(), {ok,Server} = proc_reg:start_link(), eqc_par_statem:run_commands(?MODULE,Cmds), cleanup(ETSTabs,Server), end), {H,AB,Res} = scheduler:get_result(SRes), Res == ok end))).

Pid = spawn(fun() -> receive {scheduler, go} -> Fun() end end), scheduler ! {spawned, Pid}, Pid

In other words, the process spawns an altered process that waits for the message go from the scheduler before it does anything. The scheduler is then informed of the existence of the spawned process, and we continue. 5.4.4

PULSE uses a random seed, generated by seed(). It also takes a function as an argument, so we create a lambda-function which initializes and runs the tests. The result of running the scheduler is a list of things, thus we need to call scheduler:get result to retrieve the actual result from run commands. We should also remember to instrument rather than compile all the involved modules. Note that we still use ?ALWAYS in order to run the same test data with different random seeds, which helps the shrinking process in finding smaller failing test cases that would otherwise be less likely to fail. When testing this modified property, we find the following counterexample, which is in fact simpler than the one we found in Section 4.4:

Receiving

Receiving in Erlang works by pattern matching on the messages in the process’ mailbox. When a process is ready to receive a new message, it will have to ask PULSE for permission. However, it is possible that an appropriate message already exists in its mailbox, and receiving this message would not be a side-effect. Therefore, an instrumented process will first check if it is possible to receive a message with the desired pattern, and proceed if this is possible. If not, it will tell the scheduler that it expects a new message in its mailbox, and blocks. When woken up again on the delivery of a new message, this whole process is repeated if necessary. We need a helper function that implements this checkingwaiting-checking loop. It is called receiving:

{[{set,{var,9},{call,proc_reg_eqc,spawn,[]}}, {set,{var,10},{call,proc_reg_eqc,kill,[{var,9}]}}], {[{set,{var,15},{call,proc_reg_eqc,reg,[c,{var,9}]}}], [{set,{var,12},{call,proc_reg_eqc,reg,[c,{var,9}]}}]}}

When prompted, PULSE provides quite a lot of information about the test case run and the scheduling decisions taken. Below we show an example of such information. However, it is still not easy to explain the counterexample; in the next section we present a method that makes it easier to understand the scheduler output.

receiving(Receiver) -> Receiver(fun() -> scheduler ! block, receive {scheduler, go} -> receiving(Receiver) end end).

-> calls scheduler:process_flag [priority,high] returning normal. -> sends ’{call,{attach,}, ,#Ref}’ to . -> blocks. *** unblocking by delivering ’{call,{attach,}, , #Ref}’ sent by . ...

receiving gets a receiver function as an argument. A receiver function is a function that checks if a certain message is in its mailbox, and if not, executes its argument function. The function receiving turns this into a loop that only terminates once PULSE has delivered the right message. When the receiver function fails, PULSE is notified by the block message, and the process waits for permission to try again. Code of the form receive Pat -> Exp end

is then replaced by receiving(fun (Failed) -> receive Pat -> Exp after 0 -> Failed() end end)

6.

Visualizing Traces

records a complete trace of the interesting events during test execution, but these traces are long, and tedious to understand. To help us interpret them, we have, utilizing the popular GraphViz package (Gansner and North 1999), built a trace visualizer that draws the trace as a graph. For example, Figure 1 shows the graph drawn for one possible trace of the following program: PULSE

In the above, we use the standard Erlang idiom (receive . . . after 0 -> . . . end) for checking if a message of a certain type exists. It is easy to see how receive statements with more than one pattern can be adapted to work with the above scheme.

procA() -> PidB = spawn(fun procB/0), PidB ! a, process_flag(trap_exit, true), link(PidB), receive {’EXIT’,_,Why} -> Why end.

5.5 Testing proc reg with PULSE To test the proc reg module using both QuickCheck and PULSE, we need to make a few modifications to the QuickCheck property in Section 4.4. prop_proc_reg_scheduled() -> ?FORALL(Cmds,eqc_par_statem:commands(?MODULE), ?ALWAYS(10,?FORALL(Seed,seed(), begin

procB() -> receive a -> exit(kill) end.

155

root

root scheduler:process_flag(trap_exit,true) = false

a

procA.PidB

scheduler:process_flag(trap_exit,true) = false

a

link link procA.PidB {EXIT,_,noproc}

kill

{EXIT,_,kill} kill

noproc

kill

Figure 1. A simple trace visualization.

Figure 2. An alternative possible execution.

The function procA starts by spawning a process, and subsequently sends it a message a. Later, procA links to the process it spawned, which means that it will get notified when that process dies. The default behavior of a process when such a notification happens is to also die (in this way, one can build hierarchies of processes). Setting the process flag trap exit to true changes this behaviour, and the notification is delivered as a regular message of the form {EXIT,_,_} instead. In the figure, each process is drawn as a sequence of state transitions, from a start state drawn as a triangle, to a final state drawn as an inverted triangle, all enclosed in a box and assigned a unique color. (Since the printed version of the diagrams may lack these colors, we reference diagram details by location and not by color. However, the diagrams are even more clear in color.) The diagram shows the two processes, procA (called root) which is shown to the left (in red), and procB (called procA.PidB, a name automatically derived by PULSE from the point at which it was spawned) shown to the right (in blue). Message delivery is shown by gray arrows, as is the return of a result by the root process. As explained in the previous section, processes make transitions when receiving a message5 , or when calling a function that the instrumenter knows has a side-effect. From the figure, we can see that the root process spawned PidB and sent the message a to it, but before the message was delivered then the root process managed to set its trap_exit process flag, and linked to PidB. PidB then received its message, and killed itself, terminating with reason kill. A message was sent back to root, which then returned the exit reason as its result. Figure 2 shows an alternative trace, in which PidB dies before root creates a link to it, which generates an exit message with a different exit reason. The existence of these two different traces indicates a race condition when using spawn and link separately (which is the reason for the existence of an atomic spawn_link function in Erlang). The diagrams help us to understand traces by gathering together all the events that affect one process into one box; in the original traces, these events may be scattered throughout the entire trace. But notice that the diagrams also abstract away from irrelevant information—specifically, the order in which messages are deliv-

root file:write_file("a.txt",_) = ok

write_race.Pid

ok

file:write_file("a.txt",_) = ok

Figure 3. A race between two side-effects.

ered to different processes, which is insignificant in Erlang. This abstraction is one strong reason why the diagrams are easier to understand than the traces they are generated from. However, we do need to know the order in which calls to functions with side-effects occur, even if they are made in different processes. To make this order visible, we add dotted black arrows to our diagrams, from one side-effecting call to the next. Figure 3 illustrates one possible execution of this program, in which two processes race to write to the same file: write_race() -> Pid = spawn(fun() -> file:write_file("a.txt","a") end), file:write_file("a.txt","b").

In this diagram, we can see that the write_file in the root process preceded the one in the spawned process write_race.Pid. If we draw these arrows between every side-effect and its successor, then our diagrams rapidly become very cluttered. However,

5 If

messages are consumed from a process mailbox out-of-order, then we show the delivery of a message to the mailbox, and its later consumption, as separate transitions.

156

it is only necessary to indicate the sequencing of side-effects explicitly if their sequence is not already determined. For each pair of successive side-effect transitions, we thus compute Lamport’s “happens before” relation (Lamport 1978) between them, and if this already implies that the first precedes the second, then we draw no arrow in the diagram. Interestingly, in our examples then this eliminates the majority of such arrows, and those that remain tend to surround possible race conditions—where the message passing (synchronization) does not enforce a particular order of sideeffects. Thus black dotted arrows are often a sign of trouble.

message overtaking even on a single “many-core” processor; thus we consider it an advantage that our scheduler allows this behavior, and can provoke race conditions that it causes. It should be noted that exactly the same scenario can be triggered in an alternative way (without parallel processes and multicore!); namely if the BPid above is preempted between its call to ets:insert new and sending the cast-message. However, the likelihood for this is almost negligible, since the Erlang scheduler prefers running processes for relatively long time-slices. Using PULSE does not help triggering the scenario in this way either. PULSE is not in control at any point between ets:insert new and sending the cast-message, meaning that only the Erlang scheduler controls the execution. Therefore, the only feasible way to repeatedly trigger this faulty scenario is by delaying the cast-message by using PULSE (or a similar tool).

6.1 Analyzing the proc reg race conditions Interestingly, as we saw in Section 5.5, when we instrumented proc_reg and tested it using PULSE and QuickCheck, we obtained a different—even simpler—minimal failing test case, than the one we had previously discovered using QuickCheck with the built-in Erlang scheduler. Since we need to use PULSE in order to obtain a trace to analyze, then we must fix this bug first, and see whether that also fixes the first problem we discovered. The failing test we find using PULSE is this one:

6.2 A second race condition in proc reg Having corrected the bug in proc reg we repeated the QuickCheck test. The property still fails, with the same minimal failing case that we first discovered (which is not so surprising since the problem that we fixed in the previous section cannot actually occur with today’s VM). However, we were now able to reproduce the failure with PULSE, as well as the built-in scheduler. As a result, we could now analyze and debug the race condition. The failing case is:

{[{set,{var,9},{call,proc_reg_eqc,spawn,[]}}, {set,{var,10},{call,proc_reg_eqc,kill,[{var,9}]}}], {[{set,{var,15},{call,proc_reg_eqc,reg,[c,{var,9}]}}], [{set,{var,12},{call,proc_reg_eqc,reg,[c,{var,9}]}}]}}

{[{set,{var,4},{call,proc_reg_eqc,spawn,[]}}, {set,{var,7},{call,proc_reg_eqc,kill,[{var,4}]}}, {set,{var,12},{call,proc_reg_eqc,reg,[b,{var,4}]}}], {[{set,{var,18},{call,proc_reg_eqc,reg,[b,{var,4}]}}], [{set,{var,21},{call,proc_reg_eqc,reg,[b,{var,4}]}}]}}

In this test case, we simply create a dead process (by spawning a process and then immediately killing it), and try to register it twice in parallel, and as it happens the first call to reg raises an exception. The diagram we generate is too large to include in full, but in Figure 4 we reproduce the part showing the problem. In this diagram fragment, the processes are, from left to right, the proc_reg server, the second parallel fork (BPid), and the first parallel fork (APid). We can see that BPid first inserted its argument into the ETS table, recording that the name c is now taken, then sent an asynchronous message to the server ({cast,{..}}) to inform it of the new entry. Thereafter APid tried to insert an ETS entry with the same name—but failed. After discovering that the process being registered is actually dead, APid sent a message to the server asking it to “audit” its entry ({call,{..},_,_})—that is, clean up the table by deleting the entry for a dead process. But this message was delivered before the message from BPid! As a result, the server could not find the dead process in its table, and failed to delete the entry created by BPid, leading APid’s second attempt to create an ETS entry to fail also—which is not expected to happen. When BPid’s message is finally received and processed by the server, it is already too late. The problem arises because, while the clients create “forward” ETS entries linking the registered name to a pid, it is the server which creates a “reverse” entry linking the pid to its monitoring reference (created by the server). It is this reverse entry that is used by the server when asked to remove a dead process from its tables. We corrected the bug by letting clients (atomically) insert two ETS entries into the same table: the usual forward entry, and a dummy reverse entry (lacking a monitoring reference) that is later overwritten by the server. This dummy reverse entry enables the server to find and delete both entries in the test case above, thus solving the problem. In fact, the current Erlang virtual machine happens to deliver messages to local mailboxes instantaneously, which means that one message cannot actually overtake another message sent earlier— the cause of the problem in this case. This is why this minimal failing test was not discovered when we ran tests on a multi-core, using the built-in scheduler. However, this behavior is not guaranteed by the language definition, and indeed, messages between nodes in a distributed system can overtake each other in this way. It is expected that future versions of the virtual machine may allow

In this test case we also create a dead process, but we try to register it once in the sequential prefix, before trying to register it twice in parallel. Once again, one of the calls to reg in the parallel branches raised an exception. Turning again to the generated diagram, which is not included in the paper for space reasons, we observed that both parallel branches (APid and BPid) fail to insert b into the ETS table. They fail since the name b was already registered in the sequential part of the test case, and the server has not yet processed the DOWN message generated by the monitor. Both processes then call where(b) to see if b is really registered, which returns undefined since the process is dead. Both APid and BPid then request an “audit” by the server, to clean out the dead process. After the audit, both processes assume that it is now ok to register b, there is a race condition between the two processes, and one of the registrations fails. Since this is not expected, an exception is raised. (Note that if b were alive then this would be a perfectly valid race condition, where one of the two processes successfully registers the name and the other fails, but the specification says that the registration should always return true for dead processes). This far into our analysis of the error it became clear that it is an altogether rather unwise idea ever to insert a dead process into the process registry. To fix the error we added a simple check (is_process_alive(Pid)) before inserting into the registry. The effect of this change on the performance turned out to be negligible, because is_process_alive is very efficient for local processes. After this change the module passed 20 000 tests, and we were satisfied.

7.

Discussion and Related Work

Actually, the “fix” just described does not really remove all possible race conditions. Since the diagrams made us understand the algorithm much better, we can spot another possible race condition: If APid and BPid try to register the same pid at the same time, and that process dies just after APid and BPid have checked that it is alive, then the same problem we have just fixed, will arise. The rea-

157

run_pcommands.BPid

link

ets:insert_new(proc_reg,[{...}]) = true

{call,{...},_,_}

ets:lookup(proc_reg,{reg,c}) = [{...}]

run_pcommands.APid

ets:is_process_alive(_) = false

ets:insert_new(proc_reg,[{...}]) = false ets:match_object(proc_reg,{{...},_,_}) = ""

{cast,{...}} ets:lookup(proc_reg,{reg,c}) = [{...}]

ets:match_delete(proc_reg,{{...},_,_}) = true ets:is_process_alive(_) = false {_,ok}

ets:monitor(process,_) =_

ets:insert_new(proc_reg,[{...}]) = false

ets:insert_new(proc_reg,{{...},_,{...}}) = true

Figure 4. A problem caused by message overtaking. son that our tests succeeded even so, is that a test must contain three parallel branches to provoke the race condition in its new form— two processes making simultaneous attempts to register, and a third process to kill the pid concerned at the right moment. Because our parallel test cases only run two concurrent branches, then they can never provoke this behavior. The best way to fix the last race condition problem in proc_reg would seem to be to simplify its API, by restricting reg so that a process may only register itself. This, at a stroke, eliminates the risk of two processes trying to register the same process at the same time, and guarantees that we can never try to register a dead process. This simplification was actually made in the production version of the code.

1 2 3 N 4 5 ... 8

2 2 6 20 70 252

K 3 6 90 1680 34650 756756

12870

1010

4 24 2520 369600 6 × 107 1010 ... 1017

5 120 113400 108 3 × 1011 6 × 1014 8 × 1024

Figure 5. Possible interleavings of parallel branches four branches, we could only allow two; with five or more branches, we could allow only one command per branch. This is in itself a restriction that will make some race conditions impossible to detect. Moreover, with more parallel branches, there will be even more possible schedules for PULSE to explore, so race conditions depending on a precise schedule will be correspondingly harder to find. There is thus an engineering trade-off to be made here: allowing greater parallelism in test cases may in theory allow more race conditions to be discovered, but in practice may reduce the probability of finding a bug with each test, while at the same time increasing the cost of each test. We decided to prioritize longer sequences over more parallelism in the test case, and so we chose K = 2. How-

Parallelism in test cases We could, of course, generate test cases with three, four, or even more concurrent branches, to test for this kind of race condition too. The problem is, as we explained in section 4.2, that the number of possible interleavings grows extremely fast with the number of parallel branches. The number of interleavings of K sequences of length N are as presented in Figure 5. The practical consequence is that, if we allow more parallel branches in test cases, then we must restrict the length of each branch correspondingly. The bold entries in the table show the last “feasible” entry in each column—with three parallel branches, we would need to restrict each branch to just three commands; with

158

ever, in the future we plan to experiment with letting QuickCheck randomly choose K and N from the set of feasible combinations. To be clear, note that K only refers to the parallelism in the test case, that is, the number of processes that make calls to the API. The system under test may have hundreds of processes running, many of them controlled by PULSE, independently of K. The problem of detecting race conditions is well studied and can be divided in runtime detection, also referred to as dynamic detection, and analyzing the source code, so called static detection. Most results refer to race conditions in which two threads or processes write to shared memory (data race condition), which in Erlang cannot happen. For us, a race condition appears if there are two schedules of occurring side effects (sending a message, writing to a file, trapping exits, linking to a process, etc) such that in one schedule our model of the system is violated and in the other schedule it is not. Of course, writing to a shared ETS table and writing in shared memory is related, but in our example it is allowed that two processes call ETS insert in parallel. By the atomicity of insert, one will succeed, the other will fail. Thus, there is a valid race condition that we do not want to detect, since it does not lead to a failure. Even in this slightly different setting, known results on race conditions still indicate that we are dealing with a hard problem. For example, Netzer and Miller (1990) show for a number of relations on traces of events that ordering these events on ‘could have been a valid execution’ is an NP-hard problem (for a shared memory model). Klein et al. (2003) show that statically detecting race conditions is NP-complete if more than one semaphore is used. Thus, restricting eqc par statem to execute only two processes in parallel is a pragmatic choice. Three processes may be feasible, but real scalability is not in sight. This pragmatic choice is also supported by recent studies (Lu et al. 2008), where it is concluded that: “Almost all (96%) of the examined concurrency bugs are guaranteed to manifest if certain partial order between 2 threads is enforced.”

nothing to guarantee that functions called in the same sequence will return the same values in different runs. The user still has to make sure that the state of the system is reset properly before each run. Note that the same arguments apply to QuickCheck testing; it is crucial for shrinking and re-testing that input is deterministic and thus it works well to combine QuickCheck and PULSE. False positives In contrast to many race finding methods, that try to spot common patterns leading to concurrency bugs, our approach does not produce false positives and not even does it show races that result in correct execution of the program. This is because we employ property-based testing and classify test cases based on whether the results satisfy correctness properties and report a bug only when a property is violated. Related tools Park and Sen (2008) study atomicity in Java. Their approach is similar to ours in that they use a random scheduler both for repeatability and increased probability of finding atomicity violations. However, since Java communication is done with shared objects and locks, the analysis is rather different. It is quite surprising that our simple randomized scheduler—and even just running tests on a multi-core—coupled with repetition of tests to reduce non-determinism, should work so well for us. After all, this can only work if the probability of provoking the race condition in each test that contains one is reasonably high. In contrast, race conditions are often regarded as very hard to provoke because they occur so rarely. For example, Sen used very carefully constructed schedules to provoke race conditions in Java programs (Sen 2008)—so how can we reasonably expect to find them just by running the same test a few times on a multi-core? We believe two factors make our simple approach to race detection feasible. • Firstly, Erlang is not Java. While there is shared data in Erlang

Hierarchical approach

programs, there is much less of it than in a concurrent Java program. Thus there are many fewer potential race conditions, and a simpler approach suffices to find them.

Note that our tools support a hierarchical approach to testing larger systems. We test proc_reg under the assumption that the underlying ets operations are atomic; PULSE does not attempt to (indeed, cannot) interleave the executions of single ETS operations, which are implemented by C code in the virtual machine. Once we have established that the proc_reg operations behave atomically, then we can make the same assumption about them when testing code that makes use of them. When testing for race conditions in modules that use proc_reg, then we need not, and do not want to, test for race conditions in proc_reg itself. As a result, the PULSE schedules remain short, and the simple random scheduling that we use suffices to find schedules that cause failures.

• Secondly, we are searching for race conditions during unit test-

ing, where each test runs for a short time using only a relatively small amount of code. During such short runs, there is a fair chance of provoking race conditions with any schedule. Finding race conditions during whole-program testing is a much harder problem. Chess, developed by Musuvathi et al. (2008), is a system that shares many similarities with PULSE. Its main component is a scheduler capable of running the program deterministically and replaying schedules. The key difference between Chess and PULSE is that the former attempts to do an exhaustive search and enumerate all the possible schedules instead of randomly probing them. Several interesting techniques are employed, including prioritizing schedules that are more likely to trigger bugs, making sure that only fair schedules are enumerated and avoiding exercising schedules that differ insignificantly from already visited ones.

Model Checking One could argue that the optimal solution to finding race conditions problem would be to use a model checker to explore all possible interleavings. The usual objections are nevertheless valid, and the rapidly growing state space for concurrent systems makes model checking totally infeasible, even with a model checker optimized for Erlang programs, such as McErlang (Fredlund and Svensson 2007). Further it is not obvious what would be the property to model check, since the atomicity violations that we search for can not be directly translated into an LTL model checking property.

Visualization Visualization is a common technique used to aid understanding software. Information is extracted statically from source code or dynamically from execution and displayed in graphical form. Of many software visualization tools a number are related to our work. Topol et al. (1995) developed a tool that visualizes executions of parallel programs and shows, among other things, a trace of messages sent between processes indicating the happened-before relationships. Work of Jerding et al. (1997) is able to show dynamic

Input non-determinism PULSE provides deterministic scheduling. However, in order for tests to be repeatable we also need the external functions to behave consistently across repeated runs. While marking them as sideeffects will ensure that they are only called serially, PULSE does

159

call-graphs of object-oriented programs and interaction patterns between their components. Arts and Fredlund (2002) describe a tool that visualizes traces of Erlang programs in form of abstract state transition diagrams. Artho et al. (2007) develop a notation that extends UML diagrams to also show traces of concurrent executions of threads, Maoz et al. (2007) create event sequence charts that can express which events “must happen” in all possible scenarios.

8.

Cyrille Artho, Klaus Havelund, and Shinichi Honiden. Visualization of concurrent program executions. In COMPSAC ’07: Proc. of the 31st Annual International Computer Software and Applications Conference, pages 541–546, Washington, DC, USA, 2007. IEEE Computer Society. ˚ Fredlund. Trace analysis of Erlang programs. Thomas Arts and Lars-Ake SIGPLAN Notices, 37(12):18–24, 2002. Thomas Arts, John Hughes, Joakim Johansson, and Ulf Wiger. Testing Telecoms Software with Quviq QuickCheck. In ERLANG ’06: Proc. of the 2006 ACM SIGPLAN workshop on Erlang. ACM, 2006. Koen Claessen and John Hughes. QuickCheck: a lightweight tool for random testing of Haskell programs. In ICFP ’00: Proc. of the fifth ACM SIGPLAN international conference on Functional programming, pages 268–279, New York, NY, USA, 2000. ACM. Mats Cronqvist. Troubleshooting a large Erlang system. In ERLANG ’04: Proc. of the 2004 ACM SIGPLAN workshop on Erlang, pages 11–15, New York, NY, USA, 2004. ACM. ˚ Fredlund and Hans Svensson. McErlang: a model checker for Lars-Ake a distributed functional programming language. SIGPLAN Not., 42(9): 125–136, 2007. Emden R. Gansner and Stephen C. North. An open graph visualization system and its applications. Software - Practice and Experience, 30: 1203–1233, 1999. M. P. Herlihy and J. M. Wing. Axioms for concurrent objects. In POPL ’87: Proc. of the 14th ACM SIGACT-SIGPLAN symposium on Principles of Prog. Lang., pages 13–26, New York, NY, USA, 1987. ACM. John Hughes. QuickCheck Testing for Fun and Profit. In 9th Int. Symp. on Practical Aspects of Declarative Languages. Springer, 2007. Dean F. Jerding, John T. Stasko, and Thomas Ball. Visualizing interactions in program executions. In In Proc. of the 19th International Conference on Software Engineering, pages 360–370, 1997. Klein, Lu, and Netzer. Detecting race conditions in parallel programs that use semaphores. Algorithmica, 35:321–345, 2003. Leslie Lamport. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Transactions on Computers, 28 (9):690–691, 1979. Leslie Lamport. Time, clocks, and the ordering of events in a distributed system. Commun. ACM, 21(7):558–565, 1978. Shan Lu, Soyeon Park, Eunsoo Seo, and Yuanyuan Zhou. Learning from mistakes: a comprehensive study on real world concurrency bug characteristics. SIGARCH Comput. Archit. News, 36(1):329–339, 2008. Shahar Maoz, Asaf Kleinbort, and David Harel. Towards trace visualization and exploration for reactive systems. In VLHCC ’07: Proc. of the IEEE Symposium on Visual Languages and Human-Centric Computing, pages 153–156, Washington, DC, USA, 2007. IEEE Computer Society. Madanlal Musuvathi, Shaz Qadeer, Thomas Ball, G´erard Basler, Piramanayagam Arumuga Nainar, and Iulian Neamtiu. Finding and reproducing heisenbugs in concurrent programs. In OSDI, pages 267–280, 2008. Robert H. B. Netzer and Barton P. Miller. On the complexity of event ordering for shared-memory parallel program executions. In In Proc. of the 1990 Int. Conf. on Parallel Processing, pages 93–97, 1990. Chang-Seo Park and Koushik Sen. Randomized active atomicity violation detection in concurrent programs. In SIGSOFT ’08/FSE-16: Proc. of the 16th ACM SIGSOFT International Symposium on Foundations of software engineering, pages 135–145, New York, NY, USA, 2008. ACM. Koushik Sen. Race directed random testing of concurrent programs. SIGPLAN Not., 43(6):11–21, 2008. ˚ Fredlund. A more accurate semantics for distributed H. Svensson and L.-A. Erlang. In Erlang ’07: Proc. of the 2007 SIGPLAN Erlang Workshop, pages 43–54, New York, NY, USA, 2007. ACM. B. Topol, J.T. Stasko, and V. Sunderam. Integrating visualization support into distributed computing systems. Proc. of the 15th Int. Conf. on: Distributed Computing Systems, pages 19–26, May-Jun 1995. Ulf T. Wiger. Extended process registry for Erlang. In ERLANG ’07: Proc. of the 2007 SIGPLAN workshop on ERLANG Workshop, pages 1–10, New York, NY, USA, 2007. ACM.

Conclusions

Concurrent code is hard to debug and therefore hard to get correct. In this paper we present an extension to QuickCheck, a user level scheduler for Erlang (PULSE), and a tool for visualizing concurrent executions that together help in debugging concurrent programs. The tools allow us to find concurrency errors on a module testing level, whereas industrial experience is that most of them slip through to system level testing, because the standard scheduler is deterministic, but behaves differently in different timing contexts. We contributed eqc par statem, an extension of the state machine library for QuickCheck that enables parallel execution of a sequence of commands. We generate a sequential prefix to bring the system into a certain state and continue with parallel execution of a suffix of independent commands. As a result we can provoke concurrency errors and at the same time get good shrinking behavior from the test cases. We contributed with PULSE, a user level scheduler that enables scheduling of any concurrent Erlang program in such a way that an execution can be repeated deterministically. By randomly choosing different schedules, we are able to explore more execution paths than without such a scheduler. In combination with QuickCheck we get in addition an even better shrinking behavior, because of the repeatability of test cases. We contributed with a graph visualization method and tool that enabled us to analyze concurrency faults more easily than when we had to stare at the produced traces. The visualization tool depends on the output produced by PULSE, but the use of computing the “happens before” relation to simplify the graph is a general principle. We evaluated the tools on a real industrial case study and we detected two race conditions. The first one by only using eqc par statem; the fault had been noticed before, but now we did not need to instrument the code under test with yield() commands. The first and second race condition could easily be provoked by using PULSE. The traces recorded by PULSE were visualized and helped us in clearly identifying the sources of the two race conditions. By analyzing the graphs we could even identify a third possible race condition, which we could provoke if we allowed three instead of two parallel processes in eqc par statem. Our contributions help Erlang software developers to get their concurrent code right and enables them to ship technologically more advanced solutions. Products that otherwise might have remained a prototype, because they were neither fully understood nor tested enough, can now make it into production. The tool PULSE and the visualization tool are available under the Simplified BSD License and have a commercially supported version as part of Quviq QuickCheck.

Acknowledgments This research was sponsored by EU FP7 Collaborative project ProTest, grant number 215868.

References Joe Armstrong. Programming Erlang: Software for a Concurrent World. Pragmatic Bookshelf, July 2007.

160

Partial Memoization of Concurrency and Communication Lukasz Ziarek

KC Sivaramakrishnan

Suresh Jagannathan

Department of Computer Science Purdue University {lziarek,chandras,suresh}@cs.purdue.edu

Abstract

1.

Memoization is a well-known optimization technique used to eliminate redundant calls for pure functions. If a call to a function f with argument v yields result r, a subsequent call to f with v can be immediately reduced to r without the need to re-evaluate f ’s body. Understanding memoization in the presence of concurrency and communication is significantly more challenging. For example, if f communicates with other threads, it is not sufficient to simply record its input/output behavior; we must also track inter-thread dependencies induced by these communication actions. Subsequent calls to f can be elided only if we can identify an interleaving of actions from these call-sites that lead to states in which these dependencies are satisfied. Similar issues arise if f spawns additional threads. In this paper, we consider the memoization problem for a higher-order concurrent language whose threads may communicate through synchronous message-based communication. To avoid the need to perform unbounded state space search that may be necessary to determine if all communication dependencies manifest in an earlier call can be satisfied in a later one, we introduce a weaker notion of memoization called partial memoization that gives implementations the freedom to avoid performing some part, if not all, of a previously memoized call. To validate the effectiveness of our ideas, we consider the benefits of memoization for reducing the overhead of recomputation for streaming, server-based, and transactional applications executed on a multi-core machine. We show that on a variety of workloads, memoization can lead to substantial performance improvements without incurring high memory costs.

Eliminating redundant computation is an important optimization supported by many language implementations. One important instance of this optimization class is memoization (Liu and Teitelbaum 1995; Pugh and Teitelbaum 1989; Acar et al. 2003), a wellknown dynamic technique that can be utilized to avoid performing a function application by recording the arguments and results of previous calls. If a call is supplied an argument that has been previously cached, the execution of the function body can be elided, with the corresponding result immediately returned instead. When functions perform effectful computations, leveraging memoization becomes significantly more challenging. Two calls to a function f that performs some stateful computation need not generate the same result if the contents of the state f uses to produce its result are different at the two call-sites. Concurrency and communication introduce similar complications. If a thread calls a function f that communicates with functions invoked in other threads, then memo information recorded with f must include the outcome of these actions. If f is subsequently applied with a previously seen argument, and its communication actions at this call-site are the same as its effects at the original application, re-evaluation of the pure computation in f ’s body can be avoided. Because of thread interleavings, synchronization, and non-determinism introduced by scheduling choices, making such decisions is non-trivial. Nonetheless, we believe memoization can be an important component in a concurrent programming language runtime. Our belief is enforced by the widespread emergence of multi-core platforms, and renewed interest in streaming (Gordon et al. 2006), speculative (Pickett and Verbrugge 2005) and transactional (Harris and Fraser 2003; Adl-Tabatabai et al. 2006) abstractions to program these architectures. For instance, optimistic concurrency abstractions rely on efficient control and state restoration mechanisms. When a speculation fails because a previously available computation resource becomes unavailable, or when a transaction aborts due to a serializability violation and must be retried (Harris et al. 2005), their effects must be undone. Failure represents wasted work, both in terms of the operations performed whose effects must now be erased, and in terms of overheads incurred to implement state restoration; these overheads include logging costs, read and write barriers, contention management, etc. One way to reduce this overhead is to avoid subsequent re-execution of those function calls previously executed by the failed computation whose results are unchanged. The key issue is understanding when utilizing memoized information is safe, given the possibility of concurrency, communication, and synchronization among threads. In this paper, we consider the memoization problem for a higher-order concurrent language in which threads communicate through synchronous message-passing primitives (e.g. Concurrent

Categories and Subject Descriptors D.3.3 [Language Constructs and Features]: Concurrent programming structures; D.1.3 [Concurrent Programming]; D.3.1 [Formal Definitions and Theory]: Semantics General Terms Design, Experimentation, Languages, Performance, Theory, Algorithms Keywords Concurrent Programming, Partial Memoization, Software Transactions, Concurrent ML, Multicore Systems.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICFP’09, August 31–September 2, 2009, Edinburgh, Scotland, UK. c 2009 ACM 978-1-60558-332-7/09/08. . . $5.00. Copyright

161

Introduction

ML (Reppy 1999)). A synchronization event acknowledges the existence of an external action performed by another thread willing to send or receive data. If such events occur within a function f whose applications are memoized, then avoiding re-execution at a call-site c is only possible if these actions are guaranteed to succeed at c. In other words, using memo information requires discovery of interleavings that satisfy the communication constraints imposed by a previous call. If we can identify a global state in which these constraints are satisfied, the call to c can be avoided; if there exists no such state, then the call must be performed. Because finding such a state can be expensive (it may require an unbounded state space search), we consider a weaker notion of memoization: by recording the context in which a memoization constraint was generated, implementations can always choose to simply resume execution of the function at the program point associated with the constraint using the saved context. In other words, rather than requiring global execution to reach a state in which all constraints in a memoized application are satisfied, partial memoization gives implementations the freedom to discharge some fraction of these constraints, performing the rest of the application as normal. Although our description and formalization is developed in the context of message-based communication, the applicability of our solution naturally extends to shared-memory communication as well given the simple encoding of the latter in terms of the former (Reppy 1999). Whenever a constraint built during memoization is discharged on a subsequent application, there is a side-effect on the global state that occurs. For example, consider a communication constraint associated with a memoized version of a function f that expects a thread T to receive data d on channel c. To use this information at a subsequent call, we must identify the existence of T , and having done so, must propagate d along c for T to consume. Thus, whenever a constraint is satisfied, an effect that reflects the action represented by that constraint is performed. We consider the set of constraints built during memoization as forming an ordered log, with each entry in the log representing a condition that must be satisfied to utilize the memoized version, and an effect that must be performed if the condition holds. The point of memoization for our purposes is thus to avoid performing the pure computations that execute between these effectful operations. 1.1

ically created channels through which they produce and consume values. Since communication is synchronous, a thread wishing to communicate on a channel that has no ready recipient must block until one exists, and all communication on channels is ordered. Our formal treatment does not consider references, but there are no additional complications that ensue in order to handle them; our implementation supports all of Standard ML. In this context, deciding whether a function application can be avoided based on previously recorded memo information depends upon the value of its arguments, its communication actions, channels it creates, threads it spawns, and the return value it produces. Thus, the memoized result of a call to a function f can be used at a subsequent call if (a) the argument given matches the argument previously supplied; (b) recipients for values sent by f on channels in an earlier memoized call are still available on those channels; (c) values consumed by f on channels in an earlier call are again ready to be sent to other threads; (d) channels created in an earlier call have the same actions performed on them, and (e) threads created by f can be spawned with the same arguments supplied in the memoized version. Ordering constraints on all sends and receives performed by the procedure must also be enforced. A successful application of a memoized call yields a new state in which the effects captured within the constraint log have been performed; thus, the values sent by f are received by waiting recipients, senders on channels from which f expects to receive values propagate these values on those channels, and channels and threads that f is expected to create are created. To avoid making a call, a send action performed within the applied function, for example, will need to be paired with a receive operation executed by some other thread. Unfortunately, there may be no thread currently scheduled that is waiting to receive on this channel. Consider an application that calls a memoized function f which (a) creates a thread T that receives a value on channel c, and (b) sends a value on c computed through values received on other channels that is then consumed by T . To safely use the memoized return value for f nonetheless still requires that T be instantiated, and that communication events executed in the first call can still be satisfied (e.g., the values f previously read on other channels are still available on those channels). Ensuring these actions can succeed may involve an exhaustive exploration of the execution state space to derive a schedule that allows us to consider the call in the context of a global state in which these conditions are satisfied. Because such an exploration may be infeasible in practice, our formulation also supports partial memoization. Rather than requiring global execution to reach a state in which all constraints in a memoized application are satisfied, partial memoization gives implementations the freedom to discharge some fraction of these constraints, performing the rest of the application as normal.

Contributions

Besides providing a formal characterization of these ideas, we also present performance evaluation of two parallel benchmarks. We consider the effect of memoization on improving performance of multi-threaded CML applications executing on a multicore architecture. Our results indicate that memoization can lead to substantial runtime performance improvement over a non-memoized version of the same program, with only modest increases in memory overhead (15% on average). To the best of our knowledge, this is the first attempt to formalize a memoization strategy for effectful higher-order concurrent languages, and to provide an empirical evaluation of its impact on improving wall-clock performance for multi-threaded workloads. The paper is organized as follows. The programming model is defined in Section 2. Motivation for the problem is given in Section 3. The formalization of our approach is given in Sections 4 and Section 5. A detailed description of our implementation and results are given in Sections 6 and 7. We discuss previous work and provide conclusions in Section 8.

2.

3.

Motivation

As a motivating example, we consider how memoization can be profitably utilized in a concurrent message-passing red-black tree implementation. The data structure supports concurrent insertion, deletion, and traversal operations. A node in the tree is a tuple containing the node’s color, an integer value, and links to its left and right children. Associated with every node is a thread that reads from an input channel, and outputs the node’s data on an output channel, effectively encoding a server. Accessing and modifying a tree node’s data is thus accomplished through a communication protocol with the node’s input and output channels. Abstractly, a read corresponds to a receive operation ( recv ) from a node’s output channel, and a write to a send ( send ) on a node’s input channel.

Programming Model

Our programming model is a simple synchronous message-passing dialect of ML similar to CML. Threads communicate using dynam-

162

When a node is initially created, we instantiate two new channels ( channel() ), and spawn a server. The function server , the concrete representation of a node, chooses between accepting a communication on the incoming channel cIn (corresponding to a write operation) and sending a communication on the outgoing channel cOut (corresponding to a read operation), synchronizes on the communication, and then updates the server through a recursive call.

T1 T2

R48 R48

R76 R76 48 R85 R85 R93

76

67

R79

85

79

85

93 C99 86

datatype rbtree = N of {cOut: node chan, cIn: node chan} and node = Node of {color: color, value: int, left:rbtree, right:rbtree} | EmptyNode fun node(c:color, v:int, l:rbtree, r:rbtree):rbtree = let val (cOut, cIn) = (channel(), channel()) val node = Node{color=c,value=v,left=l,right=r} fun server(node) = sync( choose([wrap(sendEvt(cOut, node), (fn x => server(node)) ), wrap(recvEvt(cIn), (fn x => server(x)))])) in spawn(fn () => server(node)); N{cOut=cOut, cIn=cIn}) end

Figure 1. Memoization can be used to avoid redundant tree traversals. In this example, two threads traverse a red-black tree. Each node in the tree is represented as a process that outputs the current value of the node, and inputs new values. The shaded path illustrates memoization potential.

T1 T2

R48 R48

R76 R76 48

For example, the procedure contains queries the tree to determine if a node containing its integer argument is present. It takes as input the number being searched, and the root of the tree from which to begin the traversal. The recvT operation reads the value from each node, and based on this value check navigates the tree:

R84 R85 76 R93 67

R79

85

79

fun recvT(N{cOut, cIn}:rbtree) = recv(cOut) fun contains (n:int, tree:rbtree):bool = let fun check(n, tree) = (case recvT(tree) of Node {color,value,left,right} => (case Int.compare (value, n) of EQUAL => true | GREATER => check (n,left) | LESSER => check (n,right)) | EmptyNode => false) in check(n, tree) end

93

C99

86

Figure 2. Even if the constraints stored in a memoized call cannot be fully satisfied at a subsequent call, it may be possible to discharge some fraction of these constraints, retaining some of the optimization benefits memoization affords.

Because of the actions of T2, subsequent calls to contains with argument 79 cannot avail of the information recorded during evaluation of the initial memoized call. As a consequence of T2’s update, a subsequent traversal by T1 would observe a change to the tree. Note however, that even though the immediate parent of 79 has changed in color, the path leading up to node 85 has not (see Fig. 2). By leveraging partial memoization on the earlier memoized call to contains , a traversal attempting to locate node 79 can avoid traversing this prefix, if not the entire path. Notice that if the node with value 85 was later recolored, and assuming structural equality of nodes is used to determine memoization feasibility, full memoization would once again be possible.

Memoization can be leveraged to avoid redundant traversals of the tree. Consider the red/black tree shown in Fig. 1. Triangles represent abstracted portions of the tree, red nodes are unbolded, and black nodes are marked as bold circles. Suppose we memoize the call to contains which finds the node containing the value 79. Whenever a subsequent call to contains attempts to find the node containing 79, the traversal can directly use the result of the previous memoized call if both the structure of the tree along the path to the node, and the values of the nodes on the path remain the same. Both of these properties are reflected in the values transmitted by node processes. The path is depicted by shading relevant nodes in gray in Fig. 1. More concretely, we can avoid recomputing the traversal if communication with node processes remains unchanged. Informally, memo information associated with a function f can be leveraged to avoid subsequent applications of f if communication actions performed by the memoized call can be satisfied in these later applications. Thus, to successfully leverage memo information for a call to contains with input 79, we would need to ensure a subsequent call of the function with the same argument would be able to receive the sequence of node values: (red, 48), (black, 76), (red, 85), and (black, 79) in that order during a traversal. In Fig. 1 thread T1 can take advantage of memoization, while thread T2 subsequently recolors the node containing 85.

As this example suggests, a key requirement for effectively utilizing memoized function applications is the ability to track communication (and other effectful) actions performed by previous instantiations. Provided that the global state would permit these same actions (or some subset thereof) to succeed if a function is re-executed with the same inputs, memoization can be employed to avoid recomputing applications, or to reduce the amount of the application that is executed. We note that although the example presented dealt with memoization of a function that operates over base integer values, our solution detailed in the following sections considers memoization in the context of any value that can appear on a channel, including closures, vectors, and datatype instances.

163

4.

Syntax and Semantics

of a memoized function, and whose associated effects must be discharged if the constraint is satisfied. To record constraints, we augment our semantics to include a memo store (σ), a map that given a function identifier and an argument value, returns the set of constraints and result value that was previously recorded for a call to that function with that argument. If the set of constraints returned by the memo store is satisfied in the current state (and their effects performed), then the return value can be used and the application elided. The memo store contains only one function/value pair for simplicity of the presentation. We can envision extending the memo store to contain multiple memoized versions of a function based on its arguments or constraints. We utilize two thread contexts t[e] and tC [e], the former to indicate that evaluation of terms should be captured within the memo store, and the latter to indicate that previously recorded constraints should be discharged. We elaborate on their use below.

Our semantics is defined in terms of a core call-by-value functional language with threading and communication primitives. Communication between threads is achieved using synchronous channels. For perspicuity, we first present a simple multi-threaded language with synchronous channel based communication. We then extend this core language with memoization primitives, and subsequently consider refinements of this semantics to support partial memoization. Although the core language has no support for selective communication, extending it to support choice is straightforward. Memoization would simply record the result of the choice and replay would only be possible if the recorded choice was satisfiable. In the following, we write α to denote a sequence of zero or more elements, β.α to denote sequence concatenation, and 0/ to denote an empty sequence. Metavariables x and y range over variables, t ranges over thread identifiers, l ranges over channels, v ranges over values, and α, β denote tags that label individual actions in a program’s execution. We use P to range over program states, E for evaluation contexts, and e for expressions. Our communication model is a message-passing system with synchronous send and receive operations. We do not impose a strict ordering of communications on channels; communication actions on the same channel by different threads are paired nondeterministically. To model asynchronous sends, we simply spawn a thread to perform the send. Spawning an expression (that evaluates to a thunk) creates a new thread in which the application of the thunk is performed. 4.1

The definition of the language augmented with memoization support is given in Fig. 4. We now define evaluation using a new relation ( =⇒ ) defined over two global configurations. In one case, a configuration consists of a program state (P) and a memo store (σ). This configuration is used when evaluation does not leverage memoized information. The second configuration is defined in terms of a thread id and constraint sequence pair ((t,C)), a program state (P), and a memo store (σ); transitions use this configuation when discharging constraints recorded from a previous memoized application. In addition, a thread state is augmented to hold an additional structure. The memo state (θ) records the function identifier (δ), the argument (v) supplied to the call, the context (E) in which the call is performed, and the sequence of constraints (C) that are built during the evaluation of the application being memoized.

Language

The syntax and semantics of the language are given in Fig. 3. Expressions are either variables, locations that represent channels, λabstractions, function applications, thread creation operations, or communication actions that send and receive messages on channels. We do not consider references in this core language as they can be modeled in terms of operations on channels (Reppy 1999). A thread context t[E[e]] denotes an expression e available for execution by a thread with identifier t within context E. Evaluation is specified via a relation ( 7−→ ) that maps a program state (P) to another program state. Evaluation rules are applied up to commutativity of parallel composition (k). An evaluation step is marked with a tag that indicates the action performed by that step. As shortα hand, we write P 7−→ P0 to represent the sequence of actions α that transforms P to P0 . Application (rule A PP) substitutes the argument value for free occurrences of the parameter in the body of the abstraction, and channel creation (rule C HANNEL) results in the creation of a new location that acts as a container for message transmission and reception. A spawn action (rule S PAWN), given an expression e that evaluates to a thunk changes the global state to include a new thread in which the thunk is applied. A communication event (rule C OMM) synchronously pairs a sender attempting to transmit a value along a specific channel in one thread with a receiver waiting on the same channel in another thread. 4.2

Constraints built during a memoized function application define actions that must be satisfied at subsequent call-sites in order to avoid complete re-evaluation of the function body. For a communication action, a constraint records the location being operated upon, the value sent or received, the action performed (R for receive and S for send), and the continuation immediately prior to the action being performed; the application resumes evaluation from this point if the corresponding constraint could not be discharged. For a spawn operation, the constraint records the action (Sp), the expression being spawned, and the continuation immediately prior to the spawn. For a channel creation operation (Ch), the constraint records the location of the channel and the continuation immediately prior to the creation operation. Returns are also modeled as constraints (Rt, v) where v is the return value of the application being memoized. Consider an application of function f to value v that has been memoized. Since subsequent calls to f with v may not be able to discharge all constraints, we need to record the program points for all communication actions within f that represent potential resumption points from which normal evaluation of the function body proceeds; these continuations are recorded as part of any constraint that can fail 1 (communication actions, and return constraints as described below). But, since the calling contexts at these other callsites are different from the original, we must be careful to not include them within saved continuations recorded within these constraints. Thus, the contexts recorded as part of the saved constraint during memoization only define the continuation of the action up to the return point of the function; the memo state (θ) stores the evaluation context representing the caller’s continuation. This context is restored once the application completes (see rule R ET).

Partial Memoization

The core language presented above provides no facilities for memoization of the functions it executes. To support memoization, we must record, in addition to argument and return values, synchronous communication actions, thread spawns, channel creation etc. as part of the memoized state. These actions define a log of constraints (C) that must be satisfied at subsequent applications

1 We

also record continuations on non-failing constraints; while not strictly necessary, doing so simplifies our correctness proofs.

164

P ROGRAM S TATES :

S YNTAX : P e ∈ Exp v ∈ Val

::= ::= | ::=

PkP | t[e] x | y | v | e(e) | spawn(e) mkCh() | send(e, e) | recv(e) unit | λ x.e | l

P t x, y l α, β

∈ ∈ ∈ ∈ ∈

Process Tid Var Channel Tag =

{App,Ch, Spn,Com}

E VALUATION C ONTEXTS : E ::= [ ] | E(e) | v(E) | spawn(E) | send(E, e) | send(l, E) | recv(E)

(A PP )

(C HANNEL )

l fresh

App

Pkt[E[λ x.e (v)]] 7−→ Pkt[E[e[v/x]]]

(S PAWN )

Ch

Pkt[E[mkCh()]] 7−→ Pkt[E[l]]

(C OMM )

t0 fresh Spn

P = P0 kt[E[send(l, v)]]kt0 [E0 [recv(l)]] Com

P 7−→ P0 kt[E[unit]]kt0 [E0 [v]]

Pkt[E[spawn(λ x.e)]] 7−→ Pkt[E[unit]]kt0 [e[unit/x]]

Figure 3. A concurrent language with synchronous communication. If function f calls function g , then actions performed by g must be satisfiable in any call that attempts to leverage the memoized version of f . Consider the following program fragment: let fun f(...) = ... let fun g(...) = ... in ... end in ... g(...) ... end

potentially avoided; if not, its evaluation is memoized by rule A PP. If a memoized call is applied, we must examine the set of associated constraints that can be discharged. To do so, we employ an auxiliary relation ℑ defined in Fig. 5. Abstractly, ℑ checks the global state to determine which communication, channel creation, and spawn creation constraints (the possible effectful actions in our language) can be satisfied, and returns a set of failed constraints, representing those actions that could not be satisfied. The thread context (tC [e]) is used to signal the utilization of memo information. The failed constraints are added to the original thread context. Rule M EMO A PP yields a new global configuration whose thread id and constraint sequence ((t,C)) corresponds to the constraints satisfiable in the current global state (defined as C00 ) for thread t as defined by ℑ. These constraints, when discharged, will leave the thread performing the memoized call in a new state in which the evaluation of the call is the expression associated with the first failed constraint returned by ℑ. As we describe below in Sec 4.3, there is always at least one such constraint, namely Rt , the return constraint that holds the return value of the memoized call. We also introduce a rule to allow the completion of memo information use (rule E ND M EMO). The rule installs the continuation of the first currently unsatisfied constraint; no further constraints are subsequently examined. In this formulation, the other failed constraints are simply discarded; we present an extension of this semantics in Section. 4.6 that make use of them.

send(c,v) ...

If we encounter an application of f after it has been memoized, then g ’s communication action (i.e., the send of v on c ) must be satisfiable at the point of the application to avoid performing the call. We therefore associate a call stack of constraints (θ) with each thread that defines the constraints seen thus far, requiring the constraints computed for an inner application to be satisfiable for any memoization of an outer one. The propagation of constraints to the memo states of all active calls is given by the operation shown in Fig. 4. Channels created within a memoized function must be recorded in the constraint sequence for that function (rule C HANNEL). Consider a function that creates a channel and subsequently initiates communication on that channel. If a call to this function was memoized, later applications that attempt to avail of memo information must still ensure that the generative effect of creating the channel is not omitted. Function evaluation now associates a label with function evaluation that is used to index the memo store (rule F UN). If a new thread is spawned within a memoized application, a spawn constraint is added to the memo state, and a new global state is created that starts memoization of the actions performed by the newly spawned thread (rule S PAWN). A communication action performed by two functions currently being memoized are also appropriately recorded in the corresponding memo state of the threads that are executing these functions (rule C OMM). When a memoized application completes, its constraints are recorded in the memo store (rule R ET). When a function f is applied to argument v, and there exists no previous invocation of f to v recorded in the memo store, the function’s effects are tracked and recorded (rule A PP). Until an application of a function being memoized is complete, the constraints induced by its evaluation are not immediately added to the memo store. Instead, they are maintained as part of the memo state (θ) associated with the thread in which the application occurs. The most interesting rule is the one that deals with determining how much of an application of a memoized function can be elided (rule M EMO A PP). If an application of function f with argument v has been recorded in the memo store, then the application can be

4.3

Constraint Matching

The constraints built as a result of evaluating these rules are discharged by the rules shown in Fig. 6. Each rule in Fig. 6 is defined with respect to a thread id and constraint sequence. Thus, at any given point in its execution, a thread is either building up memo constraints (i.e., the thread is of the form t[e]) within an application for subsequent calls to utilize, or attempting to discharge these constraints (i.e., the thread is of the form tC [e]) for applications indexed in the memo store. The function ℑ leverages the evaluation rules defined in Fig. 6 by examining the global state and determining which constraints can be discharged, except for the return constraint. ℑ takes a constraint set (C) and a program state (P) and returns a sequence of unmatchable constraints (C0 ). Send and receive constraints are matched with threads blocked in the global program state on the opposite communication action. Once a thread has been matched with a constraint it is no longer a candidate for future communication since its communication action is consumed by the constraint. This guarantees that

165

S YNTAX :

P ROGRAM S TATES : P v ∈ Val E ∈ Context

δ c C

∈ ∈ ∈

MemoId FailableConstraint= Constraint =

σ θ α, β

∈ ∈ ∈

MemoStore MemoState Tag

PkP | hθ, t[e]i | hθ, tC [e]i unit | λδ x.e | l

::= ::=

C ONSTRAINT A DDITION : θ0 = {(δ, v, E,C.C)|(δ, v, E,C) ∈ θ} θ,C θ0

(C HANNEL )

= = =

({R, S} × Loc × Val) + Rt (FailableConstraint × Exp)+ ((Sp × Exp) × Exp) + ((Ch × Location) × Exp) MemoId × Val → Constraint∗ MemoId × Val × Context × Constraint∗ {Ch, Spn, Com, Fun, App, Ret, MCom, MCh, MSp, MemS, MemE, MemR, MemP}

(F UN )

θ, ((Ch, l), E[mkCh()]) θ0

δ fresh

l fresh

Fun

Ch

Pkhθ, t[E[mkCh()]]i, σ =⇒ Pkhθ0 , t[E[l]]i, σ

Pkhθ, t[E[λ x.e]]i, σ =⇒ Pkhθ, t[E[λδ x.e]]i, σ

(S PAWN )

(C OMM )

t0 fresh θ, ((Sp, λδ x.e(unit)), E[spawn(λδ x.e)]) θ0 tk = hθ0 , t[E[unit]]i / t0 [λδ x.e(unit)]i ts = h0,

P = P0 khθ, t[E[send(l, v)]]ikhθ0 , t0 [E0 [recv(l)]]i θ, ((S, l, v), E[send(l, v)]) θ00 θ0 , ((R, l, v), E0 [recv(l)]) θ000 ts = hθ00 , t[E[unit]]i tr = hθ000 , t0 [E0 [v]]i Com

Spn

P, σ =⇒ P0 kts ktr , σ

Pkhθ, t[E[spawn(λδ x.e)]]i, σ =⇒ Pktk kts , σ

(R ET )

(A PP ) / (δ, v) 6∈ Dom(σ) θ = (δ, v, E, 0)

θ = (δ, v, E,C) Ret

Pkhθ.θ, t[v0 ]i, σ =⇒ Pkhθ, t[E[v0 ]]i, σ[(δ, v) 7→ C.(Rt, v0 )]

App

Pkhθ, t[E[λδ x.e (v)]]i, σ =⇒ Pkhθ.θ, t[e[v/x]]i, σ

(M EMO A PP )

(E ND M EMO ) (δ, v) ∈ Dom(σ) σ(δ, v) = C ℑ(C, P) = C0 C = C00 .C0

C = (c, e0 )

MemS

Pkhθ, t[E[λδ x.e (v)]]i, σ =⇒ (t,C00 ), Pkhθ, tc0 [E[λδ x.e (v)]]i, σ

MemE

/ Pkhθ, tC.C [E[λδ x.e (v)]]i, σ =⇒ Pkhθ, t[E[e0 ]]i, σ (t, 0),

Figure 4. A concurrent language supporting memoization of synchronous communication and dynamic thread creation. the candidate function will communicate at most once with each thread in the global state. Although a thread may in fact be able to communicate more than once with the candidate function, determining this requires arbitrary look ahead and is infeasible in practice. We discuss the implications of this restriction in Section 4.5.

receive value v on channel l , and there exists a thread able to send v on l , evaluation proceeds to a state in which the communication succeeds (the receiving thread now evaluates in a context in which the receipt of the value has occurred), and the constraint is removed from the set of constraints that need to be matched (rule MR ECV). Note also that the sender records the fact that a communication with a matching receive took place in the thread’s memo state, and the receiver does likewise. Any memoization of the sender must consider the receive action that synchronized with the send, and the application in which the memoized call is being examined must record the successful discharge of the receive action. In this way, the semantics permits consideration of multiple nested memoization actions. If the current constraint expects to send a value v on channel l , and there exists a thread waiting on l , the constraint is also satisfied (rule MS END). A send operation can match with any waiting receive action on that channel. The semantics of synchronous communication allows us the freedom to consider pairings of sends with receives other than the one it communicated with in the original memoized execution. This is because a receive action places no restriction on either the value it reads, or the specific sender that provides the value. If there is no matching receiver, the constraint fails. Observe that there is no evaluation rule for the Rt constraint that can consume it. This constraint contains the return value of the memoized function (see rule R ET). If all other constraints have

Thus, a spawn constraint (rule MS PAWN) is always satisfied, and leads to the creation of a new thread of control. Observe that the application evaluated by the new thread is now a candidate for memoization if the thunk was previously applied and its result is recorded in the memo store. A channel constraint of the form ((Ch,l), E[e]) (rule MC H) creates a new channel location l0 , and replaces all occurrences of l found in the remaining constraint sequence for this thread with l0 ; the channel location may be embedded within send and receive constraints, either as the target of the operation, or as the argument value being sent or received. Thus, discharging a channel constraint ensures that the effect of creating a new channel performed within an earlier memoized call is preserved on subsequent applications. The renaming operation ensures that later send and receive constraints refer to the new channel location. Both channel creation and thread creation actions never fail – they modify the global state with a new thread and channel, respectfully, but impose no preconditions on the state in order for these actions to succeed. MCom

There are two communication constraint matching rules ( =⇒ ), both of which may indeed fail. If the current constraint expects to

166

= = = = = =

ℑ(((S, l, v), e).C, Pkhθ, t[E[recv(l)]]i) ℑ(((R, l, v), e).C, Pkhθ, t[E[send(l, v)]]i) ℑ((Rt, v), P) ℑ(((Ch, l), e).C, P) ℑ(((Sp, e0 ), e).C, P) ℑ(C, P)

let fun f() = (send(c,1); send(c,2)) fun g() = (recv(c);recv(c)) in spawn(g); f(); ...; spawn(g); f() end

ℑ(C, P) ℑ(C, P) (Rt, v) ℑ(C, P) ℑ(C, P) C, otherwise

Figure 9. The second application of f can only be partially memoized up to the second send since only the first receive made by g is blocked in the global state.

Figure 5. The function ℑ yields the set of constraints C which are not satisfiable in program state P.

v2 or v3 on channels c1 and c2 would be included as well. For the sake of discussion, assume that the send of v2 by h was consumed by g and the send of v3 was paired with the receive in f when f() was originally executed. Consider the memoizability constraints built during the first call to f . The send constraint on f ’s application can be satisfied by matching it with the corresponding receive constraint associated with the application of g ; observe g() loops forever, consuming values on channels c1 and c2 . The receive constraint associated with f can be discharged if g receives the first send by h , and f receives the second. A schedule that orders the execution of f and g in this way, and additionally pairs i with a send operation on c2 in the let -body would allow the second call to f to fully leverage the memo information recorded in the first. Doing so would enable eliding the pure computation in f (abstracted by . . .) in its definition, performing only the effects defined by the communication actions (i.e., the send of v1 on c1 , and the receipt of v3 on c2 ).

let val (c1,c2) = (mkCh(),mkCh()) fun f () = (send(c1,v1); ...; recv(c2)) fun g () = (recv(c1); ...; recv(c2); g()) fun h () = (...; send(c2,v2); send(c2,v3); h()); fun i () = (recv(c2); i()) in spawn(g); spawn(h); spawn(i); f(); ...; send(c2, v4); ...; f() end

Figure 7. Determining if an application can fully leverage memo information may require examining an arbitrary number of possible thread interleavings. g() c1

v1

c1

f()

h()

i()

c2

c2

4.5

v2 c2

c2

As this example illustrates, utilizing memo information completely may require forcing a schedule that pairs communication actions in a specific way, making a solution that requires all constraints to be satisfied infeasible in practice. Hence, rule M EMO A PP allows evaluation to continue within an application that has already been memoized once a constraint cannot be matched. As a result, if during the second call to f , i indeed received v3 from h , the constraint associated with the recv operation in f would not be satisfied, and the rules would obligate the call to block on the recv , waiting for h or the main body to send a value on c2 . Nonetheless, the semantics as currently defined does have limitations. For example, the function ℑ does not examine future actions of threads and thus can only match a constraint with a thread if that thread is able to match the constraint in the current state. Hence, the rules do not allow leveraging memoization information for function calls involved in a producer/consumer relation. Consider the simple example given in Fig. 9. The second application of f can take advantage of memoized information only for the first send on channel c. This is because the global state in which constraints are checked only has the first recv made by g blocked on the channel. The second recv only occurs if the first is successfully paired with a corresponding send. Although in this simple example the second recv is guaranteed to occur, consider if g contained a branch which determined if g consumed a second value from the channel c. In general, constraints can only be matched against the current communication action of a thread. Secondly, exploiting memoization may lead to starvation since subsequent applications of the memoized call will be matched based on the constraints supplied by the initial call. Consider the example shown in Fig. 10. If the initial application of f pairs with the send performed by g, subsequent calls to f that use this memoized version will also pair with g since h produces different values. This leads to starvation of h. Although this behavior is certainly legal, one might reasonably expect a scheduler to interleave the sends of g and h.

c2

v3 c2

v4

Figure 8. The communication pattern of the code in Fig 7. Circles represent operations on channels. Gray circles are sends and white circles are receives. Double arrows represent communications that are captured as constraints during memoization. been satisfied, it is this return value that replaces the call in the current context (see the consequent of rule M EMO A PP). 4.4

Issues

Example

The program fragment shown in Fig. 7 applies functions f, g, h and i. The calls to g, h, and i are evaluated within separate threads of control, while the applications of f takes place in the original thread. These different threads communicate with one other over shared channels c1 and c2. The communication pattern is depicted in Fig. 8. Separate threads of control are shown as rectangles. Communication actions are represented as circles; gray for send actions and white for receives. The channel on which the communication action takes place is annotated on the circle and the value which is sent on the arrow. Double arrows represent communication actions for which constraints are generated. To determine whether the second call to f can be elided we must examine the constraints that would be added to the thread state of the threads in which these functions are applied. First, spawn constraints would be added to the main thread for the threads executing g, h, and i. Second, a send constraint followed by a receive constraint, modeling the exchange of values v1 and either

167

(MC H )

(MS PAWN ) C = ((Ch, l), ) l0 fresh C00 = C[l0 /l] θ,C θ0

C = ((Sp, e0 ), ) t0 fresh

θ,C θ0

MSp

MCh

/ t0 [e0 ]i, σ (t,C.C), Pkhθ, tC0 [E[λδ x.e (v)]]i, σ =⇒ (t,C), Pkhθ0 , tC0 [E[λδ x.e (v)]]ikh0,

(t,C.C), Pkhθ, tC0 [E[λδ x.e (v)]]i, σ =⇒ (t,C00 ), Pkhθ0 , tC0 [E[λδ x.e (v)]]i, σ

(MR ECV )

(MS END ) C = ((S, l, v), ) ts = hθ0 , t0 0 [E[λδ x.e (v)]]i tr = hθ, t[E0 [recv(l)]]i C θ0 ,C θ000 θ, ((R, l, v), E0 [recv(l)]) θ00 ts0 = hθ000 , t0 0 [E[λδ x.e (v)]]i tr0 = hθ00 , t[E0 [v]]i

C = ((R, l, v), ) ts = hθ, t[E[send(l, v)]]i tr = hθ0 , t0 0 [E0 [λδ x.e (v)]]i C θ0 ,C θ000 θ, ((S, l, v), E[send(l, v)]) θ00 ts0 = hθ00 , t[E[unit]]i tr0 = hθ000 , t0 0 [E0 [λδ x.e (v)]]i C

C

MCom

MCom

(t0 ,C.C), Pkts ktr , σ =⇒ (t0 ,C), Pkts0 ktr0 , σ

(t0 ,C.C), Pkts ktr , σ =⇒ (t0 ,C), Pkts0 ktr0 , σ

Figure 6. Constraint matching is defined by four rules. Communication constraints are matched with threads performing the opposite communication action of the constraint. straint and another on a receive constraint both of which match on the channel and value. In this case, the constraints on both sender and receiver can be safely discharged. This allows calls which attempt to use previously memoized constraints to match against constraints extant in other calls that also attempt to exploit memoized state.

let fun f() = recv(c) fun g() = send(c,1);g() fun h() = send(c,2);h() in spawn(g); spawn(h); f(); ...; f() end

Figure 10. Memoization of the function f can lead to the starvation of either of g or h depending on which value the original application of f consumed from channel c. 4.6

5.

Soundness

We can relate the states produced by memoized evaluation to the states constructed by the non-memoizing evaluator. To do so, we first define a transformation function T that transforms process states (and terms) defined under memo evaluation to process states (and terms) defined under non-memoized evaluation (see Fig. 12). Since memo evaluation stores evaluation contexts in θ they must be extracted and restored. This is done in the opposite order that they were pushed onto the stack θ since the top represents the most recent call. Functions currently in the process of utilizing memo information must be resumed from the expression captured in the first non-discharged constraint. Similarly threads which are currently paused must also be resumed. Our safety theorem ensures memoization does not yield states which could not be realized under non-memoized evaluation: Theorem[Safety] If

Schedule Aware Partial Memoization

To address the limitations in the previous section, we define two new symmetric rules to pause and resume memoization (see Fig. 11). Pausing memoization (rule PAUSE M EMO) is similar to the rule E ND M EMO in Fig. 4 except the failed constraints are not discarded and the thread context is not given an expression to evaluate. Instead the thread retains its log of currently unsatisfied constraints which prevents its further evaluation. This effectively pauses the evaluation of this thread but allows regular threads to continue normal evaluation. Notice we only pause a thread utilizing memo information once it has correctly discharged its constraints. We could envision an alternative definition which pauses non-deterministically on any constraint and moves the nondischarged constraints back to the thread context which holds unsatisfied constraints. For the sake of simplicity we opted for greedy semantics which favors the utilization of memoization. We can resume the paused thread, enabling it to discharge other constraints using the rule R ESUME M EMO, which begins constraint discharge anew for a paused thread. Thus, if a thread context has a set of constraints that were not previously satisfied and evaluation is not utilizing memo information, we can once again apply our ℑ function. Note that the use of memo information can be ended at any time (rule E ND M EMO can be applied instead of PAUSE M EMO). We can, therefore, change a thread in a paused state into a bona fide thread by first applying R ESUME M EMO. If ℑ does not indicate we can discharge any additional constraints, we simply apply the rule E ND M EMO. We also extend our evaluation rules to allow constraints to be matched against other constraints (rule MC OM). This is accomplished by matching constraints between two paused threads. Of course, it is possible that two threads, both of which were paused on a constraint that was not satisfiable may nonetheless satisfy one another. This happens when one thread is paused on a send con-

Pkhθ, t[E[λδ x.e (v)]]i, σ

MemS.β1 .MemE.β2

=⇒

P0 khθ0 , t[E[v0 ]]i, σ0

then there exists α1 , . . . , αn ∈ {App, Ch, Spn, Com} s.t. n 1 T (Pkhθ, t[E[λδ x.e (v)]]i) 7−→ . . . 7−→ T (P0 khθ0 , t[E[v0 ]]i)

α

α

2 The proof 2 is by induction on the length of β1 .MemE.β2 . Each of the elements comprising β1 .MemE.β2 corresponds to an action necessary to discharge previously recorded memoization constraints. We can show that every β step taken under memoization corresponds to some number of pure steps, and zero or one sideeffecting steps under non-memoized evaluation: zero steps for returns and memo actions (e.g. M EM S, M EM E, M EM P, and M EM R), and one step for core evaluation, and effectful actions (e.g., MC H, MS PAWN, MR ECV, MS END, and MC OM). Since a function which 2 see

http://www.cs.purdue.edu/homes/lziarek/memoproof.pdf for full details.

168

(PAUSE M EMO ) MemP

/ P, σ =⇒ P, σ (t, 0), (R ESUME M EMO )

(MC OM )

ℑ(C, P) = C0

C = ((S, l, v), ) C0 = ((R, l, v), ) ts = hθ, tC.C [λδ x.e (v)]i tr = hθ0 , t0 0 0 [λδ1 x.e0 (v0 )]i C .C θ,C θ00 θ0 ,C0 θ000 00 000 ts0 = hθ , tC [λδ x.e (v)]i tr0 = hθ , t0 0 [λδ1 x.e0 (v0 )]i

C = C00 .C0

C

MemR

MCom

Pkhθ, tC [E[λδ x.e (v)]]i, σ =⇒ (t,C00 ), Pkhθ, tC0 [E[λδ x.e (v)]]i, σ

Pkts ktr , σ =⇒ Pkts0 ktr0 , σ

Figure 11. Schedule Aware Partial Memoization.

T ((t,C), Pkhθ, tC0 [E[λδ x.e (v)]]i) T ((P1 kP2 )) T (hθ, t[e]i) T (hθ, t( ,e0 ).C [e]i) T (λδ x.e) T (e1 (e2 )) T (spawn(e)) T (send(e1 , e2 )) T (recv(e))

= = = = = = = = =

e

T (Pkhθ, tC.C0 [E[λδ x.e (v)]]i) T (P1 )kT (P2 ) t[T (En [. . . E1 [e]])] θi = (δi , vi , Ei ,Ci ) ∈ θ t[T (En [. . . E1 [e0 ]])] θi = (δi , vi , Ei ,Ci ) ∈ θ λ x.e

T (e1 )(T (e2 )) spawn(T (e)) send(T (e1 ), T (e2 )) recv(T (e)) otherwise

Figure 12. T defines an erasure property on program states. The first four rules remove memo information and restore evaluation contexts. 6.2

is utilizing memoized information does not execute pure code (rule A PP under 7−→ ), it may correspond to a number of A PP transitions under 7−→ .

6.

Our parallel implementation of CML is based on Reppy’s parallel model of CML (Reppy and Xiao 2008). We utilize low level locks implemented with compare and swap to provide guarded access to channels. Whenever a thread wishes to perform an action on a channel, it must first acquire the lock associated with the channel. Since a given thread may only utilize one channel at a time, there is no possibility of deadlock.

Implementation

Our implementation is incorporated within Multi-MLton, an extension of MLton (MLton), a whole-program optimizing compiler for Standard ML, that provides support for parallel thread execution. The main extensions to Multi-MLton to support partial memoization involve insertion of read and write barriers to track accesses and updates of references, barriers to monitor function arguments and return values, and hooks to the Concurrent ML library to monitor channel based communication. 6.1

Parallel CML and hooks

The underlying CML library was also modified to make memoization efficient. The bulk of the changes were hooks to monitor channel communication and spawns, additional channel queues to support constraint matching on synchronous operations, and to log successful communication (including selective communication and complex composed events). The constraint matching engine required a modification to the channel structure. Each channel is augmented with two additional queues to hold send and receive constraints. When a constraint is being tested for satisfiability, the opposite queue is first checked (e.g. a send constraint would check the receive constraint queue). If no match is found, the regular queues are checked for satisfiability. If the constraint cannot be satisfied immediately it is added to the appropriate queue.

Multi-MLton

To support parallel execution, we modified the MLton compiler to support parallel threads. A POSIX pthread executes on each processor. Each pthread manages a lightweight Multi-MLton thread queue. Each pthread switches between lightweight MLton threads on its queue when it is preempted. Pthreads are spawned and managed by Multi-MLton’s runtime. Currently, our implementation does not support migration of Multi-MLton threads to different thread queues.

6.3

The underlying garbage collector also supports parallel allocation. Associated with every processor is a private memory region used by threads it executes; allocation within this region requires no global synchronization. These regions are dynamic and growable. All pthreads must synchronize when garbage collection is triggered. Data shared by threads found on different processors are copied to a shared memory region that requires synchronization to access.

Supporting Memoization

Any SML function can be annotated as a candidate for memoization. For such annotated functions, its arguments and return values at different call-sites, the communication it performs, and information about the threads it spawns are recorded in a memo table. Memoization information is logged through hooks to the CML runtime and stored by the underlying client code. In addition, to support partial memoization, the continuations of logged communication events are also saved.

169

Our memoization implementation extended CML channels to be aware of memoization constraints. Each channel structure contained a queue of constraints waiting to be solved on the channel. Because it will not be readily apparent if a memoized version of a CML function can be utilized at a call site, we delay a function application to see if its constraints can be matched; these constraints must be satisfied in the order in which they were generated. Constraint matching can certainly fail on a receive constraint. A receive constraint obligates a receive operation to read a specific value from a channel. Since channel communication is blocking, a receive constraint that is being matched can choose from all values whose senders are currently blocked on the channel. This does not violate the semantics of CML since the values blocked on a channel cannot be dependent on one another; in other words, a schedule must exist where the matched communication occurs prior to the first value blocked on the channel. Unlike a receive constraint, a send constraint can only fail if there are (a) no matching receive constraints on the sending channel that expect the value being sent, or (b) no receive operations on that same channel. A CML receive operation (not receive constraint) is ambivalent to the value it removes from a channel; thus, any receive on a matching channel will satisfy a send constraint. If no receives or sends are enqueued on a constraint’s target channel, a memoized execution of the function will block. Therefore, failure to fully discharge constraints by stalling memoization on a presumed unsatisfiable constraint does not compromise global progress. This observation is critical to keeping memoization overheads low. Our memoization technique relies on efficient equality tests, and approximate equality on reals and functions; the latter is modeled as closure equality. Memoization data is discarded during garbage collection. This prevents unnecessary build up of memoization meta-data during execution. As a heuristic, we also enforce an upper bound for the amount of memo data stored for each function, and the space that each memo entry can take. A function that generates a set of constraints whose size exceeds the memo entry space bound is not memoized. For each memoized function, we store a list of memo meta-data. When the length of the list reaches the upper limit but new memo data is acquired upon an application of the function to previously unseen arguments, one entry from the list is removed at random. 6.4

to the next worker thread. The result defines an approximation of n clusters (where n is the number of workers) of size k (points that compose the cluster). STMBench7 (Guerraoui et al. 2007), is a comprehensive, tunable multi-threaded benchmark designed to compare different STM implementations and designs. Based on the well-known 007 database benchmark (Carey et al. 1993), STMBench7 simulates data storage and access patterns of CAD/CAM applications that operate over complex geometric structures. At its core, STMBench7 builds a tree of assemblies whose leaves contain bags of components; these components have a highly connected graph of atomic parts and design documents. Indices allow components, parts, and documents to be accessed via their properties and IDs. Traversals of this graph can begin from the assembly root or any index and sometimes manipulate multiple pieces of data. STMBench7 was originally written in Java. Our port, besides translating the assembly tree to use a CML-based server abstraction (as discussed in Section 3), also involved building an STM implementation to support atomic sections, loosely based on the techniques described in (Saha et al. 2006; Adl-Tabatabai et al. 2006). All nodes in the complex assembly structure and atomic parts graph are represented as servers with one receiving channel and handles to all other adjacent nodes. Handles to other nodes are simply the channels themselves. Each server thread waits for a message to be received, performs the requested computation, and then asynchronously sends the subsequent part of the traversal to the next node. A transaction can thus be implemented as a series of communications with various server nodes. Swerve is a web server written entirely in CML. It consists of a collection of modules that communicate using CML’s messagepassing operations. There are three critical modules of interest: (a) a listener that processes incoming requests; (b) a file processor that handles access to the underlying file system; and, (c) a timeout manager that regulates the amount of time allocated to serve a request. There is a dedicated listener for every distinct client, and each request received by a listener is delegated to a server thread responsible for managing that request. Requested files are broken into chunks and packaged as messages sent back to the client.

7.

Results

Our benchmarks were executed on a 16-way AMD Opteron 865 server with 8 processors, each containing two symmetric cores, and 32 GB of total memory, with each CPU having its own local memory of 4 GB. Access to non-local memory is mediated by a hyper-transport layer that coordinates memory requests between processors. To measure the effectiveness of our memoization technique, we executed two configurations (one memoized, and the other non-memoized) of our k-clustering algorithm, STMBench7, and Swerve, and measured overheads and performance by averaging results over ten executions. Non-memoized executions utilized a clean version of Multi-MLton without memoization hooks and barriers. The k-clustering algorithm utilizes memoization to avoid redundant computations based on previously witnessed points as well as redundant computations of clusters. For STMBench7 the nonmemoized configuration uses our STM implementation without any memoization where as the memoized configuration implements partial memoization of aborted transactions. Swerve uses memoization to avoid reading and processing previously requested files from disk. For k-clustering, we computed 16 clusters of size 60 out of a stream of 10K randomly generated points. This resulted in the creation of 16 workers threads, one stream generating thread, and a

Benchmarks

We examined three benchmarks to measure the effectiveness of partial memoization in a parallel setting. The first benchmark is a streaming algorithm for approximating a k-clustering of points on a geometric plane. The second is a port of the STMBench7 benchmark (Guerraoui et al. 2007). STMBench7 utilizes channel based communication instead of shared memory and bears resemblance to the red-black tree program presented in Section 3. The third is Swerve (Ziarek et al. 2006), a highly-tuned webserver written in CML. Similar to most streaming algorithms (Matthew Mccutchen and Khuller 2008), a k-clustering application defines a number of worker threads connected in a pipeline fashion. Each worker maintains a cluster of points that sit on a geometric plane. A stream generator creates a randomized data stream of points. A point is passed to the first worker in the pipeline. The worker computes a convex hull of its cluster to determine if a smaller cluster could be constructed from the newly received point. If the new point results in a smaller cluster, the outlier point from the original cluster is passed to the next worker thread. On the other hand, if the received point does not alter the configuration of the cluster, it is passed on

170

sink thread which aggregates the computation results. STMBench7 was executed on a graph in which there were approximately 280k complex assemblies and 140k assemblies whose bags referenced one of 100 components; by default, each component contained a parts graph of 100 nodes. STMBench7 creates a number of threads proportional to the number of nodes in the underlying data structure; this is roughly 400K threads for our configuration. Our experiments on Swerve were executed using the default server configuration and were exercised using HTTPerf, a well known tool for measuring webserver performance.

In Swerve, we observe an increase in performance correlated to the size of the file being requested by HTTPerf (see Fig. 13(c)); performance gains are capped at roughly 80% for file sizes greater than 8 MB. For each requested file, we build constraints corresponding to the file chunks read from disk. As long as no errors are encountered, the Swerve file processor sends the file chunks to be processed into HTTP packets by another module. After each chunk has been read from disk the file processor polls other modules for timeouts and other error conditions. If an error is encountered, subsequent file processing is stopped and the request is terminated. Partial memoization is particularly well suited for Swerve’s file processing semantics because control is reverted to the error handling mechanism precisely at the point an error is detected. This corresponds to a failed constraint.

The benchmarks represent three very different programming models – pipeline stream-based parallelism (k-clustering), dynamically established communication links (Swerve), and software transactions (STM-Bench7), and leverage different executions models – k-clustering makes use of long-lived worker-threads while STMBench7 utilizes many lightweight server threads. Swerve utilizes both lightweight server threads in conjunction with long-lived worker threads. Each run of the benchmarks have execution times that range between 1 and 3 minutes.

For all benchmarks, memory overheads are proportional to cache sizes and averaged roughly 15% for caches of size eight. The cache size defines the number of different memoized calls for a function maintined; thus a cache size of eight means that every memoized function records effects for eight different arguments.

For k-clustering we varied the number of repeated points generated by the stream. Configurations in which there is a high degree of repeated points offer the best performance gain (see Fig. 13(b)). For example, an input in which 50% of the input points are repeated yields roughly 50% performance gain. However, we also observe roughly 17% performance improvement even when all points are randomized. This is because the cluster’s convex hull shrinks as the points which comprise the cluster become geometrically compact. Thus, as the convex hull of the cluster shrinks, the likelihood of a random point being contained within the convex hull of the cluster is reduced. Memoization can take advantage of this phenomenon by avoiding recomputation of the convex hull as soon as it can be determined that the input point resides outside the current cluster. Although we do not envision workloads that have high degrees of repeatability, memoization nonetheless leads to a 30% performance gain on a workload in which only 10% of the inputs are repeated.

8.

Related Work

Memoization, or function caching (Liu and Teitelbaum 1995; Pugh 1988; Heydon et al. 2000; Swadi et al. 2006), is a well understood method to reduce the overheads of redundant function execution. Memoization of functions in a concurrent setting is significantly more difficult and usually highly constrained (Pickett and Verbrugge 2005). We are unaware of any existing techniques or implementations that apply memoization to the problem of reducing transaction overheads in languages that support selective communication and dynamic thread creation. Our approach also bears resemblance to the procedure summary construction for concurrent programs (Qadeer et al. 2004). However, these approaches tend to be based on a static analysis (e.g., the cited reference leverages procedure summaries to improve the efficiency of model checking) and thus are obligated to make use of memoization greedily. Because our motivation is quite different, our approach can consider lazy alternatives, ones that leverage synchronization points to stall memo information use, resulting in potentially improved runtime benefit.

In STMbench7, the utility of memoization is closely correlated to the number and frequency of aborted transactions. Our tests varied the read-only/read-write ratio (see Fig. 13(a)) within transactions. Only transactions that modify values can cause aborts. Thus, an execution where all transactions are read-only cannot be accelerated, but one in which transactions can frequently abort (because of conflicts due to concurrent writes) offer potential opportunities for memoization. Thus, the cost to support memoization is seen when there are 100% read-only transactions; in this case, the overhead is roughly 11%. These overheads arise because of the cost to capture memo information (storing arguments, continuations, etc) and the cost associated with trying to utilize the memo information (discharging constraints).

Recently software transactions (Harris and Fraser 2003; Saha et al. 2006) have emerged as a new method to safely control concurrent execution. There has also been work on applying these techniques to a functional programming setting (Harris et al. 2005; Ringenburg and Grossman 2005). These proposals usually rely on an optimistic concurrency control model that checks for serializability violations prior to committing the transaction, aborting when a violation is detected. Our benchmark results suggest that partial memoization can help reduce the overheads of aborting optimistic transactions.

Notice that as the number of transactions which perform modifications to the underlying data-structure increases so do memoization gains. For example, when the percentage of read-only transactions is 60%, we see an 18% improvement in runtime performance compared to a non-memoizing implementation for STMBench7.

Self adjusting mechanisms (Acar et al. 2008; Ley-Wild et al. 2008) leverage memoization along with change propagation to automatically alter a program’s execution to a change of inputs given an existing execution run. Memoization is used to identify parts of the program which have not changed from the previous execution while change propagation is harnessed to install changed values where memoization cannot be applied. There has also been recent work on using change propagation in a parallel setting (Hammer et al. 2007). The programming model assumes fork/join parallelism, and is therefore not suitable for the kinds of contexts we consider. We believe our memoization technique is synergistic with current self-adjusting techniques and can be leveraged along with self-adjusting computation to create self-adjusting programs which utilize message passing.

We expected to see roughly a linear trend correlated to the increase in transactions which perform an update. However, we noticed that performance degrades about 10% from a read/write ratio of 20 to a read/write ratio of zero. This phenomenon occurs because memoized transactions are more likely to complete on their first try when there are fewer modifications to the structure. Since a non-memoized transaction requires longer to complete, it has a greater chance of aborting when there are frequent updates. This difference becomes muted as the number of changes to the data structure increase.

171

(a)

(b)

(c)

Figure 13. Normalized runtime % speedup for k-clustering, STMBench7, and Swerve benchmarks of memoized evaluation compared to non-memoized execution. Our technique also shares some similarity with transactional events (Donnelly and Fluet 2006; Effinger-Dean et al. 2008). Transactional events explore a state space of possible communications finding matching communications to ensure atomicity of a collection of actions. To do so transactional events require arbitrary lookahead in evaluation to determine if a complex composed event can commit. Partial memoization also explores potential matching communication actions to satisfy memoization constraints. However, partial memoization avoids the need for arbitrary lookahead – failure to discharge memoization constraints simply causes execution to proceed as normal.

Tim Harris, Simon Marlow, Simon Peyton-Jones, and Maurice Herlihy. Composable Memory Transactions. In Proceedings of the ACM Conference on Principles and Practice of Parallel Programming, pages 48–60, 2005. Allan Heydon, Roy Levin, and Yuan Yu. Caching function calls using precise dependencies. In PLDI, pages 311–320, 2000. Ruy Ley-Wild, Matthew Fluet, and Umut A. Acar. Compiling self-adjusting programs with continuations. In ICFP, pages 321–334, 2008. Yanhong A. Liu and Tim Teitelbaum. Caching Intermediate Results for Program Improvement. In PEPM, pages 190–201, 1995. Richard Matthew Mccutchen and Samir Khuller. Streaming algorithms for k-center clustering with outliers and with anonymity. In APPROX ’08 / RANDOM ’08, pages 165–178, 2008.

Acknowledgements

MLton. http://www.mlton.org.

Thanks to Matthew Fluet and Dan Spoonhower for their help in the design and development of Multi-MLton. This work is supported by the National Science Foundation under grants CCF-0701832 and CCF-0811631, and an Intel graduate fellowship.

Christopher J. F. Pickett and Clark Verbrugge. Software Thread Level Speculation for the Java Language and Virtual Machine Environment. In Proceedings of the International Workshop on Languages and Compilers for Parallel Computing, 2005. W. Pugh and T. Teitelbaum. Incremental Computation via Function Caching. In POPL, pages 315–328, 1989.

References Umut A. Acar, Guy E. Blelloch, and Robert Harper. Selective Memoization. In POPL, pages 14–25, 2003.

William Pugh. An Improved Replacement Strategy for Function Caching. In LFP, pages 269–276, 1988.

Umut A. Acar, Amal Ahmed, and Matthias Blume. Imperative SelfAdjusting Computation. In POPL, pages 309–322, 2008.

Shaz Qadeer, Sriram K. Rajamani, and Jakob Rehof. Summarizing procedures in concurrent programs. In POPL, pages 245–255, 2004.

Ali-Reza Adl-Tabatabai, Brian T. Lewis, Vijay Menon, Brian R. Murphy, Bratin Saha, and Tatiana Shpeisman. Compiler and Runtime Support for Efficient Software Transactional Memory. In PLDI, pages 26–37, 2006.

John Reppy and Yingqi Xiao. Towards a Parallel Implementation of Concurrent ML. In DAMP 2008, January 2008. John H. Reppy. Concurrent Programming in ML. Cambridge University Press, 1999.

Michael J. Carey, David J. DeWitt, and Jeffrey F. Naughton. The 007 benchmark. SIGMOD Record, 22(2):12–21, 1993.

Michael F. Ringenburg and Dan Grossman. AtomCaml: First-Class Atomicity via Rollback. In Proceedings of the ACM SIGPLAN International Conference on Functional Programming, pages 92–104, 2005.

Kevin Donnelly and Matthew Fluet. Transactional Events. In ICFP, pages 124–135, 2006.

Bratin Saha, Ali-Reza Adl-Tabatabai, Richard L. Hudson, Chi Cao Minh, and Benjamin Hertzberg. McRT-STM: a High-Performance Software Transactional Memory system for a Multi-Core Runtime. In PPoPP, pages 187–197, 2006.

Laura Effinger-Dean, Matthew Kehrt, and Dan Grossman. Transactional events for ml. In ICFP ’08, pages 103–114, 2008. ISBN 978-1-59593919-7. Michael I. Gordon, William Thies, and Saman Amarasinghe. Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs. In ASPLOS-XII, pages 151–162, 2006.

Kedar Swadi, Walid Taha, Oleg Kiselyov, and Emir Pasalic. A Monadic Approach for Avoiding Code Duplication When Staging Memoized Functions. In PEPM, pages 160–169, 2006.

Rachid Guerraoui, Michal Kapalka, and Jan Vitek. STMBench7: A Benchmark for Software Transactional Memory. In EuroSys, pages 315–324, 2007.

Lukasz Ziarek, Philip Schatz, and Suresh Jagannathan. Stabilizers: A Modular Checkpointing Abstraction for Concurrent Functional Programs. In ACM International Conference on Functional Programming, pages 136– 147, 2006.

Matthew Hammer, Umut A. Acar, Mohan Rajagopalan, and Anwar Ghuloum. A Proposal for Parallel Self-Adjusting Computation. In Workshop on Declarative Aspects of Multicore Programming, 2007. Tim Harris and Keir Fraser. Language support for lightweight transactions. In OOPSLA, pages 388–402, 2003.

172

Free Theorems Involving Type Constructor Classes Functional Pearl Janis Voigtl¨ander Institut f¨ur Theoretische Informatik Technische Universit¨at Dresden 01062 Dresden, Germany [email protected]

Abstract

Let us consider a more specific example, say functions of the type Monad µ ⇒ [µ Int] → µ Int. Here are some:1

Free theorems are a charm, allowing the derivation of useful statements about programs from their (polymorphic) types alone. We show how to reap such theorems not only from polymorphism over ordinary types, but also from polymorphism over type constructors restricted by class constraints. Our prime application area is that of monads, which form the probably most popular type constructor class of Haskell. To demonstrate the broader scope, we also deal with a transparent way of introducing difference lists into a program, endowed with a neat and general correctness proof.

f1 = head f2 ms = sequence ms >>= return ◦ sum f3 = f2 ◦ reverse f4 [ ] = return 0 f4 (m : ms) = do i ← m let l = length ms if i > l then return (i + l) else f4 (drop i ms)

Categories and Subject Descriptors F.3.1 [Logics and Meanings of Programs]: Specifying and Verifying and Reasoning about Programs—Invariants; D.1.1 [Programming Techniques]: Applicative (Functional) Programming; D.3.3 [Programming Languages]: Language Constructs and Features—Polymorphism General Terms Keywords

1.

As we see, there is quite a variety of such functions. There can be simple selection of one of the monadic computations from the input list (as in f1 ), there can be sequencing of these monadic computations (in any order) and some action on the encapsulated values (as in f2 and f3 ), and the behaviour, in particular the choice which of the computations from the input list are actually performed, can even depend on the encapsulated values themselves (as in f4 , made a bit artificial here). Further possibilities are that some of the monadic computations from the input list are performed repeatedly, and so forth. But still, all these functions also have something in common. They can only combine whatever monadic computations, and associated effects, they encounter in their input lists, but they cannot introduce new effects of any concrete monad, not even of the one they are actually operating on in a particular application instance. This limitation is determined by the function type. For if an f were, on and of its own, to cause any additional effect to happen, be it by writing to the output, by introducing additional branching in the nondeterminism monad, or whatever, then it would immediately fail to get the above type parametric over µ. In a language like Haskell, should not we be able to profit from this kind of abstraction for reasoning purposes? If so, what kind of insights can we hope for? One thing to expect is that in the special case when the concrete computations in an input list passed to an f :: Monad µ ⇒ [µ Int] → µ Int correspond to pure values (e.g., are values of type IO Int that do not perform any actual input or output), then the same should hold of f ’s result for that input list. This statement is quite intuitive from the above observation about f being unable to cause new effects on

Languages, Verification

relational parametricity

Introduction

One of the strengths of functional languages like Haskell is an expressive type system. And yet, some of the benefits this strength should hold for reasoning about programs seem not to be realised to full extent. For example, Haskell uses monads (Moggi 1991) to structure programs by separating concerns (Wadler 1992; Liang et al. 1995) and to safely mingle pure and impure computations (Peyton Jones and Wadler 1993; Launchbury and Peyton Jones 1995). A lot of code can be kept independent of a concrete choice of monad. This observation pertains to functions from the Prelude (Haskell’s standard library) like sequence :: Monad µ ⇒ [µ α] → µ [α] , but also to many user-defined functions. Such abstraction is certainly a boon for modularity of programs. But also for reasoning?

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICFP’09, August 31–September 2, 2009, Edinburgh, Scotland, UK. c 2009 ACM 978-1-60558-332-7/09/08. . . $5.00 Copyright

1 The

functions head, sum, reverse, length, and drop are all from the Prelude. Their general types and explanation can be found via Hoogle (http://haskell.org/hoogle). The notation ◦ is for function composition, while >>= and do are two different syntaxes for performing computations in a monad one after the other. Finally, return embeds pure values into a monad.

173

can only ever contain elements from the input list. For the function, not knowing the element type of the lists it operates over, cannot possibly make up new elements of any concrete type to put into the output, such as 42 or True, or even id , because then f would immediately fail to have the general type [α] → [α].2

its own. But what about more interesting statements, for example the preservation of certain invariants? Say we pass to f a list of stateful computations and we happen to know that they do depend on, but do not alter (a certain part of) the state. Is this property preserved throughout the evaluation of a given f ? Or say the effect encapsulated in f ’s input list is nondeterminism but we would like to simplify the program by restricting the computation to a deterministically chosen representative from each nondeterministic manifold. Under what conditions, and for which kind of representativeselection functions, is this simplification safe and does not lead to problems like a collapse of an erstwhile nonempty manifold to an empty one from which no representative can be chosen at all? One could go and study these questions for particular functions like the f1 to f4 given further above. But instead we would like to answer them for any function of type Monad µ ⇒ [µ Int] → µ Int in general, without consulting particular function definitions. And we would not like to restrict to the two or three scenarios depicted in the previous paragraph. Rather, we want to explore more abstract settings of which statements like the ones in question above can be seen, and dealt with, as particular instances. And, of course, we prefer a generic methodology that applies equally well to other types than the specific one of f considered so far in this introduction. These aims are not arbitrary or far-fetched. Precedent has been set with the theorems obtained for free by Wadler (1989) from relational parametricity (Reynolds 1983). Derivation of such free theorems, too, is a methodology that applies not only to a single type, works independently of particular function definitions, and applies to a diverse range of scenarios: from simple algebraic laws to powerful program transformations (Gill et al. 1993), to meta-theorems about whole classes of algorithms (Voigtl¨ander 2008b), to specific applications in software engineering and databases (Voigtl¨ander 2009). Unsurprisingly then, we do build on Reynolds’ and Wadler’s work. Of course, the framework that is usually considered when free theorems are derived needs to be extended to deal with types like Monad µ ⇒ . . . . But the ideas needed to do so are there for the taking. Indeed, both relational parametricity extended for polymorphism over type constructors rather than over ordinary types only, as well as relational parametricity extended to take class constraints into account, are in the folklore. However, these two strands of possible extension have not been combined before, and not been used as we do. Since we are mostly interested in demonstrating the prospects gained from that combination, we refrain here from developing the folklore into a full-fledged formal apparatus that would stand to blur the intuitive ideas. This is not an overly theoretical paper. Also on purpose, we do not consider Haskell intricacies, like those studied by Johann and Voigtl¨ander (2004) and Stenger and Voigtl¨ander (2009), that do affect relational parametricity but in a way orthogonal to what is of interest here. Instead, we stay with Reynolds’ and Wadler’s simple model (but consider the extension to general recursion in Appendix C). For the sake of accessibility, we also stay close to Wadler’s notation.

2.

So for any input list l (over any element type) the output list f l consists solely of elements from l. But how can f decide which elements from l to propagate to the output list, and in which order and multiplicity? The answer is that such decisions can only be made based on the input list l. For in a pure functional language f has no access to any global state or other context based on which to decide. It cannot, for example, consult the user in any way about what to do. And the means by which to make decisions based on l are limited as well. In particular, decisions cannot possibly depend on any specifics of the elements of l. For the function is ignorant of the element type, and so is prevented from analysing list elements in any way (be it by patternmatching, comparison operations, or whatever). In fact, the only means for f to drive its decision-making is to inspect the length of l, because that is the only element-independent “information content” of a list. So for any pair of lists l and l0 of same length (but possibly over different element types) the lists f l and f l0 are formed by making the same position-wise selections of elements from l and l0 , respectively. Now consider the following standard Haskell function: map :: (α → β) → [α] → [β] map g [ ] = [] map g (a : as) = (g a) : (map g as) Clearly, map g for any g preserves the lengths of lists. So if l0 = map g l, then f l and f l0 are of the same length and contain, at each position, position-wise exactly corresponding elements from l and l0 , respectively. Since, moreover, any two position-wise corresponding elements, one from l and one from l0 = map g l, are related by the latter being the g-image of the former, we have that at each position f l0 contains the g-image of the element at the same position in f l. So for any list l and (type-appropriate) function g, we have f (map g l) = map g (f l). Note that during the reasoning leading up to that statement we did not (need to) consider the actual definition of f at all. The methodology of deriving free theorems a` la Wadler (1989) is a way to obtain statements of this flavour for arbitrary function types, and in a more disciplined (and provably sound) manner than the mere handwaving performed above. The key to doing so is to interpret types as relations. For example, given the type signature f :: [α] → [α], we take the type and replace every quantification over type variables, including implicit quantification (note that the type [α] → [α], by Haskell convention, really means ∀α. [α] → [α]), by quantification over relation variables: ∀R. [R] → [R]. Then, there is a systematic way of reading such expressions over relations as relations themselves. In particular,

Free Theorems, in Full Beauty

So what is the deal with free theorems? Why should it be possible to derive statements about a function’s behaviour from its type alone? Maybe it is best to start with a concrete example. Consider the type signature

• base types like Int are read as identity relations, • for relations R and S, we have

f :: [α] → [α] .

R → S = {(f, g) | ∀(a, b) ∈ R. (f a, g b) ∈ S} ,

What does it tell us about the function f ? For sure that it takes lists as input and produces lists as output. But we also see that f is polymorphic, due to the type variable α, and so must work for lists over arbitrary element types. How, then, can elements for the output list come into existence? The answer is that the output list

and 2 The

situation is more complicated in the presence of general recursion. For further discussion, see Appendix C.

174

• for types τ and τ 0 with at most one free variable, say α, and

Sheard (1996), Kuˇcan (1997), Takeuti (2001), and Vytiniotis and Weirich (2009). Regarding class constraints, Wadler (1989, Section 3.4) directs the way by explaining how to treat the type class Eq in the context of deriving free theorems. The idea is to simply restrict the relations chosen as interpretation for type variables that are subject to a class constraint. Clearly, only relations between types that are instances of the class under consideration are allowed. Further restrictions are obtained from the respective class declaration. Namely, the restrictions must precisely ensure that every class method (seen as a new constant in the language) is related to itself by the relational interpretation of its type. This relatedness then guarantees that the overall result (i.e., that the relational interpretation of every closed type is an identity relation) stays intact (Mitchell and Meyer 1985). The same approach immediately applies to type constructor classes as well. Consider, for example, the Monad class declaration:

a function F on relations such that every relation R between closed types τ1 and τ2 , denoted R : τ1 ⇔ τ2 , is mapped to a relation F R : τ [τ1 /α] ⇔ τ 0 [τ2 /α], we have ∀R. F R = {(u, v) | ∀τ1 , τ2 , R : τ1 ⇔ τ2 . (uτ1 , vτ2 ) ∈ F R} (Here, uτ1 :: τ [τ1 /α] is the instantiation of u :: ∀α. τ to the type τ1 , and similarly for vτ2 . In what follows, we will always leave type instantiation implicit.) Also, every fixed type constructor is read as an appropriate construction on relations. For example, the list type constructor maps every relation R : τ1 ⇔ τ2 to the relation [R] : [τ1 ] ⇔ [τ2 ] defined by (the least fixpoint of) [R] = {([ ], [ ])}∪{(a : as, b : bs) | (a, b) ∈ R, (as, bs) ∈ [R]} , the Maybe type constructor maps every relation R : τ1 ⇔ τ2 to the relation Maybe R : Maybe τ1 ⇔ Maybe τ2 defined by

class Monad µ where return :: α → µ α (>>=) :: µ α → (α → µ β) → µ β

Maybe R = {(Nothing, Nothing)} ∪ {(Just a, Just b) | (a, b) ∈ R} ,

Since the type of return is ∀µ. Monad µ ⇒ (∀α. α → µ α), we expect that (return, return) ∈ ∀F. Monad F ⇒ (∀R. R → F R), and similarly for >>=. The constraint “Monad F” on a relational action is now defined in precisely such a way that both conditions will be fulfilled.

and similarly for other user-definable types. The key insight of relational parametricity a` la Reynolds (1983) now is that any expression over relations that can be built as above, by interpreting a closed type, denotes the identity relation on that type. For the above example, this insight means that any f :: ∀α. [α] → [α] satisfies (f, f ) ∈ ∀R. [R] → [R], which by unfolding some of the above definitions is equivalent to having for every τ1 , τ2 , R : τ1 ⇔ τ2 , l :: [τ1 ], and l0 :: [τ2 ] that (l, l0 ) ∈ [R] implies (f l, f l0 ) ∈ [R], or, specialised to the function level (R 7→ g, and thus [R] 7→ map g), for every g :: τ1 → τ2 and l :: [τ1 ] that f (map g l) = map g (f l). This proof finally provides the formal counterpart to the intuitive reasoning earlier in this section. And the development is algorithmic enough that it can be performed automatically. Indeed, an online free theorems generator (B¨ohme 2007) is accessible at our homepage (http://linux.tcs.inf.tu-dresden.de/~voigt/ft/).

3.

Definition 1. Let κ1 and κ2 be type constructors that are instances of Monad and let F : κ1 ⇔ κ2 be a relational action. If • (return κ1 , return κ2 ) ∈ ∀R. R → F R and • ((>>=κ1 ), (>>=κ2 )) ∈ ∀R. ∀S. F R → ((R → F S) →

F S), then F is called a Monad-action.3 (While we have decided to generally leave type instantiation implicit, we explicitly retain instantiation of type constructors in what follows, except for some examples.) For example, given the following standard Monad instance definitions: instance Monad Maybe where return a = Just a Nothing >>= k = Nothing Just a >>= k = k a

The Extension to Type Constructor Classes

We now want to deal with two new aspects: with quantification over type constructor variables (rather than just over type variables) and with class constraints (Wadler and Blott 1989). For both aspects, the required extensions to the interpretation of types as relations appear to be folklore, but have seldom been spelled out and have not been put to use before as we do in this paper. Regarding quantification over type constructor variables, the necessary adaptation is as follows. Just as free type variables are interpreted as relations between arbitrarily chosen closed types (and then quantified over via relation variables), free type constructor variables are interpreted as functions on such relations tied to arbitrarily chosen type constructors. Formally, let κ1 and κ2 be type constructors (of kind ∗ → ∗). A relational action for them, denoted F : κ1 ⇔ κ2 , is a function F on relations between closed types such that every R : τ1 ⇔ τ2 (for arbitrary τ1 and τ2 ) is mapped to an F R : κ1 τ1 ⇔ κ2 τ2 . For example, the function F that maps every R : τ1 ⇔ τ2 to

instance Monad [] where return a = [a] as >>= k = concat (map k as) the relational action F : Maybe ⇔ [] given above is not a Monadaction, because it is not the case that ((>>=Maybe ), (>>=[] )) ∈ ∀R. ∀S. F R → ((R → F S) → F S). To see this, consider R = S = id Int , m1 = Just 1 , m2 = [1, 2] , k1 = λi → if i > 1 then Just i else Nothing , and k2 = λi → reverse [2..i] . Clearly, (m1 , m2 ) ∈ F id Int and (k1 , k2 ) ∈ id Int → F id Int , but (m1 >>=Maybe k1 , m2 >>=[] k2 ) = (Nothing, [2]) ∈ / F id Int . On the other hand, the relational action F 0 : Maybe ⇔ [] with

F R = {(Nothing, [ ])} ∪ {(Just a, b : bs) | (a, b) ∈ R, bs :: [τ2 ]}

F 0 R = {(Nothing, [ ])} ∪ {(Just a, [b]) | (a, b) ∈ R}

is a relational action F : Maybe ⇔ []. The relational interpretation of a type quantifying over a type constructor variable is now performed in an analogous way as explained for quantification over type (and then, relation) variables above. In different formulations and detail, the same basic idea is mentioned or used by Fegaras and

is a Monad-action. 3 It

is worth noting that “dictionary translation” (Wadler and Blott 1989) would be an alternative way of motivating this definition.

175

inside functions of that type. Just consider the function f2 from the introduction. During that function’s computation, the monadic bind operation (>>=) is used to combine a µ-encapsulated integer list (viz., sequence ms :: µ [Int]) with a function to a µ-encapsulated single integer (viz., return ◦ sum :: [Int] → µ Int). Clearly, the same or similarly modular code could not have been written at type f2 :: IntMonad µ ⇒ [µ] → µ, because there is no way to provide a function like sequence for the IntMonad class (or any single monomorphised class), not even when we are content with making sequence less flexible by fixing the α in its current type to be Int. So again, proving results about all functions of type f :: Monad µ ⇒ [µ Int] → µ Int covers more ground than might at first appear to be the case. Having rationalised our choice of example function type, let us now get some actual work done. As a final preparation, we need to mention three laws that Monad instances κ are often expected to satisfy:5

We are now ready to derive free theorems involving (polymorphism over) type constructor classes. For example, functions f :: Monad µ ⇒ [µ Int] → µ Int as considered in the introduction will necessarily always satisfy (f, f ) ∈ ∀F . Monad F ⇒ [F id Int ] → F id Int , i.e., for every choice of type constructors κ1 and κ2 that are instances of Monad, and every Monad-action F : κ1 ⇔ κ2 , we have (fκ1 , fκ2 ) ∈ [F id Int ] → F id Int . In the next section we prove several theorems by instantiating the F here, and provide plenty of examples of interesting results obtained for concrete monads. An important role will be played by a notion connecting different Monad instances on a functional, rather than relational, level. Definition 2. Let κ1 and κ2 be instances of Monad and let h :: κ1 α → κ2 α. If • h ◦ return κ1 = return κ2 and • for every choice of closed types τ and τ 0 , m :: κ1 τ , and

k :: τ → κ1 τ 0 ,

return κ a >>=κ k = k a (1) m >>=κ return κ = m (2) (m >>=κ k) >>=κ q = m >>=κ (λa → k a >>=κ q) (3)

h (m >>=κ1 k) = h m >>=κ2 h ◦ k , then h is called a Monad-morphism.

Since Haskell does not enforce these laws, and it is easy to define Monad instances violating them, we will explicitly keep track of where the laws are required in our statements and proofs.

The two notions of Monad-action and Monad-morphism are strongly related, in that Monad-actions are closed under pointwise composition with Monad-morphisms or the inverses thereof, depending on whether the composition is from the left or from the right (Filinski and Støvring 2007, Proposition 3.7(2)).

4.

4.1

Purity Preservation

As mentioned in the introduction, one first intuitive statement we naturally expect to hold of any f :: Monad µ ⇒ [µ Int] → µ Int is that when all the monadic values supplied to f in the input list are actually pure (not associated with any proper monadic effect), then f ’s result value, though of some monadic type, should also be pure. After all, f itself, being polymorphic over µ, cannot introduce effects from any specific monad. This statement is expected to hold no matter what monad the input values live in. For example, if the input list consists of computations in the list monad, defined in the previous section and modelling nondeterminism, but all the concretely passed values actually correspond to deterministic computations, then we expect that f ’s result value also corresponds to a deterministic computation. Similarly, if the input list consists of IO computations, but we only pass ones that happen to have no side-effect at all, then f ’s result, though living in the IO monad, should also be side-effect-free. To capture the notion of “purity” independently of any concrete monad, we use the convention that the pure computations in any monad are those that may be the result of a call to return. Note that this does not mean that the values in the input list must syntactically be return-calls. Rather, each of them only needs to be semantically equivalent to some such call. The desired statement is now formalised as follows. It is proved in Appendix A, and is a corollary of Theorem 3 (to be given later).

One Application Field: Reasoning about Monadic Programs

For most of this section, we focus on functions f :: Monad µ ⇒ [µ Int] → µ Int. However, it should be emphasised that results of the same spirit can be systematically obtained for other function types involving quantification over Monad-restricted type constructor variables just as well. And note that the presence of the concrete type Int in the function signature makes any results we obtain for such f more, rather than less, interesting. For clearly there are strictly, and considerably, fewer functions of type Monad µ ⇒ [µ α] → µ α than there are of type Monad µ ⇒ [µ Int] → µ Int 4 , so proving a statement for all functions of the latter type demonstrates much more power than proving the same statement for all functions of the former type only. In other words, telling f what type of values are encapsulated in its monadic inputs and output entails more possible behaviours of f that our reasoning principle has to keep under control. Also, it is not the case that using Int in most examples in this section means that we might as well have monomorphised the monad interface as follows: class IntMonad µ where return :: Int → µ (>>=) :: µ → (Int → µ) → µ

Theorem 1. Let f :: Monad µ ⇒ [µ Int] → µ Int, let κ be an instance of Monad satisfying law (1), and let l :: [κ Int]. If every element in l is a return κ -image, then so is fκ l.

and thus are actually just proving results about a less interesting type IntMonad µ ⇒ [µ] → µ without any higher-orderedness (viz., quantifying only over a type variable rather than over a type constructor variable). This impression would be a misconception, as we do indeed prove results for functions critically depending on the use of higher-order polymorphism. That the type under consideration is Monad µ ⇒ [µ Int] → µ Int does by no way mean that monadic encapsulation is restricted to only integer values

We can now reason for specific monads as follows. Example 1. Let l :: [[Int]], i.e., l :: [κ Int] for κ = []. We might be interested in establishing that when every element

4 After all, any function (definition) of the type polymorphic over α can also

be given the more specific type, whereas of the functions f1 to f4 given in the introduction as examples for functions of the latter type only f1 can be given the former type as well.

5 Indeed, only a Monad instance satisfying these laws constitutes a “monad” in the mathematical sense of the word.

176

in l is (evaluated to) a singleton list, then the result of applying any f :: Monad µ ⇒ [µ Int] → µ Int to l will be a singleton list as well. While this propagation is easy to see for f1 , f2 , and f3 from the introduction, it is maybe not so immediately obvious for the f4 given there. However, Theorem 1 tells us without any further effort that the statement in question does indeed hold for f4 , and for any other f of the same type.

Assume we are interested in applying an f :: Monad µ ⇒ [µ Int] → µ Int to an l :: [Writer Int], yielding a monadic result of type Writer Int. Assume further that for some particular purpose during reasoning about the overall program, we are only interested in the actual integer value encapsulated in that result, as extracted by the following function:

Likewise, we obtain the statement about side-effect-free computations in the IO monad envisaged above. All we rely on then is that the IO monad, like the list monad, satisfies monad law (1).

Intuition suggests that then the value of p (f l) should not depend on any logging activity of elements in l. That is, if l were replaced by another l0 :: [Writer Int] encapsulating the same integer values, but potentially attached with different logging information, then p (f l0 ) should give exactly the same value. Since the given p fulfils the required conditions, Theorem 2 confirms this intuition.

4.2

p :: Writer α → α p (Writer (a, s)) = a

Safe Value Extraction

A second general statement we are interested in is to deal with the case that the monadic computations provided as input are not necessarily pure, but we have a way of discarding the monadic layer and recovering underlying values. Somewhat in the spirit of unsafePerformIO :: IO α → α, but for other monads and hopefully safe. Then, if we are interested only in a thus projected result value of f , can we show that it only depends on likewise projected input values, i.e., that we can discard any effects from the monadic computations in f ’s input list when we are not interested in the effectful part of the output computation? Clearly, it would be too much to expect this reduction to work for arbitrary “projections”, or even arbitrary monads. Rather, we need to devise appropriate restrictions and prove that they suffice. The formal statement is as follows.

It should also be instructive here to consider a negative example. Example 3. Recall the list monad defined in Section 3. It is tempting to use head :: [α] → α as an extraction function and expect that for every f :: Monad µ ⇒ [µ Int] → µ Int, we can factor head ◦ f as g ◦ (map head ) for some suitable g :: [Int] → Int. But actually this factorisation fails in a subtle way. Consider, for example, the (for the sake of simplicity, artificial) function f5 :: Monad µ ⇒ [µ Int] → µ Int f5 [ ] = return 0 f5 (m : ms) = do i ← m f5 (if i > 0 then ms else tail ms)

Theorem 2. Let f :: Monad µ ⇒ [µ Int] → µ Int, let κ be an instance of Monad, and let p :: κ α → α. If

Then for l = [[1], [ ]] and l0 = [[1, 0], [ ]], both of type [[Int]], we have map head l = map head l0 , but head (f5 l) 6= head (f5 l0 ). In fact, the left-hand side of this inequation leads to an “head of empty list”-error, whereas the righthand side delivers the value 0. Clearly, this means that the supposed g cannot exist for f5 and head . An explanation for the observed failure is provided by the conditions imposed on p in Theorem 2. It is simply not true that for every m and k, head (m >>= k) = head (k (head m)). More concretely, the failure for f5 observed above arises from this equation being violated for m = [1, 0] and k = λi → if i > 0 then [ ] else [0].

• p ◦ return κ = id and • for every choice of closed types τ and τ 0 , m :: κ τ , and

k :: τ → κ τ 0 ,

p (m >>=κ k) = p (k (p m)) , then p ◦ fκ gives the same result for any two lists of same length whose corresponding elements have the same pimages, i.e., p ◦ fκ can be “factored” as g ◦ (map p) for some suitable g :: [Int] → Int.6

The theorem is proved in Appendix B. Also, it is a corollary of Theorem 4. Note that no monad laws at all are needed in Theorem 2 and its proof. The same will be true for the other theorems we are going to provide, except for Theorem 5. But first, we consider several example applications of Theorem 2.

Since the previous (counter-)example is a bit peculiar in its reliance on runtime errors, let us consider a related setting without empty lists, an example also serving to further emphasise the predictive power of the conditions on p in Theorem 2.

Example 2. Consider the well-known writer, or logging, monad (specialised here to the String monoid):

Example 4. Assume, just for the scope of this example, that the type constructor [] yields (the types of) nonempty lists only. Clearly, it becomes an instance of Monad by just the same definition as given in Section 3. There are now several choices for a never failing extraction function p :: [α] → α. For example, p could be head , could be last, or could be the function that always returns the element in the middle position of its input list (and, say, the left one of the two middle elements in the case of a list of even length). But which of these candidates are “good” in the sense of providing, for

newtype Writer α = Writer (α, String) instance Monad Writer where return a = Writer (a, “”) Writer (a, s) >>= k = Writer (case k a of Writer (a0 , s0 ) → (a0 , s + + s0 )) 6 In fact, this g

is explicitly given as follows: unId ◦ fId ◦ (map Id), using the type constructor Id and its Monad instance definition from Appendix A.

177

• (return κ2 , return κ1 ) ∈ ∀R. R → F R, since for ev-

ery R and (a, b) ∈ R, (return κ2 a, h (return κ1 b)) = (return κ2 a, return κ2 b) ∈ κ2 R by (return κ2 , return κ2 ) ∈ ∀R. R → κ2 R (which holds due to return κ2 :: ∀α. α → κ2 α), and • ((>>=κ2 ), (>>=κ1 )) ∈ ∀R. ∀S. F R → ((R → F S) → F S), since for every R, S, (m1 , m2 ) ∈ (κ2 R) ; h−1 , and (k1 , k2 ) ∈ R → ((κ2 S) ; h−1 ),

every f :: Monad µ ⇒ [µ Int] → µ Int, a factorisation of p ◦ f into g ◦ (map p) ? The answer is provided by the two conditions on p in Theorem 2, which specialised to the (nonempty) list monad require that • for every a, p [a] = a, and • for every choice of closed types τ and τ 0 , m :: [τ ], and

k :: τ → [τ 0 ], p (concat (map k m)) = p (k (p m)).

(m1 >>=κ2 k1 , h (m2 >>=κ1 k2 )) = (m1 >>=κ2 k1 , h m2 >>=κ2 h ◦ k2 ) ∈ κ2 S

From these conditions it is easy to see that now p = head is good (in contrast to the situation in Example 3), and so is p = last, while the proposed “middle extractor” is not. It does not fulfil the second condition above, roughly because k does not necessarily map all its inputs to equally long lists. (A concrete counterexample f6 , of appropriate type, can easily be produced from this observation.)

4.3

by ((>>=κ2 ), (>>=κ2 )) ∈ ∀R. ∀S. κ2 R → ((R → κ2 S) → κ2 S), (m1 , h m2 ) ∈ κ2 R, and (k1 , h ◦ k2 ) ∈ R → κ2 S. Hence, (fκ2 , fκ1 ) ∈ [F id Int ] → F id Int . Given that we have F id Int = (κ2 id Int ) ; h−1 = h−1 , this implies the claim. (Note that κ2 id Int is the relational interpretation of the closed type κ2 Int, and thus itself denotes id κ2 Int .) Using Theorem 3, we can indeed prove the statements mentioned for the list and exception monads above. Here, for diversion, we instead prove some results about more stateful computations.

Monad Subspacing

Next, we would like to tackle reasoning not about the complete absence of (`a la Theorem 1), or disregard for (`a la Theorem 2), monadic effects, but about finer nuances. Often, we know certain computations to realise only some of the potential effects to which they would be entitled according to the monad they live in. If, for example, the effect under consideration is nondeterminism a` la the standard list monad, then we might know of some computations in that monad that they realise only none-or-onenondeterminism, i.e., never produce more than one answer, but may produce none at all. Or we might know that they realise only nonfailing-nondeterminism, i.e., always produce at least one answer, but may produce more than one. Then, we might want to argue that the respective nature of nondeterminism is preserved when combining such computations using, say, a function f :: Monad µ ⇒ [µ Int] → µ Int. This preservation would mean that applying any such f to any list of empty-or-singleton lists always gives an emptyor-singleton list as result, and that applying any such f to any list of nonempty lists only gives a nonempty list as result for sure. Or, in the case of an exception monad (Either String), we might want to establish that an application of f cannot possibly lead to any exceptional value (error description string) other than those already present somewhere in its input list. Such “invariants” can often be captured by identifying a certain “subspace” of the monadic type in question that forms itself a monad, or, indeed, by “embedding” another, “smaller”, monad into the one of interest. Formal counterparts of the intuition behind the previous sentence and the vague phrases occurring therein can be found in Definition 2 and the following theorem, as well as in the subsequent examples.

Example 5. Consider the well-known reader monad: newtype Reader ρ α = Reader (ρ → α) instance Monad (Reader ρ) where return a = Reader (λr → a) Reader g >>= k = Reader (λr → case k (g r) of Reader g 0 → g 0 r) Assume we are given a list of computations in a Reader monad, but it happens that all present computations depend only on a certain part of the environment type. For example, for some closed types τ1 and τ2 , l :: [Reader (τ1 , τ2 ) Int], and for every element Reader g in l, g (x, y) never depends on y. We come to expect that the same kind of independence should then hold for the result of applying any f :: Monad µ ⇒ [µ Int] → µ Int to l. And indeed it does hold by Theorem 3 with the following Monad-morphism: h :: Reader τ1 α → Reader (τ1 , τ2 ) α h (Reader g) = Reader (g ◦ fst)

It is also possible to connect more different monads, even involving the IO monad. Example 6. Let l :: [IO Int] and assume that the only sideeffects that elements in l have consist of writing strings to the output. We would like to use Theorem 3 to argue that the same is then true for the result of applying any f :: Monad µ ⇒ [µ Int] → µ Int to l. To this end, we need to somehow capture the concept of “writing (potentially empty) strings to the output as only side-effect of an IO computation” via an embedding from another monad. Quite naturally, we reuse the Writer monad from Example 2. The embedding function is as follows: h :: Writer α → IO α h (Writer (a, s)) = putStr s >> return a

Theorem 3. Let f :: Monad µ ⇒ [µ Int] → µ Int, let h :: κ1 α → κ2 α be a Monad-morphism, and let l :: [κ2 Int]. If every element in l is an h-image, then so is fκ2 l. Proof. We prove that for every l0 :: [κ1 Int], fκ2 (map h l0 ) = h (fκ1 l0 ) .

(4)

To do so, we first show that F : κ2 ⇔ κ1 with F R = (κ2 R) ; h−1 ,

What is left to do is to show that h is a Monad-morphism. But this property follows from putStr “” = return (),

where “;” is (forward) relation composition and “−1 ” gives the inverse of a function graph, is a Monad-action. Indeed,

178

putStr (s + + s0 ) = putStr s >> putStr s0 , and monad laws (1) and (3) for the IO monad.

Example 8. Consider the well-known exception monad: instance Monad (Either String) where return a = Right a Left err >>= k = Left err Right a >>= k = k a

Similarly to the above, it would also be possible to show that when the IO computations in l do only read from the input (via, possibly repeated, calls to getChar ), then the same is true of f l. Instead of exercising this through, we turn to general state transformers.

We would like to argue that if we are only interested in whether the result of f for some input list over the type Either String Int is an exceptional value or not (and which ordinary value is encapsulated in the latter case), but do not care what the concrete error description string is in the former case, then the answer is independent of the concrete error description strings potentially appearing in the input list. Formally, let l1 , l2 :: [Either String Int] be of same length, and let corresponding elements either be both tagged with Left (but not necessarily containing the same strings) or be identical Right-tagged values. Then for every f :: Monad µ ⇒ [µ Int] → µ Int, f l1 and f l2 either are both tagged with Left or are identical Right-tagged values. This statement holds by Theorem 4 with the following Monad-morphism:

Example 7. Consider the well-known state monad: newtype State σ α = State (σ → (α, σ)) instance Monad (State σ) where return a = State (λs → (a, s)) State g >>= k = State (λs → let (a, s0 ) = g s in case k a of State g 0 → g 0 s0 ) Intuitively, this monad extends the reader monad by not only allowing a computation to depend on an input state, but also to transform the state to be passed to a subsequent computation. A natural question now is whether being a specific state transformer that actually corresponds to a read-only computation is an invariant that is preserved when computations are combined. That is, given some closed type τ and l :: [State τ Int] such that for every element State g in l, snd ◦ g = id , is it the case that for every f :: Monad µ ⇒ [µ Int] → µ Int, also f l is of the form State g for some g with snd ◦ g = id ? The positive answer is provided by Theorem 3 with the following Monad-morphism:

h :: Either String α → Maybe α h (Left err ) = Nothing h (Right a) = Just a

4.5

h :: Reader τ α → State τ α h (Reader g) = State (λs → (g s, s))

fmap :: Monad µ ⇒ (α → β) → µ α → µ β fmap g m = m >>= return ◦ g

Similarly to the above, we can show preservation of the invariant that a computation transforms the state “in the background”, while the primary result value is independent of the input state. That is, if for every element State g in l, there exists an i :: Int with fst ◦ g = const i, then the same applies to f l. It should also be possible to transfer the above kind of reasoning to the ST monad (Launchbury and Peyton Jones 1995). 4.4

A More Polymorphic Example

Just to reinforce that our approach is not specific to our pet type alone, we end this section by giving a theorem obtained for another type, the one of sequence from the introduction, also showing that mixed quantification over both type constructor variables and ordinary type variables can very well be handled. The theorem’s statement involves the following function:

Theorem 5. Let f :: Monad µ ⇒ [µ α] → µ [α] and let h :: κ1 α → κ2 α be a Monad-morphism. If κ2 satisfies law (2), then for every choice of closed types τ1 and τ2 and g :: τ1 → τ2 ,

Effect Abstraction

fκ2 ◦ map (fmap κ2 g) ◦ map h = fmap κ2 (map g) ◦ h ◦ fκ1 .7

As a final statement about our pet type, Monad µ ⇒ [µ Int] → µ Int, we would like to show that we can abstract from some aspects of the effectful computations in the input list if we are interested in the effects of the final result only up to the same abstraction. For conveying between the full effect space and its abstraction, we again use Monad-morphisms.

Intuitively, this theorem means that any f of type Monad µ ⇒ [µ α] → µ [α] commutes with, both, transformations on the monad structure and transformations on the element level. The occurrences of map and fmap are solely there to bring those transformations h and g into the proper positions with respect to the different nestings of the type constructors µ and [] on the input and output sides of f . Note that by setting either g or h to id , we obtain the specialised versions

Theorem 4. Let f :: Monad µ ⇒ [µ Int] → µ Int and let h :: κ1 α → κ2 α be a Monad-morphism. Then h ◦ fκ1 gives the same result for any two lists of same length whose corresponding elements have the same h-images.

fκ2 ◦ map h = h ◦ fκ1 Proof. Let l1 , l2 :: [κ1 Int] be such that map h l1 = map h l2 . Then h (fκ1 l1 ) = h (fκ1 l2 ) by statement (4) from the proof of Theorem 3.

7 For the curious reader: the proof derives this statement from (f , f ) ∈ κ2 κ1 [F g −1 ] → F [g −1 ] for the same Monad-action F : κ2 ⇔ κ1 as used in the proof of Theorem 3.

179

by emptyR, consR, and appendR (and/or applying rep to explicit lists). But this implicit assumption is not immediately in reach for formal grasp. So it would be nice to be able to provide a single, conclusive correctness statement for transformations like the one above. One way to do so was presented by Voigtl¨ander (2002), but it requires a certain restructuring of code that can hamper compositionality and flexibility by introducing abstraction at fixed program points (via lambda-abstraction and so-called vanish-combinators). This also brings us to the second problem with the simple approach above. When, and how, should we switch between the original and the alternative representations of lists during program construction? If we first write the original version of flatten and only later, after observing a quadratic runtime overhead, switch manually to the flatten 0 -version, then this rewriting is quite cumbersome, in particular when it has to be done repeatedly for different functions. Of course, we could decide to always use emptyR, consR, and appendR from the beginning, to be on the safe side. But actually this strategy is not so safe, efficiency-wise, because the representation of lists by functions carries its own (constant-factor) overhead. If a function does not use appends in a harmful way, then we do not want to pay this price. Hence, using the alternative presentation in a particular situation should be a conscious decision, not a default. And assume that later on we change the behaviour of flatten, say, to explore only a single path through the input tree, so that no appends at all arise. Certainly, we do not want to have to go and manually switch back to the, now sufficient, original list representation. The cure to our woes here is almost obvious, and has often been applied in similar situations: simply use overloading. Specifically, we can declare a type constructor class as follows:

and fκ ◦ map (fmap κ g) = fmap κ (map g) ◦ fκ . (5) Further specialising the latter by choosing the identity monad for κ, we would also essentially recover the free theorem derived for f :: [α] → [α] in Section 2.

5.

Another Application: Difference Lists, Transparently

It is a well-known problem that computations over lists sometimes suffer from a quadratic runtime blow-up due to left-associatively nested appends. For example, this is so for flattening a tree of type data Tree α = Leaf α | Node (Tree α) (Tree α) using the following function: flatten :: Tree α → [α] flatten (Leaf a) = [a] flatten (Node t1 t2 ) = flatten t1 + + flatten t2 An equally well-known solution is to switch to an alternative representation of lists as functions, by abstraction over the list end, often called difference lists. In the formulation of Hughes (1986), but encapsulated as an explicitly new data type: newtype DList α = DL {unDL :: [α] → [α]} rep :: [α] → DList α rep l = DL (l + +) abs :: DList α → [α] abs (DL f ) = f [ ]

class ListLike δ where empty :: δ α cons :: α → δ α → δ α append :: δ α → δ α → δ α

emptyR :: DList α emptyR = DL id consR :: α → DList α → DList α consR a (DL f ) = DL ((a :) ◦ f )

and code flatten in the following form: flatten :: Tree α → (∀δ. ListLike δ ⇒ δ α) flatten (Leaf a) = cons a empty flatten (Node t1 t2 ) = append (flatten t1 ) (flatten t2 )

appendR :: DList α → DList α → DList α appendR (DL f ) (DL g) = DL (f ◦ g) Then, flattening a tree into a list in the new representation can be done using the following function:

Then, with the obvious instance definitions instance ListLike [] where empty = [ ] cons = (:) append = (+ +)

flatten 0 :: Tree α → DList α flatten 0 (Leaf a) = consR a emptyR flatten 0 (Node t1 t2 ) = appendR (flatten 0 t1 ) (flatten 0 t2 ) and

and a more efficient variant of the original function, with its original type, can be recovered as follows:

instance ListLike DList where empty = emptyR cons = consR append = appendR we can use the single version of flatten above both to produce ordinary lists and to produce difference lists. The choice between the two will be made automatically by the type checker, depending on the context in which a call to flatten occurs. For example, in

flatten :: Tree α → [α] flatten = abs ◦ flatten 0 There are two problems with this approach. One is correctness. How do we know that the new flatten is equivalent to the original one? We could try to argue by “distributing” abs over the definition of flatten 0 , using abs emptyR = [ ], abs (consR a as) = a : abs as, and abs (appendR as bs) = abs as + + abs bs .

last (flatten t)

(6)

(7)

the ordinary list representation will be used, due to the input type of last. Actually, (7) will compile (under GHC, at least) to exactly the same code as last (flatten t) for the original definition of flatten from the very beginning of this section. Any overhead related to the type class abstraction is simply eliminated by a standard optimisation. In particular, this means that where the original representation of lists would have perfectly sufficed, programming against the abstract interface provided by the ListLike class does

But actually the last equation does not hold in general. The reason is that there are as :: DList τ that are not in the image of rep. Consider, for example, as = DL reverse. Then neither is as = rep l for any l, nor does (6) hold for every bs. Any argument “by distributing abs” would thus have to rely on the implicit assumption that a certain discipline has been exercised when going from the original flatten to flatten 0 by replacing [ ], (:), and (+ +)

180

by (a, b) ∈ R and (f l1 , bs + + l2 ) ∈ [R] (which holds due to (f, (bs + +)) ∈ [R] → [R] and (l1 , l2 ) ∈ [R]), and • (appendR, (+ +)) ∈ ∀R. F R → (F R → F R), since for every R, (f, as) ∈ ([R] → [R]) ; (+ +)−1 , (g, bs) ∈ ([R] → −1 [R]) ; (+ +) , and (l1 , l2 ) ∈ [R],

no harm either. On the other hand, (7) of course still suffers from the same quadratic runtime blow-up as with the original definition of flatten. But now we can switch to the better behaved difference list representation without touching the code of flatten at all, by simply using last (abs (flatten t)) . (8) Here the (input) type of abs determines flatten to use emptyR, consR, and appendR, leading to linear runtime. Can we now also answer the correctness question more satisfactorily? Given the forms of (7) and (8), it is tempting to simply conjecture that abs t = t for any t. But this conjecture cannot be quite right, as abs has different input and output types. Also, we have already observed that some t of abs’s input type are problematic by not corresponding to any actual list. The coup now is to only consider t that only use the ListLike interface, rather than any specific operations related to DList as such. That is, we will indeed prove that for every closed type τ and t :: ListLike δ ⇒ δ τ ,

(unDL (appendR (DL f ) (DL g)) l1 , (as + + bs) + + l2 ) = (f (g l1 ), as + + (bs + + l2 )) ∈ [R] by (f, (as + +)) ∈ [R] → [R], (g, (bs + +)) ∈ [R] → [R], and (l1 , l2 ) ∈ [R]. Hence, (tDList , t[] ) ∈ F id τ . Given that we have F id τ = unDL ; ([id τ ] → [id τ ]) ; (+ +)−1 = unDL ; (+ +)−1 , this implies (9). Note that the ListLike-action F : DList ⇔ [] used in the above proof is the same as F R = (DList R) ; rep −1 ,

abs tDList = t[] .

given that DList R = unDL ; ([R] → [R]) ; DL. This connection suggests the following more general theorem, which can actually be proved much like above.

Since the polymorphism over δ in the type of t is so important, we follow Voigtl¨ander (2008a) and make it an explicit requirement in a function that we will use instead of abs for switching from the original to the alternative representation of lists:

Theorem 7. Let t :: ListLike δ ⇒ δ τ for some closed type τ , let κ1 and κ2 be instances of ListLike, and let h :: κ1 α → κ2 α. If

improve :: (∀δ. ListLike δ ⇒ δ α) → [α] improve t = abs t Now, when we observe the problematic runtime overhead in (7), we can replace it by

• h empty κ = empty κ , 1 2 • for every closed type τ , a :: τ , and as

:: κ1 τ , h (cons κ1 a as) = cons κ2 a (h as), and • for every closed type τ and as, bs :: κ1 τ , h (append κ1 as bs) = append κ2 (h as) (h bs),

last (improve (flatten t)) . That this replacement does not change the semantics of the program is established by the following theorem, which provides the soughtafter general correctness statement.

then h tκ1 = tκ2 .

Theorem 6. Let t :: ListLike δ ⇒ δ τ for some closed type τ . Then improve t = t[] .

Theorem 6 is a special case of this theorem by setting κ1 = [], κ2 = DList, and h = rep, and observing that • rep [ ] = emptyR,

Proof. We prove

• for every closed type τ , a :: τ , and as :: [τ ], rep (a : as) =

unDL tDList = (t[] + +) ,

consR a (rep as),

(9)

• for every closed type τ and as, bs :: [τ ], rep (as + + bs) =

which by the definitions of improve and abs, and by t[] + + [] = t[] , implies the claim. To do so, we first show that F : DList ⇔ [] with F R = unDL ; ([R] → [R]) ; (+ +)−1 is a ListLike-action, where the latter concept is defined as any relational action F : κ1 ⇔ κ2 for type constructors κ1 and κ2 that are instances of ListLike such that

appendR (rep as) (rep bs), and • abs ◦ rep = id ,

all of which hold by easy calculations. One key observation here is that the third of the above observations does actually hold, in contrast to its faulty “dual” (6) considered earlier in this section. Of course, free theorems can now also be derived for other types than those considered in Theorems 6 and 7. For example, for every closed type τ , f :: ListLike δ ⇒ δ τ → δ τ , and h as in Theorem 7, we get that:

• (empty κ , empty κ ) ∈ ∀R. F R, 1 2 • (cons κ1 , cons κ2 ) ∈ ∀R. R → (F R → F R), and • (append κ , append κ ) ∈ ∀R. F R → (F R → F R). 1

fκ2 ◦ h = h ◦ fκ1 .

2

Indeed,

6.

• (emptyR, [ ]) ∈ ∀R. F R, since for every R and (l1 , l2 ) ∈

[R], (unDL emptyR l1 , [ ] + + l2 ) = (l1 , l2 ) ∈ [R], • (consR, (:)) ∈ ∀R. R → (F R → F R), since for every R, (a, b) ∈ R, (f, bs) ∈ ([R] → [R]) ; (+ +)−1 , and (l1 , l2 ) ∈ [R],

Discussion and Related Work

Of course, statements like that of Theorem 7 are not an entirely new revelation. That statement can be read as a typical fusion law for compatible morphisms between algebras over the signature described by the ListLike class declaration. (For a given τ , consider ListLike δ ⇒ δ τ as the corresponding initial algebra, κ1 τ and κ2 τ as two further algebras, and the operation ·κi of instantiating

(unDL (consR a (DL f )) l1 , (b : bs) + + l2 ) = (a : f l1 , b : bs + + l2 ) ∈ [R]

181

a t :: ListLike δ ⇒ δ τ to a tκi :: κi τ as initial algebra morphism, or catamorphism. Then the conditions on h in Theorem 7 make it an algebra morphism and the theorem’s conclusion, also expressible as h ◦ ·κ1 = ·κ2 , is “just” that of the standard catamorphism fusion law.) But being able to derive such statements directly from the types in the language, based on its built-in abstraction facilities, immediately as well for more complicated types (like ListLike δ ⇒ δ τ → δ τ instead of ListLike δ ⇒ δ τ ), and all this without going through category-theoretic hoops, is new and unique to our approach. There has been quite some interest recently in enhancing the state of the art in reasoning about monadic programs. Filinski and Støvring (2007) study induction principles for effectful data types. These principles are used for reasoning about functions on data types involving specific monadic effects (rather than about functions that are parametric over some monad), and based on the functions’ defining equations (rather than based on their types only), and thus are orthogonal to our free theorems. But for their example applications to formal models of backtracking, Filinski and Støvring also use a form of relational reasoning very close to the one appearing in our invocation of relational parametricity. In particular, our Definition 1 corresponds to their Definition 3.3. They also use monad morphisms (not to be confused with their monadalgebra morphisms, or rigid functions, playing the key role in their induction principles). The scope of their relational reasoning is different, though. They use it for establishing the observational equivalence of different implementations of the same monadic effect. This is, of course, one of the classical uses of relational parametricity: representation independence in different realizations of an abstract data type. But it is only one possible use, and our treatment of full polymorphism opens the door to other uses also in connection with monadic programs. Rather than only relating different, but semantically equivalent, implementations of the same monadic effect (as hard-wired into Filinski and Støvring’s Definition 3.5), we actually connect monads embodying different effects. These connections lead to applications not previously in reach, such as our reasoning about preservation of invariants. It is worth pointing out that Filinski (2007) does use monad morphisms for “subeffecting”, but only for the discussion of hierarchies inside each one of two competing implementations of the same set of monadic effects; the relational reasoning (via Monad-actions and so forth) is then orthogonal to these hierarchies and again can only lead to statements about observational equivalence of the two realizations overall, rather than to more nuanced statements about programs in one of them as such. The reason again, as with Filinski and Støvring (2007), is that no full polymorphism is considered, but only parametrisation over same-effect-monads on top-level. Interestingly, though, the key step in all our proofs in Section 4, namely finding a suitable Monad-action, can be streamlined in the spirit of Proposition 3.7 of Filinski and Støvring (2007) or Lemmas 45, 46 of Filinski (2007). It seems fair to mention that the formal accounts of Filinski and Støvring are very complex, but that this is necessarily so because they deal with general recursion at both term and type level, while we have completely dodged such issues. Treating general recursion in a semantic framework typically involves a good deal of domain theory such as considered by Birkedal et al. (2007). We only provide a very brief sketch of what interactions we expect between general recursion and our developments from the previous sections in Appendix C. Swierstra (2008) proposes to code against modularly assembled free monads, where the assembling takes place by building coproducts of signature functors corresponding to the term languages of free monads. The associated type signatures are able to convey some of the information captured by our approach. For example, a monadic type Term PutStr Int can be used to describe com-

putations whose only possible side-effect is that of writing strings to the output. Passing a list of values of that type to a function f :: Monad µ ⇒ [µ Int] → µ Int clearly results in a value of type Term PutStr Int as well. Thus, if it is guaranteed (note the proof obligation) that “execution” of such a term value, on a kind of virtual machine (Swierstra and Altenkirch 2007) or in the actual IO monad, does indeed have no other side effect than potential output, then one gets a statement in the spirit of our Example 6. On the other hand, statements like the one in our Example 8 (also, say, reformulated for exceptions in the IO monad) are not in reach with that approach alone. Moreover, Swierstra’s approach to “subeffecting” depends very much on syntax, essentially on term language inclusion along with proof obligations on the execution functions from terms to some semantic space. This dependence prevents directly obtaining statements roughly analogous to our Examples 5 and 7 using his approach. Also, depending on syntactic inclusion is a very strong restriction indeed. For example, putStr “” is semantically equivalent to return (), and thus without visible side-effect. But nevertheless, any computation syntactically containing a call to putStr would of necessity be assigned a type in a monad Term g with g “containing” (with respect to Swierstra’s functor-level relation :≺:) the functor PutStr, even when that call’s argument would eventually evaluate to the empty string. Thus, such a computation would be banned from the input list in a statement like the one we give below Example 6. It is not so with our more semantical approach. Dealing more specifically with concrete monads is the topic of recent works by Hutton and Fulger (2008), using point-free equational reasoning, and by Nanevski et al. (2008), employing an axiomatic extension of dependent type theory. On the tool side, we already mentioned the free theorems generator at http://linux.tcs.inf.tu-dresden.de/~voigt/ft/. It deals gracefully with ordinary type classes (in the offline, shellbased version even with user-defined ones), but has not yet been extended for type constructor classes. There is also another free theorems generator, written by Andrew Bromage, running in Lambdabot (http://haskell.org/haskellwiki/Lambdabot). It does not know about type or type constructor classes, but deals with type constructors by treating them as fixed functors. Thus, it can, for example, derive the statement (5) for functions fκ :: [κ α] → κ [α], but not more general and more interesting statements like those given in Theorem 5 and earlier, connecting different Monad instances, concerning the beyond-functor aspects of monads, or our results about ListLike.

Acknowledgments I would like to thank the anonymous reviewers of more than one version of this paper who have helped to improve it through their criticism and suggestions. Also, I would like to thank Helmut Seidl, who inspired me to consider free theorems involving type constructor classes in the first place by asking a challenging question regarding the power of type(-only)-based reasoning about monadic programs during a train trip through Munich quite some time ago. (The answer to his question is essentially Example 7.)

References L. Birkedal, R.E. Møgelberg, and R.L. Petersen. Domain-theoretical models of parametric polymorphism. Theoretical Computer Science, 388 (1–3):152–172, 2007. S. B¨ohme. Free theorems for sublanguages of Haskell. Master’s thesis, Technische Universit¨at Dresden, 2007. N.A. Danielsson, R.J.M. Hughes, P. Jansson, and J. Gibbons. Fast and loose reasoning is morally correct. In Principles of Programming Languages, Proceedings, pages 206–217. ACM Press, 2006.

182

L. Fegaras and T. Sheard. Revisiting catamorphisms over datatypes with embedded functions (or, Programs from outer space). In Principles of Programming Languages, Proceedings, pages 284–294. ACM Press, 1996.

P. Wadler. Theorems for free! In Functional Programming Languages and Computer Architecture, Proceedings, pages 347–359. ACM Press, 1989. P. Wadler. The essence of functional programming (Invited talk). In Principles of Programming Languages, Proceedings, pages 1–14. ACM Press, 1992. P. Wadler and S. Blott. How to make ad-hoc polymorphism less ad hoc. In Principles of Programming Languages, Proceedings, pages 60–76. ACM Press, 1989.

A. Filinski. On the relations between monadic semantics. Theoretical Computer Science, 375(1–3):41–75, 2007. A. Filinski and K. Støvring. Inductive reasoning about effectful data types. In International Conference on Functional Programming, Proceedings, pages 97–110. ACM Press, 2007. A. Gill, J. Launchbury, and S.L. Peyton Jones. A short cut to deforestation. In Functional Programming Languages and Computer Architecture, Proceedings, pages 223–232. ACM Press, 1993.

A.

Proof of Theorem 1

We prove that for every l0 :: [Int],

R.J.M. Hughes. A novel representation of lists and its application to the function “reverse”. Information Processing Letters, 22(3):141–144, 1986.

fκ (map return κ l0 ) = return κ (unId (fId (map Id l0 ))) , where

G. Hutton and D. Fulger. Reasoning about effects: Seeing the wood through the trees. In Trends in Functional Programming, Draft Proceedings, 2008.

newtype Id α = Id {unId :: α} instance Monad Id where return a = Id a Id a >>= k = k a To do so, we first show that F : κ ⇔ Id with

P. Johann and J. Voigtl¨ander. Free theorems in the presence of seq. In Principles of Programming Languages, Proceedings, pages 99–110. ACM Press, 2004. J. Kuˇcan. Metatheorems about Convertibility in Typed Lambda Calculi: Applications to CPS Transform and “Free Theorems”. PhD thesis, Massachusetts Institute of Technology, 1997.

F R = return −1 κ ; R ; Id , where “;” is (forward) relation composition and “−1 ” gives the inverse of a function graph, is a Monad-action. Indeed,

J. Launchbury and S.L. Peyton Jones. State in Haskell. Lisp and Symbolic Computation, 8(4):293–341, 1995. S. Liang, P. Hudak, and M.P. Jones. Monad transformers and modular interpreters. In Principles of Programming Languages, Proceedings, pages 333–343. ACM Press, 1995.

• (return κ , return Id ) ∈ ∀R. R → F R, since for every R and

J.C. Mitchell and A.R. Meyer. Second-order logical relations (Extended abstract). In Logic of Programs, Proceedings, volume 193 of LNCS, pages 225–236. Springer-Verlag, 1985.

• ((>>=κ ), (>>=Id )) ∈ ∀R. ∀S. F R → ((R → F S) →

(a, b) ∈ R, (return κ a, return Id b) = (return κ a, Id b) ∈ return −1 κ ; R ; Id, and F S), since for every R, S, (a, b) ∈ R, and (k1 , k2 ) ∈ R → F S, (return κ a >>=κ k1 , Id b >>=Id k2 ) = (k1 a, k2 b) ∈ F S. (Note the use of monad law (1) for κ.)

E. Moggi. Notions of computation and monads. Information and Computation, 93(1):55–92, 1991. A. Nanevski, G. Morrisett, A. Shinnar, P. Govereau, and L. Birkedal. Ynot: Dependent types for imperative programs. In International Conference on Functional Programming, Proceedings, pages 229–240. ACM Press, 2008.

Hence, by what we derived towards the end of Section 3, (fκ , fId ) ∈ [F id Int ] → F id Int . Given that we have F id Int = return −1 κ ; Id = (return κ ◦ unId)−1 , this implies the claim.

S.L. Peyton Jones and P. Wadler. Imperative functional programming. In Principles of Programming Languages, Proceedings, pages 71–84. ACM Press, 1993.

B.

Proof of Theorem 2

We prove that for every l :: [κ Int],

J.C. Reynolds. Types, abstraction and parametric polymorphism. In Information Processing, Proceedings, pages 513–523. Elsevier, 1983.

p (fκ l) = unId (fId (map (Id ◦ p) l)) , where the type constructor Id and its Monad instance definition are as in the proof of Theorem 1. To do so, we first show that F : κ ⇔ Id with F R = p ; R ; Id is a Monad-action. Indeed,

F. Stenger and J. Voigtl¨ander. Parametricity for Haskell with imprecise error semantics. In Typed Lambda Calculi and Applications, Proceedings, volume 5608 of LNCS, pages 294–308. Springer-Verlag, 2009. W. Swierstra. Data types a` la carte. Journal of Functional Programming, 18(4):423–436, 2008. W. Swierstra and T. Altenkirch. Beauty in the beast — A functional semantics for the awkward squad. In Haskell Workshop, Proceedings, pages 25–36. ACM Press, 2007.

• (return κ , return Id ) ∈ ∀R. R → F R, since for every R and

I. Takeuti. The theory of parametricity in lambda cube. Manuscript, 2001.

• ((>>=κ ), (>>=Id )) ∈ ∀R. ∀S. F R → ((R → F S) →

J. Voigtl¨ander. Concatenate, reverse and map vanish for free. In International Conference on Functional Programming, Proceedings, pages 14–25. ACM Press, 2002.

F S), since for every R, S, (m, b) ∈ p ; R, and (k1 , k2 ) ∈ R → F S, (m >>=κ k1 , Id b >>=Id k2 ) ∈ p ; S ; Id by p (m >>=κ k1 ) = p (k1 (p m)) and (k1 (p m), k2 b) ∈ p ; S ; Id (which holds due to (k1 , k2 ) ∈ R → F S and (p m, b) ∈ R).

(a, b) ∈ R, (return κ a, b) ∈ p ; R by p (return κ a) = a, and

J. Voigtl¨ander. Asymptotic improvement of computations over free monads. In Mathematics of Program Construction, Proceedings, volume 5133 of LNCS, pages 388–403. Springer-Verlag, 2008a.

Hence, (fκ , fId ) ∈ [F id Int ] → F id Int . Given that we have F id Int = p ; Id = Id ◦ p = p ; unId−1 , this implies the claim.

J. Voigtl¨ander. Much ado about two: A pearl on parallel prefix computation. In Principles of Programming Languages, Proceedings, pages 29–35. ACM Press, 2008b.

C.

J. Voigtl¨ander. Bidirectionalization for free! In Principles of Programming Languages, Proceedings, pages 165–176. ACM Press, 2009.

Free Theorems, the Ugly Truth

Free theorems as described in Section 2 are beautiful. And very nice. Almost too good to be true. And actually they are not. At least not unrestricted and in a setting more closely resembling a modern

D. Vytiniotis and S. Weirich. Type-safe cast does no harm: Syntactic parametricity for Fω and beyond. Manuscript, 2009.

183

For the extension to the setting with type constructor classes (cf. Section 3), we will need to mandate that any relational action, · now denoted F : κ1 ⇔ κ2 , must preserve strictness, i.e., map · · R : τ1 ⇔ τ2 to F R : κ1 τ1 ⇔ κ2 τ2 . Apart from that, Definition 1, for example, is expected to remain unchanged (except that R and S will now range over strict relations, of course). Under these assumptions, we can investigate the impact of the presence of general recursion on the results seen in the main body of this paper. Consider Theorem 1, for example. In order to have · F : κ ⇔ Id in its proof, we need to change the definition of F R as follows:

functional language than the plain polymorphic lambda-calculus for which relational parametricity was originally conceived. In particular, problems are caused by general recursion with its potential for nontermination. We have purposefully ignored this issue throughout the main body of the paper, so as to be able to explain our ideas and new abstractions in the most basic surrounding. In a sense, our reasoning has been “up to ⊥”, or “fast and loose, but morally correct” (Danielsson et al. 2006). We leave a full formal treatment of free theorems involving type constructor classes in the presence of partiality as a challenge for future work, but use this appendix to outline some refinements that are expected to play a central role in such a formalisation. So what is the problem with potential nontermination? Let us first discuss this question based on the simple example

F R = {(⊥, ⊥)} ∪ (return −1 κ ; R ; Id) . For this relational action to be a Monad-action, we would need the additional condition that ⊥ >>=κ k1 = k1 ⊥ for any choice of k1 . Then, (fκ , fId ) ∈ [F id Int ] → F id Int would allow to derive the following variant, valid in the presence of general recursion and ⊥.

f :: [α] → [α] from Section 2. There, we argued that the output list of any such f can only ever contain elements from the input list. But this claim is not true anymore now, because f might just as well choose, for some element position of its output list, to start an arbitrary looping computation. That is, while f certainly (and still) cannot possibly make up new elements of any concrete type to put into the output, such as 42 or True, it may very well put ⊥ there, even while not knowing the element type of the lists it operates over, because ⊥ does exist at every type. So the erstwhile claim that for any input list l the output list f l consists solely of elements from l has to be refined as follows.

Theorem 1’. Let f :: Monad µ ⇒ [µ Int] → µ Int, let κ be an instance of Monad satisfying law (1) and ⊥ >>=κ k = k ⊥ for every (type-appropriate) k, and let l :: [κ Int]. If every element in l is a return κ -image or ⊥, then so is fκ l.

Note that the Reader monad, for example, satisfies the conditions for applying the thus adapted theorem. Similar repairs are conceivable for the other statements we have derived, or one might want to derive. Just as another sample, we expect Example 7 to change as follows.

For any input list l the (potentially partial or infinite) output list f l consists solely of elements from l and/or ⊥. The decisions about which elements from l to propagate to the output list, in which order and multiplicity, and where to put ⊥ can again only be made based on the input list l, and only by inspecting its length (or running into an undefined tail or an infinite list).

Example 7’. Let f :: Monad µ ⇒ [µ Int] → µ Int, let τ be a closed type, and let l :: [State τ Int]. If for every element State g in l, the property P (g) defined as

So for any pair of lists l and l0 of same length (refining this notion to take partial and infinite lists into account) the lists f l and f l0 are formed by making the same position-wise selections of elements from l and l0 , respectively, and by inserting ⊥ at the same positions, if any.

P (g) := ∀s. snd (g s) = s ∨ snd (g s) = ⊥ holds, then also f l is of the form State g for some g with P (g).

For any l0 = map g l, we then still have that f l and f l0 are of the same length and contain position-wise exactly corresponding elements from l and l0 = map g l, at those positions where f takes over elements from its input rather than inserting ⊥. For those positions where f does insert ⊥, which will then happen equally for f l and f l0 , we may only argue that the element in f l0 contains the g-image of the corresponding element in f l if indeed ⊥ is the g-image of ⊥, that is, if g is a strict function.

Note that even if we had kept the stronger precondition that snd ◦ g = id for every element State g in l, it would be impossible to prove snd ◦ g = id instead of the weaker P (g) for f l = State g. Just consider the case that f invokes an immediately looping computation, i.e., f l = ⊥ = State ⊥.8 The g = ⊥ here satisfies P (g), but not snd ◦ g = id .

So for any list l and, importantly, strict function g, we have f (map g l) = map g (f l). The formal counterpart to the extra care exercised above regarding potential occurrences of ⊥ is the provision of Wadler (1989, Section 7) that only strict and continuous relations should be allowed as interpretations for types. In particular, when interpreting quantification over type variables by quantification over relation variables, those quantified relations are required to contain the pair (⊥, ⊥), also signified via · the added · in the new notation R : τ1 ⇔ τ2 . With straightforward changes to the required constructions on relations, such · as explicitly including the pair (⊥, ⊥) in [R] : [τ1 ] ⇔ [τ2 ] and · Maybe R : Maybe τ1 ⇔ Maybe τ2 , and replacing the least by the greatest fixpoint in the definition of [R], we get a treatment of free theorems that is sound even for a language including general recursion, and thus nontermination.

8 The

equality ⊥ = State ⊥ holds by the semantics of newtype in Haskell.

184

Experience Report: Haskell in the “Real World” Writing a Commercial Application in a Lazy Functional Language Curt J. Sampson Starling Software, Tokyo, Japan [email protected]

Abstract

as C# and Python). When we embarked on this project, we had no significant functional programming experience. In the mid-2000s a fellow programmer extolled to me the virtues of Scheme and, following up on this, I found that people such as Paul Graham were also making convincing arguments (in essays such as “Beating the Averages” 2 (3)) that functional programming would increase productivity and reduce errors. Working through The Little Schemer(1) I was impressed by the concision and adaptability of what was admittedly toy code. Upon founding a software development company soon after this, my co-founder and I agreed that pursuing functional programming was likely to be beneficial, and thus we should learn a functional language and develop a software project in it. There things sat for about two years, until an opportunity arose. In the spring of 2008 we were approached by a potential client to write an financial application. He had previous experience working on a similar application in Java, and approached us with the idea of doing something similar.

I describe the initial attempt of experienced business software developers with minimal functional programming background to write a non-trivial, business-critical application entirely in Haskell. Some parts of the application domain are well suited to a mathematically-oriented language; others are more typically done in languages such as C++. I discuss the advantages and difficulties of Haskell in these circumstances, with a particular focus on issues that commercial developers find important but that may receive less attention from the academic community. I conclude that, while academic implementations of “advanced” programming languages arguably may lag somewhat behind implementations of commercial languages in certain ways important to businesses, this appears relatively easy to fix, and that the other advantages that they offer make them a good, albeit long-term, investment for companies where effective IT implementation can offer a crucial advantage to success. Categories and Subject Descriptors D.1.1 [Programming Techniques]: Applicative (Functional) Programming; D.2.3 [General]: Coding Tools and Techniques; D.3.2 [Programming Languages]: Language Classifications—Applicative (functional) Languages General Terms Performance

1.2

Experimentation, Human Factors, Languages,

Keywords functional programming, Haskell, commercial programming, financial systems

1.

Introduction

1.1

Selling the Client

We felt that the client’s application domain could benefit from using a more powerful language, and we saw this as an opportunity to explore functional programming. Convincing the client to do this took some work, especially since we had no real experience with functional languages at that point. We took a two-pronged approach to selling functional programming for this job. First, we explained that we had significant programming experience in more traditional languages, especially Java. We pointed out that this made us familiar with the limitations of these languages: so familiar, in fact, that in the case of Java we had already moved away from the language to escape those very limitations. We explained that, for us, switching languages every few years was a normal technology upgrade (much as any other industry will incorporate new advancements in the state of the art) and we had previous experience with working through these sorts of changes. Second, we argued that we had been for some time looking to move on to a new language, had done significant research in that direction, and had already identified many of the specific things needed to make the change successful and profitable. As well as general promises of better productivity and fewer bugs, we pointed to specific features we wanted in a new language for which the client himself was looking. In particular, he had expressed a preference for a system where he would have the ability to examine and modify the financial algorithms himself; we explained and demonstrated that the languages we were looking at offered much

Background and Beginnings 1

We are reasonably smart but not exceptional programmers with several decades of professional experience amongst us. Our main working languages were C in the 1990s, Java in the very late 1990s and the first part of the 2000s, and Ruby(7) after that. Other developers on the team have experience in similar languages (such 1 Despite this paper having one author, I am reporting on the combined experiences of a multi-person developer and customer team. Thus, I use “we” throughout, except when relating distinctly personal experiences.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICFP’09, August 31–September 2, 2009, Edinburgh, Scotland, UK. c 2009 ACM 978-1-60558-332-7/09/08. . . $5.00 Copyright

2 Particularly fascinating to us was, “In business, there is nothing more valuable than a technical advantage your competitors don’t understand.” We are still considering the importance of ourselves understanding the technical advantages we’ve gained.

185

to the platform facilities themselves (system calls and so on) is important. At times being able to do things such as set operatingsystem-specific socket options for individual network connections can make the difference between an application working well or poorly. In some cases, applications need to be able to control memory layout and alignment in order to marshal specific data structures used by the system or external libraries. In our case performance was an important consideration. The application reads and parses market data messages that arrive on a network connection and responds with orders. The response must occur within a few dozen milliseconds for the system to work reasonably well, and the faster the response, the greater the chance the system’s algorithms will be able to make profitable trades. Finally, though this is not true of all commercial developers, we prefer to use a free, open source implementation of a language, rather than commercial tools. There are two reasons for this beyond the purchase cost. First, we’ve found that commercial support for proprietary software is not significantly better, and is often worse, than community support for open source software under active development. Second, the access to source code and ability to rebuild it can be extremely helpful when it comes to debugging problems (the burden of which usually falls on the customer). Third, clients often feel more comfortable with open source when using niche products (as most functional language platforms are) as it ensures that they can have continued access to the tools needed to build their system, as well as the system itself.

more concision and a more mathematical notation than Java, and showed how this would enable him to more easily participate in the programming process.

2.

Selection of Language and Tools

2.1

Desired Language Features

The world of functional programming offers a wide, even bewildering variety of features and technologies, spread across many different languages. In some cases a language has has strong support for an imperative or object-oriented coding style, with merely the possibility of doing functional programming (such as JavaScript); in other cases one must abandon wholesale one’s non-functional style and adopt a completely different one. As we had already extensive experience with Java and Ruby and had become unhappy with the level of expressive power that either offered, we chose to pursue what we felt were more “advanced” language features, at the cost of an increased learning curve and greater risk. Functional language features of particular interest to us were a sophisticated type system (Hindley-Milner is the obvious example here), and advanced, modular structures for control flow. More generally, we were looking for minimal syntax, concise yet expressive code, powerful abstractions, and facilities that would allow us easily to use parallelism. The “concise yet expressive code” requirement was particularly important to us, both because we’ve found that reading and understanding existing code is a substantial part of the programming process, and because we wanted to be able to develop code that, at least for the parts in his domain, our client could read and perhaps modify. 2.2

2.3

Languages Examined

With the above criteria in mind, we looked at some of the more popular free functional language implementations available. Our ability to compare these was limited: we simply didn’t know enough about functional languages in general, and had no extensive experience with any functional language implementation. Thus, we were forced to rely on what we could learn from textbooks and information on the web. (Blog entries documenting interesting features of the various languages were particularly influential.) Especially, we had no way of judging some of the more esoteric and complex language features as we simply didn’t understand them. This no doubt skewed our evaluation process, but we saw no reasonable way of dealing with this other than spending several months building a non-trivial test application in each of several different languages. This point should be kept in mind by language promoters: the information you need to communicate to convince non-functional-programmers to try a language is of a substantially different sort than what sways those who are already familiar with at least one functional programming language. We looked most seriously at the Glasgow Haskell Compiler(2), Objective Caml(6), and Scala(8). (We considered looking at LISP and Scheme implementations, but these seemed to be both lacking in certain language features and we had (what are in retrospect perhaps undeserved) concerns about the available free implementations.) All three of these language implementations we examined share some features in common:

Considerations of Commercial Software Developers

As commercial software developers, there were also several other factors influencing our selection of tools. A primary criterion is reliability: bugs in the toolchain or runtime system can derail a project irrecoverably, and may not appear until well into development. Good support can help to mitigate these problems, but that still provides no guarantee that any particular issue can be solved in a timely manner. Generally, widespread use on a particular platform provides the best hope that most serious problems will have already been encountered and debugged. For developers with sufficient technical knowledge, having the source for the system can also provide the ability to debug problems in more detail than the systems supporters are perhaps willing to do, increasing the chances of finding a solution, even if the solution is implemented by someone else. The compiler or interpreter is only one part of a developer’s environment; just as important are the methods of code storage (flat files or otherwise), source code control systems, editors, build tools (such as make), testing frameworks, profilers, and other tools, many of these often home-grown. These represent a substantial investment in both learning and development time, and are usually important to the success of a project. Thus, being able to use a new language implementation with existing and familiar tools is a huge benefit. If the implementation is very nearly a drop-in replacement for a language implementation already in use (as with GHC or Objective Caml for GCC in typical Unix development projects, or F# in .NET environments), much of the previously-built development environment can be reused. If, on the other hand, everything needs to be replaced (as with some Smalltalk environments, or Lisp machines), this imposes a tremendously increased burden. For many commercial developers, integration with other code running on the platform (such as C libraries) and complete access

• expressive static type systems; • concise code, and syntax suited to building embedded domain-

specific languages without macros; • compilers known to be fairly reliable and produce reasonably

fast code; • tools that fit with our current Unix development systems; • the ability to interface with code compiled from other lan-

guages; and • an open source implementation under active development.

186

2.3.1

The Glasgow Haskell Compiler

with more due to appear, blog entries describing interesting uses of Haskell and various techniques were plentiful, and the #haskell channel on IRC was particularly responsive to questions.

Haskell was the language that most appealed to us; it seemed to have the most advanced set of features of any language out there. Especially, being able easily to guarantee the purity of parts of the code promised to make testing easier, which is an important point for us as we rely heavily on automated testing. The other significantly different language feature of Haskell, lazy evaluation, had mild appeal for potential performance benefits, but we really had no idea what the true ramifications of lazy evaluation were. The extensive use of monads was interesting to us, but, as with lazy evaluation, we didn’t know enough about it to have any sort of informed opinion. We simply believed what we’d read that monads were a good thing. There were things about Haskell we found easier to understand. Type classes we found relatively simple, and they seemed to offer more flexibility than the standard object-oriented inheritance-based approach. Two things we could understand very well about Haskell were that it was relatively popular, with a strong community, and that the Foreign Function Interface offered us an escape into other languages should we encounter any overwhelming problems, performance or otherwise. The books available for learning Haskell, though at times mystifying, seemed to provide the promise of wonderful things. The Haskell School of Expression(4), in particular, we found fascinating. There are several implementations of Haskell available; we chose the Glasgow Haskell Compiler(2) as that is widely recognized as the most mature Haskell compiler available. 2.3.2

• The compiler, while perhaps at the time not as good as the

Objective Caml compiler, appeared to be under more active development.

3.

Successes and Advantages

3.1

Concise and Readable Code

After a week or two of adjustment, we found Haskell syntax to be remarkably clean and expressive. An initial test was to work with the client to code some of the mathematical algorithms used by the trading system; many were straightforward translations of the mathematical equations into similar Haskell syntax. This obvious advantage of Haskell proved a particularly good selling point to the client early on in the process, reinforcing the idea that the client, though not a programmer, would be able easily to understand some of the more important domain-related aspects of his application. Another early area of work was the parser for the market data feed. Initially we used Parsec for this, and were quite impressed by how expressive Haskell and a combinator-based approach can be. However, we soon decided to experiment with writing our own parser, for two reasons. First, Parsec did not allow us to keep certain items of state that we felt we needed for error-checking and recovery purposes. Second, the version of Parsec that we were using at the time used String rather than ByteString, and some brief profiling experiments showed that the performance advantages of ByteString were considerable.3 As it turned out, after some study (the “Functional Parsers” chapter of Hutton’s Programming in Haskell(5) was particularly helpful here), we found that parsers in Haskell are particularly easy to write. Learning about control structures beyond simple recursion, particularly monadic ones, took considerably more time, but also proved fertile ground for finding ways to improve our code’s clarity and concision. After a year or so we are extremely happy with our improved ability (over object-oriented languages, and well beyond just parsers) to build combinators, manage state, and deal with complex control flow through monadic code. To summarize, though the learning effort for the various techniques available to us ranged from low (for mathematical formulae) to moderately high (monadic control structures), there seemed to be almost no area in which Haskell code was not more clear and considerably more concise than doing the equivalent in Ruby or, particularly, Java.

Objective Caml

To us, Objective Caml(6) appeared to be a more conservative alternative to GHC, offering many of GHC’s and Haskell’s features, but without being so radical as to introduce enforced purity or lazy evaluation. Advantages particular to the implementation were that it was well known, mature, reputedly produced very fast code, and we knew it to be used for large amounts of production code by other companies in the financial community. OCaml did have several things that put us off. The language syntax didn’t seem nearly as clean as Haskell, which was partially inherent (such as double-semicolons to end declarations) and partly due to the inability to overload functions (due to the lack of type classes). As well, the books we had available were just not as good. At that time we’d just finished reading a fairly recent book on OCaml that, unlike the Haskell books we’d read, did not show us any impressive new programming techniques but instead appeared to treat the language in a very imperative fashion.

3.2

The Type System and Unit Tests

We considered Scala(8) mainly because we thought we might have to use Java libraries. This turned out not to be the case, and as we otherwise preferred to run native code, we didn’t investigate the language very deeply.

It’s not unusual, when writing in languages with little compile-time type-checking such as Ruby, for about a third of our code to be unit tests. We had anticipated that we might write fewer unit tests in Haskell, but the extent to which the type system reduced the need for unit tests surprised us. As a comparison, in reasonably complex Ruby application of similar line count (though rather less

2.4

3 ByteString

2.3.3

Scala

Selection of Haskell

is, I think, from an academic point of view a rather trivial optimization of some fairly ordinary routines. But from our point of view it was a huge win, and we are eternally grateful to Don Stewart and Duncan Coutts for writing this library. This is worth considering when designing a new system: you need not immediately supply efficient things for basic commercial needs, but if you can give others the ability (through mechanisms such as the OverloadedStrings language extension) to easily bring those in later, you open up more possibilities for success in the commercial arena. Haskell has not been bad in this respect, but there are many times I’ve wished for improvements such as making String a typeclass.

After examining the three options above, we chose the Glasgow Haskell Compiler, version 6.8, as our development platform. We felt that GHC offered the following advantages: • Haskell appeared to be a more powerful and more interesting

language than Objective Caml. • The Haskell community offered good support and was (and

still is) growing. There were several good books available

187

functionality), we have about 5200 lines of production code and just over 1500 lines of unit tests. In contrast, in the version of the trading system as of early 2009, we had over 5600 lines of production code and less than 500 lines of unit tests. The functional tests show similar differences. Further, in most areas where we used the type system to show program correctness, we came out with much more confidence that the program was indeed correct than if we had used testing. Unlike tests, a good type system and compiler will often force the developer to deal with cases that might have been forgotten in testing, both during the initial development and especially when later modifying the code. On one last note, the difference in the amount of test code we used was far larger between Ruby and Haskell than between Ruby and Java, though Java, as with Haskell, also offers static type checking. We attribute this to a large difference in expressive power between the type systems of Haskell and Java.

allowed us to code this entirely in Haskell without having to write a single line of C. Further, we were able to make extensive use of the C header files supplied for the library, resorting only minimally to error-prone supplying of type information by hand. Being able to use the Haskell type checker to write C-like code was particularly enjoyable. Type-checking memory allocations and the like is a well known and impressive technique in the HindleyMilner type system community, but it wasn’t until we’d used it ourselves in anger that we realized that programming this close to the machine could be quite a relaxing thing. This was a significant change from any other language we’ve used, and from what we believe we would have needed to do in, say, Objective Caml. Every other foreign interface we’ve seen would have required us to write at least some code in C and would have provided us with significantly less type safety.

3.3

No development environment or set of tools is without its problems, and GHC turned out to be no exception. However, while the nature of the problems we ran into was different from other systems, the general number and level of difficulty of the problems was not significantly different from any other system, especially taking into account how different the language is. The largest problems for us were issues with learning the language itself, dealing with lazy evaluation, and the performance issue of space leaks.

4.

Speed

With the exception of space leak problems (see next section), speed was never an issue for us. The application is sensitive to how fast it can do the calculations to build a pricing model for the current market, but the code generated by GHC 6.8.3 turned out to be significantly faster than the client’s previous Java application. We’ve to this point seen no significant difference between the current system’s performance and what we believe we could achieve in C or C++. We use multi-core systems, and make extensive use of multithreaded code for both concurrency and parallelism. This has worked well for us. However, when used for parallelism, with lazy evaluation one has to be careful about in which thread computations are really occurring: i.e., that one is sending a result, and not an unevaluated thunk, to another thread. This has brought up problems and used solutions similar in style to the space leak issues. We have yet to make extensive use of explicitly parallel computation based on the par function. However, it seems clear that if and when we move in to this area, implementation will be considerably simpler than when using a thread-based model. Haskell’s ”purity by default” model helps greatly here. 3.4

4.1

Language Issues

Haskell’s syntax is quite simple and consistent, and this is great advantage. However syntax is a relatively small part of learning a language, even in languages where it is significantly more complex. To use a language well, and take good advantage of it, one must also learn its structures, idioms and style. Haskell is well known as a language that’s very different from others in this regard, and even after many months of programming in it, we still don’t feel we’ve progressed particularly far in this direction, aside from the use of monads for handling state and control flow. (For example, we use applicative style in very limited ways, and as yet make no use of Control.Applicative.) We found that to use many common Haskell structures, such as functors, monoids and monads, we needed to think in significantly different ways than we did in other languages, even other functional languages such as Scheme. Further, the level of abstract thought required to understand these structures (in great part due to their generality) is noticeably higher than in other languages, and in our experience not in the background of many typical commercial programmers. This is not an insurmountable problem, given appropriate learning materials: we built our first monadic parser from scratch after only a few days of study. But learning these things, especially on one’s own, can be pretty rough going. This is mitigated to some degree by being able to fall back easily to more commonly known structures from other languages. We spent a couple of days as an extended interview working with a programmer unfamiliar with Haskell but with extensive experience in LISP, and while the code we produced made little use of typical Haskell idioms, it was certainly clear and workable.

Portability

One pleasant surprise was that, once we’d made a few modifications to our build system, building and running our application under Microsoft Windows was no trouble at all. While not an initial requirement, this was a nice facility to have as it saved our client setting up a separate Unix machine to run the simulator for himself. Eventually, we ended up being able to write in Haskell a small amount of Windows code we’d initially planned to write in C++, which saved us considerable time and effort (see below). 3.5

Problems and Disadvantages

The Foreign Function Interface

One part of our application involved interfacing with the Microsoft Windows DDE facility in order to transfer hundreds of data values to Excel to be displayed and updated several times per second. We had originally planned to write this this in C++, but in view of the portability we’d discovered, we decided see how much of this we could do in Haskell. This involved a non-trivial interface with a Microsoft C library. In addition to simple calls into C code, we had to deal with lowlevel data structures involving machine words that combined bit fields and integers of non-standard bit sizes, special memory allocation and deallocation schemes specific to the library, and callbacks from the library back into our code. This might well have been the biggest and most pleasant surprise we encountered: GHC’s excellent Foreign Function Interface

4.2

Refactoring

One of our reasons for doing extensive automated testing is to support refactoring, a technique which we use extensively. As compared to Ruby, refactoring “in the small,” that is, within a single module or a small change crossing a few modules, was more difficult. The main issue was that where, with a language such as Ruby, one can change a small handful of functions and leave the

188

rest “broken” while testing the change, in Haskell one must fix every function in the module before even being able to compile. There are certain techniques to help work around this, such as copying the module and removing the code not currently of interest (which is tedious), or minimizing the use of type declarations (which doesn’t always help, and brings its own problems), but we didn’t find these to be very satisfactory. We feel that a good refactoring tool could significantly ease this problem, but there seems nothing like this in common use in the Haskell community at the moment. That said, the major refactorings we do on a regular basis (“major” ones being restructuring that crosses many modules and moves significant amounts of code between them) were less affected by this problem, and did not seem to take significantly longer than implementing the same sort of change in any other language. As well, being able to do more work with the type checker and less with testing increased our confidence that our refactorings had not broken parts of the system.

over time. This would allow us to monitor the distribution of work to see where more strictness needs to be applied in order to make the best use of our multi-core machines. This needs to be available for non-profiled builds so that we can monitor how the application runs in the real environment, as well as when it’s specially built for profiling. We feel that this would be a great step forward in fulfilling pure functional programming’s promise of much easier use of multi-core CPUs than conventional languages. A last note about a particular problem for us with GHC: profiling builds must run using the non-parallel runtime. For an application that requires more than a full core even for builds not slowed down by profiling, this can at times be entirely useless: the problems that show up in the profile output are due entirely to the lack of CPU power available to run the application within its normal time constraints. We are contemplating modifying our stock exchange simulator and trading system to run at a small fraction of “real-time” speed in order to see if this might provide more realistic profiling results.

4.3

4.4

Profiling Tools, and Deficiencies Thereof

Lazy Evaluation and Space Leaks

Lazy evaluation turned out to have good and bad points in the end. However, it caused, and continues to cause, us pain in two particular areas. The first is that, when distributing computation amongst multiple co-operating threads, it’s important to ensure that data structures passed between threads are evaluated to head normal form when necessary before being used by the receiving thread. Initially we found figuring out how to do this to be difficult; even determining when this is necessary can be a challenge in itself. (One clue is a large amount of heap use connected with communications queues between threads, as the sending thread overruns the receiver’s ability to evaluate the thunks it’s receiving.) A profiler that did not slow down the program and that could help with this would be highly welcome. The second is the dreaded “space leak,” when unevaluated computations retain earlier versions of data in the heap and cause it to grow to enormous proportions, slowing and eventually halting the program as it consumes all memory and swap space. This proved to be a particular problem in our application, which is essentially several loops (processing messages and doing calculations) executed millions of times over the course of a many-hour run. The profiler that comes with GHC is of more help here, but we have yet to work out a truly satisfactory way of dealing with this. A particularly common cause of space leaks for us was when using data structures (such as lists and Data.Map) that have many small changes over time. As one inserts and removes data, old, nolonger-referenced versions of the data structure are kept around to enable the evaluation of thunks that will create the new version of the data structure. As mentioned previously, na¨ıve approaches to the problem can backfire, and better approaches were not always obvious. For Data.Map and similar structures, we resorted to doing an immediate lookup and forced evaluation of data just inserted; this seemed to fix the problem while avoiding the massive performance penalty of doing an rnf on the root of the data structure. Another issue relates to interactive programs. Certain lazy idioms (such as using the getContents function) can effectively deadlock such programs, rendering them unresponsive. This is not immediately obvious to programmers new to lazy evaluation, and caused us early on to spend some figuring out what was going on. Knowing what we know now, we are still hesitant about the value of lazy evaluation; every time we feel we’ve finally got a good handle on it, another problem seems to crop up. We feel that the subject needs, at the very least, a good book that would both cover all of the issues well and help the reader develop a good intuition for the behaviour of lazily-evaluated systems.

Due especially to lazy evaluation (of which more is mentioned later), understanding the runtime behaviour of one’s application is rather important when working in Haskell. GHC’s profiling tools are good—certainly as good as anything in Java—but unfortunately not good enough that we don’t have some significant complaints. First, they are rather subtle: it takes time and some careful examination of the documentation and the output to learn to use them well (though this is very often true of any profiling tool). We have found that, as of mid-2009, there are no good tutorials available for this; frustrating and time-consuming experimentation has been the rule. Better descriptions of the output of the summarization tools (such as hp2ps, which generates a graph of heap usage from raw profile data) may not be the only solution to this: providing more documentation of the raw profiling data itself might not only enable developers to build their own profile analysis tools, but provide more insight as to just what’s going on inside the runtime system. One of the major problems, directly connected to lazy evaluation, is that where the work happens is quite a slippery concept. One module may define and apply all of the functions for the work to be done, but in actuality leave only a few thunks in the heap: the real work of the computation occurs in some other place where the data are used. If several other modules use the results of, say, a parser, which module or function actually ends up doing the computational work of the parsing (evaluating the thunks) can change from run to run of the program. This problem is only exacerbated by doing work in multiple threads. When one hopes to spread the work more or less evenly amongst all the threads in the application, inter-thread communication mechanisms that as happily transfer thunks as they do data are more of a liability than an asset. It’s quite possible to have a final consumer of “work” actually doing all of the work that one wanted to have done in several other threads running on different cores. Na¨ıve approaches to fixing the problem can easily backfire. At one point we made some of our data structures members of the NFData class from Control.Parallel.Strategies and used the rnf function at appropriate points to ensure that evaluation had been fully forced. This turned out to do more harm than good, causing the program’s CPU usage shoot through the roof. While we didn’t investigate this in detail, we believe that the time spent repeatedly traversing large data structures, even when already mostly evaluated, was the culprit. (This would certainly play havoc with the CPU’s cache.) We would particularly like to have a way at runtime to monitor the CPU usage of each individual Haskell thread within our application no matter which operating system threads it might run on

189

5.

References

Conclusions

[1] Friedman, Daniel P., and Felleisen, Matthias, The Little Schemer, Fourth Edition. MIT Press, 1996. ISBN-13: 978-0-262-56099-3. [2] The Glasgow Haskell Compiler. http://www.haskell.org/ghc/ [3] Graham, Paul, “Beating the Averages”, from Hackers and Painters O’Reilly, 2004, ISBN-13: 978-0-596-00662-4. Also available from http://paulgraham.com/avg.html [4] Hudak, Paul, The Haskell School of Expression: Learning Functional Programming Through Multimedia. Cambridge University Press, 2000. ISBN-13: 978-0-521-64408-2. [5] Hutton, Graham, Programming in Haskell. Cambridge University Press, 2007. ISBN-13: 978-0-521-69269-4. [6] Objective Caml. http://caml.inria.fr/ocaml/index.en.html [7] Ruby Programming Language. http://www.ruby-lang.org/en/ [8] The Scala Programming Language. http://www.scala-lang.org/ [9] Duncan Coutts, Don Stewart and Roman Leshchinskiy, “Rewriting Haskell Strings.” http://www.cse.unsw.edu.au/ dons/papers/CSL06.html

Our experience with GHC has shown that, while it has its quirks and issues, these are no worse than those of other commonly-used programming language implementations when used for commercial software development. However, some of the issues are different enough that one should be prepared to spend a little more time learning how to deal with them than when moving between more similar systems. We found significant advantages to using Haskell, and it’s clear to us that there is much more expressive power available of which we’ve yet to make use. Learning to take advantage of Haskell takes a fair amount of work, but benefits are seen fairly quickly and continuously as one improves. We were lucky with our choice of Haskell and GHC and, in light of our experience, we would make the same choice again. However, we note that, given our lack of experience with other functional languages and platforms, we cannot really say whether or not it is significantly better than, say, OCaml or Scala. It is our opinion that, overall, our switch to a functional language in a commercial environment has been successful, and we are convinced we will continue to see further benefit over the long term.

190

Beautiful Differentiation Conal M. Elliott LambdaPix [email protected]

Abstract

d (u + v) d (u · v) d (−u) d (eu ) d (log √u) d ( u) d (sin u) d (cos u) d (sin−1 u) d (cos−1 u) d (tan−1 u) d (sinh u) d (cosh u) d (sinh−1 u) d (cosh−1 u) d (tanh−1 u)

Automatic differentiation (AD) is a precise, efficient, and convenient method for computing derivatives of functions. Its forwardmode implementation can be quite simple even when extended to compute all of the higher-order derivatives as well. The higherdimensional case has also been tackled, though with extra complexity. This paper develops an implementation of higher-dimensional, higher-order, forward-mode AD in the extremely general and elegant setting of calculus on manifolds and derives that implementation from a simple and precise specification. In order to motivate and discover the implementation, the paper poses the question “What does AD mean, independently of implementation?” An answer arises in the form of naturality of sampling a function and its derivative. Automatic differentiation flows out of this naturality condition, together with the chain rule. Graduating from first-order to higher-order AD corresponds to sampling all derivatives instead of just one. Next, the setting is expanded to arbitrary vector spaces, in which derivative values are linear maps. The specification of AD adapts to this elegant and very general setting, which even simplifies the development.

≡ ≡ ≡ ≡ ≡ ≡ ≡ ≡ ≡ ≡ ≡ ≡ ≡ ≡ ≡ ≡

du + dv dv · u + du · v −d u d u · eu d u/u √ d u/(2 · u) d u · cos u d u ·√ (− sin u) d u/ √ 1 − u2 −d u/ 1 − u2 d u/(u2 + 1) d u · cosh u d u ·√ sinh u u2 + 1 d u/ √ −d u/ u2 − 1 d u/(1 − u2 )

Figure 1. Some rules for symbolic differentiation

Categories and Subject Descriptors G.1.4 [Mathematics of Computing]: Numerical Analysis—Quadrature and Numerical Differentiation

One differentiation method is numeric approximation, using simple finite differences. This method is based on the definition of (scalar) derivative:

General Terms

f (x + h) − f x (1) h (The left-hand side reads “the derivative of f at x ”.) The approximation method uses f (x + h) − f x df x ≈ h for a small value of h. While very simple, this method is often inaccurate, due to choosing either too large or too small a value for h. (Small values of h lead to rounding errors.) More sophisticated variations improve accuracy while sacrificing simplicity. A second method is symbolic differentiation. Instead of using the limit-based definition directly, the symbolic method uses a collection of rules, such as those in Figure 1 There are two main drawbacks to the symbolic approach to differentiation. One is simply the inconvience of symbolic methods, requiring access to and transformation of the source code of computation, and placing restrictions on that source code. A second drawback is that implementations tend to be quite expensive and in particular perform redundant computation. A third method is the topic of this paper (and many others), namely automatic differentiation (also called “algorithmic differentiation”), or “AD”. There are forward and reverse variations (“modes”) of AD, as well as mixtures of the two. This paper considers only the forward-mode. The idea of AD is to simultaneously manipulate values and derivatives. Overloading of the standard numerical operations makes this combined manipulation as

Keywords

1.

Algorithms, Design, Theory

d f x ≡ lim

h→0

Automatic differentiation, program derivation

Introduction

Derivatives are useful in a variety of application areas, including root-finding, optimization, curve and surface tessellation, and computation of surface normals for 3D rendering. Considering the usefulness of derivatives, it is worthwhile to find software methods that are • simple to implement, • simple to prove correct, • convenient, • accurate, • efficient, and • general.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICFP’09, August 31–September 2, 2009, Edinburgh, Scotland, UK. c 2009 ACM 978-1-60558-332-7/09/08. . . $5.00 Copyright

191

• It is simple to verify informally, because of its similarity to the

data D a = D a a deriving (Eq, Show )

differentiation laws.

constD :: Num a ⇒ a → D a constD x = D x 0

• It is convenient to use, as shown with f1 above. • It is accurate, as shown above, producing exactly the same result

idD :: Num a ⇒ a → D a idD x = D x 1

as the symbolic differentiated code, f2 . • It is efficient, involving no iteration or redundant computation.

instance Num a ⇒ Num (D a) where fromInteger x = constD (fromInteger x ) D x x 0 + D y y 0 = D (x + y) (x 0 + y 0 ) D x x 0 ∗ D y y 0 = D (x ∗ y) (y 0 ∗ x + x 0 ∗ y) negate (D x x 0 ) = D (negate x ) (negate x 0 ) signum (D x ) = D (signum x ) 0 abs (D x x 0 ) = D (abs x ) (x 0 ∗ signum x )

The formulation in Figure 2 does less well with generality: • It computes only first derivatives. • It applies (correctly) only to functions over a scalar (one-

dimensional) domain. Moreover, proving correctness is hampered by lack of a precise specification. Later sections will address these shortcomings.

instance Fractional x ⇒ Fractional (D x ) where fromRational x = constD (fromRational x ) recip (D x x 0 ) = D (recip x ) (x 0 / sqr x )

This paper’s technical contributions include the following. • A prettier formulation of first-order and higher-order forward-

sqr :: Num a ⇒ a → a sqr x = x ∗ x

mode AD using function-based overloading (Sections 2, 3 and 4).

instance Floating x ⇒ Floating (D x ) where π = constD π exp (D x x 0 ) = D (exp x ) (x 0 ∗ exp x ) log (D x x 0 ) = D (log x ) (x 0 / x ) sqrt (D x x 0 ) = D (sqrt x ) (x 0 / (2 ∗ sqrt x )) sin (D x x 0 ) = D (sin x ) (x 0 ∗ cos x ) cos (D x x 0 ) = D (cos x ) (x 0 ∗ (−sin x )) asin (D x x 0 ) = D (asin x ) (x 0 / sqrt (1 − sqr x )) acos (D x x 0 ) = D (acos x ) (x 0 / (−sqrt (1 − sqr x ))) ...

• A simple formal specification for AD (Section 5). • A systematic derivation of first-order and higher-order forward-

mode AD from the specification (Sections 5.1 and 6). • Reformulation of AD to general vector spaces including (but

not limited to) Rm → Rn , from the perspective of calculus on manifolds (CoM) (Spivak 1971), and adaptation of the AD derivation to this new setting (Section 10). • General and efficient formulations of linear maps and bases of

vector spaces (using associated types and memo tries), since the notion of linear map is at the heart of CoM (Appendix A).

2.

Figure 2. First-order, scalar, functional automatic differentiation

Friendly and precise

To start, let’s make some cosmetic improvements, which will be carried forward to the more general formulations as well. Figure 1 has an informality that is is typical of working math notation, but we can state these properties more precisely. For now, give differentiation the following higher-order type:

convenient and elegant as manipulating values without derivatives. Moreover, the implementation of AD can be quite simple as well. For instance, Figure 2 gives a simple, functional (foward-mode) AD implementation, packaged as a data type D and a collection of numeric type class instances. Every operation acts on a regular value and a derivative value in tandem. (The derivatives for abs and signum need more care at 0.) As an example, define

d :: (a → a) → (a → a)

-- first attempt

Then Figure 1 can be made more precise. For instance, the sum rule is short-hand for d (λx → u x + v x ) ≡ λx → d u x + d v x

f1 :: Floating a ⇒ a → a f1 z = sqrt (3 ∗ sin z )

and the log rule means d (λx → log (u x )) ≡ λx → d u x / u x

and try it out in GHCi:

These more precise formulations are tedious to write and read. Fortunately, there is an alternative to replacing Figure 1 with more precise but less human-friendly forms. We can instead make the human-friendly form become machine-friendly. The trick is to add numeric overloadings for functions, so that numeric operations apply point-wise. For instance,

*Main> f1 (D 2 1) D 1.6516332160855343 (-0.3779412091869595) To test correctness, here is a symbolically differentiated version: f2 :: Floating a ⇒ a → D a f2 x = D (f1 x ) (3 ∗ cos x / (2 ∗ sqrt (3 ∗ sin x )))

u + v ≡ λx → u x + v x log u ≡ λx → log (u x )

Try it out in GHCi: *Main> f2 2 D 1.6516332160855343 (-0.3779412091869595)

Then the “informal” laws in Figure 1 turn out to be well-defined and exactly equivalent to the “more precise” long-hand versions above. The Functor and Applicative (McBride and Paterson 2008) instances of functions (shown in Figure 3) come in quite handy. Figure 4 shows the instances needed to make Figure 1 welldefined and correct exactly as stated, by exploiting the Functor and Applicative instances in Figure 3. In fact, these instances work

This AD implementation satisfies most of our criteria very well: • It is simple to implement. The code matches the familiar laws

given in Figure 1. There are, however, some stylistic improvements to be made in Section 4.

192

≡ { d sin ≡ cos } (cos ◦ u) ∗ d u ≡ { cos on functions } cos u ∗ d u

instance Functor ((→) t) where fmap f g = f ◦ g instance Applicative ((→) t) where pure = const f ~g = λt → (f t) (g t)

The first two rules cannot be explained in terms of the scalar chain rule, but can be explained via the generalized chain rule in Section 10. We can implement the scalar chain rule simply, via a new infix operator, (./), whose arguments are a function and its derivative.

Consequently, liftA2 h u v ≡ λx → h (u x ) (v x ) liftA3 h u v w ≡ λx → h (u x ) (v x ) (w x ) ...

infix 0 ./ (./) :: Num a ⇒ (a → a) → (a → a) → (D a → D a) (f ./ f 0 ) (D a a 0 ) = D (f a) (a 0 ∗ f 0 a)

Figure 3. Standard Functor and Applicative instances for functions instance Num fromInteger negate (+) (∗) abs signum

This chain rule removes repetition from our instances. For instance, instance Floating x ⇒ Floating (D x ) where π =D π0 exp = exp ./ exp log = log ./ recip sqrt = sqrt ./ λx → recip (2 ∗ sqrt x ) sin = sin ./ cos cos = cos ./ λx → −sin x asin = asin ./ λx → recip (sqrt (1 − sqr x )) acos = acos ./ λx → −recip (sqrt (1 − sqr x )) ...

b ⇒ Num (a → b) where = pure ◦ fromInteger = fmap negate = liftA2 (+) = liftA2 (∗) = fmap abs = fmap signum

instance Fractional b ⇒ Fractional (a → b) where fromRational = pure ◦ fromRational recip = fmap recip

4.

instance Floating b ⇒ Floating (a → b) where π = pure π sqrt = fmap sqrt exp = fmap exp log = fmap log sin = fmap sin ...

5. Figure 4. Numeric overloadings for function types

What is automatic differentiation, really?

The preceding sections describe what AD is, informally, and they present plausible implementations. Let’s now take a deeper look at AD, in terms of three questions:

for any applicative functor—a point that will become important in Section 10. We’ll soon see how to exploit this simple, precise notation to improve the style of the definitions from Figure 2.

3.

Prettier derivatives via function overloading

Section 2 gave numeric overloadings for functions in order to make the derivative laws in Figure 1 precise, while retaining their simplicity. We can use these overloadings to make the derivative implementation simpler as well. With the help of (./) and the overloadings in Figure 4, the code in Figure 2 can be simplified to that in Figure 5.

1. What does it mean, independently of implementation? 2. How do the implementation and its correctness flow gracefully from that meaning? 3. Where else might we go, guided by answers to the first two questions?

A scalar chain rule 0

Many of the laws in Figure 1 look similar: d (f u) = d u ∗ f u for some function f 0 . The f 0 is not just some function; it is the derivative of f . The reason for this pattern is that these laws follow from the scalar chain rule for derivatives.

5.1

A model for automatic differentiation

How do we know whether this AD implementation is correct? We can’t even begin to address this question until we first answer a more fundamental one: what exactly does its correctness mean? In other words, what specification must our implementation obey? AD has something to do with calculating a function’s values and derivative values simultaneously, so let’s start there.

d (g ◦ f ) x ≡ d g (f x ) ∗ d f x Using the (∗) overloading in Figure 4, the chain rule can also be written as follows: d (g ◦ f ) ≡ (d g ◦ f ) ∗ d f

toD :: (a → a) → (a → D a) toD f = λx → D (f x ) (d f x )

All but the first two rules in Figure 1 then follow from the chain rule. For instance,

Or, in point-free form,

d (sin u) ≡ { sin on functions } d (sin ◦ u) ≡ { reformulated chain rule } (d sin ◦ u) ∗ d u

toD f = liftA2 D f (d f ) thanks to the Applicative instance in Figure 3. We have no implementation of d , so this definition of toD will serve as a specification, not an implementation.

193

instance Num a ⇒ Num (D a) where fromInteger = constD ◦ fromInteger D x0 x 0 + D y0 y 0 = D (x0 + y0 ) (x 0 + y 0 ) 0 0 x @(D x0 x ) ∗ y@(D y0 y ) = D (x0 ∗ y0 ) (x 0 ∗ y + x ∗ y 0 ) negate = negate ./ −1 abs = abs ./ signum signum = signum ./ 0 instance Fractional a ⇒ Fractional (D a) where fromRational = constD ◦ fromRational recip = recip ./ −sqr recip

5.2.1

Constants and identity function

Value/derivative pairs for constant functions and the identity function are specified as such: constD :: Num a ⇒ a → D a idD :: Num a ⇒ a → D a constD x ≡ toD (const x ) ⊥ idD ≡ toD id To derive implementations, expand toD and simplify. constD x ≡ { specification } toD (const x ) ⊥ ≡ { definition of toD } D (const x ⊥) (d (const x ) ⊥) ≡ { definition of const and its derivative } Dx 0

instance Floating a ⇒ Floating (D a) where π = constD π exp = exp ./ exp log = log ./ recip sqrt = sqrt ./ recip (2 ∗ sqrt) sin = sin ./ cos cos = cos ./ −sin asin = asin ./ recip (sqrt (1 − sqr )) acos = acos ./ recip (−sqrt (1 − sqr )) atan = atan ./ recip (1 + sqr ) sinh = sinh ./ cosh cosh = cosh ./ sinh asinh = asinh ./ recip (sqrt (1 + sqr )) acosh = acosh ./ recip (−sqrt (sqr − 1)) atanh = atanh ./ recip (1 − sqr )

idD x ≡ { specification } toD (id x ) ≡ { definition of toD } D (id x ) (d id x ) ≡ { definition of id and its derivative } Dx 1 In (Karczmarczuk 2001) and elsewhere, idD is called dVar and is sometimes referred to as the “variable” of differentiation, a term more suited to symbolic differentiation than to AD. 5.2.2

Figure 5. Simplified derivatives using the scalar chain rule and function overloadings

Addition

Specify addition on D by requiring that toD preserves its structure: toD (u + v ) ≡ toD u + toD v Expand toD, and simplify both sides, starting on the left: toD (u + v ) ≡ { definition of toD } liftA2 D (u + v ) (d (u + v )) ≡ { d (u + v ) from Figure 1 } liftA2 D (u + v ) (d u + d v ) ≡ { liftA2 on functions from Figure 3 } λx → D ((u + v ) x ) ((d u + d v ) x ) ≡ { (+) on functions from Figure 4 } λx → D (u x + v x ) (d u x + d v x )

Since AD is structured as type class instances, one way to specify its semantics is by relating it to a parallel set of standard instances, by a principle of type class morphisms, as described in (Elliott 2009c,b), which is to say that the interpretation preserves the structure of every method application. For AD, the interpretation function is toD. The Num, Fractional , and Floating morphisms provide the specifications of the instances: toD toD toD toD toD

(u + v ) ≡ toD u + toD v (u ∗ v ) ≡ toD u ∗ toD v (negate u) ≡ negate (toD u) (sin u) ≡ sin (toD u) (cos u) ≡ cos (toD u) ...

Then start over with the right-hand side: toD u + toD v ≡ { (+) on functions from Figures 3 and 4 } λx → toD u x + toD v x ≡ { definition of toD } λx → D (u x ) (d u x ) + D (v x ) (d v x )

Note here that the numeric operations are applied to values of type a → a on the left, and to values of type a → D a on the right. These (morphism) properties exactly define correctness of any implementation of AD, answering the first question:

We need a (+) on D that makes these two final forms equal, i.e., λx → D (u x + v x ) (d u x + d v x ) ≡ λx → D (u x ) (d u x ) + D (v x ) (d v x )

What does it mean, independently of implementation? 5.2

An easy choice is

Deriving an AD implementation

D a a 0 + D b b 0 = D (a + b) (a 0 + b 0 )

Equipped with a simple, formal specification of AD (numeric type class morphisms), we can try to prove that the implementation above satisfies the specification. Better yet, let’s do the reverse, using the morphism properties to derive (discover) the implementation, and prove it correct in the process. The derivations will then provide a starting point for more ambitious forms of AD.

This definition provides the missing link, completing the proof that toD (u + v ) ≡ toD u + toD v The multiplication case is quite similar (Elliott 2009a).

194

5.2.3

liftA2 D (u + v ) (toD (d (u + v ))) ≡ { d (u + v ) } liftA2 D (u + v ) (toD (d u + d v )) ≡ { induction for toD / (+) } liftA2 D (u + v ) (toD (d u) + toD (d v )) ≡ { definition of liftA2 and (+) on functions } λx → D (u x + v x ) (toD (d u) x + toD (d v ) x )

Sine

The specification: toD (sin u) ≡ sin (toD u) Simplify the left-hand side: toD (sin u) ≡ { definition of toD } liftA2 D (sin u) (d (sin u)) ≡ { d (sin u) } liftA2 D (sin u) (d u ∗ cos u) ≡ { liftA2 on functions } λx → D ((sin u) x ) ((d u ∗ cos u) x ) ≡ { sin, (∗) and cos on functions } λx → D (sin (u x )) (d u x ∗ cos (u x ))

and then the right: toD u + toD v ≡ { (+) on functions } λx → toD u x + toD v x ≡ { definition of toD } λx → D (u x ) (toD (d u x )) + D (v x ) (toD (d v x )) Again, we need a definition of (+) on D that makes the LHS and RHS final forms equal, i.e.,

and then the right:

λx → D (u x + v x ) (toD (d u) x + toD (d v ) x ) ≡ λx → D (u x ) (toD (d u) x ) + D (v x ) (toD (d v ) x )

sin (toD u) ≡ { sin on functions } λx → sin (toD u x ) ≡ { definition of toD } λx → sin (D (u x ) (d u x ))

Again, an easy choice is D a0 a 0 + D b0 b 0 = D (a0 + b0 ) (a 0 + b 0 )

So a sufficient definition is

The “induction” step above can be made more precise in terms of fixed-point introduction or the generic approximation lemma (Hutton and Gibbons 2001). Crucially, the morphism properties are assumed more deeply inside of the representation.

sin (D a a 0 ) = D (sin a) (a 0 ∗ cos a) Or, using the chain rule operator, sin = sin ./ cos

6.2

Simplifying on the left:

The whole implementation can be derived in exactly this style, answering the second question:

toD (u ∗ v ) ≡ { definition of toD } liftA2 D (u ∗ v ) (toD (d (u ∗ v ))) ≡ { d (u ∗ v ) } liftA2 D (u ∗ v ) (toD (d u ∗ v + d v ∗ u)) ≡ { induction for toD / (+) } liftA2 D (u ∗ v ) (toD (d u ∗ v ) + toD (d v ∗ u)) ≡ { induction for toD / (∗) } liftA2 D (u ∗ v ) (toD (d u) ∗ toD v + toD (d v ) ∗ toD u) ≡ { liftA2 , (∗), (+) on functions } λx → liftA2 D (u x ∗ v x ) (toD (d u) x ∗ toD v x + toD (d v ) x ∗ toD u x )

How does the implementation and its correctness flow gracefully from that meaning?

6.

Higher-order derivatives

Let’s now turn to the third question: Where else might we go, guided by answers to the first two questions? Our next destination will be higher-order derivatives, followed in Section 10 by derivatives over higher-dimensional domains. Jerzy Karczmarczuk (2001) extended the D representation above to an infinite “lazy tower of derivatives”.

and then on the right:

data D a = D a (D a)

toD u ∗ toD v ≡ { definition of toD } liftA2 D u (toD (d u)) ∗ liftA2 D v (toD (d v )) ≡ { liftA2 and (∗) on functions } λx → D (u x ) (toD (d u) x ) ∗ D (v x ) (toD (d v ) x )

The toD function easily adapts to this new D type: toD :: (a → a) → (a → D a) toD f x = D (f x ) (toD (d f ) x ) or

A sufficient definition is

toD f = liftA2 D f (toD (d f ))

a@(D a0 a 0 ) ∗ b@(D b0 b 0 ) = D (a0 + b0 ) (a 0 ∗ b + b 0 ∗ a)

The definition of toD comes from simplicity and type-correctness. Similarly, let’s adapt the previous derivations and see what arises. 6.1

Multiplication

because toD u x ≡ D (u x ) (toD (d u) x ) toD v x ≡ D (v x ) (toD (d v ) x )

Addition

Specification:

Note the new element here. The entire D value (tower) is used in building the derivative.

toD (u + v ) ≡ toD u + toD v Simplify the left-hand side:

6.3

toD (u + v ) ≡ { definition of toD }

Sine

As usual sin shows a common pattern that applies to other unary functions. Simplifying on the left-hand side:

195

8.

toD (sin u) ≡ { definition of toD } liftA2 D (sin u) (toD (d (sin u))) ≡ { d (sin u) } liftA2 D (sin u) (toD (d u ∗ cos u)) ≡ { induction for toD / (∗) } liftA2 D (sin u) (toD (d u) ∗ toD (cos u)) ≡ { induction for toD / cos } liftA2 D (sin u) (toD (d u) ∗ cos (toD u)) ≡ { liftA2 , sin, cos and (∗) on functions } λx → D (sin (u x )) (toD (d u) x ∗ cos (toD u x ))

Section 6 showed how easily and beautifully one can construct an infinite tower of derivative values in Haskell programs, while computing plain old values. The trick (from (Karczmarczuk 2001)) was to overload numeric operators to operate on the following (co)recursive type: data D b = D b (D b)

and then the right: sin (toD u) ≡ { definition of toD } sin (liftA2 D u (toD (d u))) ≡ { liftA2 and sin on functions } λx → sin (D (u x ) (toD (d u) x )) To make the left and right final forms equal, define sin a@(D a0 a 0 ) ≡ D (sin a0 ) (a 0 ∗ cos a) 6.4

A higher-order, scalar chain rule

The derivation above for sin shows the form of a chain rule for scalar derivative towers. It is very similar to the formulation in Section 3. The only difference are that the second argument (the derivative) gets applied to the whole tower instead of a regular value, and so has type D a → D a instead of a → a. infix 0 ./ (./) :: (Num a) ⇒ (a → a) → (D a → D a) → (D a → D a) (f ./ f 0 ) a@(D a0 a 0 ) = D (f a0 ) (a 0 ∗ f 0 a) With this new definition of (./), all of the chain-rule-based definitions in Figure 5 (first-order derivatives) carry over without change to compute infinite derivative towers. For instance,

This representation, however, works only when differentiating functions from a one-dimensional domain. The reason for this limitation is that only in those cases can the type of derivative values be identified with the type of regular values. Consider a function f :: R2 → R. The value of f at a domain value (x , y) has type R, but the derivative of f consists of two partial derivatives. Moreover, the second derivative consists of four partial second-order derivatives (or three, depending how you count). A function f :: R2 → R3 also has two partial derivatives at each point (x , y), each of which is a triple. That pair of triples is commonly written as a three-by-two matrix. Each of these situations has its own derivative shape and its own chain rule (for the derivative of function compositions), using plainold multiplication, scalar-times-vector, vector-dot-vector, matrixtimes-vector, or matrix-times-matrix. Second derivatives are more complex and varied. How many forms of derivatives and chain rules are enough? Are we doomed to work with a plethora of increasingly complex types of derivatives, as well as the diverse chain rules needed to accommodate all compatible pairs of derivatives? Fortunately, not. There is a single, simple, unifying generalization. By reconsidering what we mean by a derivative value, we can see that these various forms are all representations of a single notion, and all the chain rules mean the same thing on the meanings of the representations. Let’s now look at unifying view of derivatives, which is taken from calculus on manifolds (Spivak 1971). To get an intuitive sense of what’s going on with derivatives in general, we’ll look at some examples. 8.1

instance Floating a ⇒ Floating (D a) where exp = exp ./ exp log = log ./ recip sqrt = sqrt ./ recip (2 ∗ sqrt) sin = sin ./ cos cos = cos ./ −sin ...

One dimension

Start with a simple function on real numbers: f1 :: R → R f1 x = x 2 + 3 ∗ x + 1 Writing the derivative of a function f as d f , let’s now consider the question: what is d f1 ? We might say that d f1 x = 2 ∗ x + 3

Now the operators and literals on the right of (./) are overloaded for the type D a → D a. For instance, in the definition of sqrt,

so e.g., d f1 5 = 13. In other words, f1 is changing 13 times as fast as its argument, when its argument is passing 5. Rephrased yet again, if dx is a very tiny number, then f1 (5 + dx ) − f1 5 is very nearly 13 ∗ dx . If f1 maps seconds to meters, then d f1 5 is 13 meters per second. So already, we can see that the range of f (meters) and the range of d f (meters/second) disagree.

2 :: D a → D a recip :: (D a → D a) → (D a → D a) (∗) :: (D a → D a) → (D a → D a) → (D a → D a)

7.

What is a derivative, really?

8.2

Optimizing zeros

Two dimensions in and one dimension out

As a second example, consider a two-dimensional domain:

The derivative implementations above are simple and powerful, but have an efficiency problem. For polynomial functions (constant, linear, quadratic, etc), all but a few derivatives are zero. Considerable wasted effort can go into multiplying and adding zero derivatives, especially with higher-order derivatives. To optimize away the zeros, wrap Maybe around the derivative in D.

f2 :: R2 → R f2 (x , y) = 2 ∗ x ∗ y + 3 ∗ x + 5 ∗ y + 7 Again, let’s consider some units, to get a guess of what kind of thing d f2 (x , y) really is. Suppose that f2 measures altitude of terrain above a plane, as a function of the position in the plane. You can guess that d f (x , y) is going to have something to do with how fast the altitude is changing, i.e. the slope, at (x , y). But there is no single slope. Instead, there’s a slope for every possible compass direction (a hiker’s degrees of freedom).

data D a = D a (Maybe (D a)) The implementation can stay very simple and readable, as shown in (Elliott 2009a).

196

Now consider the conventional answer to what is d f2 (x , y). Since the domain of f2 is R2 , it has two partial derivatives:

Now what about the different chain rules, saying to combine derivative values via various kinds of products (scalar/scalar, scalar/vector, vector/vector dot, matrix/vector)? Each of these products implements the same abstract notion, which is composition of linear maps.

d f2 (x , y) = (2 ∗ y + 3, 2 ∗ x + 5) In our example, these two pieces of information correspond to two of the possible slopes. The first is the slope if heading directly east, and the second if directly north (increasing x and increasing y, respectively). What good does it do our hiker to be told just two of the infinitude of possible slopes at a point? The answer is perhaps magical: for well-behaved terrains, these two pieces of information suffice to calculate all (infinitely many) slopes, with just a bit of math. Every direction can be described as partly east and partly north (negatively for westish and southish directions). Given a direction angle ϑ (where east is zero and north is 90 degrees), the east and north components are cos ϑ and sin ϑ, respectively. When heading in the direction ϑ, the slope will be a weighted sum of the north-going slope and the east-going slope, where the weights are these north and south components. Instead of angles, our hiker may prefer thinking directly about the north and east components of a tiny step from the position (x , y). If the step is small enough and lands dx to the east and dy to the north, then the change in altitude, f2 (x +dx , y +dy)−f2 (x , y) is very nearly equal to (2 ∗ y + 3) ∗ dx + (2 ∗ x + 5) ∗ dy. If we use () to mean dot (inner) product, then this change in altitude is d f2 (x , y) (dx , dy). From this second example, we can see that the derivative value is not a range value, but also not a rate-of-change of range values. It’s a pair of such rates plus the know-how to use those rates to determine output changes. 8.3

9.

Linear maps (transformations) lie at the heart of generalized differentiation. Talking about linearity requires a few simple operations, which are encapsulated in the the abstract interface known from math as a vector space. Vector spaces specialize the more general notion of a group which as an associative and commutative binary operator, an identity, and inverses. For convenience, we’ll specialize to an additive group which provides addition-friendly names: class AdditiveGroup v where 0 :: v (+) :: v → v → v negate :: v → v Next, given a field s, a vector space over s adds a scaling operation: class AdditiveGroup v ⇒ Vector s v where (·) :: s → v → v Instances include Float, Double, and Complex , as well as tuples of vectors, and functions with vector ranges. (By “vector” here, I mean any instance of Vector , recursively). For instance, here are instances for functions: instance AdditiveGroup v ⇒ AdditiveGroup (a → v ) where 0 = pure 0 (+) = liftA2 (+) negate = fmap negate instance Vector s v ⇒ Vector s (a → v ) where (·) s = fmap (s·)

Two dimensions in and three dimensions out

Next, imagine moving around on a surface in space, say a torus, and suppose that the surface has grid marks to define a two-dimensional parameter space. As our hiker travels around in the 2D parameter space, his position in 3D space changes accordingly, more flexibly than just an altitude. The hiker’s type is then

These method definitions have a form that can be used with any applicative functor. Other useful operations can be defined in terms of these methods, e.g., subtraction for additive groups, and linear interpolation for vector spaces. Several familiar types are vector spaces:

f3 :: R2 → R3 At any position (s, t) in the parameter space, and for every choice of direction through parameter space, each of the coordinates of the position in 3D space has a rate of change. Again, if the function is mathematically well-behaved (differentiable), then all of these rates of change can be summarized in two partial derivatives. This time, however, each partial derivative has components in X, Y , and Z, so it takes six numbers to describe the 3D velocities for all possible directions in parameter space. These numbers are usually written as a 3-by-2 matrix m (the Jacobian of f3 ). Given a small parameter step (dx , dy), the resulting change in 3D position is equal to the product of the derivative matrix and the difference vector, i.e., m ‘timesVec‘ (dx , dy). 8.4

The general setting: vector spaces

• Trivially, the unit type is an additive group and is a vector space

over every field. • Scalar types are vector spaces over themselves, with (·) ≡ (∗). • Tuples add and scale component-wise. • Functions add and scale point-wise, i.e., on their range.

Appendix A gives an efficient representation of linear maps via an associated type (Chakravarty et al. 2005) of bases of vector spaces. Without regard to efficiency, we could instead represent linear maps as a simple wrapper around functions, with the invariant that the contained function is indeed linear:

A unifying perspective

The examples above use different representations for derivatives: scalar numbers, a vector (pair of numbers), and a matrix. Common to all of these representations is the ability to turn a small step in the function’s domain into a resulting step in the range.

newtype u ( v = LMap (u → v ) -- simple & inefficient deriving (AdditiveGroup, Vector )

• In f1 , the (scalar) derivative c means (c∗), i.e., multiply by c.

Assume the following abstract interface, where linear and lapply convert between linear functions and linear maps, and id L and (◦·) are identity and composition.

• In f2 , the (vector) derivative v means (v ). • In f3 , the (matrix) derivative m means (m‘timesVec‘).

linear :: (Vector s u, Vector s v ) ⇒ (u → v ) → (u ( v ) lapply :: (Vector s u, Vector s v ) ⇒ (u ( v ) → (u → v ) id L :: (Vector s u) ⇒ u ( u

So, the common meaning of these derivative representations is a function, and not just any function, but a linear function–often called a “linear map” or “linear transformation”.

197

(◦·)

Either a b, a → b, tries, and syntactic expressions (Elliott 2009b). Could these same definitions work on a . b, as an implementation of AD? Consider one example:

:: (Vector s u, Vector s v ) ⇒ (v ( w ) → (u ( v ) → (u ( w )

Another operation plays the role of dot products, as used in combining partial derivatives. () :: (Vector s u, Vector s v , Vector s w ) ⇒ (u ( w ) → (v ( w ) → ((u, v ) ( w )

sin = fmap sin For now, assume this definition and look at the corresponding numeric morphism property, i.e.,

Semantically, (l m) ‘lapply‘ (da, db) ≡ l ‘lapply‘ da + m ‘lapply‘ db

toD x (sin u) ≡ sin (toD x u)

which is linear in (da, db). Compare with the usual definition of dot products:

Expand the definitions of sin on each side, remembering that the left sin is on functions, as given in Figure 4.

(s 0 , t 0 ) (da, db) = s 0 · da + t 0 · db

toD x (fmap sin u) ≡ fmap sin (toD x u)

Dually to (), another way to form linear maps is by “zipping”:

which is a special case of the Functor morphism property for toD x . Therefore, proving the Functor morphism property will cover all of the definitions that use fmap. The other two definition styles (using pure and liftA2 ) work out similarly. For example, if toD x is an Applicative morphism, then

(?) :: (w ( u) → (w ( v ) → (w ( (u, v )) which will reappear in generalized form in Section 10.3.

10.

Generalized derivatives

toD x (fromInteger n) ≡ { fromInteger for a . b } toD x (pure (fromInteger n)) ≡ { toD x is an Applicative morphism } pure (fromInteger n) ≡ { fromInteger for functions } fromInteger n

We’ve seen what AD means and how and why it works for a specialized case of the derivative of a → a for a one-dimensional (scalar) type a. Now we’re ready to tackle the specification and derivation of AD in the much broader setting of vector spaces. Generalized differentiation introduces linear maps: d :: (Vector s u, Vector s v ) ⇒ (u → v ) → (u → (u ( v ))

toD x (u ∗ v ) ≡ { (∗) for a . b } toD x (liftA2 (∗) u v ) ≡ { toD x is an Applicative morphism } liftA2 (∗) (toD x u) (toD x v ) ≡ { (∗) on functions } toD x u ∗ toD x v

In this setting, there is a single, universal chain rule (Spivak 1971): d (g ◦ f ) x ≡ d g (f x ) ◦· d f x where (◦·) is composition of linear maps. More succinctly, d (g ◦ f ) ≡ (d g ◦ f ) ˆ◦· d f using lifted composition:

Now we can see why these definitions succeed so often: For applicative functors F and G, and function µ :: F a → G a, if µ is a morphism on Functor and Applicative, and the numeric instances for both F and G are defined as in Figure 4, then µ is a numeric morphism. Thus, we have only to come up with Functor and Applicative instances for (.) u such that toD x is a Functor and Applicative morphism.

(ˆ◦·) = liftA2 (◦·) 10.1

First-order generalized derivatives

The new type of value/derivative pairs has two type parameters: data a . b = D b (a ( b) As in Section 5.1, the AD specification centers on a function, toD, that samples a function and its derivative at a point. This time, it will be easier to swap the parameters of toD:

10.2

Functor

First look at Functor . The morphism condition (naturality), ηexpanded, is

toD :: (Vector s u, Vector s v ) ⇒ u → (u → v ) → u . v toD x f = D (f x ) (d f x )

fmap g (toD x f ) ≡ toD x (fmap g f )

In Sections 5 and 6, AD algorithms were derived by saying that toD is a morphism over numeric types. The definitions of these morphisms and their proofs involved one property for each method. In the generalized setting, we can instead specify and prove three simple morphisms, from which all of the others follow effortlessly. We already saw in Figure 4 that the numeric methods for functions have a simple, systematic form. They’re all defined using fmap, pure, or liftA2 in a simple, regular pattern, e.g.,

Using the definition of toD on the left gives fmap g (D (f x ) (d f x )) Simplifying the RHS, toD x (fmap g f ) ≡ { definition of toD } D ((fmap g f ) x ) (d (fmap g f ) x ) ≡ { definition of fmap for functions } D ((g ◦ f ) x ) (d (g ◦ f ) x ) ≡ { generalized chain rule } D (g (f x )) (d g (f x ) ◦· d f x )

fromInteger = pure ◦ fromInteger (∗) = liftA2 (∗) sin = fmap sin ...

So the morphism condition is equivalent to

Numeric instances for many other applicative functors can be given exactly the same method definitions. For instance, Maybe a,

fmap g (D (f x ) (d f x )) ≡ D (g (f x )) (d g (f x ) ◦· d f x )

198

Filling in the definition of toD,

Now it’s easy to find a sufficient definition: fmap g (D fx dfx ) = D (g fx ) (d g fx ◦· dfx )

unit ≡ D (unit x ) (d unit x ) D (f x ) (d f x ) ? D (g x ) (d g x ) ≡ D ((f ? g) x ) (d (f ? g) x )

This definition is not executable, however, since d is not. Fortunately, all uses of fmap in the numeric instances involve functions g whose derivatives are known statically and so can be statically substituted for applications of d . To make the static substitution more apparent, refactor the fmap definition, as in Section 3.

The reason for switching from Applicative to Monoidal is that differentiation is very simple with the latter: d unit ≡ const 0 d (f ? g) ≡ d f ˆ ?d g

instance Functor ((.a)) where fmap g = g ./ d g (./) :: (Vector s u, Vector s v , Vector s w ) ⇒ (v → w ) → (v → (v ( w )) → (u . v ) → (u . w ) (g ./ dg) (D fx dfx ) = D (g fx ) (dg fx ◦· dfx )

The (ˆ ?) on the right is a variant of (?), lifted to work on functions (or other applicative functors) that yield linear maps:

This new definition makes it easy to transform the fmap-based definitions systematically into effective versions. After inlining this definition of fmap, the fmap-based definitions look like

(ˆ ?) = liftA2 (?) We cannot simply pair linear maps to get a linear map. Instead, (?) pairs linear maps point-wise. Now simplify the morphism properties, using unit and (?) for functions, and their derivatives:

sin = sin ./ d sin sqrt = sqrt ./ d sqrt ...

unit ≡ D () 0 D (f x ) (d f x ) ? D (g x ) (d g x ) ≡ D (f x , g x ) (d f x ? d g x )

Every remaining use of d is applied to a function whose derivative is known, so we can replace each use. sin = sin ./ cos sqrt = sqrt ./ recip (2 ∗ sqrt) ...

So the following simple definitions suffice: unit = D () 0 D fx dfx ? D gx dgx = D (fx , gx ) (dfx ? dgx )

Now we have an executable implementation again. 10.3

The switch from Applicative to Monoidal introduced fmap app (in the definition of (~)). Because of the meaning of fmap on u .v , we will need a derivative for app. Fortunately, app is fairly easy to differentiate. Allowing only x to vary (while holding f constant), f x changes just as f changes at x , so the second partial derivative of app at (f , x ) is d f x . Allowing only f to vary, f x is linear in f , so it (considered as a linear map) is its own derivative. That is, using ($) as infix function application,

Applicative/Monoidal

Functor is handled, which leaves just Applicative (McBride and Paterson 2008): class Functor f ⇒ Applicative f where pure :: a → f a (~) :: f (a → b) → f a → f b The morphism properties will be easier to demonstrate in terms of a type class for (strong) lax monoidal functors:

d ($ x ) f ≡ linear ($ x ) d (f $) x ≡ d f x

class Monoidal f where unit :: f () (?) :: f a → f b → f (a, b)

As mentioned in Section 9, () takes the place of dot product for combining contributions from partial derivatives, so d app (f , x ) = linear ($ x ) d f x

For instance, the function instance is instance Monoidal ((→) a) where unit = const () f ? g = λx → (f x , g x )

Alternatively, define liftA2 via fmap and (?) instead of app. Either way, the usual derivative rules for (+) and (∗) follow from liftA2 , and so needn’t be implemented explicitly (Elliott 2009a).

The Applicative class is equivalent to Functor +Monoidal (McBride and Paterson 2008, Section 7). To get from Functor and Monoidal to Applicative, define

10.4

Fun with rules

Let’s back up to our more elegant method definitions: (∗) = liftA2 (∗) sin = fmap sin sqrt = fmap sqrt ...

pure a = fmap (const a) unit fs ~ xs = fmap app (fs ? xs) where app :: (a → b, a) → b app (f , x ) = f x

Section 10.2 made these definitions executable in spite of their appeal to the non-executable d by (a) refactoring fmap to split the d from the residual function (./), (b) inlining fmap, and (c) rewriting applications of d with known derivative rules. Instead, we could get the compiler to do these steps for us, by specifying the derivatives of known functions as rewrite rules (Peyton Jones et al. 2001), e.g.,

I’ve kept Monoidal independent of Functor , unlike (McBride and Paterson 2008), because the linear map type has unit and (?) but is not a functor. (Only linear functions can be mapped over a linear map.) The shift from Applicative to Monoidal makes the specification of toD simpler, again as a morphism:

d log = recip d sqrt = recip (2 ∗ sqrt) ...

unit ≡ toD x unit toD x f ? toD x g ≡ toD x (f ? g)

199

D ≡ D ≡ D ≡ D ≡ D

Notice that these definitions are simpler and more modular than the standard differentiation rules, as they do not have the chain rule mixed in. With these rules in place, we can use the incredibly simple fmap-based definitions of our methods. The definition of fmap must get inlined so as to reveal the d applications, which then get rewritten according to the rules. Fortunately, the fmap definition is tiny, which encourages its inlining. The current implementation of rewriting in GHC is somewhat fragile, so it may be a while before this sort of technique is robust enough for every day use.

((g ◦ f ) x ) (toD x (d (g ◦ f ))) { Generalized chain rule } (g (f x )) (toD x ((d g ◦ f ) ˆ◦· d f )) { liftA2 morphism, (ˆ◦·) ≡ liftA2 (◦·) } (g (f x )) (toD x (d g ◦ f ) ˆ◦· toD x (d f ))) { fmap ≡ (◦) on functions } (g (f x )) (toD x (fmap (d g) f ) ˆ◦· toD x (d f ))) { fmap morphism } (g (f x )) (fmap (d g) (toD x f ) ˆ◦· toD x (d f ))

Summarizing, toD x is a Functor morphism iff 10.5

A problem with type constraints

fmap g (D (f x ) (toD x (d f ))) ≡ D (g (f x )) (fmap (d g) (toD x f ) ˆ◦· toD x (d f ))

There is a problem with the Functor and Monoidal (and hence Applicative) instances derived in above. In each case, the method definitions type-check only for type parameters that are vector spaces. The standard definitions of Functor and Applicative in Haskell do not allow for such constraints. The problem is not in the categorical notions, but in their specialization (often adequate and convenient) in the standard Haskell library. For this reason, we’ll need a variation on these standard classes, either specific for use with vector spaces or generalized. The definitions above work with the following variations, parameterized over a scalar type s:

Given this form of the morphism condition, and recalling the definition of toD, it’s easy to find a fmap definition for (.∗ ): fmap g fxs@(D fx dfx ) = D (g fx ) (fmap (d g) fxs ˆ◦· dfx ) The only change from (.) is (ˆ◦·) in place of (◦·). Again, this definition can be refactored, followed by replacing the non-effective applications of d with known derivatives. Alternatively, replace arbitrary functions with differentiable functions (u ,→ v ), as in Section 10.5, so as to make this definition executable as is. The Monoidal derivation goes through as before, adding an induction step. The instance definition is exactly as with (.) above. The only difference is using (ˆ ?) in place of (?).

class Functor s f where fmap :: (Vector s u, Vector s v ) ⇒ (u → v ) → (f u → f v ) class Functor s f ⇒ Monoidal s f where unit :: f () (?) :: (Vector s u, Vector s v ) ⇒ f u → f v → f (u, v )

instance Vector s u ⇒ Monoidal s ((.∗ ) u) where unit = D () 0 ? v 0) D u u 0 ? D v v 0 = D (u, v ) (u 0 ˆ To optimize out zeros in either u . v or u .∗ v , add a Maybe around the derivative part of the representation, and mentioned in Section 7 and detailed in (Elliott 2009a). The zero-optimizations are entirely localized to the definitions of fmap and (?). To handle the Nothing vs Just, add an extra fmap in the fmap definition, and add another liftA2 in the (?) definition.

While we are altering the definition of Functor , we can make another change. Rather than working with any function, limit the class to accepting only differentiable functions. A simple representation of a differentiable function is a function and its derivative: data u ,→ v = FD (u → v ) (u → (u ( v )) This representation allows a simple and effective implementation of d :

11.

d :: (Vector s u, Vector s v ) ⇒ (u ,→ v ) → (u → (u ( v )) d (FD f 0 ) = f 0 With these definitions, the simple numeric method definitions (via fmap and liftA2 ) are again executable, provided that the functions passed to fmap are replaced by differentiable versions. 10.6

Related work

Jerzy Karczmarczuk (2001) first demonstrated the idea of an infinite “lazy tower of derivatives”, giving a lazy functional implementation. His first-order warm-up was similar to Figure 2, with the higher-order (tower) version somewhat more complicated by the introduction of streams of derivatives. Building on Jerzy’s work, this paper implements the higher-order case with the visual simplicity of the first-order formulation (Figure 2). It also improves on that simplicity by means of numeric instances for functions, leading to Figure 5. Another improvement is optimizing zeros without cluttering the implementation (Section 7). In contrast, (Karczmarczuk 2001) and others had twice as many cases to handle for unary methods, and four times as many for binary. Jerzy’s AD implementation was limited to scalar types, although he hinted at a vector extension in (Karczmarczuk 1999), using an explicit list of partial derivatives. These hints were later fleshed out for the higher-order case in (Foutz 2008), replacing lists with (nested) sparse arrays (represented as fast integer maps). The constant-optimizations there complicated matters but had an advantage over the version in this paper. In addition to having constant vs non-constant constructors (and hence many more cases to define), each sparse array can have any subset of its partial derivatives missing, avoiding still more mutiplications and additions with zero. To get the same benefit, one might use a linear map representation based on partial functions. Pearlmutter and Siskind (2007) also extended higher-order forward-mode AD to the multi-dimensional case. They remarked:

Generalized derivative towers

To compute infinitely many derivatives, begin with a derivative tower type and a theoretical means of constructing towers from a function: data u .∗ v = D v (u .∗ (u ( v )) toD :: (Vector s u, Vector s v ) ⇒ u → (u → v ) → u .∗ v toD x f = D (f x ) (toD x (d f )) The naturality (functor morphism) property is fmap g ◦ toD x ≡ toD x ◦ fmap g As before, let’s massage this specification into a form that is easy to satisfy. First, η-expand, and fill in the definition of toD: fmap g (D (f x ) (toD x (d f ))) ≡ D ((fmap g f ) x ) (toD x (d (fmap g f ))) Next, simplify the right side, inductively assuming the Functor and Applicative morphisms inside the derivative part of D.

200

Conal Elliott. Denotational design with type class morphisms. Technical Report 2009-01, LambdaPix, March 2009b. URL http://conal.net/papers/type-class-morphisms.

The crucial additional insight here, both for developing the extension and for demonstrating its correctness, involves reformulating Karczmarczuk’s method using Taylor expansions instead of the chain rule.

Conal Elliott. Push-pull functional reactive programming. Proceedings of the Haskell Symposium, 2009c.

The expansions involve introducing non-standard “ε” values, which come from dual numbers. Each ε must be generated, managed, and and carefully distinguished from others, so as to avoid problems of nested use described and addressed in (Siskind and Pearlmutter 2005, 2008). In contrast, the method in this paper is based on the chain rule, while still handling multi-dimensional AD. I don’t know whether the tricky nesting problem arises with the formulation in this paper based linear maps. Henrik Nilsson (2003) extended higher-order AD to work on a generalized notion of functions that includes Dirac impulses, allowing for more elegant functional specification of behaviors involving instantaneous velocity changes. These derivatives were for functions over a scalar domain (time). Doug McIlroy (2001) demonstrated some beautiful code for manipulating infinite power series. He gave two forms, Horner and Maclaurin, and their supporting operations. The MacLaurin form is especially similar, under the skin, to working with lazy derivative towers. Doug also examined efficiency and warns that “the product operators for Maclaurin and Horner forms respectively take O(2n ) and O(n2 ) coefficient-domain operations to compute n terms.” He goes on to suggest computing products by conversion to and from the Horner representation. I think the exponential complexity can apply in the formulations in (Karczmarczuk 2001) and in this paper. I am not aware of work on AD for general vector spaces, nor on deriving AD from a specification.

12.

Jason Foutz. Higher order multivariate automatic differentiation in haskell. Blog post, February 2008. URL http://metavar.blogspot.com/2008/02/ higher-order-multivariate-automatic.html. Ralf Hinze. Generalizing generalized tries. Journal of Functional Programming, 10(04):327–351, 2000. Graham Hutton and Jeremy Gibbons. The generic approximation lemma. Information Processing Letters, 79(4):197–201, 2001. Jerzy Karczmarczuk. Functional coding of differential forms. In Scottish Workshop on Functional Programming, 1999. Jerzy Karczmarczuk. Functional differentiation of computer programs. Higher-Order and Symbolic Computation, 14(1), 2001. Conor McBride and Ross Paterson. Applicative programming with effects. Journal of Functional Programming, 18(1):1–13, 2008. M. Douglas McIlroy. The music of streams. Information Processing Letters, 77(2-4):189–195, 2001. Henrik Nilsson. Functional automatic differentiation with Dirac impulses. In Proceedings of the Eighth ACM SIGPLAN International Conference on Functional Programming, pages 153–164, Uppsala, Sweden, August 2003. ACM Press. Barak A. Pearlmutter and Jeffrey Mark Siskind. Lazy multivariate higher-order forward-mode AD. In Proceedings of the 2007 Symposium on Principles of Programming Languages, pages 155–60, Nice, France, January 2007.

Future work

Reverse and mixed mode AD. Forward-mode AD uses the chain rule in a particular way: in compositions g ◦ f , g is always a primitive function, while f may be complex. Reverse-mode AD uses the opposite decomposition, with f being primitive, while mixed-mode combines styles. Can the specification in this paper be applied, as is, to these other AD modes? Can the derivations be successfully adapted to yield general, efficient, and elegant implementations of reverse and mixed mode AD, particularly in the general setting of vector spaces?

Simon Peyton Jones, Andrew Tolmach, and Tony Hoare. Playing by the rules: rewriting as a practical optimisation technique in ghc. In In Haskell Workshop. ACM SIGPLAN, 2001. Jeffrey Mark Siskind and Barak A. Pearlmutter. Nesting forwardmode AD in a functional framework. Higher Order Symbolic Computation, 21(4):361–376, 2008. Jeffrey Mark Siskind and Barak A. Pearlmutter. Perturbation confusion and referential transparency: Correct functional implementation of forward-mode AD. In Implementation and Application of Functional Languages, IFL’05, September 2005.

Richer manifold structure. Calculus on vector spaces is the foundation of calculus on rich manifold strucures stitched together out of simpler pieces (ultimately vector spaces). Explore differentiation in the setting of these rich structures.

Michael Spivak. Calculus on Manifolds: A Modern Approach to Classical Theorems of Advanced Calculus. HarperCollins Publishers, 1971.

Efficiency analysis. Forward-mode AD for Rn → R is described as requiring n passes and therefore inefficient. The method in this paper makes only one pass. That pass manipulates linear maps instead of scalars, which could be as expensive as n passes, but it might not need to be.

13.

In

Acknowledgments

A.

Efficient linear maps

A.1

Basis types

A basis of a vector space V is a subset B of V , such that the elements of B span V and are linearly independent. That is to say, every element (vector) of V is a linear combination of elements of B , and no element of B is a linear combination of the other elements of B . Moreover, every basis determines a unique decomposition of any member of V into coordinates relative to B . Since Haskell doesn’t have subtyping, we can’t represent a basis type directly as a subset. Instead, for an arbitrary vector space v , represent a distinguished basis as an associated type (Chakravarty et al. 2005), Basis v , and a function that interprets a basis representation as a vector. Another method extracts coordinates (coefficients) for a vector with respect to basis elements.

I’m grateful for comments from Anonymous, Barak Pearlmutter, Mark Rafter, and Paul Liu.

References Manuel M. T. Chakravarty, Gabriele Keller, and Simon PeytonJones. Associated type synonyms. In ICFP ’05: Proceedings of the tenth ACM SIGPLAN international conference on Functional programming, pages 241–253. ACM Press, 2005. Conal Elliott. Beautiful differentiation (extended version). Technical Report 2009-02, LambdaPix, March 2009a. URL http: //conal.net/papers/beautiful-differentiation.

201

lapply :: (Vector s u, Vector s v ) ⇒ (u ( v ) → (u → v ) lapply uv u = sumV [coord u e · uv e | e ← enumerate ]

class (Vector s v , Enumerable (Basis v )) ⇒ HasBasis s v where type Basis v :: ∗ basisValue :: Basis v → v coord :: v → (Basis v → s)

or lapply lm = linearCombo ◦ fmap (first lm) ◦ decompose

The Enumerable constraint enables enumerating basis elements for application of linear maps (Section A.2). It has one method that enumerates all of the elements in a type:

The inverse function is easier. Convert a function f , presumed linear, to a linear map representation: linear :: (Vector s u, Vector s v , HasBasis u) ⇒ (u → v ) → (u ( v )

class Enumerable a where enumerate :: [a ] A.1.1

It suffices to apply f to basis values:

Primitive bases

linear f = f ◦ basisValue

Since () is zero-dimensional, its basis is the Void type. The distinguished basis of a one-dimensional space has only one element, which can be represented with no information. Its corresponding value is 1.

The coord method can be changed to return v ( s, which is the dual of v . A.2.1

instance HasBasis Double Double where type Basis Double = () basisValue () =1 coord s = const s A.1.2

Memoization

The idea of the linear map representation is to reconstruct an entire (linear) function out of just a few samples. In other words, we can make a very small sampling of function’s domain, and re-use those values in order to compute the function’s value at all domain values. As implemented above, however, this trick makes function application more expensive, not less. If lm = linear f , then each use of lapply lm can apply f to the value of every basis element, and then linearly combine results. A simple trick fixes this efficiency problem: memoize the linear map. We could do the memoization privately, e.g.,

Composing bases

Given vector spaces u and v , a basis element for (u, v ) will be one basis representation or the other, tagged with Left or Right. The vectors corresponding to these basis elements are (ub, 0) or (0, vb), where ub corresponds to a basis element for u, and vb for v . As expected then, the dimensionality of the cross product is the sum of the dimensions. The decomposition of a vector (u, v ) contains left-tagged decompositions of u and right-tagged decompositions of v .

linear f = memo (f ◦ basisValue) If lm = linear f , then no matter how many times lapply lm is applied, the function f can only get applied as many times as the dimension of the domain of f . However, there are several other ways to make linear maps, and it would be easy to forget to memoize each combining form. So, instead of the function representation above, ensure that the function be memoized by representing it as a memo trie (Hinze 2000; Elliott 2009b).

instance (HasBasis s u, HasBasis s v ) ⇒ HasBasis s (u, v ) where type Basis (u, v ) = Basis u ‘Either ‘ Basis v basisValue (Left a) = (basisValue a, 0) basisValue (Right b) = (0, basisValue b) coord (u, v ) = coord u ‘either ‘ coord v

M

type u ( v = Basis u → v

Triples etc, can be handled similarly or reduced to nested pairs. Basis types are usually finite and small, so the decompositions can be memoized for efficiency, e.g., using memo tries (Elliott 2009b). A.2

The conversion functions linear and lapply need just a little tweaking. Split memo into its definition untrie ◦ trie, and then move untrie into lapply. We’ll also have to add HasTrie constraints: linear :: (Vector s u, Vector s v , HasBasis s u, HasTrie (Basis u)) ⇒ (u → v ) → (u ( v ) linear f = trie (f ◦ basisValue) lapply :: (Vector s u, Vector s v , HasBasis s u, HasTrie (Basis u)) ⇒ (u ( v ) → (u → v ) lapply lm = linearCombo ◦ fmap (first (untrie lm)) ◦ decompose

Linear maps

Semantically, a linear map is a function f :: a → b such that, for all scalar values s and “vectors” u, v :: a, the following properties hold: f (s · u) ≡ s · f u f (u + v ) ≡ f u + f v By repeated application of these properties, f (s1 · u1 + . . . + sn · un ) ≡ s1 · f u1 + . . . + sn · f un

Now we can build up linear maps conveniently and efficiently by using the Functor and Applicative instances for memo tries (Elliott 2009b).

Taking the ui as basis vectors, this form implies that a linear function is determined by its behavior on any basis of its domain type. Therefore, a linear function can be represented simply as a function from a basis, using the representation described in Section A.1. type u ( v = Basis u → v The semantic function converts from (u ( v ) to (u → v ). It decomposes a source vector into its coordinates, applies the basis function to basis representations, and linearly combines the results.

202

OXenstored An Efficient Hierarchical and Transactional Database using Functional Programming with Reference Cell Comparisons Thomas Gazagnaire

Vincent Hanquez

Citrix Systems First Floor Building 101, Cambridge Science Park, Milton Road Cambridge, CB4 0FY, United Kingdom {thomas.gazagnaire,vincent.hanquez}@citrix.com

Abstract

host’s hardware. They have been very popular architecture since the CP/CMS (Creasy 1981), developed at IBM in the 1960s. The X EN hypervisor is very popular and is used, for example, by the Amazon Elastic Compute Cloud project1 which allows customers to rent computers on which they can run their own applications.

We describe in this paper our implementation of the Xenstored service which is part of the X EN architecture. Xenstored maintains a hierarchical and transactional database, used for storing and managing configuration values. We demonstrate in this paper that mixing functional data-structures together with reference cell comparison, which is a limited form of pointer comparison, is: (i) safe; and (ii) efficient. This demonstration is based, first, on an axiomatization of operations on the treelike structure we used to represent the Xenstored database. From this axiomatization, we then derive an efficient algorithm for coalescing concurrent transactions modifying that structure. Finally, we experimentally compare the performance of our implementation, that we called OXenstored, and the C implementation of the Xenstored service distributed with the X EN hypervisor sources: the results show that OXenstored is much more efficient than its C counterpart. As a direct result of this work, OXenstored will be included in future releases of X EN S ERVER, the virtualization product distributed by Citrix Systems, where it will replace the current implementation of the Xenstored service.

In a virtual architecture, each guest runs securely partitioned from others in a virtualized environment. Using a technique called para-virtualization, which involves modifying processors’ administrative instructions into calls into the hypervisor, processors’ efficiencies are close to native performance. When operating system modifications are not practical or impossible, the X EN hypervisor leverages the use of special instructions, called VMM instructions. These instructions give the ability to run the guest unmodified but trapping all administrative operations securely. In the X EN architecture, a para-virtualized privileged guest called the “control domain” is in charge of all the I/O needs of the other guests. Consequently, all virtual guests’ devices (disks, network interfaces) have to be handled at this level too. For example, each guest’s virtual disk is associated with at least a process in the control domain. Subsequently, when a new guest starts, it is necessary that the corresponding processes have already been started in the control domain and configured correctly. In the X EN architecture, it is thus necessary to have a specific service in the control domain to exchange control and configuration data between guests. This service is called Xenstored and can be seen as a tuple space system, providing concurrent-safe access to a key-value association database. This service has originally been implemented in C and is distributed with the X EN hypervisor sources. We will refer to it as CXenstored.

Categories and Subject Descriptors D.1.1 [Programming Techniques]: Applicative (Functional) Programming; H.2.4 [Database Management]: Systems – Transaction processing; D3.3 [Programming Language]: Language Constructs and Features – Data types and structures General Terms Keywords

Algorithms, Design, Performance

Databases, Transactions, Concurrency, Prefix Trees

1. Introduction

This paper describes another implementation of the Xenstored requirements, done using Objective Caml (Leroy et al. 1996), and we will refer to it as OXenstored. This new implementation uses functional data-structures as well as reference cell comparison (Pitts and Stark 1993; Claessen and Sands 1999), which is a limited form of pointer comparison. OXenstored is a fifth of the size of CXenstored (around 2000 lines of code) and significantly improves the performance in several respects: particularly, it increases by a factor of 3 the number of guests per host (up to 160). Moreover, we formalized the new algorithms implemented by OXenstored and we proved that they are safe, that is concurrent accesses always leave the database in a consistent state. Finally,

X EN (Barham et al. 2003) is an open-source type 1 hypervisor, providing the ability to run multiple operating systems, called guests, concurrently on a single physical processor. Type 1 (or native, baremetal) hypervisors are software systems that run directly on the

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICFP’09, August 31–September 2, 2009, Edinburgh, Scotland, UK. c 2009 ACM 978-1-60558-332-7/09/08. . . $5.00 Copyright

1 Part

203

of the Amazon Web Services: http://aws.amazon.com/ec2/

as a result of this work, OXenstored will replace CXenstored in future releases of X EN S ERVER, the virtualization product commercialized by Citrix Systems2 which is built on top of the X EN hypervisor. OXenstored sources can be found on: http://xenbits.xen.org/ext/xen-ocaml-tools.hg

Simple database operation The clients of the Xenstored service (which are either guests or are components running inside the control domain) have only a small set of atomic operations to set and get the contents from the database: • write: create or modify a new path; • read: get the value associated with a path;

Consequently, we believe that these arguments demonstrate that (i) functional (and thus immutable) data-structures are the most intuitive and simple candidates for manipulating tree-like structured data; and (ii) reference cell comparisons can be used in a very safe way to obtain great performance as well as proven consistency.

• mkdir: create a new node in the database; • getperm: get the permissions of a node; • setperm: set the permissions of a node.

Notification The clients of the Xenstored service can ask to be notified of changes in a node. When a node is changed or created, all the guests watching this node or a parent of this node will be notified asynchronously that a specific path has been modified.

This paper is organized as follows: Section 2 gives a high-level overview of the Xenstored service’s architecture, as well as an example of how it interacts with the X EN framework. Section 3 gives an informal explanation of the algorithm used by OXenstored. Then, Section 4 formalizes the functional data-structure used in OXenstored for modeling the database and section 5 derives the notion of a transaction in this context. Then, we explain in Section 6 how, unlike CXenstored, OXenstored is able to coalesce concurrent transactions which are affecting distinct parts of the database. We conduct some experiments in Section 7 to validate our approach and show that, indeed, the simplistic way of handling transactions of CXenstored can makes the system live-lock in a very common situation (starting a lot of guests), as opposed to OXenstored. Finally, Section 8 compares our approach with other work focused on transactional databases and Section 9 discusses the results given in this paper.

Permissions Each node has its own permissions. A permission is made up of an owner, which has all privileges on the node, and an access control list which specifies who is allowed to access or modify the node. Quotas Unprivileged guests can be limited in the number of nodes they are authorized to create. This permits the Xenstored service’s memory usage to stay at reasonable levels and this protects the control domain memory from malicious guests. Transactions The clients of the Xenstored service have the ability to run a set of multiple operations modifying/reading the database in an atomic fashion. From the start to the end of the transaction, the Xenstored service will ensure that the database stays consistent. If a transaction cannot be made consistent, then the Xenstored service will reply to the client that the transaction needs to be retried. A client can have multiple opened transaction at the same time, and thus each transaction needs to be identified by a unique transaction ID – whenever a client sends a request to the Xenstored service, it will associate to its request such a transaction ID. A transaction ID of 0 means that the request is not associated to a transaction and will be executed directly.

2. Xenstored Design and Use-Case We give in Section 2.1 an overview of the design of the Xenstored service and, in Section 2.2, an example use-case of how it interacts with the X EN hypervisor. 2.1 Overview of the Xenstored Design The Xenstored service is a single-threaded daemon, running in the control domain. It is the central point of communication, as guests communicate to each others using it. Basically, it is a file-systemlike database, where control data is hierarchically organized.

These requirements have originally been implemented in CXenstored. However, in CXenstored, transactions are handled in a simplistic way. In a set of concurrent transactions, only one would be able to successfully complete, all the other transactions would have to retry from the beginning. This way of handling multiple concurrent transactions makes the mechanism really simple to provide consistency, however it also adversely affects the performance of the system: when the number of concurrent transactions is high, a slow or long transaction will be disadvantaged in an environment when only the first to finish can be completed successfully. In case a client of the Xenstored service is doing a small and fast transaction in a loop, this will cause a live-lock and some transactions of other clients will never be able to complete.

There is two different ways a client can connect and communicate with the Xenstored service: • First, using the ring buffers. These are two circular buffers

stored inside in the guests’ memory: one for sending request to the Xenstored service and one for reading its replies. • Second, using the Unix sockets. These are accessible only for

processes running inside the control domain. Clients of the Xenstored service use the standard Unix read and write operations to access to the database.

2.2 Example Use-Case

From the client point of view, the Xenstored service offers a hierarchical key-value association database which has the following properties:

In this section we give a small example which describes the sequence of steps involved when starting a new guest. This sequence of steps involves three different clients of the Xenstored service:

Structured key database Contents in the database are structured into nodes which are addressed by Unix filesystem-style paths. For example, the ith guest stores its virtual-disk configuration under the path /local/domain/i/device/vbd/ in a X ENbased system. Each path corresponds to a node in the database and it can store a value even if it has some children. 2 Citrix

• The Control domain’s Kernel (CK). The kernel of the control

domain is able to set up the virtual disk drivers and the virtual network drivers inside the control domain, for the new guests. • The new Guest Kernel (GK). When running, the kernel of the

guest will try to connect to its virtual disk and network drivers which should be running inside and exported by the control domain.

Systems, http://www.citrix.com.

204

• The management Tool-Stack (TS). It is running inside the con-

T , the prefix tree described in this figure, is the state of the database just before committing the transaction. When committing (T1 , T2 , b) to T , the system observes first that T1 and T ’s roots are not in the same memory location. However, the subtree at b in T1 is the same as the subtree at b in T . Thus, the new database state becomes the prefix tree T in figure 1, which is the substitution of the subtree at b in T by the subtree at b in T2 . In the next sections, we formalize that algorithm and demonstrate that it works correctly.

trol domain and it receives direct orders from the user. Particularly, when a user wants to start a specific guest, the management tool-stack is in charge of initiating the guest start protocol. Starting a guest can be summarized as follows. Lines are prefixed with the name of client doing the action (ie. (CK), (GK) or (TS)). Almost all of these steps should be done atomically (ie. using transactions) in order to not pollute the configuration database in case one of the three clients fails.

T2

(TS) The management tool-stack queries the X EN hypervisor to create a fresh guest. At this point, the guest is paused. The X EN hypervisor responds to the management tool-stack by giving i, the ID of the created guest. Then, the management tool-stack creates a collection of paths /local/domain/i/devices/... in the database.

T1 0

0 b 7

(TS) The management tool-stack notifies the control domain’s kernel (whose guest ID is 0) that it wants to set up some kernel devices by atomically writing a collection of configuration keys in /local/domain/0/backend/....

T 0

a a

b b

5

4

a 3 a

b 0

(CK) The control domain’s kernel is watching for modifications on the path /local/domain/0/backend, and thus is notified by the Xenstored service that someone wants it to configure new drivers inside the control domain. Once this is done, the control domain’s kernels write some special values in /local/domain/0/backend to notify the management toolstack that the devices are ready.

T Figure 1. An example of how the algorithm implemented by OXenstored works.

4. Database Representation

(TS) The management tool-stack is notified that the devices are ready inside the control domain. It can now ask the X EN hypervisor to really start the guest.

In this section we introduce the data structures we decided to use for internally representing the OXenstored database, namely the prefix trees, or simply tries (Fredkin 1960), which are efficient structures for representing dictionaries. We only consider functional tries, ie. data-structures whose states are not mutable: an update operation on these structures creates a new structure and tries to share as much data as possible with the previous one. We first give in Section 4.1 some general definitions and properties of tries and we explain in Section 4.2 the more precise assumptions and choices we made for implementing a trie library in Objective Caml: this leads to an axiomatization of the trie data-structure library on which the following sections can rely on to prove the correctness of our transaction-coalescing algorithm.

(GK) When the guest kernel is booting, it will read the configuration values put in /local/domain/i/devices/... by the management tool-stack. Using these data, it will be able to configure the data-path of its devices correctly, through the drivers of the control domain’s kernel.

3. Informal Description of the Algorithm Basically, the database of OXenstored is modeled as an immutable prefix tree (Okasaki 1999). Each transaction is associated with a triplet (T1 , T2 , p), where T1 is the root of the database just before the transaction starts, T2 is the current local copy of the database with all updates made by the transaction up to that point in time, p is the path to the node furthest from the root of T2 whose subtree contains all the updates made by the transaction up to that point. The transaction updates T2 by substitution and copying all the nodes from the root of T2 to the node where the substitution takes place. At the end of the transaction, the system tries to commit. At this point, the system checks if T1 is still the root of the current database. If it is, it just commits by setting T2 to be the current database. If it is not, it checks if the subtree at p in T1 is same as the subtree at p in the current database. If it is, it means that no-one has yet touched the nodes touched by the transaction, so it just commits by substituting the subtree at p in the database by the subtree at p in T2 . Otherwise, someone must have touched the nodes touched by the transaction, and therefore the transaction must abort to ensure serialisability (H¨arder and Reuter 1983).

4.1 Basic Materials First, let us fix a finite set K of keys and let us consider the free K, defined as the monoid K , ie. the set of string over alphabet S infinite set of (possibly empty) key sequences n∈N Kn and the composition law, having the empty sequence ε as neutral element and for which uv = a1 . . . an b1 . . . bm if u = a1 . . . an and v = b1 . . . bm ; and uε = εu = u. Elements of K are called paths. A prefix of u = a1 . . . an is either ε, or a path a1 . . . ak with k ∈ {1 . . . n}. A strict prefix of u is either ε or a path a1 . . . ak with k < n. Moreover, the size of a path u, denoted by |u|, is the number of elements contained in the sequence; more formally, |a1 . . . an | = n and |ε| = 0. In OXenstored, a path is represented as a slash-separated list of names, as /local/domain/0/device. This path can be understood as the sequence of keys a1 a2 a3 a4 , where a1 = local, a2 = domain, a3 = 0 and a4 = device. Second, let us fix a finite set V of values. A trie is a structure which partially maps paths to values. Thus, it can also be considered as a total function T : K → (V ∪ {⊥}), where ⊥ is a special symbol introduced to denote the fact that a path has no associated value.

Let us now consider a small example to see how this algorithm works. Let us consider an initial database associating 0 to ε, 5 to a and 4 to b, and a transaction trying to associate 7 to b. This transaction is represented by the triplet (T1 , T2 , b), where T1 and T2 are two prefix trees sharing in memory the same node associated to the key a, as described in Figure 1. Moreover, let us assume that

205

Finally, the singleton trie associating a value x with the path ε and ⊥ to anything else is denoted by {x}. The infinite collection of tries is denoted by T(K, V) or simply by T.

be used to observe value sharing: two values are shared if they have the same location, that is, if they are physically equal. In light of this discussion, Figure 2 redefines the substitution and the restriction in order to enforce the sharing of values: rules (E1) and (E2) focus on the behavior of physical equality only; rules (S1), (S2), (S3) and (S4) focus on the substitution operator and its interactions with the restriction operator; finally, rules (R1), (R2) and (R3) focus on the restriction operator only. These definitions can be considered as axioms that any implementation of tries should satisfy and for which any formal reasoning can rely on.

Tree Representation In practice, this hierarchical mapping can be used to optimize space utilization of a trie: indeed, it can then be implemented as a finite tree whose nodes and edges are labelled by values and keys respectively. Such a tree T can be decomposed into a structure x, {ai , Ti }i∈I , where x ∈ V, I = {1 . . . n} and for any i ∈ I, ai ∈ K and Ti ∈ T such that: • T (ε) = x; • For any u = bv ∈ K such that b ∈ K, v ∈ K , we have:

If T (b) = ⊥ then there is no i such that ai = b;

(E1)

∀T ∈ T, T ≡ T

Otherwise T (u) = Ti (v) where i is such that ai = b.

(E2)

∀T1 , T2 ∈ T and ∀u, v ∈ K , if T1 |u ≡ T2 |v, then T1 (u) = T2 (v)

(S1)

∀T1 , T2 ∈ T and ∀u, v ∈ K , T1 [u/T2 ]|uv ≡ T2 |v

(S2)

∀T1 , T2 ∈ T and ∀u, v ∈ K , T1 [uv/T2 ]|u ≡ T1 |u

(S3)

∀T1 , T2 ∈ T and ∀u, v ∈ K , if: u is not a prefix of v and v is not a prefix of u then T1 [v/T2 ]|u ≡ T1 |u

(S4)

∀T1 , T2 ∈ T and ∀u, v ∈ K , if v = ε, then T1 [uv/T2 ](u) = T1 (u)

(R1)

∀T ∈ T, T |ε ≡ T

(R2)

∀T ∈ T and ∀u, v ∈ K , T |u|v ≡ T |uv

(R3)

∀T1 , T2 ∈ T and ∀u ∈ K , if T1 ≡ T2 , then T1 |u ≡ T2 |u

Thus, for any path u in K the value associated with u in T is either T (u) ∈ V if there is a node associated with the path u in the tree representing T , or T (u) = ⊥ otherwise. Figure 3 shows such a trie T : paths /vm/0, /vm/1 and /vm/2 share the same prefix /vm and thus the values T (/vm/0), T (/vm/1) and T (/vm/2) are stored in the same subtree at /vm in T . For this trie, we also have T (/vm/3) = ⊥. We are now ready to define basic operations on tries. We consider in this paper two operations, namely the substitution and restriction. Substitution First, the substitution operation consists to replace any subtree of the original trie by another subtree. More formally, the trie substitution can be defined as follows: given two tries T1 and T2 in T and a path u in K , the substitution of T1 on path u by T2 , denoted by T1 [u/T2 ], is a new trie such that, for any path v and w in K , we have T1 [u/T2 ](uw) = T2 (w) and if u is not a prefix of v, then T1 [u/T2 ](v) = T1 (v). We can also extend these notations to define T [u/x] where T ∈ T can be decomposed into y, {ai , Ti }, u ∈ K and x ∈ V, to be the substitution T [u/x, {ai , Ti }]. Restriction Second, the restriction operation selects a specific subtree in the initial tree: given a trie T in T and a path u in K , the restriction of T to u, denoted by T |u, is a trie such that, for any path v in K , (T |u)(v) = T (uv). The trie T |u is also called a subtrie of T . The restriction operation might also be seen as a partial application on tries. Figure 3 shows an example of trie restrictions: the trie whose nodes are labeled by b, d, e and f is a sub-trie of T , obtained its restriction to the path /vm, ie. it is T |/vm.

Figure 2. Axiomatization of the reference cell equality on tries, with respect to substitution and restriction operations. Physical Equality More informally, first of all, the two first rules are related to the reference cell equality behavior only. Rule (E1) states that a given symbol always physically represents the same trie and rule (E2) states that reference cell equality implies usual structural equality: if the reference cells of two tries are identical, then they also associate the same values to the same keys.

4.2 Axiomatization using Reference Cell Equality When the question of implementing the substitution and restriction operators defined above in an efficient trie library arises, the programmer still has a lot of freedom: the given definitions do not explain directly how to express the substitution and restriction in terms of tree operations. Indeed, the above definitions hold for path/value associations only and nothing is said about the location of these values. In particular, nothing is specified about subtree sharing. However, every modern compiler of functional languages, such as Objective Caml (Leroy et al. 1996) or Scheme (Serrano 2000), enforces that multiple copies of an immutable structures share the same location in memory. is possible to design a trie library which enforces the sharing of subtrees as much as possible. In order to define more formally the notion of sharing, the reference cell equality (Pitts and Stark 1993; Claessen and Sands 1999), denoted by ≡, has been introduced. This equality, also known as physical equality, is a limited form of pointer equality. It compares the location of values instead of the values themselves and thus can

Substitution Second, the next four rules are related to the substitution operator and its interactions with the restriction one. Rule (S1) states that sub-tries of a substituted trie are physically equivalent to sub-tries of the newly inserted trie. Rule (S2) states that nodes which are on the path of the substitution are never physically equivalent to the respective nodes in the initial trie: they correspond to newly allocated reference cells. Rule (S3) states that nodes which are not related to the substitution are not modified, ie. substitution is a local operation which enforces node sharing between tries produced by substitution. Finally, rule (S4) states that even though (S2) states that nodes which are on the path are newly allocated, the value they contain is preserved and is still equal to the value of the initial trie. Restriction Finally, the last rules are related to the restriction operator. Rule (R1) states that the restriction of any trie to an empty path is the trie itself. Rule (R2) states that it is (physically)

206

equivalent to restrict a trie twice with two given path that to restrict a trie by the composition of these two paths. Finally, rule (R3) states that if two tries are physically equal, then their restriction to the same path are also physically equal.

Definition 5.1 (transaction). A transaction is a sequence of substitutions. That is, a transaction σ belongs to (K × T) , ie. either σ is the empty sequence ε or σ = [u1 /T1 ] . . . [un /Tn ], where for any i ∈ {1 . . . n}, ui ∈ K and Ti ∈ T.

Tree Representation The tree representation introduced in Section 4.1 can be then reformulated using the axioms of Figure 2 to be the following: a tree T can be decomposed into a structure x, {ai , Ti }i∈I , where I = {1 . . . n}, x ∈ V and for any i ∈ I, ai ∈ K and Ti ∈ T such that:

Such a sequence can be applied to an initial trie T ∈ T in order σ → T . This to obtain a new trie T ∈ T, which is denoted by T − application consists of sequential application of each substitution [ui /Ti ] for i from 1 to n, that is: T = ((. . . (T [u1 /T1 ]) . . .) [un /Tn ])

• T (ε) = x;

For example, let us consider the following transaction:

• There is no ai such that T (ai ) = ⊥;

1. Write x ∈ V in the path u1 ∈ K ;

• T |ai ≡ Ti .

2. Read the value associated with the path u2 ∈ K ; 3. Write y ∈ V in the path u3 ∈ K .

In this case, rule (R2) ensures that for any u ∈ K , T |ai u ≡ T |ai |u, that is T |ai u ≡ Ti |u. Finally rule (E2) ensures that T (ai u) = Ti (u).

Let us have the trie T ∈ T representing the current state of the database. Then the above transaction can be defined as: σ = [u1 /(T1 |u1 )][u2 /(T2 |u2 )][u3 /(T3 |u3 )]

For example, Figure 3 gives the graphical representation of a trie T , which can be decomposed as follows:

where T1 is T [u1 /x], T2 is T1 [u2 /T1 (u2 )] and T3 is T2 [u3 /y]. σ Finally, T , the updated state of the database is such that T − → T .

a, {(vm, b, {(0, d, ∅), (1, e, ∅), (2, f, ∅)}), (vss, c, ∅)}

Furthermore, the Xenstored requirements are that transactions can proceed concurrently. That is, multiple connections to the database can be opened and all of them can start independent transactions. As we consider immutable data-structures, this is not a problem: for each transaction started, the initial database is copied efficiently (as it is sufficient to copy only the location of the trie root, which is done in O(1)) and then modifications done by a transaction are applied directly to their own copy of the initial database.

T a

0 d

vm

vss

b 1

c

e

2

However, ending (or “committing”) a transaction is more complex. Indeed, the initial database might have been modified, since concurrent transactions or non-transactional events may have updated its state. It is thus necessary to carefully design what happens when a transaction ends.

f

Figure 3. Example of a tree representation.

First of all, Figure 4 describes the general algorithm which one has to solve when committing a transaction.

In this Figure, the values associated with the paths /vm and /vm/1 are b and e respectively and T |vm designates the sub-tree whose root is labeled by b.

C OMMIT A LGORITHM

5. Transactions

Input:

We stated that the Xenstored database can be represented as an immutable trie; this database can then be updated using the trie substitution operator as follows: if the trie T ∈ T is the current state of the database, then replacing the value associated with u ∈ K by x ∈ V is done by replacing the current state of the database by the trie T [u/x]. However, Xenstored is a transactional database. That is, it is possible to ask for any sequences of access and modification to be done atomically.

• A trie T1 ∈ T

(the initial database state)

• A modification sequence σt ∈ (K × T)

(the transaction)

• A modification sequence σ1 ∈ (K × T)

(the concurrent modifications) Output:

First of all, in order to simplify the following definitions and results, we consider any reading operation as an identity substitution: for any trie T in T and path u in K , getting T (u) does not modify T but when considering sequences of read/update operations it is necessary to remember that u had been read. So in this case, we write T [u/(T |u)] as rule (S2) of Figure 2 states that in this case T ≡ T [u/(T |u)]. However, it is possible to extend the definitions and results we present in this paper to reading operations which do not update the database state.

• A modification sequence σ2 ∈ (K × T)

(the consistent merging of σ1 and σt )

• A value r ∈ {abort, commit}

(the result of the merging)

Figure 4. The general commit algorithm. Figure 5 illustrates this situation. In this figure, the trie T1 is the initial state of the database, ie. its state just before the transaction

We can now define transactions:

207

6. Transaction Coalescing

starts; the trie Tt is the state of the database copy associated with the transaction, ie. is obtained by applying the transaction σt to T1 ; the trie T2 is the state of the database which might have been updated since the beginning of the transaction (σ1 is empty if nothing has been executed concurrently with σt ); furthermore, the trie T3 is the state of the database after the modifications caused by σt have been taken into account to update T2 accordingly; finally, the dashed lines between Tt and T3 illustrates that this process is asymmetric: the Ti tries, for i ∈ {1, 2, 3}, are associated with visible states of the database, as Tt is an internal copy which carry modification information only. Thus, in case it is not possible to merge Tt with T2 , it is safe to completely discard the changes introduced by σt , that is to return σ2 = ε and r = abort, to obtain T3 ≡ T2 .

In the previous section, we defined transactions in terms of sequences of trie modifications. We explain in this section how to merge these transactions with the main state of the database. In order to properly explain how the coalescing algorithm we implemented in OXenstored works, we first introduce in Section 6.1 some basic definitions useful for dealing with trie modifications. In Section 6.2, we explain how optimizing a part of the coalescing algorithm, by incrementally updating what we will call the modification prefix of the transaction. Finally, we give the main OXenstored algorithm in Section 6.3 as well as its complexity in Section 6.4. 6.1 Main Results

Tt

The first of these definitions is about comparing the modifications done on two tries. More precisely, it is about locating the longest path which address sub-tries that are not physically equal in the two given tries. This longest path is called the modification prefix and can be more formally defined as the following:

σ1 T1

T2

σ2

T3

Definition 6.1 (modification prefix). Let T1 and Tt be two tries. The modification prefix of T1 and Tt , denoted by π(T1 , Tt ), is an element u in ({ } ∪ K ) such that:

Figure 5. Relationship between tries used in the commit algorithm. Time goes from left to right. Thus, there exist very simple algorithms which are able to merge changes introduced by transactions into the visible database state: the first one is the one which never commits anything, that is which always returns σ2 = ε and r = abort. This behavior is correct, however it is not very useful in practice. Furthermore, CXenstored implements an algorithm which commits the changes only if T1 has not been updated since the transaction has been started, ie. if and only if T1 ≡ T2 . Otherwise, the transaction is aborted and retried after a short delay from the beginning. The hope is that nobody will update the database concurrently this time. In this case, T3 is very simple to compute, as it is exactly Tt when the transaction is committed and T2 otherwise. In practice, this simple algorithm is sufficient when transactions do not occur too often and it was implemented successfully by CXenstored. However, experiments show that in cases where the system is under load, this simple algorithm doesn’t work any more as the transaction abort-and-retry mechanism live-locks (see Section 7). Thus we designed a better algorithm, able to merge (or coalesce) concurrent transactions and implemented it in OXenstored.

• If T1 ≡ Tt then u = ; • Otherwise, u is the longest path such that, for any v ∈ K :

If v is a strict prefix of u, then T1 (v) = Tt (v); If v is a prefix of u, then T1 |v ≡ Tt |v; If T1 |v ≡ Tt |v then either u is a prefix of v or v is a prefix of u. Hence, π(T1 , Tt ) is the longest path such that Tt |π(T1 , Tt ) is not a sub-trie of T1 (and conversely, it the longest path such that T1 |π(T1 , Tt ) is not a sub-trie of Tt ). Moreover, let us remark that in case T1 (ε) = Tt (ε), then π(T1 , Tt ) = ε, as there is no strict prefix of ε and any path has ε as prefix; moreover (E2) and (R1) state that T1 (ε) = Tt (ε) implies that T1 ≡ Tt . Furthermore, let us consider the trie T of Figure 5 and let us consider a new trie Tt obtained by applying the transaction [uv/x][uw/y] to T , where x, y ∈ V. In this case, one can check that π(T, Tt ) is exactly u. The second of these definitions is about merging tries while enforcing a sub-trie sharing policy as much as possible. In order to understand the intuition of this definition, it is useful to consider the diagram shown in Figure 5. However, the following definition is more general and holds for any 3-tuple of tries:

Regarding the the context of utilization of Xenstored, it is important to remark that transactions are localized and are closely related to the hierarchical structure of the database. Indeed, each guest has its own configuration values and the transactions it will create will access and modify only these values (and for security reasons, we do not want it to access or modify configuration values of other guests). For guests whose ID is i these configuration values are stored in specific sub-tries of /local/domain/i. Then accessing information about disk or network configuration can be done in accessing only the sub-tries ./device/vbd and ./device/vif respectively: it is then not necessary to block other transactions accessing different sub-tries to commit. In the following, we use that remark to coalesce concurrent transactions accessing distinct sub-tries of the database.

Definition 6.2 (coalescing trie). Let T1 , T2 and Tt be three tries. The trie T3 is a coalescing trie of T2 and Tt , relative to T1 , if, for any path v in K , it satisfies the following conditions: 1. If T1 |v ≡ Tt |v, then T3 |v ≡ T2 |v; 2. If v is not a strict prefix of π(T1 , Tt ) and T1 |v ≡ T2 |v, then T3 |v ≡ Tt |v; 3. In all cases, either T3 (v) = Tt (v) or T3 (v) = T2 (v). We are now ready to introduce the main result of this paper. The following theorem states how to compute the coalescing trie of three tries organized as in the diagram of Figure 5, ie. with an initial trie T1 representing the initial state of the database, from which two distinct modification sequences lead to the two tries T2 (the database’s current state) and Tt (the local state associated with the current transaction). Basically, in the majority of cases, it is sufficient to substitute the database’s current state by the sub-trie of Tt addressed by the modification prefix of T1 and Tt . However, this theorem is not complete, that is there are some cases where

However, it is important to remember that we are still in a transactional model, that is some transactions will still eventually fail. So, there are still some corner-cases when, under load, the system will live-lock. However, in practice, Xenstored transactions often have a very specific shape (they are localized on some subtries) and thus the algorithm we give in the next section corrects this behavior and leads to more stable performance.

208

(a) If v = uw: Let us start from (A3): T3 |u ≡ Tt |u ⇒ (T3 |u)|w ≡ (Tt |u)|w ⇒ T3 |v ≡ Tt |v

building this coalesced trie is not possible. In this case, we can simply consider Tt as a trie related to a transaction and T2 as the current state of the database, and as already discussed, in practice it is acceptable to discard the transaction trie Tt and let the database client retry the sequence of modifications.

(using (R3)) (using (R2))

(b) If neither u is a prefix of v nor v is a prefix of u: Using Definition 6.1, we have T1 |v ≡ Tt |v. Then, we start from (A1) to obtain: T3 |v ≡ T2 [u/(Tt |u)]|v (using (R3)) (using (S3)) ≡ T2 |v Thus, if T1 |v ≡ T2 |v, then T3 |v ≡ T1 |v and thus, T3 |v ≡ Tt |v.

Theorem 6.3 (coalescing tries). Let T1 , T2 and Tt be three tries σt and σt be a transaction such that T1 −→ Tt . If π(T1 , Tt ) = and T1 |π(T1 , Tt ) ≡ T2 |π(T1 , Tt ), then the coalescing trie of T2 and Tt , relative to T1 , is: ˆ ˜ T2 π(T1 , Tt ) / (Tt |π(T1 , Tt )) Proof. Let us fix u = π(T1 , Tt ). We assume that u = and thus we can fix T3 to be T2 [u/(Tt |u)]. Now, we want to check that the three assertions of Definition 6.2 holds for T3 .

Thus, we showed that, in every case, if v is not a prefix of u and T1 |v ≡ T2 |v, then T3 |v ≡ Tt |v.

Before starting the core of the proof, let us show a useful result. From the definition of T3 , we have:

3. Finally, let us prove that all cases, T3 (v) = Tt (v) or T3 (v) = T2 (v). To do this, let us consider the following three possible cases: (a) v is a prefix of u; (b) u is a prefix of v and (c) neither u is a prefix of v nor v is a prefix of u.

T3 |u

≡ T2 [u/(Tt |u)]|u (using (R3)) (using (S1)) ≡ (Tt |u)|ε (using (R2)) ≡ Tt |u Thus, we can fix, within the scope of this proof, the following assumptions:

(a) If u = vw: (A3) states that T3 |v ≡ T2 [u/(Tt |u)]|v. Then using (S4), we obtain that T3 (v) = T1 (v). (b) If v = uw: Let us start from (A3): T3 |v ≡ T2 [u/(Tt |u)]|v

(A1) T3 ≡ T2 [u/(Tt |u)] (A2) T1 |u ≡ T2 |u (A3) T3 |u ≡ Tt |u We are now ready to prove Theorem 6.3. v, w are in K . 1. Following the structure of Definition 6.2, we first need to prove that T1 |v ≡ Tt |v implies that T3 |v ≡ T2 |v. To do this, let us consider the following three possible cases: (a) v is a prefix of u; (b) u is a prefix of v and (c) neither u is a prefix of v nor v is a prefix of u.

⇒ ⇒

T3 |v ≡ Tt |v T3 (v) = Tt (v)

(S3) (E2)

(c) If neither u is a prefix of v nor v is a prefix of u: Let us start from (A3): T3 |v ≡ T2 [u/(Tt |u)]|v ⇒ T3 |v ≡ T2 |v ⇒ T3 (v) = T2 (v)

(S3) (E2)

Hence, for all cases, we showed that either T3 (v) = Tt (v) or T3 (v) = T2 (v).

(a) If u = vw: T1 |v ≡ Tt |v

6.2 Computing the Modification Prefix

⇒ (T1 |v)|w ≡ (Tt |v)|w (using (R3)) (using (R2)) ⇒ T1 |u ≡ Tt |u However, as u is a valid prefix of u, Definition 6.1 states that T1 |u ≡ Tt |u. Contradiction.

In the last section, we explained how to properly coalesce transactions: it suffices to substitute the database’s current state by the transaction state on the modification prefix of the transaction trie and initial database state. However, computing the modification prefix of two tries can be costly if we do not have additional information. Fortunately, the modification sequence of the transaction can be used to efficiently compute that modification prefix: Lemma 6.4. T1 and Tt are two tries and σt = [u1 /Tb1 ] . . . [un /Tbn ]

(b) If v = uw: Using (A1), we have T3 ≡ T2 [u/(Tt |u)]. Then, we have: T3 |v ≡ T2 [u/(Tt |u)]|uw (using (R3)) (using (S1)) ≡ (Tt |u)|w (using (R2)) ≡ Tt |v Thus, T1 |v ≡ Tt |v implies that T1 |v ≡ T3 |v. Then, let us develop (A2): T1 |u ≡ T2 |u ⇒ (T1 |u)|w ≡ (T2 |u)|w (using (R3)) (using (R2)) ⇒ T1 |v ≡ T2 |v Thus, T1 |v ≡ Tt |v implies that T3 |v ≡ T2 |v.

σ

t be a non-empty transaction such that T1 −→ Tt . Then π(T1 , Tt ), the modification prefix of T1 and Tt , is exactly the longest path which is a common prefix of every ui , for i ∈ {1 . . . n}.

Proof. Let us prove Lemma 6.4 by induction on the size of σt . (i) Let us show the first induction step. Let us fix σt = [u1 /Tb1 ], that is, Tt ≡ T1 [u1 /Tb1 ] and let us prove that the longest common prefix of u1 is the modification prefix of T1 and Tt , that is u1 = π(T1 , Tt ). According to Definition 6.1, we have to show three implications: • First, we have to prove that, for any v ∈ K , if v is a strict prefix of u1 , then T1 (v) = Tt (v): if v is a strict prefix of u1 , that is u1 = vw with w = ε, then (S4) states that T1 [u1 /Tb1 ](v) = T1 (v), that is Tt (v) = T1 (v); • Second, we have to prove that, for any v ∈ K , if v is a prefix of u1 , then T1 |v ≡ Tt |v: if v is a prefix of u1 , that is u1 = vw, then (S2) states that T1 [u1 /Tb1 ]|v ≡ T1 |v, that is Tt |v ≡ T1 |v;

(c) If neither u is a prefix of v nor v is a prefix of u: Let us start from (A1): T3 |v ≡ T2 [u/(Tt |u)]|v (using (A1)) (using (S3)) ≡ T2 |v Thus, we showed that, in every case, T1 |v ≡ Tt |v implies that T3 |v ≡ T2 |v. 2. Second, let us prove that if v is not a prefix of u and T1 |v ≡ T2 |v, then T3 |v ≡ Tt |v. To do this, let us consider the following two possible cases: (a) u is a prefix of v and (b) neither u is a prefix of v nor v is a prefix of u.

209

• Finally, we have to prove that, for any v ∈ K , if T1 |v ≡

• T k ∈ T is a snapshot of the state of the database just before the

Tt |v, then either u1 is a prefix of v or v is a prefix of u1 : if T1 |v ≡ Tt |v, that is T1 |v ≡ T1 [u1 /Tb1 ]|v, then (S3) states that either u is a prefix of v or v is a prefix of u.

transaction started. • Ttk ∈ T is a local trie attached to the transaction and where

the modification it contains are done, while letting the main database state unchanged until the transaction is committed. Its σk − → Ttk ; value is such that T 0 −

(ii) Let us then complete the induction process. Let us fix σt = σ[un /Tbn ], with σ a non-empty transaction of size n − 1 and let σ us consider the trie T such that T1 − → T . Let us then have π = π(T1 , T ). The induction hypothesis gives us that π is also the longest common prefix of {u1 , . . . , un−1 }. Hence, for every v ∈ K :

• π k ∈ ({ } ∪ K ) is either if no modifications yet been

executed (ie. k = 0) or u if u the modification prefix associated with T k and Ttk (ie. π(T k , Ttk )), that is, it is the longest such that Ttk |π k is not a sub-trie of T k .

As already stated in Section 2.1, OXenstored gives a unique identifier to each started transaction. This identifier is created when a client sends to OXenstored a starting request for a new transaction; this identifier is associated internally with the structure T, T, , where T is the current state of the database. This identifier is also sent back to the client; the client can then put this identifier in the header of packets it will send later in order to update the state of this specific transaction. Eventually, it can also notify OXenstored that it wants the transaction to commit, ie. to push the changes introduced by the transaction into the current state of the database.

(a) If v is a strict prefix of π, then T1 (v) = T (v); (b) If v is a prefix of π then T1 |v ≡ T |v; (c) If T1 |v ≡ T |v then either v is a prefix of π or π is a prefix of v. Let u be the longest common prefix of {u1 , . . . , un } and let us show that u is also the modification prefix of T1 and T [un /Tbn ]. According to Definition 6.1, we have to show three implications: • First, we have to prove that, for any v ∈ K , if v is a strict

prefix of u, then T1 (v) = Tt (v): if v is a strict prefix of u, we have u = vw with w = ε, and thus we have un = vw with w = ε. Then (S4) states that T [un /Tbn ](v) = T (v). Using (a), we obtain that Tt (v) = T1 (v), with v being the longest common prefix of {u1 , . . . , un }; • Second, we have to prove that, for any v ∈ K , if v is a prefix of u, then T1 |v ≡ Tt |v: if v is a prefix of u, we have u = vw, for w ∈ K , thus we have un = vw. We can then use (S2) to obtain that T [un /Tbn ]|v ≡ T1 |v. Using (b), we obtain that Tt |v ≡ T1 |v, with v being the longest common prefix of {u1 , . . . , un }; • Finally, we have to prove that, for any v ∈ K , if T1 |v ≡ Tt |v, then either u is a prefix of v or v is a prefix of u: if T1 |v ≡ Tt |v then it is due either to T1 |v ≡ T |v or to T |v ≡ Tt . In the first case, (c) states that either v is a prefix of π or π is a prefix of v; in the second one (S3) states that either v is a prefix of un or un is a prefix of v. Thus, if p is the longest common prefix of π and un , that is the longest common prefix of {u1 , . . . , un }, then we have either v is a prefix of p or p is a prefix of v;

In the following, we give the algorithms used by OXenstored to update the structure associated with a transaction and to commit the changes induced by a transaction into the current database state. We assume here that (i) a client already started a transaction σk , whose associated structure is T k , Ttk , π k ; (ii) the client has been sending the requests to either update or commit a transaction; and (iii) OXenstored has already decoded the packet header of that request and the associated transaction structure is exactly T k , Ttk , π k . Updating a Transaction First of all, let us detail how to update the structure associated with a transaction. Let us assume that OXenstored decoded the packet content and found that the clients want to substitute the trie Tk+1 on path uk+1 . OXenstored has to compute the new transaction structure T k+1 , Ttk+1 , π k+1 . We have: • For any k ∈ {1 . . . n}, T k is identical, as it is a snapshot of the

state just before the beginning of the transaction; • Definition 5.1 states that a transaction is updated by applying to

its local state the substitution [uk+1 /Tk+1 ];

• Lemma 6.4 shows that the modification prefix can be com-

Thus we showed that Lemma 6.4 is valid for any size of σt .

puted online: indeed it is sufficient to compute incrementally the longest prefix of π k and uk+1 .

It is then straightforward to derive from Lemma 6.4 an incremental algorithm to compute the modification prefix of tries associated with a transaction: indeed, at each step of the sequence, it suffices to take the longest common prefix between the path currently modified by the current step and the modification prefix already computed from the beginning of the transaction.

Input: T k , Ttk , π k and the substitution [uk+1 /Tk+1 ] Output: the new transaction structure T k+1 , Ttk+1 , π k+1 T k+1 ← T k ; Ttk+1 ← Ttk [uk+1 /Tk+1 ]; π k+1 ← longest-common-prefix(π k , uk+1 );

6.3 Algorithms OXenstored is able to properly coalesce unrelated transactions, ie. transactions which modify disjoint subtrees of the current store. To do this efficiently, it exploits the functional tree representation of the database. More precisely, consider a transaction σ = [u1 /T1 ] . . . [un /Tn ]. This transaction is executed incrementally so we can also consider a sequence of time tk , for k ∈ {0 . . . n}, where t0 corresponds to a time just before the transaction starts, ie. to the sub-sequence σ0 = ε of σ and each tk corresponds to a time when only the sub-sequence σk = [u1 /T1 ] . . . [uk /Tk ] of σ has been executed. We can now define, for any k ∈ {0 . . . n}, the state of a transaction at the time tk as a structure T k , Ttk , π k , where:

return T k+1 , Ttk+1 , π k+1 Algorithm 1: Updating a transaction. These lead directly to Algorithm 1, that explains how to compute T k+1 , Ttk+1 , π k+1 from T k , Ttk , π k and [uk+1 /Tk+1 ], for any k ∈ {1 . . . (k − 1)}. Note that this algorithm uses the function longest-common-prefix(u,v) which returns either the longest common prefix of u and v if u, v ∈ K or returns v if u = (and u if v = ).

210

k . Then, for deciding if the sub-trie of the one write operations, πW transaction have been concurrently updated in the database’s curk k k k rent state, we check that T |πR ≡ T k |πR and T |πW ≡ T k |πW stands. If this is the case, we only update the sub-trie corresponding k to writing operations, ie. we set T ← T [πW ]/(Ttk |π k ). This could be helpful to reduce the conflict rate between concurrent transactions which read and write in distinct sub-tries and dramatically increase the efficiency of the coalescing in case a transaction only performs read operations.

Committing a Transaction Second, let us detail how to commit the structure associated with a transaction into the current state of the database, in coalescing concurrent transactions when possible. Let us assume that OXenstored decoded a packet whose header contains an identifier internally associated with the structure T k , Ttk , π k and whose content contains a commit order. OXenstored has to compute the new state of the database, which should share, as much as possible, nodes from the current database state T and from the transaction local state Ttk ; more precisely, we want to ensure that this new database state is exactly the coalescing trie of T and Ttk , relative to T k (see Definition 6.2). Basically, there are three cases:

6.4 Complexity In order to find the complexities of the algorithms, we can first derive the complexity of the trie operations from the axioms of Figure 2 and the tree representation detailed in Section 4.2.

• The first case is when no modifications are done concurrently

with the current transaction on the entire database. This case can be checked easily. Indeed, rule (S2) states that, in case a substitution is done on between T and T k , then there exists u ∈ K such that T |u ≡ T k . Finally, rule (R3) states that T ≡ T k . When this case is detected, it is then safe to replace the database’s current state by the local state of the transaction.

First of all, using the tree representation of tries and assuming a linear complexity for accessing children of a node it is then straightforward to obtain that, given tries T, T ∈ T(K, V) and a path u ∈ K , the complexity of computing the restriction T |u and the substitution T [u/T ] is O(|u||K|) (thus, independent of the size of T or of T ). Moreover, the complexity of computing the physical equality T ≡ T is in O(1).

• The second case is when no modifications are done concur-

rently with the current transaction on the sub-trie corresponding to the scope of that transaction. Being able to detect this case efficiently is the main improvement from CXenstored to OXenstored. However, in our settings, this is relatively easy. Indeed, Definition 6.1 states that π k is the longest path such that Ttk |π is not a sub-trie of T k , that is Ttk |π is the biggest sub-trie modified by the current transaction. Thus, to detect if no modifications are done concurrently with the current transaction, it suffices to check that the corresponding sub-trie in the database’s current state has not been modified, ie. that Tt |π ≡ Ttk |π (as stated by rules (S2) and (R3), as in the last bullet). When this case is detected, Theorem 6.3 states it is safe to replace the sub-trie of the database’s current state by the corresponding sub-trie of the transaction’s local state.

Thus, the complexity of Algorithm 1 is in O(|uk+1 ||K|), where |uk | is the size of the path which is modified and the complexity of Algorithm 2 is in O(|π k ||K|), where |π k | is the size of the longest common prefix of any path which are modified. Thus, the complexity of these algorithms does not depend at all of the size of the current database, which gives very stable performance. Moreover, in practice, the number of keys K and the path size are very rarely over 10, which can already lead to databases with 1010 entries, while keeping a good level of efficiency for modification operations.

7. Evaluation

• The last case relates to aborting the transaction if it is not

In the previous sections, we explained how the OXenstored algorithm for coalescing transactions works and we argued that, although transactions can still be aborted, the rate of committed transactions in a practice should be very high. In this section, we validate this statement experimentally, by comparing the performance of CXenstored and OXenstored. Not surprisingly, the results we obtained show that OXenstored scales much better than CXenstored: indeed CXenstored exhibits a live-lock in a very common situation (sequentially starting as many guests as possible), unlike OXenstored.

possible to commit it. In this case, the database’s current state remains unchanged and the client is notified that it has to retry the transaction. These lead to Algorithm 2. Input: the current database state T and T k , Ttk , π k Output: the new database state and the transaction status if T ≡ T k then /* if the database has not been updated return (Ttk , commit); else if T |π k ≡ T k |π k then /* if the sub-trie has not been updated T ← T [π k /(Ttk |π k )]; return (T , commit); else /* otherwise, abort return (T , abort); end

*/

More precisely, the tests we ran are the following: first in Section 7.1, we measured the performance under load of CXenstored and OXenstored in an isolated context, by creating random transactions and without interacting with the X EN hypervisor. Then, we measured the performance of CXenstored and OXenstored when they are interacting with the X EN hypervisor, with and without load, in Section 7.2 and Section 7.3 respectively.

*/

*/ 7.1 Performance under Load Experiment Description We first wanted to test the transaction time latency mechanism, on a few workloads, with no interaction with the X EN hypervisor. We wanted to simulate the behavior of starting a guest, as described in Section 2. So, we created 128 processes, each simulating the behavior of a guest and performing the following transaction:

Algorithm 2: Committing a transaction. Extension In the extended version of our implementation, where we do not consider that reading operations update the state of the database, the algorithm remains more or less the same. However, instead of keeping a unique modification prefix π k for each transk action, we keep two of them: one for read operations, πR , and for

• They all read the value of the same path in the database; and • They all write in different parts of the database.

211

Furthermore, we repeated that transaction 500 times on each process. Then, on each process, we measured the time taken by each transaction to commit and we ordered the results.

which is around 300 transactions per second. Furthermore, in the average case, Graph (b) shows that the OXenstored performance are much stable for OXenstored than for CXenstored, as the average delays for OXenstored are very low (always under 1 second), unlike CXenstored which committed few transactions over 10 seconds.

0.45 CXenstored OXenstored 0.4

0.35

Duration (in seconds)

0.3

Finally, In the worst case, Graph (c) shows that CXenstored can live-lock as some commit delays are over 12 minutes. On the other hand, even in the worst case, OXenstored performance remains very stable (around 1 second).

0.25

0.2

0.15

0.1

On graphs (a) and (b), there is a small glitch around the 17th process. It is not totally clear what is the cause of that behavior.

0.05

0 0

20

40

60

80

100

120

140

Experiment Conclusions The obtained results clearly indicate that CXenstored cannot deal with more than 100 concurrent transactions, as opposed to OXenstored which does not show any sign of performance issues. Furthermore, when access to the database is the performance bottleneck, OXenstored is always quicker than CXenstored, even in the case of few concurrent transactions. Finally, OXenstored has a very small variance compared to CXenstored, which means that we can expect that the system will behave in a more predictable way.

Process Name

(a) minimum delays 16 CXenstored OXenstored 14

Duration (in seconds)

12

10

8

6

7.2 Performance when interacting with the X EN hypervisor

4

Experiment Description Subsequently, we designed a more realistic test than the one of Section 7.1, to compare the performance of CXenstored and OXenstored to start real guests. For this experiment, we installed one guest running “Windows XP” and then we repeated the following steps 100 times:

2

0 0

20

40

60

80

100

120

140

Process Name

(b) average delays 900

1. clone 50 guests from the initial one (a clone is functionally equivalent to a fresh install, but it quicker);

CXenstored OXenstored 800

700

2. sequentially start all the cloned guests;

Duration (in seconds)

600

3. sequentially shut-down all the cloned guests; and

500

4. uninstall (ie. destroy) all the cloned guests. 400

At the same time, we measured the time taken for starting and shutting-down each guest.

300

200

Experiment Results The results are shown in Figure 7. These figures show the integral of distribution probability for the time taken to complete guest start and shutdown. The results for OXenstored and CXenstored are quite similar: half the guest are started in less than 4.5 seconds and are shut down in less than 2 seconds. Furthermore, 90% of the guests are started by OXenstored and CXenstored in less than 7.5 seconds and are shut down in less than 2.9 seconds.

100

0 0

20

40

60 80 Process Name

100

120

140

(c) maximum delays

Figure 6. Comparison between CXenstored and OXenstored for the time taken by 128 concurrent processes to to commit 500 transactions which read from the same path and write on different ones. Processes are ordered following their completion time.

Experiment Conclusions The results we obtained show that there is no major differences between OXenstored and Xenstored in this case, that is CXenstored is clearly not a bottleneck under normal load.

Experiment Results Part of the results of this experiment are shown in Figure 6. More precisely, these figures show for each process the time taken by:

7.3 Performance under load when interacting with the X EN hypervisor

(a) The fastest transaction to commit;

Experiment Description In this experiment, we wanted to start as many guests as possible and compare the performance of CXenstored and OXenstored. Thus, we needed minimalist guests which do not use many physical resources. Hence, we used a modified version of “mini-OS”, a very small operating system distributed with the X EN hypervisor sources, in order to start a lot of very small guests performing long conflicting transactions concurrently. More precisely, we created 160 “mini-OS” guests, with 1 virtual disk and

(b) The average time of the transactions to commit; and (c) The time taken by longest transaction to commit. Graph (a) shows that transactions are always committed faster in OXenstored than in CXenstored. Moreover, in the best case, the commit rate of OXenstored is very stable at around 700 transactions per second, as opposed to the commit rate of CXenstored

212

minutes, it never commits the transaction which should configure and start the 70th guest. On the other hand, OXenstored keeps starting the guests at a constant rate every 2 seconds and it started all the 160 guest in less than 6 minutes.

1 CXenstored OXenstored

20 CXenstored OXenstored

0.6 18 16 0.4 14 Duration (in minutes)

Cumulative probability

0.8

0.2

0 3

3.5

4

4.5

5

5.5

6

6.5

7

7.5

12 10 8 6

Duration (in seconds)

(a) Starting a guest.

4 2

1 CXenstored OXenstored

0

0.9

0

20

40

60

80

100

120

140

160

Number of guests started. 0.8

Cumulative probability

0.7

Figure 8. Comparison of the time taken by CXenstored and OXenstored to start as many mini-OS guests as possible, when these guests perform long transactions in a loop.

0.6 0.5 0.4

Experiment Conclusions Is is quite clear that OXenstored is not at all influenced by the long transactions, as opposed to CXenstored which begins to live-lock when more than 40 guests are started. Hence, the performance of OXenstored is more stable and it scales much better than CXenstored.

0.3 0.2 0.1 0 1.4

1.6

1.8

2 2.2 Duration (in seconds)

2.4

2.6

2.8

8. Related Work

(b) Shutting down a guest.

Transactional databases have been widely studied in the last few decades (H¨arder and Reuter 1983; Bernstein et al. 1987). They provide to the user a very simple way of encapsulating a group of actions, called a transaction, with the following properties: If one part of the transaction fails, the entire transaction fails (atomicity); at every moment, the database remains in a consistent state: only valid data are written in the database (consistency); other operations cannot access or see the data of an intermediate state during a transaction (isolation); once the user has been notified that one of its transactions has succeeded, the transaction will persist and not be undone, even in case of a failure (durability). These properties make the concurrent programming of such systems very easy, as the user does not have to worry anymore about using locks to ensure data consistency.

Figure 7. Integral of the distribution probability of the time taken by CXenstored and OXenstored to start a real guest. 1 virtual network each, which all do the following when they are started: 1. start a transaction; 2. write a value in /local/domain/X/device/foo (where X is the current guest ID) using the opened transaction; 3. sleep 1 second; 4. write a value in /local/domain/X/device/bar (where X is the current guest ID) using the opened transactions;

The most common way to ensure atomicity in transactional databases is a mechanism called compensation (Gray and Reuter 1992). In case of failure, compensation consists of executing the compensating actions, corresponding to the executed actions of the failed process, in the reverse order of their execution. This approach has recently gained renewed popularity in the context of general-purpose programming known as “Software Transactional Memory” (Shavit and Touitou 1995; Harris and Fraser 2003; Ennals 2005; Riegel et al. 2006), where the purpose is to minimize the use of locks (by the use of lock-free data structures, for example). The same compensation mechanisms have also been applied successfully to build efficient and robust transactional webservices (Biswas 2004; Biswas et al. 2008). Due to the nature of the compensation mechanism (ie. keeping a list of the actions executed by each started transaction), there is a straightforward but very costly way to merge transactions: when a transaction is committed,

5. close the opened transaction; 6. sleep 1 second; and 7. go back to step 1. These transactions simulate the way some monitors report statistics inside each guest (as the current memory usage for guests which are not modified to run on top of the X EN hypervisor, as “Windows” guests). We then sequentially started as many guests as we can on one host and we measured the cumulative time taken to start each of the guests. Experiment Results Figure 8 shows the cumulative time taken to start as many guests as possible in less than 20 minutes. CXenstored begins to starts 30 guests at a constant rate of 2 seconds per guest started, but it starts to live-lock around 40 guests. Finally, after 20

213

only the started transactions that read values written by the committed transactions are aborted. Note that this may lead to further abortions of other transactions and so on, an effect called cascading aborts. This effect can lead to very hard-to-predict performance.

performs it. As a direct consequence of these results, OXenstored will replace CXenstored in future releases of X EN S ERVER.

Thus, our approach based on “optimistic” transaction control, where each transaction has a local copy (called a shadow tree in the database literature) of the database which it is modifying, leads to much more predictable performance on tree-structured databases. Hence, a few other systems have used the same form of “optimistic” transaction control based on shadow trees, the first one of them probably being the object store of GemStone (Butterworth et al. 1991), two decades ago. However, our approach is much crisper and simpler as the applications we consider have a greater degree of locality to be exploited compared to the more general object database applications usually considered.

The authors would like to thank very much Jonathan Davies and Richard Sharp for making useful suggestions. They also would like to thank the anonymous ICFP reviewers whose comments greatly improved some parts of the paper.

Acknowledgements

References Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho, Rolf Neugebauer, Ian Pratt, and Andrew Warfield. Xen and the Art of Virtualization. In SOSP ’03: Proceedings of the nineteenth ACM symposium on Operating systems principles, pages 164–177. ACM Press, 2003. Philip A. Bernstein, Vassos Hadzilacos, and Nathan Goodman. Concurrency Control and Recovery in Database Systems. Addison-Wesley, 1987. Debmalya Biswas. Compensation in the World of Web Services Composition. In SWSWPC, pages 69–80. LNCS 3387, 2004. Debmalya Biswas, Thomas Gazagnaire, and Blaise Genest. Small logs for transactional services: Distinction is much more accurate than (positive) discrimination. IEEE International Symposium on High-Assurance Systems Engineering, pages 97–106, 2008. Paul Butterworth, Allen Otis, and Jacob Stein. The gemstone object database management system. Commun. ACM, 34(10):64–77, 1991.

Furthermore, functional data-structures, including tries, have been widely studied (see the Okasaki’s book (Okasaki 1999), as example). Our axiomatization is a step forward to integrate trie properties in a theorem prover in order to demonstrate the correctness of algorithms on tries. Moreover, to the best of our knowledge our work is the first to consider the use of tries in transactional setting. Few other functional structures have been studied in that context, as functional binary search (Trinder 1989). In that work, transactions are pure functions taking an immutable database state as an argument and returning an output and a new database state. Concurrent transactions are serialized by a database manager which then uses a data dependency analysis and the referential transparency property of pure functions to efficiently schedule the transaction’s operations. Our approach is more flexible as it allows us to mix functional and imperative features when using transactions (as in Objective Caml).

Koen Claessen and David Sands. Observable Sharing for Functional Circuit Description. In In Asian Computing Science Conference, pages 62–73. Springer-Verlag, 1999. Robert J. Creasy. The Origin of the VM/370 Time-Sharing System. IBM Journal of Research and Development, 25(5):483–490, 1981. Robert Ennals. Efficient software transactional memory. Technical Report IRC-TR-05-051, Intel Research Cambridge Tech Report, 2005. J. Nathan Foster, Michael B. Greenwald, Jonathan T. Moore, Benjamin C. Pierce, and Alan Schmitt. Combinators for bidirectional tree transformations: A linguistic approach to the view-update problem. ACM Trans. Program. Lang. Syst., 29(3):17, 2007. Edward Fredkin. Trie memory. Commun. ACM, 3(9):490–499, 1960.

More recently, the aim of the Harmony project (Foster et al. 2007) is to define a safe way of synchronizing different views of a shared persistent tree data-structure. The synchronization mechanism of Harmony can be seen as the commit phase of a transactional database, but where the only knowledge is the current state of the transaction, rather than a trace of modifications as in the case with compensation. The merging algorithm used by that tool needs to be specified by the users and thus can become very complex. On the other hand, our approach is much simpler and more efficient but it requires some extra knowledge about the transaction which is committed (ie. the modification prefix) and thus cannot be used in our work.

Jim Gray and Andreas Reuter. Transaction Processing: Concepts and Techniques. Morgan Kaufmann Publishers Inc., 1992. Theo H¨arder and Andreas Reuter. Principles of transaction-oriented database recovery. ACM Comput. Surv., 15(4):287–317, 1983. Tim Harris and Keir Fraser. Language Support for Lightweight Transactions. In OOPSLA, 2003. Xavier Leroy, J´erˆome Vouillon, Damien Doligez, et al. The Objective Caml system. http://caml.inria.fr/ocaml/, 1996. Chris Okasaki. Purely Functional Data Structures. Cambridge University Press, 1999.

9. Conclusion This paper describes the design of OXenstored, a successful application of functional technology in an industrial setting which demonstrates the effectiveness of a limited form of pointer comparison in a functional context. Indeed, the OXenstored database is represented and manipulated using an immutable prefix tree, for which we give a simple and formal semantics. For performance reasons, OXenstored uses reference cell comparison. This is a limited form of pointer comparison which can be elegantly integrated with the prefix-tree semantics, ensuring a safe and efficient way to merge concurrent transactions.

Andrew Pitts and Ian Stark. Observable properties of higher order functions that dynamically create local names, or: What’s new. In Mathematical Foundations of Computer Science, Proc. 18th Int. Symp., pages 122–141. Springer-Verlag, 1993. Torvald Riegel, Christof Fetzer, and Pascal Felber. Snapshot Isolation for Software Transactional Memory. In Proceedings of the First ACM SIGPLAN Workshop on Languages, Compilers, and Hardware Support for Transactional Computing, 2006. Manuel Serrano. Bee: an integrated development environment for the Scheme programming language. Journal of Functional Programming, 10(4):353–395, 2000. Nir Shavit and Dan Touitou. Software Transactional Memory. In Symposium on Principles of Distributed Computing, pages 204–213, 1995.

We also demonstrate the validity of our approach by implementing in Objective Caml the algorithms described in this paper and evaluating them against CXenstored, the Xenstored service written in C and distributed with the X EN hypervisor sources. These experimental results show that our transaction processor is not only one third of the size of of its C counterpart, but significantly out-

Phil Trinder. A functional database. PhD thesis, 1989.

214

Experience Report: Using Objective Caml to Develop Safety-Critical Embedded Tools in a Certification Framework Benjamin Canou2,3 Emmanuel Chailloux Philippe Wang

Bruno Pagano Olivier Andrieu Thomas Moniot 1

Esterel Technologies, 8, rue Blaise Pascal, 78890 Elancourt, France {Bruno.Pagano,Olivier.Andrieu, Thomas.Moniot} @esterel-technologies.com

2

Laboratoire d’Informatique de Paris 6, (LIP6 - CNRS UMR 7606), Universit´e Pierre et Marie Curie, UPMC, 104, avenue du Pr´esident Kennedy, 75016 Paris, France {Benjamin.Canou,Emmanuel.Chailloux, Philippe.Wang}@lip6.fr

Pascal Manoury 3

Laboratoire Preuves, Programmes et Syst`emes, (PPS - CNRS UMR 7126), Universit´e Pierre et Marie Curie, UPMC, 175 rue du Chevaleret, 75013 Paris, France. [email protected]

Jean-Louis Colac¸o ∗ 4 Prover Technology S.A.S 21 Rue Alsace Lorraine, 31000 Toulouse, France [email protected]

Abstract

1.

High-level tools have become unavoidable in industrial software development processes. Safety-critical embedded programs don’t escape this trend. In the context of safety-critical embedded systems, the development processes follow strict guidelines and requirements. The development quality assurance applies as much to the final embedded code, as to the tools themselves. The French company Esterel Technologies decided in 2006 to base its new SCADE SUITE 6TM certifiable code generator on Objective Caml. This paper outlines how it has been challenging in the context of safety critical software development by the rigorous norms DO178B, IEC 61508, EN 50128 and such.

The civil avionics authorities defined a couple of decades ago the certification requirements for aircraft embedded code. The DO178B standard (RTCA/DO-178B 1992) defines all the constraints ruling the aircraft software development. This procedure is included in the global certification process of an aircraft, and now has equivalents in other industrial sectors concerned by critical software (FDA Class III for the medical industry, EN 50128 for railway applications, IEC 61508 for the car industry, etc). The Esterel Technologies company markets SCADE SUITE 6TM 1 (Berry 2003; Camus and Dion 2003), a model-based development environment dedicated to safety-critical embedded software. The code generator (KCG 2 ) of this suite that translates models into embedded C code is DO-178B compliant and allows to shorten the certification process of avionics projects which make use of it. Using such a code generator allows the end user (the one that develops the critical embedded application) to reduce the development costs by avoiding the verification that the generated code implements the SCADE model (considered here as a specification). The verification and validation activities are reduced to provide evidence that the model meets the functional requirements of the embedded application. In this way, a large part of the certification charge weighs on the SCADE framework and this charge is shared (through the tool provider) between all the projects that make use of this technology. The first release of the compiler was implemented in C and was available in 1999. It was based on a code generator written in an Eiffel’s dialect (LOVE) (ECMA 2005) and was, at that time, rewritten in the mainstream C language to avoid the risk of being rejected by certification authorities. Then, since 2001, Esterel Technologies has investigated new compiling techniques (Colac¸o and Pouzet 2003) and language ex-

Categories and Subject Descriptors D.1.1 [Applicative (Functional) Programming]; D.2.1 [Requirements/specifications]: Tools; D.2.5 [Testing and Debugging]: Testing tools General Terms ification

Reliability, Experimentation, Measurement, Ver-

Keywords safety critical, DO-178B, Objective Caml, SCADE SUITE 6TM code generator ∗ This

work started while the author was at Esterel Technologies.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICFP’09, August 31–September 2, 2009, Edinburgh, Scotland, UK. c 2009 ACM 978-1-60558-332-7/09/08. . . $5.00 Copyright

Introduction

1 SCADE 2 KCG

215

stands for Safety Critical Application Development Environment. stands for qualifiable Code Generator.

tensions (Colac¸o et al. 2005). The aim was to demonstrate that using an academic approach for the specifications of the language (types systems, etc.) and Objective Caml (OCaml) for its implementation was also an efficient and clean approach for an industrial project The project quickly led to the expected good technical results but took some time to convince managers that such an approach should be accepted by a reasonable certification authority. It has now appeared that OCaml allowed to significantly reduce the distance between the specifications and the implementation of an engineering tool, to have a better traceability between a formal description of the input language of the tool and its compiler implementation. Thus, Esterel Technologies has designed its new SCADE SUITE 6TM in OCaml. This paper describes the specific development activities performed by Esterel Technologies to certify KCG with the several norms: DO-178B, IEC 61508 and EN 50128. The differences and particularities of these standards are not in the scope of this paper; for convenience, we focus on the FAA standard (DO-178B, level A).

2.

In a DO-178B compliant project, to ensure that the software satisfies all the requirements and that any single line of the software is necessary to its purpose, nothing can appear in the code without being clearly specified and identified first with a good traceability to high-level requirements (the specifications). These traceability links pass through architectural design and detailed design requirements. The choice of a programming language close and adapted to the software to develop is very important since a well-suited language leads to a simpler and more direct way to encode the software requirements and consequently, a better and simpler traceability. In the same vein, when the programming language is adapted to the developed software, the architecture of the software is close to the functional description of the software. The links between architecture and specifications, and between architecture and detailed design are simpler to establish and verify. The code is tested but it is also reviewed by other developers. To ease this verification, the code must be short (in the sense that it contains more about fundamental algorithms than on resource management) and readable. Furthermore, the libraries and especially the system library have to be treated in the same way as the main source code: it is mandatory to have the same traceability and verifications on any specific part of the code. So, the choice of a suitable programming language is relevant for the various verification activities required in DO-178B compliant projects. This is, of course, always true, but becomes crucial when one has to defend a project in front of a certification authority.

Certification of safety critical code

The well known V-cycle dear to the software engineering industry is the traditional framework of any certified/qualifiable development project. Constraints are reinforced by DO-178B but the principles stay the same: the product specifications are written by successive refinements, from high level requirements to low level design and then implementation. Each step involves several independent verification activities: checking complete traceability of requirements between successive steps, testing each stage of code production with adequate coverage, code and coding rules reviews. System Requirements Analysis

2.2

Software Receipt

Software High Level Specifications

Validation tests

Architectural Design

Integration testing

Detailed Design

Unitary testing

• A decision is the Boolean expression evaluated in a test instruc-

tion to determine the branch to be executed. It is covered if tests exist in which it is evaluated to true and false. • A condition is an atomic subexpression of a decision. It is covered if there exist tests in which it is evaluated to true and false. • The MC/DC requires that, for each condition c of a decision, there exist two tests which must change the decision value while keeping the same valuations for all conditions but c. It ensures that each condition can affect the outcome of the decision and that all contribute to the implemented function (no dead code is wanted).

Coding

Figure 1. V-cycle 2.1

Code coverage

The DO-178B defines several verification activities and, among these, a test suite has to be constituted to cover the set of specifications of the software in order to verify and to establish the conformity of their implementation. As any activity during a DO-178B compliant development process, the verification activities are evaluated. Some criteria must be reached to decide that the task has been completed. One of these criteria is the activation of any part of the code during a functional test. On this particular point, more than a complete structural exploration of the code, the DO-178B standard requires that a complete exploration of the control flow has to be achieved following the Modified Condition / Decision Coverage (MC/DC) measurement that we explain below.

The programming language in the development process

Traceability is one of the keywords of the compliance to DO-178B, any part of any activity of the project cycle pertains to other parts of the previous and of the following activities. For instance, any requirement of the detailed design has to be related to one or several requirements of the architecture design and to the lines of code implementing this requirement. Furthermore, the relation has to exist with the corresponding verification activity. In our example, the detailed design requirement has to be related to unitary tests that exercise this requirement. The evidence of these relations is one of the most important documents of a certification file.

The MC/DC is properly defined on an abstract Boolean data flow language (Hayhurst et al. 2001) with a classical automata point of view. The measure is extended to imperative programming languages, especially the C language, and is implemented in verification tools able to compute this measure. The challenging consequences of the choice of OCaml instead of the usual C or ADA on MC/DC test campaigns is described in section 4.

216

2.3

Source to object code traceability

The high-level requirements that specify the static and dynamic semantics of the Scade language involve logical inference rules. The distance between such a form of requirements and a program written in ML is small and the implementation is very routine, even straightforward for some parts. Indeed:

The code verification takes place essentially on the source code. But, the real need is to assert that all verified properties of the source code are also properties of the object code and, indeed, the executable binary. Most of the time these verifications activities are neither possible to do on the binary, nor on the object file. To handle this contradiction, the process requires to verify that the properties of the source code are also properties of the object code. The compiler analysis focuses on three points:

• the functional abstraction and the modularity of OCaml are

high-level enough to be used as architectural requirements (direct traceability). • the extensive usage of algebraic data types and pattern matching meets the algorithmic description. • this functional architecture based on well identified compiler phases allows an independent validation of each pass.

• the traceability of the object code generation: by transitivity,

one can deduce from that, the traceability of the requirements in the object code. • the management of the system calls: processes for safety critical applications are very suspicious about calls of system subroutines. • conservation of the control flow: the code coverage measurement is relevant if and only if the control flow is traceable from source to object code.

As any modern functional language, OCaml benefits from a compiler that produces trustable applications, safer than most of the mainstream languages which require to make use of dedicated verification tools. In particular, the safety of its static typing allows to skip some verifications that would be mandatory with other languages: among the most evident are the memory allocation, coherency, initialization checks, which are no longer relevant and can therefore be omitted when using OCaml. The OCaml code is compact, which allows to concentrate the verification efforts on the real difficulties, i.e. the algorithmic ones, and very little efforts are devoted to data encoding or resource management issues. On the other hand, some of the high-level constructs of this programming language may have a bad incidence on the verification activities. We decided not to support the complete OCaml language, and thus forbade or restricted the usage of the most complex parts:

More than the choice of a programming language, a DO-178B project manager has to choose the complete development suite, integrating the code generator and test management tools which will be the most convenient to manage all the development activities; including coding but also all the verification activities about this coding. Section 5 describes how the three above requirements can be addressed.

3.

Using OCaml in the development process

• the object-oriented paradigm is not used for the reason that the

OCaml is a functional, imperative and object oriented ML dialect. The development environment provided by INRIA contains a native compiler dedicated to the most common architectures3 . As a functional language, OCaml is adapted to symbolic computation and so, particularly suitable for compiler design and formal analysis tools which rely mainly on symbolic computation. As well for its bootstrapping (Leroy et al. 2008), OCaml is used in Lucid Synchrone (Pouzet 2006), the a` la Lustre language for reactive systems implementation, or the Coq (Project 2006) proof assistant implementation. Some years ago Dassault major avionics industry approached the use of OCaml in software engineering for safe realtime programs development. The experience of Surlog with AGFL4 and the usage of Astr´ee (Cousot et al. 2005) by AIRBUS industries show that tools written in OCaml can be integrated in a critical software development process. The Esterel Technologies project presented in this paper is a code generator, named KCG, that translates SCADE models (dataflow with state machines) into embeddable C code. SCADE is a Lustre(Halbwachs et al. 1991) dialect (program directed by equations with time constructions) enhanced by powerful control flow constructions (automata). KCG has a classical architecture: a front-end with several steps of type-check, a middle-end performing a scheduling and translation of the equational and temporal source language into an imperative intermediate language, and a back-end which generates a bunch of C files. It also contains several optimization passes. A particularity of KCG compared to other compilers resides in its ability to ensure a maximum of traceability between the input model and the generated C program. KCG is specified in a 500 page document containing more than a thousand high-level requirements: one third of them describe the functional requirements of the tool, the others explain the semantics of the input language. 3 In

control it offers is very difficult to manage statically, • modules and functors constructions are allowed but without

some unnecessary constructs such as the manifest types and other artifacts, • exceptions and higher-order constructions are restricted by specific coding rules to avoid complex behaviors that would otherwise be hard to verify. While using OCaml in a development process has undeniable advantages, it remains to answer the specific requirements of the safety-critical software context. This point is addressed in the two following sections.

4.

Code Coverage for OCaml programs

An OCaml program such as KCG uses two kinds of library code: the OCaml standard library, written mainly in OCaml, and the runtime library, written in C and assembly language. Both are shipped with the OCaml compiler and linked with the final executable. The difficulty of specifying and testing such low-level library code led us to adapt and simplify it. The bulk of the modifications of the runtime library was to remove unessential features according the coding standard of KCG. In particular, the support for concurrency and serialization was removed. Most of the work consisted in simplifying the efficient but complex memory management subsystem. We successfully replaced it by a plain Stop&Copy collector with a reasonable loss of performance. As most of the standard library is written in plain OCaml, its certification is no more difficult than that of any OCaml application. Regarding the OCaml part, we developed a tool, called mlcov5 , capable of measuring the MC/DC rate of OCaml programs. The tool first allows to create an instrumented version of the source code that handles a trace file. Running the instrumented executable then

the sequel OCaml compiler will design the INRIA OCaml compiler

4 www.surlog.com

5 http://www.esterel-technologies.com/technology/free-software/

217

leads to (incremental) update of the counters and structures of the trace file. Finally, the coverage results are presented through HTML reports generated from the trace file. MC/DC for OCaml sources Since OCaml is an expression language, we have to address the coverage of expression evaluation: we state that an expression has been covered as soon as its evaluation has ended. Expressions are instrumented with a mark allowing to record by side-effect that this point of the program has been reached. Some constructions of the OCaml language (such as if then else) may introduce several execution branches. Coverage of expressions entails tracing the evaluation of each one of the branches independently. These transformations are detailed in (Pagano et al. 2008). The mlcov implementation The mlcov tool is built on top of the front-end of the OCaml compiler. For our specific purposes, a first pass is done, prior to the instrumentation stage, in order to reject OCaml programs that do not comply with the coding standard of KCG. The figure 2 shows a source code annotated according to test programs: conditions in light gray fulfill the MC/DC criterion, while those in dark gray are not completely covered. And the figure 3 gives the structural coverage and MC/DC statistics for these tests.

Figure 3. Coverage rates

Actually, not only does the executed object code of an OCaml program consist of the generated code but also includes some service assembly code, the runtime library and the so-called standard OCaml libraries. All those components are linked together at the end of the compilation step. As noticed in section 4, the set of OCaml libraries had been slightly simplified to keep only the ones written in OCaml, thus it falls under the regular treatment of pieces of OCaml code. The runtime library, developed in C, is mainly concerned with garbage collection. A little static assembly code provides mechanisms for external calls to memory management C functions and for exception handling. As for any OCaml application, when compiling KCG, an ad hoc piece of assembly code is generated to set the optimized mechanism of functional application of OCaml. The code for all the standard and runtime libraries used in KCG is reasonably compact, especially after the drastic simplification of the GC. External calls are well confined in small static assembly code and no use of the libc library can escape from it. So, fulfilling the two requirements cited above (traceability and safe management of system calls) for this part of OCaml programs can be done by following the usual process. To deal with the generated code, we first benefited from the facts that the source code of the OCaml compiler is open and its functional architecture designs a clear process of refining step by step the intermediate languages, from the abstract syntax tree to the assembly code. The OCaml compilation is itself traceable in the sense that all the intermediate rewritings of the source program can be pretty-printed. It is notable that the bootstrapped OCaml compiler itself naturally offers the traceability facilities that were intentionally designed for the KCG code generator (see section 3). It is possible to stop the OCaml compiling process after the emission of assembly code. Then, one can assemble by hand and link all the components, using the same command as the one the compiler would have used, and finally obtain the same executable as the one the full compiling process would have produced. As a consequence, it is enough to establish traceability from source code to assembly code: a test set can consist in a piece of OCaml code as input and its corresponding piece of assembly code as expected output. At this level of the expertise, three main points had to be taken into consideration:

Figure 2. Annoted source code Performance Results Performances are good enough for code coverage analysis since this activity mainly consists in applying a lot of pretty small examples targeting specific requirements.

5.

Traceability from sources to binaries

A DO-178B level A software development imposes to give evidence about the trustability of the tools and compilers used in the process. To reach this goal, we expertized the OCaml compiling process in order to set up hints for the traceability from the source code to the object code. On the basis of this expertise, among other required documentation, test sets have been produced and are part of the bunch of documents for the certification of KCG. We present in this section the guidelines of this study, mainly focused on two points: the safe management of system calls and the traceability of control flow.

• the translation of explicit controls of the source code, including

pattern matching and exception handling;

218

• the controls introduced by the compiler itself which are indeed

ing software engineering tools. In the transportation domain, Prover Technology also provides certifiable solutions for automating verification activities. To meet a high level of certification (SIL 4 in IEC 61508 standard) required by these applications, a diversified implementation of some software modules present in the toolchains. This diversification consists in having two implementations each using its own implementation technology and comparing the result. For this purpose, OCaml has been chosen jointly with the mainstream C language. This different approach of certification is another opportunity for functional languages.

few and have been tracked; • the so called primitive functions which may either be translated to assembly language or generate calls to external functions. Concerning the first point, rather than an unfeasible full correctness test of the OCaml compiler, we proceeded to a review of its design principles, deep enough to set a methodology able to ensure the above intended properties of the compiler for a given OCaml application (restricted to the coding standard of the project). Concerning the second point, a detailed review of the code led us to enumerate the few occurrences where tests are generated: memory allocation, call to the GC, division by 0, access to array or string elements and the mechanism of functional application. In each case, either one can design test sets to cover them, or the branching may stop the program in a safe state6 . Concerning the third point, all the primitives actually appear in the intermediate lambda code and an exhaustive study of their appearance in the generated assembly code has been performed.

6.

The main result for the ICFP community is that the use of our favorite languages to build compilers is starting to be well understood and accepted by industrial processes and certification authorities in the context of software engineering tools. We can be optimistic to see that, in the middle of all the mainstream (and efficient for other purposes) languages, there is a room for functional technologies and culture.

References

Conclusion

G´erard Berry. The Effectiveness of Synchronous Languages for the Development of Safety-Critical Systems. Technical report, EsterelTechnologies, 2003. Jean-Louis Camus and Bernard Dion. Efficient Development of Airborne Software with SCADE SuiteTM . Technical report, EsterelTechnologies, 2003. Jean-Louis Colac¸o and Marc Pouzet. Clocks as First Class Abstract Types. In Third International Conference on Embedded Software (EMSOFT’03), Philadelphia, Pennsylvania, USA, oct 2003. Jean-Louis Colac¸o, Bruno Pagano, and Marc Pouzet. A Conservative Extension of Synchronous Data-flow with State Machines. In ACM International Conference on Embedded Software (EMSOFT’05), Jersey city, New Jersey, USA, sep 2005. P. Cousot, R. Cousot, J. Feret, L. Mauborgne, A. Min´e, D. Monniaux, and X. Rival. The astr´ee analyser. In European Symposium on Programming. LNCS, April 2005. ECMA. ECMA-367: Eiffel analysis, design and programming language. ECMA (European Association for Standardizing Information and Communication Systems), pub-ECMA:adr, June 2005. N. Halbwachs, P. Caspi, P. Raymond, and D. Pilaud. The synchronous dataflow programming language lustre. In Proceedings of the IEEE, pages 1305–1320, 1991. Kelly J. Hayhurst, Dan S. Veerhusen, John J. Chilenski, and Leanna K. Rierson. A Practical Tutorial on Modified Condition/Decision Coverage. Technical report, NASA/TM-2001-210876, May 2001. Xavier Leroy, Damien Doligez, Jacques Garrigue, Didier Rmy, and Jrme Vouillon. The Objective Caml system, documentation and user’s manual – release 3.11. INRIA, December 2008. URL http://caml.inria.fr/pub/docs/manual-ocaml/. Bruno Pagano, Olivier Andrieu, Benjamin Canou, Emmanuel Chailloux, Jean-Louis Colao, Thomas Moniot, and Philippe Wang. Certified development tools implementation in objective caml. In Paul Hudak and David Scott Warren, editors, Tenth International Symposium on Practical Aspects of Declarative Languages (PADL), volume 4902 of Lecture Notes in Computer Science, pages 2–17. Springer, 2008. Marc Pouzet. Lucid Synchrone version 3.0 : Tutorial and Reference Manual, 2006. (www.lri.fr/%7Epouzet/lucid-synchrone). The Coq Development Team LogiCal Project. The Coq Proof Assistant Reference Manual, 2006. (coq.inria.fr/V8.1beta/refman). RTCA/DO-178B. Software Considerations in Airborne Systems and Equipment Certification. Radio Technical Commission for Aeronautics RTCA, pages 31,74, December 1992.

In the field of safety-critical avionics software, the mainstream programming languages are exclusively C and ADA. Even to develop tools, which are not embedded themselves but which are used to implement embedded applications, the usage of object-oriented programming languages as Java or C++ is not considered relevant due to the complexity of their control flow. The restrictions needed to develop safety-critical Java/C++ software remove all the features that differentiate OOP languages than C/ADA. At the very beginning of the project, using OCaml instead of C was a challenge; the point was to have a programming language closer to the functional specifications but further away from the executable program. The main risk resided in the problems that could have been met to show the traceability between the different levels of specifications and the binary resulting from the compilation of a highly functional and polymorphic source code. This project has shown that this was not an issue thanks to the good traceability of the OCaml compiler and its compilation schemes. Another risk was to express and reach a full code coverage with respect to the MC/DC measure. It was managed by the development of a tool and the performing of a classic test campaign, which revealed neither longer nor more expensive than the previous experiences of code coverage involving code generators written in C code. The additional cost of development of a specific tool (mlcov) is balanced by a gain when qualifying as a verification tool a software that is completely designed for our purpose. The new KCG, developed with OCaml, is certified with respect to the IEC 61508 and EN 50128 norms. It is used in several civil avionics DO-178B projects (such as the A380 Airbus plane, for instance) and will be qualified simultaneously to the project qualifications (with the DO-178B, the tools are not qualified by themselves, but by their usage in a project). The project has been accomplished with the expected delays and costs. The software consists in 65k lines of OCaml code, including a lexer and a parser, plus 4k lines of C code for the runtime library. The development team was composed of 6 software engineers and 8 test engineers during almost 2 years. It is a real DO-178B project, yet with only one singularity compared to other tool development in this certification framework: the use of OCaml as the main programming language. There are others industrial usages of OCaml in some big companies in the field of embedded avionics systems and they have an increasing interest on the usage of this kind of language for build6 This

is not acceptable for embedded code, but it is for development tools in the sense that it ensures that no faulty code will ever be silently produced.

219

Identifying Query Incompatibilities with Evolving XML Schemas Pierre Genev`es

Nabil Laya¨ıda

CNRS [email protected]

INRIA {nabil.layaida,vincent.quint}@inria.fr

Abstract

also get modified to cope with the evolution of the real world entities they describe. Schema changes raise the issue of data consistency. Existing documents and data that were valid with a certain version of a schema may become invalid on a new version of the schema (forward incompatibility). Conversely, new documents created with the latest version of a schema may be invalid on some previous versions (backward incompatibility). In addition, schemas may be written in different languages, such as DTD, XML Schema, or Relax-NG, to name only the most popular ones. And it is common practice to describe the same structure, or new versions of a structure, in different schema languages. Document formats developed by W3C provide a variety of examples: XHTML 1.0 has both DTDs and XML Schemas, while XHTML 2.0 has a Relax-NG definition; the schema for SVG Tiny 1.1 is a DTD, while version 1.2 is written in Relax-NG; MathML 1.01 has a DTD, MathML 2.0 has both a DTD and an XML Schema, and MathML 3.0 is developed with a Relax-NG schema and is expected to have also a DTD and an XML Schema. An issue then is to make sure that schemas written in different languages are equivalent, i.e. they describe the same structure, possibly with some differences due to the expressivity of the language [Murata et al. 2005]. Another issue is to clearly identify the differences between two versions of the same schema expressed in different languages. Moreover, the issues of forward and backward compatibility of instances obviously remain when schema languages change from a version to another. Validation, and then compatibility, is not the only purpose of a schema. Validation is usually the first step for safe processing of documents and data. It makes sure that documents and data are structured as expected and can then be processed safely. The next step is to actually access and select the various parts to be handled in each phase of an application. For this, query languages play a key role. As an example, when transforming a document with XSL, XPath queries are paramount to locate in the original document the data to be produced in the transformed document. Queries are affected by schema evolutions. The structures they return may change depending on the version of the schema used by a document. When changing schema, a query may return nothing, or something different from what was expected, and obviously further processing based on this query is at risk. These observations highlight the need for evaluating precisely and safely the impact of schema evolutions on existing and future instances of documents and data. They also show that it is important for software engineers to precisely know what parts of a processing chain have to be updated when schemas change. In this paper we focus on the XPath query language which is used in many situations while processing XML documents and data. The XSL transformation language was already mentioned, but XPath is also present in XLink and XQuery for instance.

During the life cycle of an XML application, both schemas and queries may change from one version to another. Schema evolutions may affect query results and potentially the validity of produced data. Nowadays, a challenge is to assess and accommodate the impact of these changes in evolving XML applications. Such questions arise naturally in XML static analyzers. These analyzers often rely on decision procedures such as inclusion between XML schemas, query containment and satisfiability. However, existing decision procedures cannot be used directly in this context. The reason is that they are unable to distinguish information related to the evolution from information corresponding to bugs. This paper proposes a predicate language within a logical framework that can be used to make this distinction. We present a system for monitoring the effect of schema evolutions on the set of admissible documents and on the results of queries. The system is very powerful in analyzing various scenarios where the result of a query may not be anymore what was expected. Specifically, the system is based on a set of predicates which allow a fine-grained analysis for a wide range of forward and backward compatibility issues. Moreover, the system can produce counterexamples and witness documents which are useful for debugging purposes. The current implementation has been tested with realistic use cases, where it allows identifying queries that must be reformulated in order to produce the expected results across successive schema versions. Categories and Subject Descriptors D.3.4 [Software]: Programming Languages—Processors; D.2.4 [Software]: Engineering— Software/Program Verification General Terms Keywords

1.

Vincent Quint

Languages, Standardization, Verification

XML, Schema, Queries, Evolution, Compatibility

Introduction

XML is now commonplace on the web and in many information systems where it is used for representing all kinds of information resources, ranging from simple text documents such as RSS or Atom feeds to highly structured databases. In these dynamic environments, not only data are changing steadily but their schemas

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICFP’09, August 31–September 2, 2009, Edinburgh, Scotland, UK. c 2009 ACM 978-1-60558-332-7/09/08. . . $10.00 Copyright

221

2.

Analysis Framework

tree types expressions define regular tree languages. In addition, an element definition may involve simple attribute expressions that describe which attributes the defined element may (or may not) carry:

The main contribution of this paper is a framework that allows the automatic verification of properties related to XML schema and query evolution. In particular, it offers the possibility of checking fine-grained properties of the behavior of queries with respect to successive versions of a given schema. The system can be used for checking whether schema evolutions require a particular query to be updated. Whenever schema evolutions may induce query malfunctions, the system is able to generate annotated XML documents that exemplify bugs, with the goal of helping the programmer to understand and properly overcome undesired effects of schema evolutions. The system relies on a predicate language (presented in Section 4) specifically designed for studying schema and query compatibility issues when schemas evolve. In particular, predicates allow characterizing in a precise manner nodes subject to evolution. For instance, predicates allow to distinguish new nodes selected by the query after a schema change from new nodes that appear in the modified schema. Predicates also allow to describe nodes that appear in new regions of a schema compared to its original version, or even in a new context described by a particular XPath expression. Predicates, together with the composition language provided in the system allow to express and analyze complex settings. The system has been fully implemented [Genev`es and Laya¨ıda 2009] and is outlined in Figure 1. It is composed of a parser for reading the text file description of the problem (which in turn use specific parsers for schemas, queries, logical formulas, and predicates), compilers for translating schemas and queries into their logical representations, a solver for checking satisfiability of logical formulas, and a counter example XML tree generator (described in [Genev`es et al. 2008]). We first introduce the data model we consider for XML documents, schemas and queries.

a list

∅ () τ |τ τ, τ l(a)[τ ] x let x = τ in τ

We use the usual semantics of regular tree types found in [Hosoya et al. 2005] and [Genev`es et al. 2008]. Our tree type expressions capture most of the schemas in use today [Murata et al. 2005]. In practice, our system provides parsers that convert DTDs, XML Schemas, and Relax NGs to this internal tree type representation. Users may thus define constraints over XML documents with the language of their choice, and, more importantly, they may refer to most existing schemas for use with the system. Queries The set of XPath expressions we consider is given by the syntax shown on Figure 2. The semantics of XPath expressions is described in [Clark and DeRose 1999], and more formally in [Wadler 2000]. We observed that, in practice, many XPath expressions contain syntactic sugars that can also fit into this fragment. Figure 3 presents how our XPath parser rewrites some commonly found XPath patterns into the fragment of Figure 2, where the notation (axis::nt)k stands for the composition of k successive path steps of the same form: axis::nt/.../axis::nt. | {z } k steps

query

path

qualifier

::= /path path query | query query ∩ query

absolute path relative path union intersection

path/path path[qualifier] axis::nt

path composition qualified path step

::=

::= qualifier and qualifier qualifier or qualifier not(qualifier) path path/@nt @nt

tree type expression empty set empty sequence disjunction concatenation element definition variable binder

nt

::= σ ∗

axis

::=

conjunction disjunction negation path attribute path attribute step node test node label any node label tree navigation axis

self | child | parent descendant | ancestor descendant-or-self ancestor-or-self following-sibling preceding-sibling following | preceding

The let construct allows binding one or more variables to associated formulas. Since several variables can be bound at a time, the notation x = τ is used for denoting a vector of variable bindings (possibly with mutual recursion). We impose a usual restriction on the recursive use of variables: we allow unguarded (i.e. not enclosed by a label) recursive uses of variables, but restrict them to tail positions1 . With that restriction, 1 For

::= list, list l? l ¬l

Type Constraints As an internal representation for tree grammars, we consider regular tree type expressions (in the manner of [Hosoya et al. 2005]), extended with constraints over attributes. Assuming a set of variables ranged over by x, we define a tree type expression as follows: ::=

attribute expression empty list disjunction attribute list commutative concatenation optional attribute required attribute prohibited attribute

() list | a

XML Trees with Attributes An XML document is considered as a finite tree of unbounded depth and arity, with two kinds of nodes respectively named elements and attributes. In such a tree, an element may have any number of children elements, and may carry zero, one or more attributes. Attributes are leaves. Elements are ordered whereas attributes are not, as illustrated on Figure 4. In this paper, we focus on the nested structure of elements and attributes, and ignore XML data values.

τ

::=

Figure 2. XPath Expressions.

instance, “let x = l(a)[τ ], x | () in x” is allowed.

222

Unsatisfiable (property proved)

select("a//b[ancestor::e]", type("XHTML1-strict.dtd", "html"))

let $X=e & $X... Parsing and Compilation

Satisfiability Test Logical formula over binary trees with attributes

XML Problem Description (Text File)

Satisfiable

Satisfying binary Sample XML binary to n-ary document inducing tree with attributes a bug

Synthesis

Figure 1. Framework Overview. .

nt[position() = 1]

nt[not(preceding-sibling::nt)]

nt[position() = last()]

nt[not(following-sibling::nt)]

nt[position() = |{z} k ]

nt[(preceding-sibling::nt)k−1 ]

a

k>1

count(path) = 0

not(path)

count(path) > 0

path

count(nt) > |{z} k

b

r

d

c

s v

nt/(following-sibling::nt)

k

t w

u

e

k>0

x preceding-sibling::∗[position() = last() and qualifier] preceding-sibling::∗[not(preceding-sibling::∗) and qualifier]

Figure 5. Binary Encoding of Tree of Figure 4.

Figure 3. Syntactic Sugars and their Rewritings. X separated by commas. The reader can directly use this syntax for encoding formulas as text files to be used with the system [Genev`es and Laya¨ıda 2009]. This concrete syntax is used as a single unifying notation throughout all the paper.

The next Section presents the logic underlying the predicate language. Section 4 describes predicates for characterizing the impact of schema changes. Finally, experiments on realistic use cases are reported in Section 5.

3.

ϕ

Logical Setting

::= T F l p # ϕ|ϕ ϕ&ϕ ϕ => ϕ ϕ ϕ (ϕ) ˜ϕ

ϕ T $X let h$X = ϕi in ϕ predicate

It is well-known that there exist bijective encodings between unranked trees (trees of unbounded arity) and binary trees. Owing to these encodings binary trees may be used instead of unranked trees without loss of generality. In the sequel, we rely on a simple “first-child & next-sibling” encoding of unranked trees. In this encoding, the first child of an element node is preserved in the binary tree representation, whereas siblings of this node are appended as right successors in the binary representation. Attributes are left unchanged by this encoding. For instance, Figure 5 presents how the sample tree of Figure 4 is mapped.

XML Notation

a

s v

b

r

d

w

t

c

p

x e

Figure 6. Concrete Syntax of Formulas.

Figure 4. Sample XML Tree with Attributes. The logic we introduce below, used as the core of our framework, operates on such binary trees with attributes. 3.1

::= 1 2 -1 -2

u

formula true false element name atomic proposition start context disjunction conjunction implication equivalence parenthesized formula negation existential modality attribute named l variable binder for recursion predicate (See Section 4) program inside modalities first child next sibling parent previous sibling

The semantics of logical formulas corresponds to the classical semantics of a µ-calculus interpreted over finite tree structures. A formula is satisfiable iff there exists a finite binary tree with attributes for which the formula holds at some node. This is formally defined in [Genev`es et al. 2007], and we review it informally below through a series of examples.

Logical Formulas

The concrete syntax of logical formulas is shown on Figure 6, where the meta-syntax hXi means one or more occurences of

223

Sample Formula

Tree a b

a & b

let $X = (a & $Y) | $X | $X, $Y = b | $Y in $X

XML

asserts that there is a node somewhere in the subtree such that this node is named a and it has at least one sibling which is named b. Binding several variables at a time provides a very expressive yet succinct notation for expressing mutually recursive structural patterns (that are common in XML Schemas, for instance). From a theoretical perspective, the recursive binder let $X = ϕ in ϕ corresponds to the fixpoint operators of the µ-calculus. It is shown in [Genev`es et al. 2007] that the least fixpoint and the greatest fixpoint operators of the µ-calculus coincide over finite tree structures, for a restricted class of formulas called cycle-free formulas. Translations of XPath expressions and schemas presented in this paper always yield cycle-free formulas (see [Genev`es et al. 2008] for more details).

a b c

a & (b & c)

d e & (d & g) f & (g & ~T)

g

e none

none

Table 1. Sample Formulas and Satisfying Trees.

3.2

Queries

The logic is expressive enough to capture the set of XPath expressions presented in Section 2. For example, Figure 7 illustrates how the sample XPath expression:

There is a difference between an element name and an atomic proposition2 : an element has one and only one element name, whereas it can satisfy multiple atomic propositions. We use atomic propositions to attach specific information to tree nodes, not related to their XML labeling. For example, the start context (a reserved atomic proposition) is used to mark the starting context nodes for evaluating XPath expressions. The logic uses modalities for navigating in binary trees. A modality

ϕ can be read as follows: “there exists a successor node by program p such that ϕ holds at this successor”. As shown on Figure 6, a program p is simply one of the four basic programs {1, 2, -1, -2}. Program 1 allows navigating from a node down to its first successor, and program 2 allows navigating from a node down to its second successor. The logic also features converse programs -1 and -2 for navigating upward in binary trees, respectively from the first successor to its parent and from the second successor to its previous sibling. Table 1 gives some simple formulas using modalities for navigating in binary trees, together with sample satisfying trees, in binary and unranked tree representations. The logic allows expressing recursion in trees through the recursive binder. For example the recursive formula:

child::r[child::w/@att] is expressed in the logic. From a given context in an XML document, this expression selects all r child nodes which have at least one w child with an attribute att. Figure 7 shows how it is expressed in the logic, on the binary tree representation. The formula holds for r nodes which are selected by the expression. The first part of the formula, ϕ, corresponds to the step child::r which selects candidates r nodes. The second part, ψ, navigates downward in the subtrees of these candidate nodes to verify that they have at least one immediate w child with an attribute att. # r v

att

ϕ∧ψ

s w

r

ϕ

let $X = b | $X in $X Translated Query: child::r[child::w/@att]

means that either the current node is named b or there is a sibling of the current node which is named b. For this purpose, the variable $X is bound to the subformula b | $X which contains an occurence of $X (therefore defining the recursion). The scope of this binding is the subformula that follows the “in” symbol of the formula, that is $X. The entire formula can thus be seen as a compact recursive notation for a infinitely nested formula of the form:

Translation: r & (let $X=# | $X) & let $Y=w & T | $Y {z }| {z } | ϕ

Figure 7. XPath Translation Example. This example illustrates the need for converse programs inside modalities. The translated XPath expression only uses forward axes (child and attribute), nevertheless both forward and backward modalities are required for its logical translation. Without converse programs we would have been unable to differentiate selected nodes from nodes whose existence is simply tested. More generally, properties must often be stated on both the ancestors and the descendants of the selected node. Equipping the logic with both forward and converse programs is therefore crucial. Logics without converse programs may only be used for solving XPath emptiness but cannot be used for solving other decision problems such as containment efficiently. A systematic translation of XPath expressions into the logic is given in [Genev`es et al. 2007]. In this paper, we extended it to

b | (b | (b | (...))) Recursion allows expressing global properties. For instance, the recursive formula: ~ let $X = a | $X | $X in $X expresses the absence of nodes named a in the whole subtree of the current node (including the current node). Furthermore, the fixpoint operator makes possible to bind several variables at a time, which is specifically useful for expressing mutual recursion. For example, the mutually recursive formula: 2 In

ψ

practice, an atomic proposition must start with a “ ”.

224

deal with attributes. We implemented a compiler that takes any expression of the fragment of Figure 2 and computes its logical translation. With the help of this compiler, we extend the syntax of logical formulas with a logical predicate select("query", ϕ). This predicate compiles the XPath expression query given as parameter into the logic, starting from a context that satisfies ϕ. The XPath expression to be given as parameter must match the syntax of the XPath fragment shown on Figure 2 (or Figure 3). In a similar manner, we introduce the predicate exists("query", ϕ) which tests the existence of query from a context satisfying ϕ, in a qualifier-like manner (without moving to its result). Additionally, the predicate select("query") is introduced as a shortcut for select("query", #), where # simply marks the initial context node of the XPath expression3 . The predicate exists("query") is a shortcut for exists("query", T). These syntactic extensions of the logic allow the user to easily embed XPath expressions and formulate decision problems out of them (like e.g. containment or any other boolean combination). In the next sections we explain how the framework allows combining queries with schema information for formulating problems. 3.3

according to the predicate nullable(x) which indicates whether the type T 6= () bound to x contains the empty tree. The function tra(a) compiles attribute expressions associated with element definitions as follows: def

tra(()) = notothers(()) def

tra(list | a) = tra(list) & notothers(list) tra(list, list0 ) = tra(list) & tra(list0 ) def

def

tra(l?) = l |˜l def

tra(l) = l def

tra(¬l) =˜l In usual schemas (e.g. DTDs, XML Schemas) when no attribute is specified for a given element, it simply means no attribute is allowed for the defined element. This convention must be explicitly stated in the logic. This is the role of the function “notothers(list)” which returns the negated disjunction of all attributes not present in list. As a result, taking attributes into account comes at an extracost. The above translation appends a (potentially very large) formula in which all attributes occur, for each element definition. In practice, a placeholder atomic proposition is inserted until the full set of attributes involved in the problem formulation is known. When the whole formula has been parsed, placeholders are replaced by the conjunction of negated attributes they denote. This extra-cost can be observed in practice, and the system allows two modes of operations: with or without attributes4 . Nevertheless the system is still capable of handling real world DTDs (such as the DTD of XHTML 1.0 Strict) with attributes. This is due to (1) the limited expressive power of languages such as DTD that do not allow for disjunction over attribute expressions (like “list | a” ); and, more importantly, (2) the satisfiability-testing algorithm which is implemented using symbolic techniques [Genev`es et al. 2008]. Tree type expressions form the common internal representation for a variety of XML schema definition languages. In practice, the logical translation of a tree type expression τ are obtained directly from a variety of formalisms for defining schemas, including DTD, XML Schema, and Relax NG. For this purpose, the syntax of logical formulas is extended with a predicate type(" · ", ·). The logical translation of an existing schema is returned by type("f ", l) where f is a file path to the schema file and l is the element name to be considered as the entry point (root) of the given schema. Any occurence of this predicate will parse the given schema, extract its internal tree type representation τ , compile it into the logic and return the logical formula tr(τ )FT .

Tree Types

Tree type expressions are compiled into the logic in two steps: the first stage translates them into binary tree type expressions, and the second step actually compiles this intermediate representation into the logic. The translation procedure from tree type expressions to binary tree type expressions is well-known and detailed in [Genev`es 2006]. The syntax of output expressions follows: τ

::= ∅ () τ |τ l(a)[x, x] let x = τ in τ

binary tree type expression empty set empty tree disjunction element definition binder

Attribute expressions are not concerned by this transformation to binary form: they are simply attached, unchanged, to new (binary) element definitions. Finally, binary tree type expressions are compiled into the logic. This translation step was introduced and proven correct in [Genev`es et al. 2007]. Originally, the translation takes a tree type expression τ and returns the corresponding logical formula. Here, we extend it slightly but crucially: the logical translation of an expression τ is given by the function tr(τ )ψ ϕ defined below, that takes additional arguments ϕ and ψ: def

tr(τ )ψ ϕ = F tr(τ1 |

def τ2 )ψ ϕ =

for τ = ∅, ()

ψ tr(τ1 )ψ ϕ | tr(τ2 )ϕ

3.4

def

tr(l(a)[x1 , x2 ])ψ ϕ = (l & ϕ & tra(a) & s1 (x1 ) & s2 (x2 )) | ψ

Type Tagging

The addition of ϕ and ψ (respectively in a new conjunction and a new disjunction) is a key element for the definition of predicates in Section 4. More precisely, this allows marking type subexpressions so that they can be distinguished in predicates, as explained in Section 3.4. In addition, ϕ and ψ are either true, false, or simple atomic propositions. Thus, it is worth noticing that their addition does not affect the linear complexity of tree type translation. The function s· (·) describes the type for each successor: 8 < ˜

T if x is bound to () ˜

T |

$X if nullable(x) sp (x) = :

$X if not nullable(x)

A tag (or “color”) is introduced in the compilation of schemas with the purpose of marking all node types of a specific schema. A tag is simply a fresh atomic proposition passed as a parameter to the translation of a tree type expression. For example: tr(τ )Fxhtml is the logical translation of τ where each element definition is annotated with the atomic proposition “xhtml”. With the help of tags, it becomes possible to refer to the element types in any context. For instance, one may formulate tr(τ )Fxhtml | tr(τ 0 )Fsmil for denoting the union of all τ and τ 0 documents, while keeping a way to distinguish element types; even if some element names are shared by the two type expressions. Tagging becomes even more useful for characterizing evolutions between successive versions of a single schema. In this setting, we need a way to distinguish nodes allowed by a newer

3 This

4 The

def

ψ ψ tr(let xi = τi in τ )ψ ϕ = let $Xi = tr(τi )ϕ in tr(τ )ϕ

mark is especially useful for comparing two or more XPath expressions from the same context.

optional argument “-attributes” must be supplied for attributes to be considered.

225

schema version from nodes allowed by an older version. This distinction must not be based only on element names, but also on content models. Assume for instance that τ 0 is a newer version of schema τ . If we are interested in the set of trees allowed by τ 0 but not allowed by τ then we may formulate:

A second, more-elaborate, class of predicates allows formulating problems that combine both a query query and two type expressions τ, τ 0 (where τ 0 is assumed to be a evolved version of τ ): • new element name("query", τ, τ 0 ) is satisfied iff the query

If we now want to check more fine-grained properties, we may rather be interested in the following (tagged) formulation:

query selects elements whose names did not occur at all in τ . This is especially useful for queries whose last navigation step contains a “*” node test and may thus select unexpected elements. This predicate is compiled into:

complement tr(τ 0 )Fall &˜tr(τ )˜old T

˜element(τ ) & select("query", tr(τ 0 )FT )

In this manner, we can distinguish elements that were added in τ 0 and whose names did not occur in τ , from elements whose names already occured in τ but whose content model changed in τ 0 , for instance. In practice, a type is tagged using the predicate type("f ", l, ϕ, ϕ0 ) which parses the specified schema, converts it 0 into its logical representation τ and returns the formula tr(τ )ϕ ϕ . This kind of type tagging is useful for studying the consequences of schema updates over queries, as presented in the next sections.

where element(τ ) is another predicate that builds the disjunction of all element names occuring in τ . In a similar manner, the predicate attribute(ϕ) builds the logical disjunction of all attribute names used in ϕ.

tr(τ 0 )FT

4.

&˜tr(τ )FT

• new region("query", τ, τ 0 ) is satisfied iff the query query se-

lects elements whose names already occurred in τ , but such that these nodes now occur in a new context in τ 0 . In this setting, the path from the root of the document to a node selected by the XPath expression query contains a node whose type is defined in τ 0 but not in τ as illustrated below:

Analysis Predicates

This section introduces the basic analysis tasks offered to XML application designers for assessing the impact of schema evolutions. In particular, we propose a mean for identifying the precise reasons for type mismatches or changes in query results under type constraints. For this purpose, we build on our query and type expression compilers, and define additional predicates that facilitate the formulation of decision problems at a higher level of abstraction. Specifically, these predicates are introduced as logical macros with the goal of allowing system usage while focusing (only) on the XMLside properties, and keeping underlying logical issues transparent for the user. Ultimately, we regard the set of basic logical formulas (such as modalities and recursive binders) as an assembly language, to which predicates are translated. We illustrate this principle with two simple predicates designed for checking backward-compatibility of schemas, and query satisfiability in the presence of a schema.

node selected by query

path from root to selected node contains node in τ0 \ τ

XML document valid against τ 0 but not against τ

The predicate new region("query", τ, τ 0 ) is logically defined as follows: new region("query", τ, τ 0 ) = def

select("query", tr(τ )Fall &˜tr(τ 0 )˜T old complement )

0

&˜added element(τ, τ 0 ) & ancestor( old complement)

• The predicate backward incompatible(τ, τ ) takes two type

expressions as parameters, and assumes τ 0 is an altered version of τ . This predicate is unsatisfiable iff all instances of τ 0 are also valid against τ . Any occurrence of this predicate in the input formula will automatically be compiled as tr(τ 0 )FT &˜tr(τ )FT .

&˜descendant( old complement) &˜following( old complement)

• The predicate non empty("query", τ ) takes an XPath expres-

&˜preceding( old complement)

sion (with the syntax defined on Figure 2) and a type expression as parameters, and is unsatisfiable iff the query always returns an empty set of nodes when evaluated on an XML document valid against τ . This predicate compiles into select("query", tr(τ )FT & #) where the top-level predicate select("query", ϕ) compiles the XPath expression query into the logic, starting from a context that satisfies ϕ, as explained in Section 3.2. This can be used to check whether the modification of the schema does not contradict any part of the query.

The previous definition heavily relies on the partition of tree nodes defined by XPath axes, as illustrated by Figure 8. The definition of new region("query", τ, τ 0 ) uses an auxiliary predicate added element(τ, τ 0 ) that builds the disjunction of all element names defined in τ 0 but not in τ (or in other terms, elements that were added in τ 0 ). In a similar manner, the predicate added attribute(ϕ, ϕ0 ) builds the disjunction of all attribute names defined in τ 0 but not in τ . The predicate new region("query", τ, τ 0 ) is useful for checking whether a query selects a different set of nodes with τ 0 than with τ because selected elements may occur in new regions of the document due to changes brought by τ 0 .

Notice that the predicate non empty("query", τ ) can be used for checking whether a query that is valid5 against a schema remains valid with an updated version of a schema. In other terms, this predicate allows determining whether a query that must always return a non-empty result (whatever the tree on which it is evaluated) keeps verifying the same property with a new version of a schema. 5 We

• new content("query", τ, τ 0 ) is satisfied iff the query query

selects elements whose names were already defined in τ , but whose content model has changed due to evolutions brought by τ 0 , as illustrated below:

say that a query is valid iff its negation is unsatisfiable.

226

r sto

This predicate can also be used for checking properties in an iterative manner, refining the property to be tested at each step. It can also be used for verifying fine-grained properties. For instance, one may check whether τ 0 defines the same set of trees as τ modulo new element names that were added in τ 0 with the following formulation:

e anc

self parent child preceding-sibling following-sibling

˜(τ τ 0 ) & exclude(added element(τ, τ 0 ))

pre

ce

ing

din

g

low fol

This allows identifying that, during the type evolution from τ to τ 0 , the query results change has not been caused by the type extension but by new compositions of nodes from the older type. In practice, instead of taking internal tree type representations (as defined in Section 2) as parameters, most predicates do actually take any logical formula as parameter, or even schema paths as parameters. We believe this facilitates predicates usage and, most notably, how they can be composed together. Figure 9 gives the syntax of built-in predicates as they are implemented in the system, where f is a file path to a DTD (.dtd), XML Schema (.xsd), or Relax NG (.rng). In addition of aforementioned predicates, the predicate

descendant

Figure 8. XPath axes: partition of tree nodes. predicate node selected by query

::= select("query") select("query", ϕ) exists("query") exists("query", ϕ)

subtree for selected node has changed (new content model)

type("f ", l) type("f ", l, ϕ, ϕ0 ) forward incompatible(ϕ, ϕ0 ) backward incompatible(ϕ, ϕ0 )

XML document valid against τ 0 but not against τ

element(ϕ) attribute(ϕ) descendant(ϕ) exclude(ϕ) added element(ϕ, ϕ0 ) added attribute(ϕ, ϕ0 )

0

The definition of new content("query", τ, τ ) follows: new content("query", τ, τ 0 ) = def

select("query", tr(τ )Fall &˜tr(τ 0 )˜T old complement ) &˜added element(τ, τ 0 )

non empty("query", ϕ) new element name("query", "f ", "f 0 ", l) new region("query", "f ", "f 0 ", l) new content("query", "f ", "f 0 ", l) predicate-name(hϕi )

&˜ancestor(added element(τ, τ 0 )) & descendant( old complement) &˜following( old complement) &˜preceding( old complement)

Figure 9. Syntax of Predicates for XML Reasoning. The predicate new content("query", τ, τ 0 ) can be used for ensuring that XPath expressions will not return nodes with a possibly new content model that may cause problems. For instance, this allows checking whether an XPath expression whose resulting node set is converted to a string value (as in, e.g. XPath expressions used in XSLT “value-of” instructions) is affected by the changes from τ to τ 0 .

descendant(ϕ) forces the existence of a node satisfying ϕ in the subtree, and predicate-name(hϕi ) is a call to a custom predicate, as explained in the next section. 4.1

Custom Predicates

Following the spirit of predicates presented in the previous section, users may also define their own custom predicates. The full syntax of XML logical specifications to be used with the system is defined on Figure 10, where the meta-syntax hXi means one or more occurrence of X separated by commas. A global problem specification can be any formula (as defined on Figure 6), or a list of custom predicate definitions separated by semicolons and followed by a formula. A custom predicate may have parameters that are instanciated with actual formulas when the custom predicate is called (as shown on Figure 9). A formula bound to a custom predicate may include calls to other predicates, but not to the currently defined predicate (recursive definitions must be made through the let binder shown on Figure 6).

The previously defined predicates can be used to help the programmer identify precisely how type constraint evolutions affect queries. They can even be combined with usual logical connectives to formulate even more sophisticated problems. For example, let us define the predicate exclude(ϕ) which is satisfiable iff there is no node that satisfies ϕ in the whole tree. This predicate can be used for excluding specific element names or even nodes selected by a given XPath expression. It is defined as follows: def

exclude(ϕ) = ˜ancestor-or-self(descendant-or-self(ϕ))

227

spec def

particular, the content model of the label element cannot have an a element in XHTML basic 1.0 while it can in XHTML basic 1.1. The counter example produced by the solver is shown below:

::= ϕ def ; ϕ

formula (see Fig. 6)

predicate-name(hli ) = ϕ0 def ; def

custom definition list of definitions

::=

Figure 10. Global Syntax for Specifying Problems. Schema XHTML 1.0 basic DTD XHTML 1.1 basic DTD MathML 1.01 DTD MathML 2.0 DTD

Variables 71 89 137 194

Elements 52 67 127 181

Attributes 57 83 72 97

Table 2. Sizes of (Some) Considered Schemas.

5.

Framework in Action

XTML basic 1.0 validity error: element "a" is not declared in "label" list of possible children

We have implemented the whole software architecture described in Section 2 and illustrated on Figure 1. The tool [Genev`es and Laya¨ıda 2009] is available online from:

Notice that we observed similar forward and backward compatibility issues with several other W3C normative schemas (in particular for the different versions of SMIL and SVG). Such backward incompatibilities suggests that applications cannot simply ignore new elements from newer schemas, as the combination of older elements may evolve significantly from one version to another.

http://wam.inrialpes.fr/xml We have carried out extensive experiments of the system with real world schemas such as XHTML, MathML, SVG, SMIL (Table 2 gives details related to their respective sizes) and queries found in transformations such MathML content to presentation [Pietriga 2005]. We present two of them that show how the tool can be used to analyze different situations where schemas and queries evolve.

MathML Content to Presentation Conversion MathML is an XML format for describing mathematical notations and capturing both its structure and graphical structure, also known as Content MathML and Presentation MathML respectively. The structure of a given equation is kept separate from the presentation and the rendering part can be generated from the structure description. This operation is usually carried out using an XSLT transformation that achieves the conversion. In this test series, we focus on the analysis of the queries contained in such a transformation sheet and evaluate the impact of the schema change from MathML 1.0 to MathML 2.0 on these queries. Most of the queries contained in the transformation represent only a few patterns very similar up to element names. The following three patterns are the most frequently used:

Evolution of XHTML Basic The first test consists in analyzing the relationship (forward and backward compatibility) between XHTML basic 1.0 and XHTML basic 1.1 schemas. In particular, backward compatibility can be checked by the following command: backward_incompatible("xhtml-basic10.dtd", "xhtml-basic11.dtd", "html") The test immediately yields a counter example as the new schema contains new element names. The counter example (shown below) contains a style element occurring as a child of head, which is not permitted in XHTML basic 1.0:

Q1: Q2: Q3:

//apply[*[1][self::eq]] //apply[*[1][self::apply]/inverse] //sin[preceding-sibling::*[position()=last() and (self::compose or self::inverse)]]

The first test is formulated by the following command: new_region("Q1","mathml.dtd","mathml2.dtd","math") The result of the test shows a counter example document that proves that the query may select nodes in new contexts in MathML 2.0 compared to MathML 1.0. In particular, the query Q1 selects apply elements whose ancestors can be declare elements, as indicated on the document produced by the solver:

The next step consists in focusing on the relationship between both schemas excluding these new elements. This can be formulated by the following command: backward_incompatible("xhtml-basic10.dtd", "xhtml-basic11.dtd", "html") & exclude(added_element( type("xhtml-basic10.dtd","html"), type("xhtml-basic11.dtd", "html")))

The result of the test shows a counter example document that proves that XHTML basic 1.1 is not backward compatible with XHTML basic 1.0 even if new elements are not considered. In

228

Notice that the solver automatically annotates a pair of nodes related by the query: when the query is evaluated from a node marked with the attribute solver:context, the node marked with solver:target is selected. To evaluate the effect of this change, the counter example is filled with content and passed as an input parameter to the transformation. This shows immediately a bug in the transformation as the resulting document is not a MathML 2.0 presentation document. Based on this analysis, we know that the XSLT template associated with the match pattern Q1 must be updated to cope with MathML evolution from version 1.0 to version 2.0. The next test consists in evaluating the impact of the MathML type evolution for the query Q2 while excluding all new elements added in MathML 2.0 from the test. This identifies whether old elements of MathML 1.0 can be composed in MathML 2.0 in a different manner. This can be performed with the following command: new_content("Q2","mathml.dtd","mathml2.dtd","math") & exclude(added_element(type("mathml.dtd","math"), type("mathml2.dtd", "math")))

a result, the stylesheet cannot be used safely over documents of the new type without modifications. In addition, the required changes to the stylesheet are not limited to the addition of new templates for MathML 2.0 elements. The templates that deal with the composition of MathML 1.0 elements should be revised as well. All the previous tests were processed in less than 30 seconds on an ordinary laptop computer running Mac OS X. The 30s correspond to the most complex use cases. Most complex means analyzing recursive forward/backward and qualified queries such as Q3, under evolution of large and heavily recursive schemas such as XHTML and MathML (large number of type variables, elements and attributes: see Table 2). These are the hardest cases measured in practice with the implementation. Most of other schemas and queries usually found in applications are much simpler than the ones presented in this paper and will obviously be solved much faster. Given the variety of schemas occurring in practice, we focused on the most complex W3C standard schemas. The accompanying full online implementation [Genev`es and Laya¨ıda 2009] allows to run all the tests described in the paper as well as usersupplied ones. It shows intermediate compilation stages, generated formulae (in particular the translation of schemas into the logic), and reports on the performance of each step of the analysis.

6.

The test result shows an example document that effectively combines MathML 1.0 elements in a way that was not allowed in MathML 1.0 but permitted in MathML 2.0.

Related Work

Schema evolution is an important topic and has been extensively explored in the context of relational, object-oriented, and XML databases. Most of the previous work for XML query reformulation is approached through reductions to relational problems [Beyer et al. 2005]. This is because schema evolution was considered as a storage problem where the priority consists in ensuring data consistency across multiple relational schema versions. In such settings, two distinct schemas and an explicit description of the mapping between them are assumed as input. The problem then consists in reformulating a query expressed in terms of one schema into a semantically equivalent query in terms of the other schema: see [Yu and Popa 2005] and more recently [Moon et al. 2008] with references thereof. In addition to the fundamental differences between XML and the relational data model, in the more general case of XML processing, schemas constantly evolve in a distributed, independent, and unpredictable environment. The relations between different schemas are not only unknown but hard to track. In this context, one priority is to help maintaining query consistency during these evolutions, which is still considered as a challenging problem [Sedlar 2005, Rose 2004]. The absence of evolution analysis tools for XML/XPath contrasts with the abundance of tools and methods routinely used in relational databases. The work found in [Moro et al. 2007] discusses the impact of evolving XML schemas on query reformulation. Based on a taxonomy of XML schema changes during their evolution, the authors provide informal – not exact nor systematic – guidelines for writing queries which are less sensitive to schema evolution. In fact, studying query reformulation requires at least the ability to analyze the relationship between queries. For this reason, a closely related work is the problem of determining query containment and satisfiability under type constraints [Benedikt et al. 2005, Colazzo et al. 2006, Genev`es et al. 2007]. The work found in [Benedikt et al. 2005] studies the complexity of XPath emptiness and containment for various fragments (see [Benedikt and Koch 2006] and references thereof for a survey). In [Colazzo et al. 2004, 2006], a technique is presented for statically ensuring correctness of paths. The approach deals with emptiness of XPath expressions without reverse axes. The work presented in [Genev`es et al. 2007] solves the more general problem of containment, including reverse axes.

Similarly, the last test consists in evaluating the impact of the MathML type evolution for the query Q3, excluding all new elements added in MathML 2.0 and counter example documents containing declare elements (to avoid trivial counter examples): new_regions("Q3","mathml.dtd","mathml2.dtd","math") & exclude(added_element(type("mathml.dtd","math"), type("mathml2.dtd","math"))) & exclude(declare) The counter example document shown below illustrates a case where the sin element occurs in a new context.

Applying the transformation on previous examples yields documents which are neither MathML 1.0 nor MathML 2.0 valid. As

229

The main distinctive idea pursued in this paper is to develop a logical approach for guiding schema and query evolution. In contrast to the previous use of logics for proving properties such as query emptiness or equivalence, the goal here is different in that we seek to provide the necessary tools to produce relevant knowledge when such relations do not hold. From a complexity point-of-view, it is worth noticing that the addition of predicates does not increase complexity for the underlying logic shown in [Genev`es et al. 2007]. We would also like to emphasize that, to the best of our knowledge, this work is the first to provide precise analyses of XML evolution, that was tested on real life use cases (such as XHTML and MathML types) and complex queries (involving recursive and backward navigation). As a consequence, in this context, analysis tools such as type-checkers [Hosoya and Pierce 2003, Benzaken et al. 2003, Møller and Schwartzbach 2005, Gapeyev et al. 2006, Castagna and Nguyen 2008] do no match the expressiveness, typing precision, and analysis capabilities of the work presented here.

7.

James Clark and Steve DeRose. XML path language (XPath) version 1.0, W3C recommendation, November 1999. http://www.w3.org/TR/ 1999/REC-xpath-19991116. Dario Colazzo, Giorgio Ghelli, Paolo Manghi, and Carlo Sartiani. Types for path correctness of XML queries. In ICFP ’04: Proceedings of the ninth ACM SIGPLAN international conference on Functional programming, pages 126–137, New York, NY, USA, 2004. ACM Press. ISBN 1-58113905-5. Dario Colazzo, Giorgio Ghelli, Paolo Manghi, and Carlo Sartiani. Static analysis for path correctness of XML queries. J. Funct. Program., 16 (4-5):621–661, 2006. ISSN 0956-7968. Vladimir Gapeyev, Franc¸ois Garillot, and Benjamin C. Pierce. Statically typed document transformation: An Xtatic experience. In PLAN-X 2006: Proceedings of the International Workshop on Programming Language Technologies for XML, volume NS-05-6 of BRICS Notes Series, pages 2–13, Aarhus, Denmark, January 2006. BRICS. Pierre Genev`es. Logics for XML. PhD thesis, Institut National Polytechnique de Grenoble, December 2006. http://www.pierresoft.com/pierre.geneves/phd.htm. Pierre Genev`es and Nabil Laya¨ıda. The XML reasoning solver project, February 2009. http://wam.inrialpes.fr/xml. Pierre Genev`es, Nabil Laya¨ıda, and Alan Schmitt. Efficient static analysis of XML paths and types. In PLDI ’07, pages 342–351. ACM Press, 2007. ISBN 978-1-59593-633-2. doi: http://doi.acm.org/10.1145/ 1250734.1250773. Pierre Genev`es, Nabil Laya¨ıda, and Alan Schmitt. Efficient static analysis of XML paths and types. Long version of [Genev`es et al. 2007], Research Report 6590, INRIA, July 2008. URL http://hal.inria. fr/inria-00305302/en/. Haruo Hosoya and Benjamin C. Pierce. XDuce: A statically typed XML processing language. ACM Trans. Inter. Tech., 3(2):117–148, 2003. ISSN 1533-5399. Haruo Hosoya, J´erˆome Vouillon, and Benjamin C. Pierce. Regular expression types for XML. ACM TOPLAS, 27(1):46–90, 2005. ISSN 01640925. doi: http://doi.acm.org/10.1145/1053468.1053470. Anders Møller and Michael I. Schwartzbach. The design space of type checkers for XML transformation languages. In Proc. Tenth International Conference on Database Theory, ICDT ’05, volume 3363 of LNCS, pages 17–36, London, UK, January 2005. Springer-Verlag. Hyun J. Moon, Carlo A. Curino, Alin Deutsch, and Chien-Yi Hou. Managing and querying transaction-time databases under schema evolution. In VLDB ’08, pages 882–895. VLDB Endowment, 2008. Mirella M. Moro, Susan Malaika, and Lipyeow Lim. Preserving xml queries during schema evolution. In WWW ’07, pages 1341–1342. ACM, 2007. ISBN 978-1-59593-654-7. doi: http://doi.acm.org/10.1145/ 1242572.1242841. Makoto Murata, Dongwon Lee, Murali Mani, and Kohsuke Kawaguchi. Taxonomy of XML schema languages using formal language theory. ACM TOIT, 5(4):660–704, 2005. ISSN 1533-5399. doi: http://doi.acm. org/10.1145/1111627.1111631. Emmanuel Pietriga. MathML content2presentation transformation, May 2005. http://www.lri.fr/˜pietriga/mathmlc2p/mathmlc2p.html. Kristoffer H. Rose. The XML world view. In DocEng ’04: Proceedings of the 2004 ACM symposium on Document engineering, pages 34–34, New York, NY, USA, 2004. ACM. ISBN 1-58113-938-1. doi: http://doi.acm. org/10.1145/1030397.1030403. URL http://www.research.ibm. com/XML/Rose-DocEng2004.pdf. Eric Sedlar. Managing structure in bits & pieces: the killer use case for XML. In SIGMOD ’05, pages 818–821. ACM, 2005. ISBN 1-59593060-4. doi: http://doi.acm.org/10.1145/1066157.1066256. Philip Wadler. Two semantics for XPath. Internal Technical Note of the W3C XSL Working Group, http://homepages.inf.ed.ac.uk/wadler/papers/xpath-semantics/xpathsemantics.pdf, January 2000. Cong Yu and Lucian Popa. Semantic adaptation of schema mappings when schemas evolve. In VLDB ’05, pages 1006–1017. VLDB Endowment, 2005. ISBN 1-59593-154-6.

Conclusion

In this article, we present an application of a logical framework for verifying forward/backward compatibility issues caused by schemas and queries evolution. We provide evidence that such a framework can be successfully used to overcome the obstacles of the analysis of XML type and query evolution. This kind of analyses is widely considered as a challenging problem in XML programming. As mentioned earlier, the difficulty is twofold: first it requires dealing with large and complex language constructions such as XML types and queries, and second, it requires modeling and reasoning about evolution of such constructions. The presented tool allows XML designers to identify queries that need reformulation in order to produce the expected results across successive schema versions. With this tool designers can examine precisely the impact of schema changes over queries, therefore facilitating their reformulation. We gave illustrations of how to use the tool for both schema and query evolution on realistic examples. In particular, we considered typical situations in applications involving W3C schemas evolution such as XHTML and MathML. The tool can be very useful for standard schema writers and maintainers in order to assist them enforce some level of quality assurance on compatibility between versions. There are a number of interesting extensions to the proposed system. In particular, the set of predicates can be easily enriched to detect more precisely the impact on queries. For example, one can extend the tagging to identify separately every navigation step and qualifier in a query expression. This will help greatly in the identification and reformulation of the navigation steps or qualifiers affected by schemas evolution.

References Michael Benedikt and Christoph Koch. XPath leashed. submitted, 2006. Michael Benedikt, Wenfei Fan, and Floris Geerts. XPath satisfiability in the presence of DTDs. In PODS ’05, pages 25–36. ACM Press, 2005. ISBN 1-59593-062-0. doi: http://doi.acm.org/10.1145/1065167.1065172. V´eronique Benzaken, Giuseppe Castagna, and Alain Frisch. CDuce: An XML-centric general-purpose language. In ICFP ’03: Proceedings of the Eighth ACM SIGPLAN International Conference on Functional Programming, pages 51–63, New York, NY, USA, 2003. ACM Press. ISBN 1-58113-756-7. ¨ Kevin Beyer, Fatma Ozcan, Sundar Saiprasad, and Bert Van der Linden. DB2/XML: designing for evolution. In SIGMOD ’05, pages 948–952. ACM, 2005. ISBN 1-59593-060-4. doi: http://doi.acm.org/10.1145/ 1066157.1066299. Giuseppe Castagna and Kim Nguyen. Typed iterators for XML. In ICFP, pages 15–26, 2008.

230

Commutative Monads, Diagrams and Knots Dan Piponi Industrial Light & Magic, San Francisco [email protected]

Abstract Categories and Subject Descriptors tion]: Studies of Program Constructs

There is certain diverse class of diagram that is found in a variety of branches of mathematics and which all share this property: there is a common scheme for translating all of these diagrams into useful functional code. These diagrams include Bayesian networks, quantum computer circuits [1], trace diagrams for multilinear algebra [3], Feynman diagrams and even knot diagrams [2]. I will show how a common thread lying behind these diagrams is the presence of a commutative monad and I will show how we can use this fact to translate these diagrams directly into Haskell code making use of do-notation for monads. I will also show a number of examples of such translated code at work and use it to solve problems ranging from Bayesian inference to the topological problem of untangling tangled strings. Along the way I hope to give a little insight into the subjects mentioned above and illustrate how a functional programming language can be a valuable tool in mathematical research and experimentation.

General Terms

F.3.3 [Theory of Computa-

Theory

Keywords diagrams, functional, monads, linear algebra,knot theory, Haskell

References [1] Coecke, Bob. Kindergarten Quantum Mechanics. http://arxiv.org/abs/quantph/0510032 [2] Kauffman, L.H.: Knots and Physics, 3rd edn. World Scientific (2001). [3] Steven Morse and Elisha Peterson,Trace Diagrams, Matrix Minors, and Determinant Identities, http://arxiv.org/abs/0903.1373

Copyright is held by the author/owner(s). ICFP ’09 August 31–September 2, 2009, Edinburgh, Scotland, UK. ACM 978-1-60558-332-7/09/08.

231

Generic Programming with Fixed Points for Mutually Recursive Datatypes Alexey Rodriguez Yakushev1

Stefan Holdermans2

Andres L¨oh2

Johan Jeuring2,3

1 Vector Fabrics B.V., Paradijslaan 28, 5611 KN Eindhoven, The Netherlands of Information and Computing Sciences, Utrecht University, P.O. Box 80.089, 3508 TB Utrecht, The Netherlands 3 School of Computer Science, Open University of the Netherlands, P.O. Box 2960, 6401 DL Heerlen, The Netherlands

2 Department

{stefan,andres,johanj}@cs.uu.nl

[email protected]

Abstract

level programming patterns to different datatypes, time and time again. Datatype-generic programming alleviates this burden by enabling programmers to write generic functions, i.e., functions that are defined once, but that can be used on many different datatypes. Over the years, a vast body of work has emerged on adding support for datatype-generic programming to mainstream functional programming languages, most notably Haskell (Peyton Jones 2003). While early proposals encompassed extending the underlying language with dedicated new constructs for generic programming (Jansson and Jeuring 1997; Hinze 2000a,b), recent approaches favour the definition of generic functions in Haskell itself using Haskell’s advanced type-class system (Cheney and Hinze 2002; Hinze 2004; L¨ammel and Peyton Jones 2003). The various approaches to generic programming generally differ in the expressivity of the generic functions that can be defined and the classes of datatypes that are supported. The most prominent example is that quite a number of generic functions operate on the recursive structure of datatypes, but most approaches do not provide access to the recursive positions in a datatype’s definition. The approaches that do provide access to these recursive positions are limited in the sense that they only apply to a restricted set of datatypes. In particular, the full recursive structure of families of mutually recursive datatypes is beyond the reach of these approaches. Still, many real-life applications of functional programming do involve mutually recursive datatypes, arguably the most striking example being the representation of abstract syntax trees in compilers. Moreover, the generic functions that arise in such applications typically require access to the full recursive structure of these types; examples include navigation (Huet 1997; Hinze et al. 2004; McBride 2008), unification (Jansson and Jeuring 1998), rewriting (Jansson and Jeuring 2000; Van Noort et al. 2008), and pattern matching (Jeuring 1995) and, more generally, recursion schemes such as fold and the like (Meijer et al. 1991) and downwards accumulations (Gibbons 2000). In this paper, we present an approach to datatype-generic programming embedded in Haskell that does enable the definition of generic functions over the full recursive structure of mutually recursive datatypes. Specifically, our contributions are the following:

Many datatype-generic functions need access to the recursive positions in the structure of the datatype, and therefore adopt a fixed point view on datatypes. Examples include variants of fold that traverse the data following the recursive structure, or the Zipper data structure that enables navigation along the recursive positions. However, Hindley-Milner-inspired type systems with algebraic datatypes make it difficult to express fixed points for anything but regular datatypes. Many real-life examples such as abstract syntax trees are in fact systems of mutually recursive datatypes and therefore excluded. Using Haskell’s GADTs and type families, we describe a technique that allows a fixed-point view for systems of mutually recursive datatypes. We demonstrate that our approach is widely applicable by giving several examples of generic functions for this view, most prominently the Zipper. Categories and Subject Descriptors D.1.1 [Programming Techniques]: Applicative (Functional) Programming; D.2.13 [Software Engineering]: Reusable Software—Reusable libraries; D.3.3 [Programming Languages]: Language Constructs and Features—Data types and structures General Terms

1.

Design, Languages

Introduction

One of the most ubiquitous activities in software development is structuring data. Indeed, many programming methods and development tools center around the creation of datatypes (or XML schemas, UML models, classes, grammars, et cetera). Once the structure of the data has been decided on, a programmer adds functionality to the datatypes. Here, there is always some functionality that is specific to a datatype – and part of the reason that the datatype has been designed in the first place. Other functionality is, however, generic and similar or even the same on many datatypes. Classic examples of such generic functionality are testing for equality, ordering, parsing, and pretty printing. Implementing generic functionality can be tiresome and, therefore, error-prone: it involves adapting and applying the same high-

• We show how to generalise the encoding of regular datatypes

as fixed points of functors (reviewed in Section 2) to arbitrary families of mutually recursive types. We make use of a single higher-order fixed point operator (Section 3).

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICFP’09, August 31–September 2, 2009, Edinburgh, Scotland, UK. c 2009 ACM 978-1-60558-332-7/09/08. . . $5.00 Copyright

• The functors for families of mutually recursive datatypes can be

constructed from a small set of combinators, thereby enabling datatype-generic programming (Section 4).

233

• We present several applications of generic programming in this

As before, in the instance for expressions,

setting, most notably the Zipper for mutually recursive types in Section 5 and generic rewriting in Section 6.

instance Regular Expr where from = fromExpr to = toExpr

Related work is presented in Section 7, and future work and conclusions in Section 8. A strength of our approach is that it can be readily implemented in Haskell, making use of language extensions such as type families (Schrijvers et al. 2008) and GADTs (Peyton Jones et al. 2006). The multirec and zipper libraries that are based on this paper can be obtained from HackageDB.

2.

the shallow conversion functions fromExpr and toExpr are trivial to define. In order to establish that PFExpr really is a functor, we make it an instance of class Functor: instance Functor PFExpr where fmap f (ConstF i) = ConstF i fmap f (AddF e e0 ) = AddF (f e) (f e0 ) fmap f (MulF e e0 ) = MulF (f e) (f e0 )

Fixed points for representing regular datatypes

Let us first review generic programming using fixed points for regular datatypes. While this is well-known, it serves not only as an introduction to the terminology we are using, but also as a template for our introduction of the more general case for families of mutually recursive types in Section 4. A functor is a type constructor of kind ∗ → ∗ for which we can define a map function. Fixed points are represented by an instance of the Fix datatype:

Given fmap, many recursion schemes can be defined, for example: fold :: (Regular a, Functor (PF a)) ⇒ (PF a r → r) → (a → r) fold f = f ◦ fmap (fold f) ◦ from unfold :: (Regular a, Functor (PF a)) ⇒ (r → PF a r) → (r → a) unfold f = to ◦ fmap (unfold f) ◦ f

data Fix f = In {out :: (f (Fix f))}

Note how the conversions in the class Regular allow us to work with the original datatype Expr rather than its fixed point representation Expr0 , but it is easy to define the deep conversion functions. For instance, fold In turns a regular datatype a into its fixed-point representation Fix (PF a). Another recursion scheme we can define is compos (Bringert and Ranta 2006). Much like fold, it traverses a data structure and performs operations on the children. There are different variants of compos, the simplest is equivalent to PolyP’s mapChildren (Jansson and Jeuring 1998): it applies a function of type a → a to all children. This parameter is also responsible for performing the recursive call, because compos itself is not recursive:

Haskell’s record notation is used to introduce the selector function out :: Fix f → f (Fix f). 2.1

Defining a pattern functor directly

Using Fix, we can represent the following datatype for simple arithmetic expressions data Expr = Const Int | Add Expr Expr | Mul Expr Expr by its pattern functor: data PFExpr r = ConstF Int | AddF r r | MulF r r type Expr0 = Fix PFExpr

compos :: (Regular a, Functor (PF a)) ⇒ (a → a) → a → a compos f = to ◦ fmap f ◦ from

The types Expr and Expr0 are isomorphic, and in Haskell we can witness this isomorphism by instantiating a class

2.2

class Regular a where deepFrom :: a → Fix (PF a) deepTo :: Fix (PF a) → a

Building functors systematically

The approach presented above still requires us to write fmap by hand for every datatype. Furthermore, other applications such as navigation or rewriting require functions defined on the pattern functor that cannot directly be derived from fmap. Thus, having to write fmap and such other functions manually for each datatype is undesirable. Fortunately, it is also unnecessary. In the following, we present a fixed set of datatypes that can be used to construct pattern functors systematically:

where type family PF a :: ∗ → ∗ The type family PF is an open type-level function mapping a regular type a of kind ∗ to its pattern functor PF a of kind ∗ → ∗. We can instantiate it by saying

data K a r=Ka data I r=Ir data (f :×: g) r = f r :×: g r data (f :+: g) r = L (f r) | R (g r) infixr 7 :×: infixr 6 :+:

type instance PF Expr = PFExpr The functions deepFrom and deepTo are straightforward to define. In practice, converting between a datatype and its fixed-point representation occurs often when programming generically, and traversing the whole value as required by deepFrom and deepTo is often more work than is actually required. We therefore present an alternative correspondence, making use of the isomorphism a∼ = Fix (PF a) ∼ = (PF a) (Fix (PF a)) ∼ = (PF a) a

The type K is used to represent occurrences of constant types, such as Int and Bool in the Expr example. The type I represents recursive positions. Using :×:, we can combine different fields of a constructor, and with :+:, we can combine constructors. Using the above datatypes, we can thus represent the pattern functor of Expr as follows:

This means that we relate a to its one-layer unfolding PF a a (a shallow conversion). We redefine class Regular to use the following conversion functions from and to:

type PFExpr = K Int :+: (I :×: I) :+: (I :×: I) type Expr0 = Fix PFExpr

class Regular a where from :: a → PF a a to :: PF a a → a

Datatypes, such as Expr, whose recursive structure can be represented by a polynomial functor (consisting of sums, products and

234

with families such as this representation of an abstract syntax tree, we will now demonstrate how to generalize the representation of datatypes as fixed points of functors to such families.

constants) are often called regular datatypes. The uniform encoding allows us to define functions that work on all regular datatypes. In particular, we can now define a generic map function by declaring the following instances of the class Functor:

3.1

class Functor f where fmap :: (a → b) → f a → f b instance Functor I where fmap f (I x) = I (f x) instance Functor (K a) where fmap (K x) = K x instance (Functor f, Functor g) ⇒ Functor (f :+: g) where fmap f (L x) = L (fmap f x) fmap f (R y) = R (fmap f y) instance (Functor f, Functor g) ⇒ Functor (f :×: g) where fmap f (x :×: y) = fmap f x :×: fmap f y

data Fix2 f g = In2 (f (Fix2 f g) (Fix2 g f)) We can easily generalize this idea further and define Fix3 , Fix4 and so on. Depending on whether we want to count Var as a full member of our abstract syntax tree family or not, we can then use either Fix2 or Fix3 to represent such a family as a fixed point of functors. The problem, however, is that we can also no longer use I, K, :+: and :×: to construct functors for arbitrary families systematically. Instead, it turns out that we require new variants of I, K, :+: and :×: for each arity. In the end, we have to rework our entire generic programming machinery for each arity of family we want to support, defeating the very purpose of generic programming. Furthermore, families of datatypes can be very large, and we cannot hope that supporting a limited amount of arities will suffice in practice.

With these declarations, we obtain fmap on PFExpr for free. Similarly, we get fmap on all datatypes as long as we express them as fixed points of pattern functors using K, I, :×: and :+:. Using fmap, we get fold, unfold and compos for free on all these datatypes. By providing one structural representation of a datatype (the instantiations of PF and Regular), we gain access to a multitude of powerful functions, and can easily define more. Being able to convert between the original datatype such as Expr and the fixed point Expr0 or the one-layer unfolding PF Expr Expr now becomes much more important, because for application-specific, non-generic functions, we want to be able to use the original constructor names rather than sequences of constructor applications for the representation. This is reflected in the fact that the conversion functions fromExpr and toExpr , while still being entirely straightforward, now become more verbose. Here is fromExpr as an example:

3.2

A uniform way to represent fixed points

At first, it looks like we cannot easily abstract over the arities of fixed-points in Haskell. However, it is well-known that an n-tuple of types can be encoded as a function taking an index (between 0 and n − 1) to a type. In other words, we have an isomorphism between the kinds ∗n and n → ∗ provided that n is a kind with exactly n different inhabitants. Let us apply this idea to fixed points. A fixed point of a single datatype is given by

fromExpr :: Expr → PF Expr Expr fromExpr (Const i) = L (K i) fromExpr (Add e e0 ) = R (L (I e :×: I e0 )) fromExpr (Mul e e0 ) = R (R (I e :×: I e0 ))

Fix :: (∗ → ∗) → ∗ For a family of two datatypes, we need two applications of

To facilitate the conversion, some generic programming languages automatically generate mappings that relate datatypes such as Expr with their structure representation counterparts (Expr0 ) (Jansson and Jeuring 1997; L¨oh 2004; Holdermans et al. 2006). In this case – if we do not want to extend the compiler – we can use a meta-programming tool such as Template Haskell (Sheard and Peyton Jones 2002) to generate the PF and Regular instances for a datatype.

3.

Fixed points for a specific number of datatypes

Swierstra et al. (1999) have shown how to represent a family of two mutually recursive types as a fixed point in Haskell. The idea is to introduce a different fixed point datatype that abstracts over bifunctors of kind ∗ → ∗ → ∗ rather than functors of kind ∗ → ∗:

Fix2 :: (∗ → ∗ → ∗) → (∗ → ∗ → ∗) → ∗ which modulo currying is the same as Fix2 :: (∗2 → ∗)2 → ∗ For a family of n datatypes, we need n applications of Fixn :: (∗n → ∗)n → ∗ or a single application of Fixn :: ((∗n → ∗)n → ∗)n

Fixed points for mutually recursive datatypes

Applying the isomorphism between ∗n and n → ∗, we get

In Section 2, we have shown how we can generically program with regular datatypes by expressing them as fixed points of functors. In practice, one often has to deal with large families of mutually recursive datatypes, which are not regular. As an example, consider the following extended version of our Expr datatype:

Fixn :: n → (n → ((n → ∗) → ∗)) → ∗ Reordering the arguments reveals that we have really generalized from a fixed point for kind ∗ to a fixed point for n → ∗: Fixn :: ((n → ∗) → (n → ∗)) → (n → ∗)

data Expr = Const Int | Add Expr Expr | Mul Expr Expr | EVar Var | Let Decl Expr data Decl = Var := Expr | Seq Decl Decl type Var = String

Apart from the fact that n is not available in Haskell’s kind system, we now have a uniform representation of a fixed-point combinator that is suitable to express arbitrary families of datatypes. Fortunately, the remaining gap is easy to bridge, as we show in the next section.

4.

We now have two datatypes that are mutually recursive, Expr and Decl. Both make use of a third type Var. In order to deal

Indexed fixed points in Haskell

After having presented the idea of how to get a uniform representation of fixed points, we are now going to explain how to make use

235

instance El AST Expr where proof = Expr instance El AST Decl where proof = Decl instance El AST Var where proof = Var

of this idea in Haskell. We develop a library for generic programming with families of mutually recursive types much in the same style as we did in Section 2 for regular datatypes. We are going to use the family of abstract syntax trees from the introduction of Section 3 as our running example. 4.1

4.2

Defining a pattern functor directly

Before we discuss how to represent functors of families generically, let us show how we can represent our family for abstract syntax trees as a fixed point in terms of HFix directly. The functor for AST can be defined as follows: data PFAST :: (∗AST → ∗) → (∗AST → ∗) where ConstF :: Int → PFAST r Expr AddF :: r Expr → r Expr → PFAST r Expr MulF :: r Expr → r Expr → PFAST r Expr EVarF :: r Var → PFAST r Expr LetF :: r Decl → r Expr → PFAST r Expr BindF :: r Var → r Expr → PFAST r Decl SeqF :: r Decl → r Decl → PFAST r Decl VF :: String → PFAST r Var

Encoding indexed fixed points in Haskell’s kind system

First, we have to find a way to encode n in Haskell’s kind system, where n is supposed to be a kind that has exactly n types as inhabitants. Haskell offers just one base kind, namely ∗, so we are left with little choice. However, we can simply approximate n by ∗ in Haskell, as long as we promise to instantiate ∗ with only n different types. In practice, if we have a family ϕ with n different types, we use the types in the family themselves as the indices to instantiate such positions of ∗. In this paper, we will write ∗ϕ rather than ∗ for such positions in order to make it more explicit that we are using a virtual subkind of ∗ that only consists of the members of family ϕ. Thus, our uniform fixed-point combinator now has kind HFix :: ((∗ϕ → ∗) → (∗ϕ → ∗)) → (∗ϕ → ∗)

The parameter r is used to denote a recursive call. At each recursive position, we apply r to the appropriate index in order to indicate the type we recurse on. Furthermore, each constructor of the functor targets a specific member of the family. By using HFix on the pattern functor, we obtain types that are isomorphic to the original family:

and can be defined in Haskell as data HFix (f :: (∗ϕ → ∗) → (∗ϕ → ∗)) (ix :: ∗ϕ ) = HIn (f (HFix f) ix) In our abstract syntax tree example, we have a family that we choose to call AST with three different types, and we are going to write ∗AST for the subkind of ∗ consisting only of the types Expr, Decl, and Var. We go even further and introduce a family-specific GADT (that we also call ϕ) and define it such that a value of ϕ ix can serve as a proof that ix is a type that belongs to ϕ. Whenever we quantify over a variable of kind ∗ϕ , we will pass such a value of type ϕ ix to make explicit that we quantify over a limited set of types. For the example, we introduce the GADT

type Expr0 = HFix PFAST Expr type Decl0 = HFix PFAST Decl type Var0 = HFix PFAST Var The isomorphisms can be witnessed by conversion functions once more, and for this purpose, we declare a class Family that corresponds to Regular: class Family ϕ where from :: ϕ ix → ix → PF ϕ I∗ ix to :: ϕ ix → PF ϕ I∗ ix → ix type family PF (ϕ :: ∗ϕ → ∗) :: (∗ϕ → ∗) → (∗ϕ → ∗)

data AST :: ∗AST → ∗ where Expr :: AST Expr Decl :: AST Decl Var :: AST Var

Like in the class Regular, we decide to implement a shallow conversion rather than a deep conversion. Note that all conversion functions take a ϕ ix as first argument, as proof that ix is indeed a member of ϕ. In the pattern functor, we have to describe the type of the recursive positions by means of a datatype of kind ∗ϕ → ∗. The one-layer unfolding uses the original datatypes of the family in the recursive positions, and we express this by choosing I∗ :

such that a value of AST ix serves as a proof that ix is a member of the AST family. One example where we make use of explicit proofs is when defining a map function for higher-order functors. Since the type has changed, we have to define a new class

data I∗ (ix :: ∗ϕ ) = I∗ {unI∗ :: ix}

class HFunctor (ϕ :: ∗ϕ → ∗) (f :: (∗ϕ → ∗) → (∗ϕ → ∗)) where hmap :: (∀ix.ϕ ix → r ix → r0 ix) → ϕ ix → f r ix → f r0 ix

The type I∗ behaves as the identity on types so that recursive occurrences inside the functor are stored “as is”. Although the definition of I∗ is essentially the same as that of I in Section 2, we give it a different name to highlight that we are using it conceptually at kind ∗ϕ → ∗ rather than kind ∗ → ∗, even though the two kinds coincide in the Haskell code. Here is the Family instance of AST:

The function hmap has a rank-2 type. The function that is mapped is quantified over all members ix of family ϕ. If for every index ix in ϕ, this function transforms an r ix into an r0 ix, then we can transform a functor with recursive calls given by r into a functor with recursive calls given by r0 . It is perhaps instructive to note that if ϕ is a family consisting of only one type, there will be only one choice for ϕ ix, and the type of hmap reduces to the type of fmap for regular functors. Instead of using explicit proofs of type ϕ ix, it is sometimes helpful to use a type class

type instance PF AST = PFAST instance Family AST where from = fromAST to = toAST The functions fromAST and toAST are straightforward and not given here. We can now go on to define a HFunctor instance and subsequently recursion schemes such as fold and unfold for PFAST . However, since we strive for programming generically with families of datatypes, we want to avoid having to define HFunctor man-

class El (ϕ :: ∗ϕ → ∗) (ix :: ∗ϕ ) where proof :: ϕ ix and then use an implicit class constraint El ϕ ix instead of a value of type ϕ ix. For the AST family, we define the following instances:

236

fromAST Decl (x := e) = R (R (R (R (R (L (Tag (ci x :×: ci e))))))) fromAST Decl (Seq d d0 ) = R (R (R (R (R (R (L (Tag (ci d :×: ci d0 )))))))) fromAST Var x = R (R (R (R (R (R (R (Tag (K x)))))))) ci x = I (I∗ x)

ually for our family. Instead, we will try – as we have before in Section 2 – to build our functor systematically from a fixed set of datatypes. 4.3

Building functors systematically

It turns out that we can use almost the same datatypes as before to represent functors. The datatypes K, :×:, and :+: can be lifted from being parameterized over an r of kind ∗ to being parameterized over an r of kind ∗ϕ → ∗ and an index ix of kind ∗ϕ :

4.4

data K a (r :: ∗ϕ → ∗) (ix :: ∗ϕ ) = K a data (f :+: g) (r :: ∗ϕ → ∗) (ix :: ∗ϕ ) = L (f r ix) | R (g r ix) data (f :×: g) (r :: ∗ϕ → ∗) (ix :: ∗ϕ ) = f r ix :×: g r ix

instance El ϕ xi ⇒ HFunctor ϕ (I xi) where hmap f (I x) = I (f proof x) instance HFunctor ϕ (K a) where hmap f (K x) = K x instance (HFunctor ϕ f, HFunctor ϕ g) ⇒ HFunctor ϕ (f :+: g) where hmap f p (L x) = L (hmap f p x) hmap f p (R y) = R (hmap f p y) instance (HFunctor ϕ f, HFunctor ϕ g) ⇒ HFunctor ϕ (f :×: g) where hmap f p (x :×: y) = hmap f p x :×: hmap f p y instance HFunctor ϕ f ⇒ HFunctor ϕ (f :.: ix) where hmap f p (Tag x) = Tag (hmap f p x)

The type I has been used to represent a recursive call. In the current situation, recursive calls can be to a specific index in the family. Therefore, I gets an additional argument xi :: ∗ϕ that is used to determine the recursive call to make: data I (xi :: ∗ϕ ) (r :: ∗ϕ → ∗) (ix :: ∗ϕ ) = I (r xi) It is perhaps surprising that xi is different from ix. But where ix projects out a certain member of the family, the type of the recursive call is independent of the type we are ultimately interested in. In fact, we have not yet used the parameter ix anywhere. If we look at the direct definition of PFAST , we see that depending on the index we choose to project out of the functor, we get different functors. Only the first five constructors of PFAST contribute to PFAST r Expr, for example. We introduce another combinator for pattern functors in order to express such constraints on the index:

Despite our generalization, the code for hmap looks almost completely identical to the code for fmap. We need an additional, but trivial case for (:.:). A slight change occurs in the case for I, where we additionally have to require that the recursive call is actually in our family via El ϕ xi, to be able to pass the required proof to the mapped function f.

infix 6 :.: data (f :.: (xi :: ∗ϕ )) (r :: ∗ϕ → ∗) (ix :: ∗ϕ ) where Tag :: f r xi → (f :.: xi) r xi By tagging a functor with an index from the family, we make explicit that the tagged part only contributes to the structure of that particular member of the family. We now have all the combinators we need to give a structural representation of the AST pattern functor: type PFAST = K Int :.: Expr :+: (I Expr :×: I Expr) :.: Expr :+: (I Expr :×: I Expr) :.: Expr :+: I Var :.: Expr :+: (I Decl :×: I Expr) :.: Expr :+: (I Var :×: I Expr) :.: Decl :+: (I Decl :×: I Decl) :.: Decl :+: K String :.: Var

Generic hmap

We still have to establish that our new functor combinators are actually higher-order functors themselves:

4.5

Generic compos

Using hmap, it is easy to define compos: compos :: (Family ϕ, HFunctor ϕ (PF ϕ)) ⇒ (∀ix.ϕ ix → ix → ix) → ϕ ix → ix → ix compos f p = to p ◦ hmap (λ p → I∗ ◦ f p ◦ unI∗ ) p ◦ from p

-- Const -- Add -- Mul -- EVar -- Let -- := -- Seq -- V

The only differences to the version in Section 2 are due to the presence of explicit proof terms of type ϕ ix and because the actual values in the structure are now wrapped in applications of the I∗ constructor. Bringert and Ranta (2006) describe in their paper on compos how to define the function on families of mutually recursive datatypes. Their solution, however, requires to modify the family of datatypes and rewrite them as a single GADT. Our version of compos works on families of mutually recursive datatypes without modification. As an example use of compos, consider the following expression: example = Let ("x" := Mul (Const 6) (Const 9)) (Add (EVar "x") (EVar "y"))

To match the structure of the direct definition of PFAST more closely, we have chosen to tag the representation of every constructor with the index it targets. Alternatively, we could have tagged the sum of all constructors of a type just once. If we use the structural version of PFAST in the Family instance, we have to adapt the conversion functions. Again, these are straightforward, but lengthy. We only show fromAST : fromAST :: AST ix → ix → PFAST I∗ ix fromAST Expr (Const i) = L (Tag (K i)) fromAST Expr (Add e e0 ) = R (L (Tag (ci e :×: ci e0 ))) fromAST Expr (Mul e e0 ) = R (R (L (Tag (ci e :×: ci e0 )))) fromAST Expr (EVar x) = R (R (R (L (Tag (ci x))))) fromAST Expr (Let d e) = R (R (R (R (L (Tag (ci d :×: ci e))))))

The following function renames all variables in example – note how renameVar0 can use the type representation to take different actions for different nodes – in this case, filter out nodes of type Var. renameVar :: Expr → Expr renameVar = renameVar0 Expr where renameVar0 :: AST a → a → a renameVar0 Var x = x ++ "_" renameVar0 p x = compos renameVar0 p x The call renameVar example yields:

237

instance Apply g ⇒ Apply (I xi :×: g) where apply f (I x :×: y) = apply (f x) y instance Apply f ⇒ Apply (f :.: xi) where apply f (Tag x) = apply f x

Let ("x_" := Mul (Const 6) (Const 9)) (Add (EVar "x_") (EVar "y_"))

4.6

Generic fold

We can further facilitate the construction of algebras by defining an infix operator for pairing:

We can also define fold using hmap. Again, the definition is very similar to the single-datatype version:

infixr 1 & (&) = (, )

type Algebra ϕ r = ∀ix.ϕ ix → PF ϕ r ix → r ix fold :: (Family ϕ, HFunctor ϕ (PF ϕ)) ⇒ Algebra ϕ r → ϕ ix → ix → r ix fold f p = f p ◦ hmap (λ p (I∗ x) → fold f p x) p ◦ from p

As an example, let us specify an evaluator on our abstract syntax tree types using an algebra. Because different types in our family are mapped to different results, we need another family of datatypes for the result type of our algebra:

Using fold is slightly trickier than using compos, because we have to construct a suitable argument of type Algebra. This algebra argument involves a function operating on the pattern functor, which is itself a generically derived datatype. We therefore have to write a function that destructs a sum of products, where the fields in the products are wrapped by occurrences of K or I. It is much more natural to define an algebra by giving one function per constructor, with the functions taking as many arguments as there are fields, preferably even in a curried style. This problem is not caused by having families of many datatypes. The generic programming language PolyP (Jansson and Jeuring 1997) has a special ad-hoc construct that helps in defining algebras in a convenient style. We can do better: in the following, we will define a type-indexed datatype (Hinze et al. 2004) for algebras, as a type family inductively defined over the structure of functors. We can then define algebras in a convenient style, and use them in a generic fold. The type-indexed datatype Alg is defined as follows:

data family Value a :: ∗ data instance Value Expr = EV (Env → Int) data instance Value Decl = DV (Env → Env) data instance Value Var = VV Var type Env = [(Var, Int)] An environment maps variables to integers. Expressions can contain variables, we therefore interpret them as functions from environments to integers. Declarations can be seen as environment transformers. Variables evaluate to their names. We can now state the algebra: evalAlg :: Algebra AST Value evalAlg = const (apply ( (λ x → EV (const x)) & (λ (EV x) (EV y) → EV (λ m → x m + y m)) & (λ (EV x) (EV y) → EV (λ m → x m ∗ y m)) & (λ (VV x) → EV (fromJust ◦ lookup x)) & (λ (DV e) (EV x) → EV (λ m → x (e m))) & (λ (VV x) (EV v) → DV (λ m → (x, v m) : m)) & (λ (DV f) (DV g) → DV (g ◦ f)) & (λ x → VV x)))

type family Alg (f :: (∗ϕ → ∗) → ∗ϕ → ∗) (r :: ∗ϕ → ∗) (ix :: ∗) :: ∗ type instance Alg (K a) r ix = a → r ix type instance Alg (I xi) r ix = r xi → r ix type instance Alg (f :+: g) r ix = (Alg f r ix, Alg g r ix) type instance Alg (K a :×: g) r ix = a → Alg g r ix type instance Alg (I xi :×: g) r ix = r xi → Alg g r ix type instance Alg (f :.: xi) r ix = Alg f r xi

-- Const -- Add -- Mul -- EVar -- Let -- := -- Seq -- V

Testing eval :: Expr → Env → Int eval x = let (EV f) = fold evalAlg Expr x in f in the expression eval example [("y", −12)] yields 42.

The definition shows how we want to define our algebras: Occurrences of K and I are unwrapped. An algebra on a sum is a pair of algebras on the components. In the product case, we make use of knowledge on how datatypes are built: products are always nested to the right, and the left components are always fields, either wrapped by K or I. Hence, we can give two cases that allow us to turn algebras on a product into curried functions. The case for tags simply recurses. We then have to show that we can transform such a more convenient algebra into the form that fold expects. To this end, we define the generic function apply:

4.7

Summary

We have now introduced a library for generic programming on families of mutually recursive types. The library consists of the type family PF, the classes Family and El, and the functor constructors I, K, :+:, :×:, and :.:. Furthermore, the library contains classes and instances for a number of generic functions, such as all the HFunctor code, the definitions of compos, fold and unfold. To use the library for a specific family a user has to do the following: define a GADT such as AST, instantiate the type family PF to the pattern functor, and construct Family and El instances. This may still seem a significant amount of work, but all of this code is entirely straightforward and can easily be automated. In fact, we have implemented the generation of most of this boilerplate code in Template Haskell, so that only the definition of the GADT and a call to a Template Haskell function remains. Once the library is instantiated, all generic functions that are provided by the library are available for this family without any further work.

class Apply (f :: (∗ϕ → ∗) → ∗ϕ → ∗) where apply :: Alg f r ix → f r ix → r ix instance Apply (K a) where apply f (K x) = f x instance Apply (I xi) where apply f (I x) = f x instance (Apply f, Apply g) ⇒ Apply (f :+: g) where apply (f, g) (L x) = apply f x apply (f, g) (R x) = apply g x instance Apply g ⇒ Apply (K a :×: g) where apply f (K x :×: y) = apply (f x) y

5.

The Zipper

For a tree-like datatype, the Zipper (Huet 1997) is a derived data structure that allows efficient navigation through a tree, along its

238

LetC1 :: Expr → CtxAST Expr Decl

recursive nodes. At every moment, the Zipper keeps track of a location: a point of focus paired with a context that represents the rest of the tree. The focus can be moved up, down, left, and right. For regular datatypes, it is well-known how to define Zippers generically (Hinze et al. 2004). In the following, we first show how to define a Zipper for a system of mutually recursive datatypes using our example of abstract syntax trees (Section 5.1). Then, in Section 5.2, we give a generic algorithm in terms of the representations introduced in Section 4.3. 5.1

If, however, we descend into the second position, then Expr is the type of the hole with Decl remaining: LetC2 :: Decl → CtxAST Expr Expr 5.1.2

Zipper for mutually recursive datatypes

We first give a non-generic presentation of the Zipper for abstract syntax trees as defined in Section 3. A location is the current focus paired with context information. In a setting with multiple types, the type of the focus ix is not known – hence, we make it existential, and carry around a representation of type AST ix: data LocAST :: ∗AST → ∗ where Loc :: AST ix → ix → CtxsAST a ix → LocAST a

down :: LocAST ix → Maybe (LocAST ix) down (Loc Expr (Add e e0 ) cs) = Just (Loc Expr e (Cons (AddC1 e0 ) cs)) down (Loc Expr (Mul e e0 ) cs) = Just (Loc Expr e (Cons (MulC1 e0 ) cs)) down (Loc Expr (EVar x) cs) = Just (Loc Var x (Cons EVarC cs)) down (Loc Expr (Let d e) cs) = Just (Loc Decl d (Cons (LetC1 e) cs)) down (Loc Decl (x := e) cs) = Just (Loc Var x (Cons (BindC1 e) cs)) down (Loc Decl (Seq d d0 ) cs) = Just (Loc Decl d (Cons (SeqC1 d0 ) cs)) down = Nothing

The type CtxsAST encodes context information for the focus as a path from the focus to the root of the full tree. The path is stored in a stack of context frames: data CtxsAST :: ∗AST → ∗AST → ∗ where Nil :: CtxsAST a a Cons :: CtxAST ix b → CtxsAST a ix → CtxsAST a b A context stack of type CtxsAST a b represents a value of type a with a b-typed hole in it. More specifically, a stack consists of frames of type CtxAST ix b that represent constructor applications that yield an ix-value with a hole of type b in it. The full tree that is represented by a location can be recovered by plugging the value in focus into the topmost context frame, plugging the resulting value into the next frame, and so on. For this to work, the target type ix of each context frame must be equal to the type of the hole in the remainder of the stack – as enforced by the type of Cons. 5.1.1

Navigation

We now define functions that move the focus, transforming a location into a new location. These functions return their result in the Maybe monad, because navigation may fail: we cannot move down from a leaf of the tree, up from the root, or right if there are no more siblings in that direction. Moving down analyzes the current focus. For all constructors that do not build leaves, we descend into the leftmost child by making it the new focus, and by pushing an appropriate frame onto the context stack. For leaves, we return Nothing.

The function right succeeds for nodes that actually have a right sibling. The size of the context stack remains unchanged: we just replace its top element with a new frame.

Contexts

right :: LocAST ix → Maybe (LocAST ix) right (Loc e (Cons (AddC1 e0 ) cs)) = Just (Loc Expr e0 (Cons (AddC2 e) cs)) right (Loc e (Cons (MulC1 e0 ) cs)) = Just (Loc Expr e0 (Cons (MulC2 e) cs)) right (Loc d (Cons (LetC1 e) cs)) = Just (Loc Expr e (Cons (LetC2 d) cs)) right (Loc x (Cons (BindC1 e) cs)) = Just (Loc Expr e (Cons (BindC2 x) cs)) right (Loc d (Cons (SeqC1 d0 ) cs)) = Just (Loc Decl d0 (Cons (SeqC2 d) cs)) right = Nothing

A single context frame CtxAST is following the structure of the types in the AST system closely. data CtxAST :: ∗AST → ∗AST → ∗ where AddC1 :: Expr → CtxAST Expr Expr AddC2 :: Expr → CtxAST Expr Expr MulC1 :: Expr → CtxAST Expr Expr MulC2 :: Expr → CtxAST Expr Expr EVarC :: CtxAST Expr Var LetC1 :: Expr → CtxAST Expr Decl LetC2 :: Decl → CtxAST Expr Expr BindC1 :: Expr → CtxAST Decl Var BindC2 :: Var → CtxAST Decl Expr SeqC1 :: Decl → CtxAST Decl Decl SeqC2 :: Decl → CtxAST Decl Decl

The function left is very similar to right. Finally, the function up is applicable whenever the current focus is not the root of the tree, i.e., whenever the context stack is non-empty. We then analyze the top context frame and plug in the old focus, yielding the new focus, and retain the rest of the context. The definition is omitted for reasons of space.

The relation between CtxAST and AST becomes even more pronounced if we also look at the directly defined pattern functor PFAST from Section 4.2. For every constructor in PFAST , we have as many constructors in CtxAST as there are recursive positions. We can descend into a recursive position. The type of the recursive position then becomes the type of the hole, the second argument of CtxAST . The other components of the original constructor are stored in the context. As an example, consider:

5.1.3

Using the Zipper

To use the Zipper, we need functions to turn syntax trees into locations, and back again. For manipulating trees, we provide an update operation that replaces the subtree in focus. To enter the tree, we place it into the empty context:

Let :: Decl → Expr → Expr LetF :: r Decl → r Expr → PFAST r Expr We have two recursive positions. If we descend into the first, then Decl is the type of the hole, while Expr remains – and so we get

enter :: AST ix → ix → LocAST ix enter p e = Loc p e Nil

239

To leave, we move up as far as possible and then return the expression in focus. leave :: LocAST Expr → Expr leave (Loc e Nil) = e leave loc = leave (fromJust (up loc))

data instance Ctx (K a) ix b = CK Void data instance Ctx (f :+: g) ix b = CL (Ctx f ix b) | CR (Ctx g ix b) data instance Ctx (f :×: g) ix b = C1 (Ctx f ix b) (g I∗ ix) | C2 (f I∗ ix) (Ctx g ix b)

To update the tree, we pass in a function capable of modifying the current point of focus. Because the value in focus can have different types, this function needs to be parameterized by the type representation.

For constants, there are no recursive positions, hence we produce an empty datatype, i.e., a datatype with no constructors: data Void For a sum, we are given either an f or a g, and compute the context of that. For a product, we can descend either left or right. If we descend into f, we pair a context for f with g. If we descend into g, we pair f with a context for g. We are left with the cases for I and (:.:). According to the analogy with the derivative, the context of the identity should be the unit type. However, we are in a situation where there are multiple types involved. The type index of I fixes the type of the hole. We express this type equality as follows, by means of a GADT:1

update :: (∀ix.AST ix → ix → ix) → LocAST Expr → LocAST Expr update f (Loc p x cs) = Loc p (f p x) cs As an example, we modify the multiplication in example = Let ("x" := Mul (Const 6) (Const 9)) (Add (EVar "x") (EVar "y")) To combine the navigation and edit operations, it is helpful to make use of flipped function composition (>>>) :: (a → b) → (b → c) → (a → c) and monadic composition (>=>) :: Monad m ⇒ (a → m b) → (b → m c) → (a → m c). The call

data instance Ctx (I xi) ix b where CId :: Ctx (I xi) ix xi For the case of tags, we have a similar situation. A tag does not affect the structure of the context, it only provides information for the type system. In this case, not the type of the hole, but the type of the context itself is required to match the type index of the tag:

enter Expr >>> down >=> down >=> right >=> update solve >>> leave >>> return $ example with solve :: AST ix → ix → ix solve Expr = Const 42 solve x =x

data instance Ctx (f :.: xi) ix b where CTag :: Ctx f xi b → Ctx (f :.: xi) xi b This completes the definition of Ctx. We can convince ourselves that instantiating Ctx to PF AST results in a datatype that is isomorphic to CtxAST . It is also quite a bit more complex than the hand-written variant, but fortunately, the programmer never has to use it directly. Instead, we can interface with it using generic navigation functions.

results in Just (Let ("x" := Const 42) (Add (EVar "x") (EVar "y"))) 5.2

A generic Zipper

We now define a Zipper generically for a system of mutually recursive datatypes. We make the same steps as in the example for abstract syntax trees before. The type definitions for locations and context stacks stay essentially the same:

5.2.2

class Zipper ϕ f where ...

data Loc :: (∗ϕ → ∗) → ∗ϕ → ∗ where Loc :: (Family ϕ, Zipper ϕ (PF ϕ)) ⇒ ϕ ix → ix → Ctxs ϕ a ix → Loc ϕ a data Ctxs :: (∗ϕ → ∗) → ∗ϕ → ∗ϕ → ∗ where Nil :: Ctxs ϕ a a Cons :: ϕ ix → Ctx (PF ϕ) ix b → Ctxs ϕ a ix → Ctxs ϕ a b

We will fill this class with methods incrementally. Down To move down in a tree, we define a generic function first in our class Zipper: class Zipper ϕ f where ... first :: (∀b.ϕ b → b → Ctx f ix b → a) → f I∗ ix → Maybe a

Instead of a specific proof term AST ix, we now store a generic proof term ϕ ix for an arbitrary family in a location. Additionally, we need a Zipper for the system ϕ. This condition is expressed by Zipper (PF ϕ) and will be explained in more detail below. In the stack Ctxs, we also require that the types of the elements are in ϕ via the field ϕ ix. 5.2.1

Navigation

The navigation functions are again generically defined on the structure of the pattern functor. Thus, we define them in a class Zipper:

The function takes a functor f I∗ ix and tries to split off its first recursive component. This is of some type b where we know ϕ b. The rest is a context of type Ctx f ix b. The function takes a continuation parameter that describes what to do with the two parts. Function down is defined in terms of first:

Contexts

The context type is defined generically on the pattern functor of ϕ. We thus reuse the type family PF defined in Section 3. We have to distinguish between different type constructors that make up the pattern functor, and therefore define Ctx as a datatype family:

down :: Loc ϕ ix → Maybe (Loc ϕ ix) down (Loc p x cs) = first (λ p0 z c → Loc p0 z (Cons p c cs)) (from p x) We try to split the tree in focus x. If this succeeds, we get a new focus z and a new context frame c. We push c on the stack.

data family Ctx f :: ∗ϕ → ∗ϕ → ∗ Like the context stack, a context frame is parameterized over both the type of the resulting index and the type of the hole. The simple cases are for constant types, sums and products. There is a correspondence between the context of a datatype and its formal derivative (McBride 2001):

1 Currently, GHC does not allow instances of datatype families to be defined

as GADTs. In the actual implementation, we therefore simulate the GADT by including an explicit proof of type equality (Peyton Jones et al. 2006; Baars and Swierstra 2002).

240

from ⊥, is impossible. In the context of our Zipper library, we can guarantee that ⊥ is never produced for Void. We therefore define:

We define first by induction on the structure of pattern functors. Constant types constitute the leaves in the tree. We cannot descend, and return Nothing.

instance Zipper ϕ (K a) where ... fill p x (CK void) = impossible void impossible :: Void → a impossible void = error "impossible"

instance Zipper ϕ (K a) where ... first f (K a) = Nothing In a sum, we descend further, and add the corresponding context constructor CL or CR to the context. instance (Zipper ϕ f, Zipper ϕ g) ⇒ Zipper ϕ (f :+: g) where ... first f (L x) = first (λ p z c → f p z (CL c)) x first f (R y) = first (λ p z c → f p z (CR c)) y

The definition of fill is very straight-forward: for I, we return the element to plug itself; for (:.:) and (:+:), we call fill recursively. In the case for products, we recurse into the context: instance (Zipper ϕ f, Zipper ϕ g) ⇒ Zipper ϕ (f :×: g) where ... fill p x (C1 c y) = fill p x c :×: y fill p y (C2 x c) = x :×: fill p y c

We want to get to the first child. Therefore, we first try to descend to the left in a product. Only if that fails (mplus), we try to split the right component.

Right As a final example of a navigation function, we define right. We again employ the same scheme as before. We define a generic function next with the following type:

instance (Zipper ϕ f, Zipper ϕ g) ⇒ Zipper ϕ (f :×: g) where ... first f (x :×: y) = first (λ p z c → f p z (C1 c y)) x ‘mplus‘ first (λ p z c → f p z (C2 x c)) y

class Zipper ϕ f where ... next :: (∀b.ϕ b → b → Ctx f ix b → a) → (ϕ b → b → Ctx f ix b → Maybe a)

In the I case, we have exactly one possibility. We split I (I∗ x) into x and the context CId and pass the two parts to the continuation f: instance El ϕ xi ⇒ Zipper ϕ (I xi) where ... first f (I (I∗ x)) = return (f proof x CId)

The function takes a context frame and an element that fits into the context. By looking at the context, it tries to move the focus one element to the right, thereby producing a new element – possibly of different type – and a new compatible context. These can, as in first, be combined using the passed continuation. With next, we can define right:

It is interesting to see why this is type correct: the type of x is xi, so applying f to x instantiates b to xi and forces the final argument of f to be of type Ctx (I xi) ix xi. But that is exactly the type of CId. Finally, for a tag, we also descend further and apply CTag to the context. instance Zipper ϕ f ⇒ Zipper ϕ (f :.: xi) where ... first f (Tag x) = first (λ p z c → f p z (CTag c)) x

right :: Loc ϕ ix → Maybe (Loc ϕ ix) right (Loc p x Nil) = Nothing right (Loc p x (Cons p0 c cs)) = next (λ p z c0 → Loc p z (Cons p0 c0 cs)) p x c We cannot move right in the root of the tree, thus right fails in an empty context. Otherwise, we only need to look at the topmost context frame, and pass it to next, together with the current focus. On success, we take the new focus, and push the new context frame back on the stack. Most cases of next are without surprises: calling next for K is again impossible; in sums and on tags we recurse. Since an I indicates a single child – a leaf in the tree – we cannot move right from there and return Nothing. The most interesting case is the case for products. If we are currently in the first component, we try to move to the next element there, but if this fails, we have to select the first child of the second component, calling first. In that case, we also have to plug the old focus x back into its context c, using fill. If, however, we are already in the right component, we do not need a case distinction and just try to move further to the right using next.

This is type correct because Tag introduces the refinement that CTag requires: applying CTag to c results in Ctx (f :.: xi) xi b. This can be passed to f only if ix from the type of first is equal to xi. But it is, because the pattern match on Tag forces it to be. Up Now that we can move down, we also want to move up again. We employ the same scheme as before: using an inductively defined generic helper function fill, we then define up. The function fill has the following type: class Zipper ϕ f where ... fill :: ϕ b → b → Ctx f ix b → f I∗ ix The function takes a value together with a compatible context frame and plugs them together, producing a value of the pattern functor. This operation is total, so no Maybe is required in the result. With fill, we can define up as follows:

instance (Zipper ϕ f, Zipper ϕ g) ⇒ Zipper ϕ (f :×: g) where ... next f p x (C1 c y) = next (λ p0 z c0 → f p0 z (C1 c0 y )) p x c ‘mplus‘ first (λ p0 z c0 → f p0 z (C2 (fill p x c) c0 )) y next f p y (C2 x c) = next (λ p0 z c0 → f p0 z (C2 x c0 )) p y c

up :: Loc ϕ ix → Maybe (Loc ϕ ix) up (Loc p x Nil) = Nothing up (Loc p x (Cons p0 c cs)) = Just (Loc p0 (to p0 (fill p x c)) cs) We cannot move up in the root of the tree and thus fail on an empty context stack. Otherwise, we pick the topmost context frame, and call fill. Since fill results in a value of the pattern functor, we have to convert back into the original form using to. We start the definition of fill with the case for K. As an argument to fill, we need a context for K, for which we defined but one constructor CK with a Void parameter. In other words, in order to call fill on K, we have to produce a value of Void, which, apart

5.2.3

Using the Zipper

The functions enter, leave and update can be converted from the specific case for AST almost without change. The code is exactly as before, we only have to adapt the types.

241

enter :: (Family ϕ, Zipper ϕ (PF ϕ)) ⇒ ϕ ix → ix → Loc ϕ ix enter p x = Loc p x Nil leave :: Loc ϕ ix → ix leave (Loc p x Nil) = x leave loc = leave (fromJust (up loc)) update :: (∀ix.ϕ ix → ix → ix) → Loc ϕ ix → Loc ϕ ix update f (Loc p x cs) = Loc p (f p x) cs

type Scheme ϕ = HFix (K String :+: PF ϕ) As in the regular case, the pattern functor is extended with a metavariable representation. We want meta-variable representations to be polymorphic, so, unlike other constructors, K String is not tagged with (:.:). Now, the same representation can be used to encode meta-variables that match, for example, Expr, Decl and Var. Dealing with multiple datatypes affects the types of substitutions. We cannot use a homogeneous list of mappings as we did earlier, because different meta-variables may map to different datatypes. We get around this difficulty by existentially quantifying over the type of the matched datatype:

Let us repeat the example from before, but now use the generic Zipper: apart from the additional argument to enter, nothing changes enter Expr >>> down >=> down >=> right >=> update solve >>> leave >>> return $ example

data DynIx ϕ = ∀ix.DynIx (ϕ ix) ix type Subst ϕ = [(String, DynIx ϕ)]

and the result is also the same: Just (Let ("x" := Const 42) (Add (EVar "x") (EVar "y"))) 6.3

6.

Generic rewriting

Generic matching is defined as follows:

Term rewriting can be specified generically, for arbitrary regular datatypes, if these are viewed as fixed points of functors (Jansson and Jeuring 2000,Van Noort et al. 2008). In the following we show how to generalize term rewriting even further, to work on families with an arbitrary number of datatypes. For reasons of space, we do not discuss generic rewriting in complete detail, but focus on the operation of matching the left-hand side of a rule with a term. 6.1

type MatchM s a = StateT (Subst s) Maybe a matchM :: (Family ϕ, HZip ϕ (PF ϕ)) ⇒ ϕ ix → Scheme ϕ ix → I∗ ix → MatchM ϕ () matchM p (HIn (L (K metavar))) (I∗ e) = do subst ← get case lookup metavar subst of Nothing → put ((metavar, DynIx p e) : subst) Just → fail ("repeated use: " ++ metavar) matchM p (HIn (R r)) (I∗ e) = combine matchM r (from p e)

Schemes of regular datatypes

Before tackling matching on families of mutually recursive datatypes, we briefly sketch the ideas behind its implementation on regular datatypes. Consider how to implement matching for the simple version of the Expr datatype introduced in Section 2. First, we define expression schemes, which extend expressions with a constructor for rule meta-variables. Then we define matching of those schemes against expressions:

Generic matching tries to match a term of type I∗ ix against a scheme of corresponding type Scheme ϕ ix. The resulting information is returned in the MatchM monad. The definition of MatchM uses Maybe for indicating possible failure, and on top of that monad we use the state transformer StateT. The state monad is used to thread the substitution as we traverse the scheme and the term in parallel. The class HZip, which contains functionality for zipping, is introduced in the following subsection. Generic matching consists of two cases. When dealing with a meta-variable, we first check that there is no previous mapping for it. (For the sake of brevity, we do not show how to deal with multiple occurrences of a meta-variable.) If that is the case, we update the state with the new mapping. The second case deals with matching constructors against constructors. More specifically, this corresponds to matching Mul (Const 6) (Const 9) against MulS (MetaVar "x") (MetaVar "y"). This is handled by the generic function combine, which matches the two pattern functor representations. If the representations match (as in our example), then matchM is applied to the recursive occurrences (for instance, on MetaVar "x" and Const 6, and MetaVar "y" and Const 9). Now we can write the following wrapper on matchM to hide the use of the state monad that threads the substitution: match :: (Family ϕ, HZip ϕ (PF ϕ)) ⇒ ϕ ix → Scheme ϕ ix → ix → Maybe (Subst ϕ) match p scheme tm = execStateT (matchM p scheme (I∗ tm)) [ ]

data ExprS = MetaVar String | ConstS Int | AddS ExprS ExprS | MulS ExprS ExprS match :: ExprS → Expr → Maybe [(String, Expr)] On success, match returns a substitution mapping meta-variables to matched subterms. For example, the call match (MulS (MetaVar "x") (MetaVar "y")) (Mul (Const 6) (Const 9)) yields Just [("x", Const 6), ("y", Const 9)]. To implement match generically, we need to define the scheme of a datatype generically. To this end, recall that a regular datatype is isomorphic to the type Fix f, for a suitably defined f. A metavariable can appear deep inside a scheme, this suggests that the extension with MetaVar should take place inside the recursion, and hence on f. This motivates the following definition for schemes of regular datatypes: type Scheme a = Fix (K String :+: PF a) For example, the expression scheme that is used above as the first argument to match can be represented by

6.4

In (R (R (R (I (In (L (K "x"))) :×: I (In (L (K "y"))))))) 6.2

Generic matching

Generic zip and combine

The generic function combine is defined in terms of a another function, which is a generalization of zipWith for arbitrary functors. Like hmap, the function hzipM is defined by induction on the pattern functor by means of a type class:

Schemes of a datatype family and substitutions

A family of mutually recursive datatypes requires as many sorts of meta-variables as there are datatypes. For example, for the family used in Section 3, we need three meta-variables, ranging over Expr, Decl and Var, respectively. Fortunately, we can deal with all these meta-variables in one go:

class HZip ϕ f where hzipM :: Monad m ⇒ (∀ix.ϕ ix → r ix → r0 ix → m (r00 ix)) → f r ix → f r0 ix → m (f r00 ix)

242

The function hzipM takes an argument that combines the r and r0 structures stored in the pattern functor. The traversal is performed in a monad to notify failure when the functor arguments do not match, and to allow the argument to use state, for example. In the case of combine, we are not interested in the resulting merged structure (r00 ix). Indeed, matchM stores information only in the state monad, so we define combine to ignore the result.

recursive datatypes in many dependently typed programming languages. Benke et al. (2003) give a formal construction for mutually recursive datatypes as indexed inductive definitions in Alfa. Some similarities with our work are that the pattern functor argument is indexed by the datatype sort, and recursive positions specify the sort index of the subtree. Altenkirch and McBride (2003) show how to do generic programming in the dependently typed programming language OLEG. We believe that it is easier to write generic programs on mutually recursive datatypes in our approach, since we do not have to deal with kind-indexed definitions, environments, type applications, datatype variables and argument variables, in addition to the cases for sums, products and constants. McBride (2001) first described a generic Zipper on regular datatypes, which was implemented in Epigram by Morris et al. (2006). The Zipper has been used as an example of a type-indexed datatype in Generic Haskell (Hinze et al. 2004), but again only for regular datatypes. The dissection operator introduced by McBride (2008) is also only defined for regular datatypes, although McBride remarks that an implementation in a dependently typed programming language for mutually recursive datatypes is possible. In most other generic programming approaches, recursion is not explicitly represented and types occurring at such positions cannot be changed. Such approaches cannot be used to define the examples in this paper. For example SYB (L¨ammel and Peyton Jones 2003) and EMGM (Oliveira et al. 2006) cannot be used to define fold. In fold, the algebra function is applied to constructor fields where the recursive positions have been replaced by the carrier of the algebra. However, in these approaches, values at recursive positions cannot be transformed into values of other types. PolyLib (Norell and Jansson 2004) supports the definition of fold, but it is limited to regular datatypes. A similar argument can be made for typeindexed types. Generic approaches that support type-indexed types (Oliveira and Gibbons 2005) need to represent recursion explicitly in order to define the Zipper and generic rewriting.

data K∗ a b = K∗ {unK∗ :: a} combine :: (Monad m, HZip ϕ f) ⇒ (∀ix.ϕ ix → r ix → r0 ix → m ()) → f r ix → f r0 ix → m () combine f x y = do hzipM wrapf x y return () where wrapf p x y = do f p x y return (K∗ ()) In the above, K∗ is used to ignore the type ix in the result. The definition of hzipM does not differ much from that used when dealing with a single regular datatype: instance El ϕ xi ⇒ HZip ϕ (I xi) where hzipM f (I x) (I y) = liftM I (f proof x y) instance (HZip ϕ a, HZip ϕ b) ⇒ HZip ϕ (a :×: b) where hzipM f (x1 :×: x2) (y1 :×: y2) = liftM2 (:×:) (hzipM f x1 y1) (hzipM f x2 y2) instance (HZip ϕ a, HZip ϕ b) ⇒ HZip ϕ (a :+: b) where hzipM f (L x) (L y) = liftM L (hzipM f x y) hzipM f (R x) (R y) = liftM R (hzipM f x y) hzipM f = fail "zip failed in :+:" instance HZip ϕ f ⇒ HZip ϕ (f :.: ix) where hzipM f (Tag x) (Tag y) = liftM Tag (hzipM f x y) instance Eq a ⇒ HZip ϕ (K a) where hzipM f (K x) (K y) | x ≡ y = return (K x) | otherwise = fail "zip failed in K"

8.

Until now, many powerful generic algorithms were known, but their adoption in practice has been hindered by their restriction to regular datatypes. In this paper, we have shown that we can overcome this restriction in a way that is directly applicable in practice: using recent extensions of Haskell, we can define generic programs that exploit the recursive structure of datatypes on families of arbitrarily many mutually recursive datatypes. For instance, extensive use of generic programming becomes finally feasible for compilers, which are often based on an abstract syntax that consists of many mutually recursive datatypes. Furthermore, our approach is noninvasive: the definitions of large families of datatypes need not be modified in order to use generic programming. Additionally, we have demonstrated our approach by implementing several recursion schemes such as compos and fold, the Zipper, and rewriting functionality. Based on this paper, the libraries multirec and zipper have been developed. They are available for download from HackageDB. A version of the rewriting library based on multirec will be released soon. The multirec library contains Template Haskell code to automatically generate the boilerplate code for a family of datatypes. Also, in addition to the functionality shown here, the library offers access to the names of constructors. Furthermore, in this paper we have focused on functions that consume or transform values rather than functions that produce values. This, however, is no limitation, and we have for instance implemented a function that generates values of a particular type according to certain criteria using multirec. Initial measurements indicate that our approach is in the slower half of generic programming libraries for Haskell. On the other

In the definition above, we use liftM and liftM2 to turn the pure structure constructors into monadic functions.

7.

Conclusions

Related work

Malcolm (1990) shows how to define two mutually recursive types as initial objects of functor-algebras. Swierstra et al. (1999) show how to implement fixed points for mutual recursive datatypes in Haskell. They introduce a new fixed point for every arity of mutually recursive datatypes. None of these approaches can be used as a basis for an implementation of fixed points for mutually recursive datatypes in Haskell suitable for implementing generic programs. Higher-order fixed points like our HFix have been used by Bird and Paterson (1999) and Johann and Ghani (2007) to model folds on nested datatypes. Several authors discuss how to generate folds and other recursive schemes on mutually recursive datatypes (B¨ohm and Berarducci 1985; Sheard and Fegaras 1993; Swierstra et al. 1999; L¨ammel et al. 2000). Again, the definitions in these papers cannot be directly generalised to families of arbitrary many datatypes in Haskell. Mitchell and Runciman (2007) show how to obtain traversals for mutually recursive datatypes using the class Biplate. However, the type on which an action is performed remains fixed during a traversal. In contrast, the recursion schemes from Section 4.4 can apply their function arguments to subtrees of different types. Since dependently typed programming languages have a much more powerful type system than Haskell extended with GADTs and type families, it is possible to define fixed-points for mutually

243

hand, some benchmarked functions expose significant optimisation opportunities to GHC. In one such example, GHC optimisations removed the intermediate use of generic values leaving code that is similar to what one would write manually. As a result, this example runs only 1.7 times slower than handwritten code (compared to 68.7 for the slowest library). More details can be found in the thesis of Rodriguez (2009). In its current form, our approach cannot be used directly on parameterized types and does not support functor composition. We have, however, prototypical code that demonstrates that our approach can be extended to support these concepts without too much difficulty, and we plan to integrate this functionality into the library in the near future. We plan to study the application of our representation using (:.:) to arbitrary GADTs, hopefully giving us fold and other generic operations on GADTs, as in the work of Johann and Ghani (2008). In parallel to the Haskell version, we have also experimented with an Agda (Norell 2007) version of our library, using dependent types. The Agda version has proved to be invaluable in thinking about the development without having to worry about Haskell limitations at the same time. As to Haskell, we hope that the support for type families, which we rely on very much, will continue to stabilize in the future, and that perhaps the kind system will be slightly improved with possibilities to encode kinds such as our ∗ϕ , or with the possibility to define kind synonyms.

S. Holdermans, J. Jeuring, A. L¨oh, and A. Rodriguez. Generic views on data types. In MPC’06, volume 4014 of LNCS, pages 209–234. Springer, 2006.

Acknowledgements Jos´e Pedro Magalh˜aes and Marcos Viera commented on a previous version of this paper. Claus Reinke suggested to us the “type families desugaring trick” to get around a problem with the type checker in an older version of GHC. The anonymous reviewers have provided several useful suggestions for improving the presentation. This research has been partially funded by the Netherlands Organisation for Scientific Research (NWO), through its projects on “Real-life Datatype-Generic Programming” (612.063.613) and “Scriptable Compilers” (612.063.406).

C. McBride. The derivative of a regular type is its type of one-hole contexts. strictlypositive.org/diff.pdf, 2001.

G. Huet. The Zipper. JFP, 7(5):549–554, 1997. P. Jansson and J. Jeuring. A framework for polytypic programming on terms, with an application to rewriting. In WGP’00, 2000. P. Jansson and J. Jeuring. PolyP — a polytypic programming language extension. In POPL’97, pages 470–482, 1997. P. Jansson and J. Jeuring. Polytypic unification. JFP, 8(5):527–536, 1998. J. Jeuring. Polytypic pattern matching. In FPCA’95, pages 238–248, 1995. P. Johann and N. Ghani. Foundations for structured programming with GADTs. In POPL’08, pages 297–308, 2008. P. Johann and N. Ghani. Initial algebra semantics is enough! In Proceedings, Typed Lambda Calculus and Applications, pages 207–222, 2007. R. L¨ammel and S. Peyton Jones. Scrap your boilerplate: A practical design pattern for generic programming. pages 26–37. ACM Press, 2003. R. L¨ammel, J. Visser, and J. Kort. Dealing with large bananas. In WGP’00, 2000. A. L¨oh. Exploring Generic Haskell. PhD thesis, Utrecht University, 2004. G. Malcolm. Data structures and program transformation. SCP, 14:255– 279, 1990. C. McBride. Clowns to the left of me, jokers to the right (pearl): dissecting data structures. In POPL’08, pages 287–295, 2008.

E. Meijer, M. Fokkinga, and R. Paterson. Functional programming with bananas, lenses, envelopes, and barbed wire. In FPCA’91, volume 523 of LNCS, pages 124–144. Springer, 1991. N. Mitchell and C. Runciman. Uniform boilerplate and list processing. In ACM SIGPLAN Haskell Workshop, 2007. P. Morris, T. Altenkirch, and C. McBride. Exploring the regular tree types. In Types for Proofs and Programs, LNCS. Springer, 2006.

References

T. van Noort, A. Rodriguez, S. Holdermans, J. Jeuring, and B. Heeren. A lightweight approach to datatype-generic rewriting. In WGP’08, 2008.

T. Altenkirch and C. McBride. Generic programming within dependently typed programming. In Generic Programming, pages 1–20. Kluwer, 2003.

U. Norell. Towards a practical programming language based on dependent type theory. PhD thesis, Chalmers University of Technology, 2007. U. Norell and P. Jansson. Polytypic programming in Haskell. In IFL’03, volume 3145 of LNCS, pages 168–184. Springer, 2004.

A. Baars and D. Swierstra. Typing dynamic typing. In ICFP’02, pages 157–166, 2002.

B. C. d. S. Oliveira and J. Gibbons. Typecase: A design pattern for typeindexed functions. In ACM SIGPLAN Haskell Workshop, 2005.

M. Benke, P. Dybjer, and P. Jansson. Universes for generic programs and proofs in dependent type theory. Nordic J. of Comp., 10(4):265–289, 2003.

B. C. d. S. Oliveira, R. Hinze, and A. L¨oh. Extensible and modular generics for the masses. In H. Nilsson, editor, TFP’06, pages 199–216, 2006.

R. Bird and R. Paterson. Generalised folds for nested datatypes. Formal Aspects of Computing, 11:11–2, 1999.

S. Peyton Jones, editor. Haskell 98 Language and Libraries: The Revised Report. Cambridge University Press, Cambridge, 2003.

C. B¨ohm and A. Berarducci. Automatic synthesis of typed Λ-programs on term algebras. Theoretical Computer Science, 39:135–154, 1985.

S. Peyton Jones, D. Vytiniotis, S. Weirich, and G. Washburn. Simple unification-based type inference for GADTs. In ICFP’06, pages 50–61, 2006.

B. Bringert and A. Ranta. A pattern for almost compositional functions. In ICFP’06, pages 216–226, 2006.

A. Rodriguez. Towards Getting Generic Programming Ready for Prime Time. PhD thesis, Utrecht University, 2009.

J. Cheney and R. Hinze. A lightweight implementation of generics and dynamics. In ACM SIGPLAN Haskell Workshop, 2002. J. Gibbons. Generic downwards accumulations. SCP, 37(1–3):37–65, 2000.

T. Schrijvers, S. Peyton Jones, M. Chakravarty, and M. Sulzmann. Type checking with open type functions. In ICFP’08, pages 51–62, 2008.

R. Hinze. A new approach to generic functional programming. In POPL’00, pages 119–132, 2000a.

T. Sheard and L. Fegaras. A fold for all seasons. In FPCA’93, pages 233– 242, 1993.

R. Hinze. Polytypic values possess polykinded types. In MPC’00, volume 1837 of LNCS, pages 2–27. Springer, 2000b.

T. Sheard and S. Peyton Jones. Template meta-programming in Haskell. In ACM SIGPLAN Haskell Workshop, 2002.

R. Hinze. Generics for the masses. In ICFP’04, pages 236–243, 2004.

D. Swierstra, P. Azero, and J. Saraiva. Designing and implementing combinator languages. In AFP, volume 1608 of LNCS, pages 150–206. Springer, 1999.

R. Hinze, J. Jeuring, and A. L¨oh. Type-indexed data types. SCP, 51(2): 117–151, 2004.

244

Attribute Grammars Fly First-Class How to do Aspect Oriented Programming in Haskell Marcos Viera

S. Doaitse Swierstra

Wouter Swierstra

Instituto de Computaci´on Universidad de la Rep´ublica Montevideo, Uruguay [email protected]

Department of Computer Science Utrecht University Utrecht, The Netherlands [email protected]

Chalmers University of Technology G¨oteborg, Sweden [email protected]

Abstract

each parameter position and each case expression where a value of this type is matched has to be inspected and modified accordingly. In object oriented programing the situation is reversed: if we implement the alternatives of a data type by sub-classing, it is easy to add a new alternative by defining a new subclass in which we define a method for each part of desired global functionality. If however we want to define a new function for a data type, we have to inspect all the existing subclasses and add a method describing the local contribution to the global computation over this data type. This problem was first noted by Reynolds (Reynolds 1975) and later referred to as “the expression problem” by Wadler (Wadler 1998). We start out by showing how the use of AGs overcomes this problem. As running example we use the classic repmin function (Bird 1984); it takes a tree argument, and returns a tree of similar shape, in which the leaf values are replaced by the minimal value of the leaves in the original tree (see Figure 1). The program was originally introduced to describe so-called circular programs, i.e. programs in which part of a result of a function is again used as one of its arguments. We will use this example to show that the computation is composed of three so-called aspects: the computation of the minimal value as the first component of the result of sem Tree (asp smin), passing down the globally minimal value from the root to the leaves as the parameter ival (asp ival ), and the construction of the resulting tree as the second component of the result (asp sres). Now suppose we want to change the function repmin into a function repavg which replaces the leaves by the average value of the leaves. Unfortunately we have to change almost every line of the program, because instead of computing the minimal value we have to compute both the sum of the leaf values and the total number of leaves. At the root level we can then divide the total sum by the total number of leaves to compute the average leaf value. However, the traversal of the tree, the passing of the value to be used in constructing the new leafs and the construction of the new tree all remain unchanged. What we are now looking for is a way to define the function repmin as:

Attribute Grammars (AGs), a general-purpose formalism for describing recursive computations over data types, avoid the trade-off which arises when building software incrementally: should it be easy to add new data types and data type alternatives or to add new operations on existing data types? However, AGs are usually implemented as a pre-processor, leaving e.g. type checking to later processing phases and making interactive development, proper error reporting and debugging difficult. Embedding AG into Haskell as a combinator library solves these problems. Previous attempts at embedding AGs as a domain-specific language were based on extensible records and thus exploiting Haskell’s type system to check the well-formedness of the AG, but fell short in compactness and the possibility to abstract over oft occurring AG patterns. Other attempts used a very generic mapping for which the AG well-formedness could not be statically checked. We present a typed embedding of AG in Haskell satisfying all these requirements. The key lies in using HList-like typed heterogeneous collections (extensible polymorphic records) and expressing AG well-formedness conditions as type-level predicates (i.e., typeclass constraints). By further type-level programming we can also express common programming patterns, corresponding to the typical use cases of monads such as Reader , Writer and State. The paper presents a realistic example of type-class-based type-level programming in Haskell. Categories and Subject Descriptors D.3.3 [Programming languages]: Language Constructs and Features; D.1.1 [Programming techniques]: Applicative (Functional) Programming General Terms tion

Design, Languages, Performance, Standardiza-

Keywords Attribute Grammars, Class system, Lazy evaluation, Type-level programming, Haskell, HList

1.

Introduction

repmin = sem Root (asp smin ⊕ asp ival ⊕ asp sres)

Functional programs can be easily extended by defining extra functions. If however a data type is extended with a new alternative,

so we can easily replace the aspect asp smin by asp savg: repavg = sem Root (asp savg ⊕ asp ival ⊕ asp sres) In Figure 2 we have expressed the solution of the repmin problem in terms of a domain specific language, i.e., as an attribute grammar (Swierstra et al. 1999). Attributes are values associated with tree nodes. We will refer to a collection of (one or more) related attributes, with their defining rules, as an aspect. After defining the underlying data types by a few DATA definitions, we define the different aspects: for the two “result” aspects we

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICFP’09, August 31–September 2, 2009, Edinburgh, Scotland, UK. c 2009 ACM 978-1-60558-332-7/09/08. . . $5.00 Copyright

245

data Root = Root Tree data Tree = Node Tree Tree | Leaf Int

DATA Root | Root tree DATA Tree | Node l , r : Tree | Leaf i : {Int }

repmin = sem Root sem Root (Root tree) = let (smin, sres) = (sem Tree tree) smin in (sres)

SYN Tree [smin : Int ] SEM Tree | Leaf lhs .smin = @i | Node lhs .smin = @l .smin ‘min‘ @r .smin

sem Tree (Node l r ) = λival → let (lmin, lres) = (sem Tree l ) ival (rmin, rres) = (sem Tree r ) ival in (lmin ‘min‘ rmin, Node lres rres) sem Tree (Leaf i) = λival → (i, Leaf ival )

INH Tree [ival : Int ] SEM Root | Root tree.ival = @tree.smin SEM Tree | Node l .ival = @lhs.ival r .ival = @lhs.ival SYN Root Tree [sres : Tree ] SEM Root | Root lhs .sres = @tree.sres SEM Tree | Leaf lhs .sres = Leaf @lhs.ival | Node lhs .sres = Node @l .sres @r .sres

⇒ 4 5

1

2

8

1 3

6

1

1

1

1

1

Figure 2. AG specification of repmin

1

Figure 1. repmin replaces leaf values by their minimal value

Attribute grammars exhibit typical design patterns; an example of such a pattern is the inherited attribute ival , which is distributed to all the children of a node, and so on recursively. Other examples are attributes which thread a value through the tree, or collect information from all the children which have a specific attribute and combine this into a synthesized attribute of the father node. In normal Haskell programming this would be done by introducing a collection of monads (Reader , State and Writer monad respectively), and by using monad transformers to combine these in to a single monadic computation. Unfortunately this approach breaks down once too many attributes have to be dealt with, when the data flows backwards, and especially if we have a non-uniform grammar, i.e., a grammar which has several different non-terminals each with a different collection of attributes. In the latter case a single monad will no longer be sufficient. One way of making such computational patterns first-class is by going to a universal representation for all the attributes, and packing and unpacking them whenever we need to perform a computation. In this way all attributes have the same type at the attribute grammar level, and non-terminals can now be seen as functions which map dictionaries to dictionaries, where such dictionaries are tables mapping Strings representing attribute names to universal attribute values (de Moor et al. 2000a). Although this provides us with a powerful mechanism for describing attribute flows by Haskell functions, this comes at a huge price; all attributes have to be unpacked before the contents can be accesses, and to be repacked before they can be passed on. Worse still, the check that verifies that all attributes are completely defined, is no longer a static check, but rather something which is implicitly done at run-time by the evaluator, as a side-effect of looking up attributes in the dictionaries. The third contribution of this paper is that we show how patterns corresponding to the mentioned monadic constructs can be described, again using the Haskell class mechanism. The fourth contribution of this paper is that it presents yet another large example of how to do type-level programming in Haskell, and what can be achieved with it. In our conclusions we will come back to this. Before going into the technical details we want to give an impression of what our embedded Domain Specific Language (DSL)

introduce synthesized attributes (SYN smin and SYN sres), and for the “parameter” aspect we introduce an inherited attribute (INH ival ). Note that attributes are introduced separately, and that for each attribute/alternative pair we have a separate piece of code describing what to compute in a SEM rule; the defining expressions at the right hand side of the =-signs are all written in Haskell, using minimal syntactic extensions to refer to attribute values (the identifiers starting with a @). These expressions are copied directly into the generated program: only the attribute references are replaced by references to values defined in the generated program. The attribute grammar system only checks whether for all attributes a definition has been given. Type checking of the defining expressions is left to the Haskell compiler when compiling the generated program (given in Figure 1). As a consequence type errors are reported in terms of the generated program. Although this works reasonably well in practice, the question arises whether we we can define a set of combinators which enables us to embed the AG formalism directly in Haskell, thus making the separate generation step uncalled for and immediately profiting from Haskell’s type checker and getting error messages referring to the original source code. A first approach to such an embedded attribute grammar notation was made by de Moor et al. (de Moor et al. 2000b). Unfortunately this approach, which is based on extensible records (Gaster and Jones 1996), necessitates the introduction of a large set of combinators, which encode positions of children-trees explicitly. Furthermore combinators are indexed by a number which indicates the number of children a node has where the combinator is to be applied. The first contribution of this paper is that we show how to overcome this shortcoming by making use of the Haskell class system. The second contribution is that we show how to express the previous solution in terms of heterogeneous collections, thus avoiding the use of Hugs-style extensible records are not supported by the main Haskell compilers.

246

ival

data Root = Root{tree :: Tree } deriving Show data Tree = Node{l :: Tree, r :: Tree } | Leaf {i :: Int } deriving Show

ival

sem Tree

$ (deriveAG ’’ Root) $ (attLabels ["smin", "ival", "sres"])

lmin lres

lmin ‘min‘ rmin Node lres rres ival

r

rmin rres

sem Tree

Figure 4. The DDG for Node semantic functions, it is statically checked for correctness at the attribute grammar level, and high-level attribute evaluation patterns can be described. In Section 2 we introduce the heterogeneous collections, which are used to combine a collection of inherited or synthesised attributes into a single value. In Section 3 we show how individual attribute grammar rules are represented. In Section 4 we introduce the aforementioned ⊕ operator which combines the aspects. In Section 5 we introduce a function knit which takes the DDG associated with the production used at the root of a tree and the mappings (sem ... functions) from inherited to synthesised attributes for its children (i.e. the data flow over the children trees) and out of this constructs a data flow computation over the combined tree. In Section 6 we show how the common patterns can be encoded in our library, and in Section 7 we show how default aspects can be defined. In Section 8 we discuss related work, and in Section 9 we conclude.

asp smin = synthesize smin at {Tree } use min 0 at {Node } define at Leaf = i asp ival = inherit ival at {Tree } copy at {Node } define at Root.tree = tree.smin asp sres = synthesize sres at {Root, Tree } use Node (Leaf 0) at {Node } define at Root = tree.sres Leaf = Leaf lhs.ival asp repmin = asp smin ⊕ asp sres ⊕ asp ival repmin t = select sres from compute asp repmin t Figure 3. repmin in our embedded DSL

2.

looks like. In Figure 3 we give our definition of the repmin problem in a lightly sugared notation. To completely implement the repmin function the user of our library1 needs to undertake the following steps (Figure 3):

HList

The library HList (Kiselyov et al. 2004) implements typeful heterogeneous collections (lists, records, ...), using techniques for dependently typed programming in Haskell (Hallgren 2001; McBride 2002) which in turn make use of Haskell 98 extensions for multiparameter classes (Peyton Jones et al. 1997) and functional dependencies (Jones 2000). The idea of type-level programming is based on the use of types to represent type-level values, and classes to represent type-level types and functions. In order to be self-contained we start out with a small introduction. To represent Boolean values at the type level we define a new type for each of the Boolean values. The class HBool represents the type-level type of Booleans. We may read the instance definitions as “the type-level values HTrue and HFalse have the type-level type HBool ”:

• define the Haskell data types involved; • optionally, generate some boiler-plate code using calls to Tem-

plate Haskell; • define the aspects, by specifying whether the attribute is inher-

ited or synthesized, with which non-terminals it is associated, how to compute its value if no explicit definition is given (i.e., which computational pattern it follows), and providing definitions for the attribute at the various data type constructors (productions in grammar terms) for which it needs to be defined, resulting in asp repmin;

class HBool x data HTrue ; hTrue = ⊥ :: HTrue data HFalse; hFalse = ⊥ :: HFalse instance HBool HTrue instance HBool HFalse

• composing the aspects into a single large aspect asp repmin • define the function repmin that takes a tree, executes the se-

mantic function for the tree and the aspect asp repmin, and selects the synthesized attribute sres from the result. Together these rules define for each of the productions a socalled Data Dependency Graph (DDG). A DDG is basically a dataflow graph (Figure 4), with as incoming values has the inherited attributes of the father node and the synthesized attributes of the children nodes (indicated by closed arrows), and as outputs the inherited attributes of the children nodes and the synthesized attributes of the father node (open arrows). The semantics of our DSL is defined as the data-flow graph which results from composing all the DDGs corresponding to the individual nodes of the abstract syntax tree. Note that the semantics of a tree is thus represented by a function which maps the inherited attributes of the root node onto its synthesized attributes. The main result of this paper is a combinator based implementation of attribute grammars in Haskell; it has statically type checked 1 Available

l

p

Since we are only interested in type-level computation, we defined HTrue and HFalse as empty types. By defining an inhabitant for each value we can, by writing expressions at the value level, construct values at the type-level by referring to the types of such expressions. Multi-parameter classes can be used to describe type-level relations, whereas functional dependencies restrict such relations to functions. As an example we define the class HOr for type-level disjunction: class (HBool t, HBool t 0 , HBool t 00 ) ⇒ HOr t t 0 t 00 | t t 0 → t 00 where hOr :: t → t 0 → t 00 The context (HBool t, HBool t 0 , HBool t 00 ) expresses that the types t, t 0 and t 00 have to be type-level values of the type-level

as AspectAG in Hackage.

247

type HBool . The functional dependency t t 0 → t 00 expresses that the parameters t and t 0 uniquely determine the parameter t 00 . This implies that once t and t 0 are instantiated, the instance of t 00 must be uniquely inferable by the type-system, and that thus we are defining a type-level function from t and t 0 to t 00 . The type-level function itself is defined by the following non-overlapping instance declarations: instance HOr where hOr instance HOr where hOr instance HOr where hOr instance HOr where hOr

HFalse HFalse = hFalse HTrue HFalse = hTrue HFalse HTrue = hTrue HTrue HTrue = hTrue

ed; here we have expressed that the second parameter should be a list again. In the next subsection we will see how to make use of this facility. 2.2

In our code we will make heavy use of non-homogeneous collections: grammars are a collection of productions, and nodes have a collection of attributes and a collection of children nodes. Such collections, which can be extended and shrunk, map typed labels to values and are modeled by an HList containing a heterogeneous list of fields, marked with the data type Record . We will refer to them as records from now on:

HFalse HTrue HTrue

newtype Record r = Record r An empty record is a Record containing an empty heterogeneous list:

HTrue

emptyRecord :: Record HNil emptyRecord = Record HNil

If we write (hOr hTrue hFalse), we know that t and t 0 are HTrue and HFalse, respectively. So, the second instance is chosen to select hOr from and thus t 00 is inferred to be HTrue. Despite the fact that is looks like a computation at the value level, its actual purpose is to express a computation at the typelevel; no interesting value level computation is taking place at all. If we had defined HTrue and HFalse in the following way:

A field with label l (a phantom type (Hinze 2003)) and value of type v is represented by the type: newtype LVPair l v = LVPair {valueLVPair :: v } Labels are now almost first-class objects, and can be used as typelevel values. We can retrieve the label value using the function labelLVPair , which exposes the phantom type parameter:

data HTrue = HTrue ; hTrue = HTrue :: Htrue data HFalse = HFalse; hFalse = HFalse :: HFalse

labelLVPair :: LVPair l v → l labelLVPair = ⊥

then the same computation would also be performed at the value level, resulting in the value HTrue of type HTrue. 2.1

Extensible Records

Since we need to represent many labels, we introduce a polymorphic type Proxy to represent them; by choosing a different phantom type for each label to be represented we can distinguish them:

Heterogeneous Lists

Heterogeneous lists are represented with the data types HNil and HCons, which model the structure of a normal list both at the value and type level:

data Proxy e; proxy = ⊥ :: Proxy e Thus, the following declarations define a record (myR) with two elements, labelled by Label1 and Label2 :

data HNil = HNil data HCons e l = HCons e l

data Label1 ; label1 = proxy :: Proxy Label1 data Label2 ; label2 = proxy :: Proxy Label2 field1 = LVPair True :: LVPair (Proxy Label1 ) Bool field2 = LVPair "bla" :: LVPair (Proxy Label2 ) [Char ] myR = Record (HCons field1 (HCons field2 HNil )

The sequence HCons True (HCons "bla" HNil ) is a correct heterogeneous list with type HCons Bool (HCons String HNil ). Since we want to prevent that an expression HCons True False represents a correct heterogeneous list (the second HCons argument is not a type-level list) we introduce the classes HList and its instances,and express express this constraint by adding a context condition to the HCons... instance:

Since our lists will represent collections of attributes we want to express statically that we do not have more than a single definition for each attribute occurrence, and so the labels in a record should be all different. This constraint is represented by requiring an instance of the class HRLabelSet to be available when defining extendability for records:

class HList l instance HList HNil instance HList l ⇒ HList (HCons e l ) The library includes a multi-parameter class HExtend to model the extension of heterogeneous collections.

instance HRLabelSet (HCons (LVPair l v ) r ) ⇒ HExtend (LVPair l v ) (Record r ) (Record (HCons (LVPair l v ) r )) where hExtend f (Record r ) = Record (HCons f r )

class HExtend e l l 0 | e l → l 0 , l 0 → e l where hExtend :: e → l → l 0 The functional dependency e l → l 0 makes that HExtend is a type-level function, instead of a relation: once e and l are fixed l 0 is uniquely determined. It fixes the type l 0 of a collection, resulting from extending a collection of type l with an element of type e. The member hExtend performs the same computation at the level of values. The instance of HExtend for heterogeneous lists includes the well-formedness condition:

The class HasField is used to retrieve the value part corresponding to a specific label from a record: class HasField l r v | l r → v where hLookupByLabel :: l → r → v At the type-level it is statically checked that the record r indeed has a field with label l associated with a value of the type v . At value-level the member hLookupByLabel returns the value of type v . So, the following expression returns the string "bla":

instance HList l ⇒ HExtend e l (HCons e l ) where hExtend = HCons The main reason for introducing the class HExtend is to make it possible to encode constraints on the things which can be HCons-

hLookupByLabel label2 myR

248

The possibility to update an element in a record at a given label position is provided by:

ival

l

class HUpdateAtLabel l v r r 0 | l v r → r 0 where hUpdateAtLabel :: l → v → r → r 0 In order to keep our programs readable we introduce infix operators for some of the previous functions:

• { v1 , ..., vn } for (v1 .∗. ... .∗. vn .∗. HNil ) • {{ v1 , ..., vn }} for (v1 .∗. ... .∗. vn .∗. emptyRecord )

myR = {{ label1 .=. True, label2 .=. "bla" }}

Rules

In this subsection we show how attributes and their defining rules are represented. An attribution is a finite mapping from attribute names to attribute values, represented as a Record , in which each field represents the name and value of an attribute.

data Att smin; smin = proxy :: Proxy Att smin data Att ival ; ival = proxy :: Proxy Att ival data Att sres; sres = proxy :: Proxy Att sres

r

(Att (Proxy Att smin) Int) (Att (Proxy Att sres) Tree) (Att (Proxy Att ival ) Int) (Att (Proxy Att ival ) Int)

type Output Node = Fam IC SP Attribute computations are defined in terms of rules. As defined by (de Moor et al. 2000a), a rule is a mapping from an input family to an output family. In order to make rules composable we define a rule as a mapping from input attributes to a function which extends a family of output attributes with the new elements defined by this rule:

data Fam c p = Fam c p

type Rule sc ip ic sp ic 0 sp 0 = Fam sc ip → Fam ic sp → Fam ic 0 sp 0

A Fam contains a single attribution for the parent and a collection of attributions for the children. Thus the type p will always be a Record with fields of type Att, and the type c a Record with fields of the type:

Thus, the type Rule states that a rule takes as input the synthesized attributes of the children sc and the inherited attributes of the parent ip and returns a function from the output constructed thus far (inherited attributes of the children ic and synthesized attributes of the parent sp) to the extended output. The composition of two rules is the composition of the two functions after applying each of them to the input family first:

type Chi ch atts = LVPair ch atts where ch is a label that represents the name of that child and atts is again a Record with the fields of type Att associated with this particular child. In our example the Root production has a single child Ch tree of type Tree, the Node production has two children labelled by Ch l and Ch r of type Tree, and the Leaf production has a single child called Ch i of type Int. Thus we generate, using template Haskell: (Ch (Ch (Ch (Ch

ival

We now have all the ingredients to define the output family for Node-s.

When inspecting what happens at a production we see that information flows from the inherited attribute of the parent and the synthesized attributes of the children (henceforth called in the input family) to the synthesized attributes of the parent and the inherited attributes of the children (together called the output family from now on). Both the input and the output attribute family is represented by an instance of:

:: Proxy :: Proxy :: Proxy :: Proxy

l

type IC = Record (HCons (Chi (Proxy (Ch l , Tree) IL) HCons (Chi (Proxy (Ch r , Tree) IR) HNil )

The labels2 (attribute names) for the attributes of the example are:

= proxy = proxy = proxy = proxy

ival

The next type collects the last two children attributions into a single record:

type Att att val = LVPair att val

tree r l i

smin sres

type SP = Record (HCons HCons HNil ) type IL = Record (HCons HNil ) type IR = Record (HCons HNil )

So, for example the definition of myR can now be written as:

tree; ch r ; ch l ; ch i; ch

r

smin sres

Families are used to model the input and output attributes of attribute computations. For example, Figure 5 shows the input (black arrows) and output (white arrows) attribute families of the repmin problem for the production Node. We now give the attributions associated with the output family of the Node production, which are the synthesized attributes of the parent (SP ) and the inherited attributions for the left and right child (IL and IR):

Furthermore we will use the following syntactic sugar to denote lists and records in the rest of the paper:

data Ch data Ch data Ch data Ch

smin sres

p

Figure 5. Repmin’s input and output families for Node

(.∗.) = hExtend .=. v = LVPair v r # l = hLookupByLabel l r

3.

p

ext :: Rule sc ip ic 0 sp 0 ic 00 sp 00 → Rule sc ip ic sp ic 0 sp 0 → Rule sc ip ic sp ic 00 sp 00 (f ‘ext‘ g) input = f input.g input

tree, Tree) r , Tree) l , Tree) i, Int)

3.1

Rule Definition

We now introduce the functions syndef and inhdef , which are used to define primitive rules which define a synthesized or an inherited attribute respectively. Figure 6 lists all the rule definitions for our running example. The naming convention is such that a rule with name prod att defines the attribute att for the production prod . Without trying to completely understand the definitions we suggest the reader to compare them with their respective SEM specifications in Figure 2.

Note that we encode both the name and the type of the child in the type representing the label. 2 These

and all needed labels can be generated automatically by Template Haskell functions available in the library

249

...

root ival (Fam chi par ) = inhdef ival { nt Tree } {{ ch tree .=. (chi # ch tree) # smin }} node ival (Fam chi par ) = inhdef ival { nt Tree } {{ ch l .=. par # ival , ch r .=. par # ival }} root sres (Fam = syndef sres leaf sres (Fam = syndef sres node sres (Fam = syndef sres

p

ival

leaf smin (Fam chi par ) = syndef smin (chi # ch i) node smin (Fam chi par ) = syndef smin (((chi # ch l ) # smin) ‘min‘ ((chi # ch r ) # smin))

l

l

sres smin

p

... ...

r

sres smin ..., smin

p r

...

l

...

r

Figure 8. Rule node sres by each child and an attribute smin can be safely added to the current synthesized attribution of the parent: 3 node sres :: (HasField (Proxy (Ch l , Tree)) sc scl , HasField (Proxy Att smin) scl Int , HasField (Proxy (Ch r , Tree)) sc scr , HasField (Proxy Att smin) scr Int , HExtend (Att (Proxy Att smin) Int) sp sp 0 ) ⇒ Rule sc ip ic sp ic sp 0

chi par ) ((chi # ch tree) # sres) chi par ) (Leaf (par # ival )) chi par ) (Node ((chi # ch l ) # sres) ((chi # ch r ) # sres))

The function inhdef introduces a new inherited attribute for a collection of non-terminals. It takes the following parameters: att the attribute which is being defined;

Figure 6. Rule definitions for repmin

nts the non-terminals with which this attribute is being associated; vals a record labelled with child names and containing values, describing how to compute the attribute being defined at each of the applicable child positions.

The function syndef adds the definition of a synthesized attribute. It takes a label att representing the name of the new attribute, a value val to be assigned to this attribute, and it builds a function which updates the output constructed thus far.

The parameter nts takes over the role of the INH declaration in Figure 2. Here this extra parameter seems to be superfluous, since its value can be inferred, but adds an additional restriction to be checked (yielding to better errors) and it will be used in the introduction of default rules later. The names for the non-terminals of our example are:

syndef :: HExtend (Att att val ) sp sp 0 ⇒ att → val → (Fam ic sp → Fam ic sp 0 ) syndef att val (Fam ic sp) = Fam ic (att .=. val .∗. sp) The record sp with the synthesized attributes of the parent is extended with a field with name att and value val , as shown in Figure 7. If we look at the type of the function, the check that we have not already defined this attribute is done by the constraint HExtend (Att att val ) sp sp 0 , meaning that sp 0 is the result of adding the field (Att att val ) to sp, which cannot have any field with name att. Thus we are statically preventing duplicated attribute definitions. p ic1

c1 . . . icn

sp

cn

p ic1

c1 . . . icn

nt Root = proxy :: Proxy Root nt Tree = proxy :: Proxy Tree The result of inhdef again is a function which updates the output constructed thus far. inhdef :: Defs att nts vals ic ic 0 ⇒ att → nts → vals → (Fam ic sp → Fam ic 0 sp) inhdef att nts vals (Fam ic sp) = Fam (defs att nts vals ic) sp

sp, att = val

The class Def is defined by induction over the record vals containing the new definitions. The function defs inserts each definition into the attribution of the corresponding child.

cn

class Defs att nts vals ic ic 0 | vals ic → ic 0 where defs :: att → nts → vals → ic → ic 0

Figure 7. Synthesized attribute definition

We start out with the base case, where we have no more definitions to add. In this case the inherited attributes of the children are returned unchanged.

Let us take a look at the rule definition node smin of the attribute smin for the production Node in Figure 6. The children ch l and ch r are retrieved from the input family so we can subsequently retrieve the attribute smin from these attributions, and construct the computation of the synthesized attribute smin. This process is demonstrated in Figure 8. The attribute smin is required (underlined) in the children l and r of the input, and the parent of the output is extended with smin. If we take a look at the type which is inferred for node sres we find back all the constraints which are normally checked by an off-line attribute grammar system, i.e., an attribute smin is made available

instance Defs att nts (Record HNil ) ic ic where defs ic = ic The instance for HCons given below first recursively processes the rest of the definitions by updating the collection of collections of inherited attributes of the children ic into ic 0 . A helper type level 3 In

order to keep the explanation simple we will suppose that min is not overloaded, and takes Int’s as parameter.

250

function SingleDef (and its corresponding value level function singledef ) is used to incorporate the single definition (pch) into ic 0 , resulting in a new set ic 00 . The type level functions HasLabel and HMember are used to statically check whether the child being defined (lch) exists in ic 0 and if its type (t) belongs to the non-terminals nts, respectively. The result of both functions are HBool s (either HTrue or HFalse) which are passed as parameters to SingleDef . We are now ready to give the definition for the non-empty case:

for this child with a field (Att att vch), thus binding attribute att to value vch. The type system checks, thanks to the presence of HExtend , that the attribute att was not defined before in och.

4.

Aspects

We represent aspects as records which contain for each production a rule field. type Prd prd rule = LVPair prd rule

instance (Defs att nts (Record vs) ic ic 0 , HasLabel (Proxy (lch, t)) ic 0 mch , HMember (Proxy t) nts mnts , SingleDef mch mnts att (Chi (Proxy (lch, t)) vch) ic 0 ic 00 ) ⇒ Defs att nts (Record (HCons (Chi (Proxy (lch, t)) vch) vs)) ic ic 00 where defs att nts ∼(Record (HCons pch vs)) ic = singledef mch mnts att pch ic 0 where ic 0 = defs att nts (Record vs) ic lch = labelLVPair pch mch = hasLabel lch ic 0 mnts = hMember (sndProxy lch) nts The class Haslabel can be encoded straightforwardly, together with a function which retrieves part of a phantom type:

For our example we thus introduce fresh labels to refer to repmin’s productions: data P Root; p Root = proxy :: Proxy P Root data P Node; p Node = proxy :: Proxy P Node data P Leaf ; p Leaf = proxy :: Proxy P Leaf We now can define the aspects of repmin as records with the rules of Figure 6.5 asp smin = {{ p , p asp ival = {{ p , p asp sres = {{ p , p , p 4.1

class HBool b ⇒ HasLabel l r b | l r → b instance HasLabel l r b ⇒ HasLabel l (Record r ) b instance (HEq l lp b, HasLabel l r b 0 , HOr b b 0 b 00 ) ⇒ HasLabel l (HCons (LVPair lp vp) r ) b 00 instance HasLabel l HNil HFalse hasLabel :: HasLabel l r b ⇒ l → r → b hasLabel = ⊥ sndProxy :: Proxy (a, b) → Proxy b sndProxy = ⊥

Leaf Node Root Node Root Node Leaf

.=. leaf smin .=. node smin }} .=. root ival .=. node ival }} .=. root sres .=. node sres .=. leaf sres }}

Aspects Combination

We define the class Com which will provide the instances we need for combining aspects: class Com r r 0 r 00 | r r 0 → r 00 where ( ⊕ ) :: r → r 0 → r 00 With this operator we can now combine the three aspects which together make up the repmin problem: asp repmin = asp smin ⊕ asp ival ⊕ asp sres Combination of aspects is a sort of union of records where, in case of fields with the same label (i.e., for rules for the same production), the rule combination (ext) is applied to the values. To perform the union we iterate over the second record, inserting the next element into the first one if it is new and combining it with an existing entry if it exists:

We only show the instance with both mch and mnts equal to HTrue, which is the case we expect to apply in a correct attribute grammar definition: we do not refer to children which do not exist, and this child has the type we expect.4 class SingleDef mch mnts att pv ic ic 0 | mch mnts pv ic → ic 0 where singledef :: mch → mnts → att → pv → ic → ic 0 instance (HasField lch ic och , HExtend (Att att vch) och och 0 , HUpdateAtLabel lch och 0 ic ic 0 ) ⇒ SingleDef HTrue HTrue att (Chi lch vch) ic ic 0 where singledef att pch ic = hUpdateAtLabel lch (att .=. vch .∗. och) ic where lch = labelLVPair pch vch = valueLVPair pch och = hLookupByLabel lch ic

instance Com r (Record HNil ) r where r ⊕ = r instance (HasLabel lprd r b , ComSingle b (Prd lprd rprd ) r r 000 , Com r 000 (Record r 0 ) r 00 ) ⇒ Com r (Record (HCons (Prd lprd rprd ) r 0 )) r 00 where r ⊕ (Record (HCons prd r 0 )) = r 00 where b = hasLabel (labelLVPair prd ) r r 000 = comsingle b prd r r 00 = r 000 ⊕ (Record r 0 ) We use the class ComSingle to insert a single element into the first record. The type-level Boolean parameter b is used to distinguish those cases where the left hand operand already contains a field for the rule to be added and the case where it is new. 6

We will guarantee that the collection of attributions ic (inherited attributes of the children) contains an attribution och for the child lch, and so we can use hUpdateAtlabel to extend the attribution 4 The

instances for error cases could just be left undefined, yielding to “undefined instance” type errors. In our library we use a class Fail (as defined in (Kiselyov et al. 2004), section 6) in order to get more instructive type error messages.

5 We

assume that the monomorphism restriction has been switched off. parameter can be avoided by allowing overlapping instances, but we prefer to minimize the number of Haskell extensions we use. 6 This

251

class ComSingle b f r r 0 | b f r → r 0 where comsingle :: b → f → r → r 0

• Define the function repmin that takes a tree, executes the

semantic function for the tree and the aspect asp repmin, and selects the synthesized attribute sres from the result.

If the first record has a field with the same label lprd , we update its value by composing the rules. 0

0

00

instance (HasField lprd r (Rule sc ip ic sp ic sp ) , HUpdateAtLabel lprd (Rule sc ip ic sp ic 00 sp 00 ) 0 rr ) ⇒ ComSingle HTrue (Prd lprd (Rule sc ip ic sp ic 0 sp 0 )) r r0 where comsingle f r = hUpdateAtLabel n ((r # n) ‘ext‘ v ) r where n = labelLVPair f v = valueLVPair f In case the first record does not have a field with the label, we just insert the element in the record. instance ComSingle HFalse f (Record r ) (Record (HCons f r )) where comsingle f (Record r ) = Record (HCons f r )

5.

5.1

The Knit Function

As said before the function knit takes the combined rules for a node and the semantic functions of the children, and builds a function from the inherited attributes of the parent to its synthesized attributes. We start out by constructing an empty output family, containing an empty attribution for each child and one for the parent. To each of these attributions we apply the corresponding part of the rules, which will construct the inherited attributes of the children and the synthesized attributes of the parent (together forming the output family). Rules however contain references to the input family, which is composed of the inherited attributes of the parent ip and the synthesized attributes of the children sc. knit :: (Empties fc ec, Kn fc ic sc) ⇒ Rule sc ip ec (Record HNil ) ic sp → fc → ip → sp knit rule fc ip = let ec = empties fc (Fam ic sp) = rule (Fam sc ip) (Fam ec emptyRecord ) sc = kn fc ic in sp

Semantic Functions

Our overall goal is to construct a Tree-algebra and a Root-algebra. For the domain associated with each non-terminal we take the function mapping its inherited to its synthesized attributes. The hard work is done by the function knit, the purpose of which is to combine the data flow defined by the DDG –which was constructed by combining all the rules for this production– with the semantic functions of the children (describing the flow of data from their inherited to their synthesized attributes) into the semantic function for the parent. With the attribute computations as first-class entities, we can now pass them as an argument to functions of the form sem . The following code follows the definitions of the data types at hand: it contains recursive calls for all children of an alternative, each of which results in a mapping from inherited to synthesized attributes for that child followed by a call to knit, which stitches everything together: sem Root asp (Root t) = knit (asp # p Root) {{ ch sem Tree asp (Node l r ) = knit (asp # p Node) {{ ch , ch sem Tree asp (Leaf i ) = knit (asp # p Leaf ) {{ ch

repmin tree = sem Root asp repmin (Root tree) () # sres

00

The function kn, which takes the semantic functions of the children (fc) and their inputs (ic), computes the results for the children (sc). The functional dependency fc → ic sc indicates that fc determines ic and sc, so the shape of the record with the semantic functions determines the shape of the other records: class Kn fc ic sc | fc → ic sc where kn :: fc → ic → sc We declare a helper instance of Kn to remove the Record tags of the parameters, in order to be able to iterate over their lists without having to tag and untag at each step: instance Kn fc ic sc ⇒ Kn (Record fc) (Record ic) (Record sc) where kn (Record fc) (Record ic) = Record $ kn fc ic

tree .=. sem Tree asp t }}

When the list of children is empty, we just return an empty list of results.

l .=. sem Tree asp l r .=. sem Tree asp r }}

instance Kn HNil HNil HNil where = hNil kn

i .=. sem Lit i }}

The function kn is a type level zipWith ($), which applies the functions contained in the first argument list to the corresponding element in the second argument list.

sem Lit e (Record HNil ) = e Since this code is completely generic we provide a Template Haskell function deriveAG which automatically generates the functions such as sem Root and sem Tree, together with the labels for the non-terminals and labels for referring to children. Thus, to completely implement the repmin function we need to undertake the following steps:

instance Kn fcr icr scr ⇒ Kn (HCons (Chi lch (ich → sch)) fcr ) (HCons (Chi lch ich) icr ) (HCons (Chi lch sch) scr ) where kn ∼(HCons pfch fcr ) ∼(HCons pich icr ) = let scr = kn fcr icr lch = labelLVPair pfch fch = valueLVPair pfch ich = valueLVPair pich in HCons (newLVPair lch (fch ich)) scr

• Generate the semantic functions and the corresponding labels

by using: $ (deriveAG “Root) • Define and compose the aspects as shown in the previous sec-

tions, resulting in asp repmin.

252

The class Empties is used to construct the record, with an empty attribution for each child, which we have used to initialize the computation of the input attributes with.

The class Copy iterates over the record ic containing the output attribution of the children, and inserts the attribute att with value vp if the type of the child is included in the list nts of non-terminals and the attribute is not already defined for this child.

class Empties fc ec | fc → ec where empties :: fc → ec

class Copy att nts vp ic ic 0 | ic → ic 0 where cpychi :: att → nts → vp → ic → ic 0

In the same way that fc determines the shape of ic and sc in Kn, it also tells us how many empty attributions ec to produce and in which order:

instance Copy att nts vp (Record HNil ) (Record HNil ) where cpychi = emptyRecord

instance Empties fc ec ⇒ Empties (Record fc) (Record ec) where empties (Record fc) = Record $ empties fc instance Empties fcr ecr ⇒ Empties (HCons (Chi lch fch) fcr ) (HCons (Chi lch (Record HNil )) ecr ) where empties ∼(HCons pch fcr ) = let ecr = empties fcr lch = labelLVPair pch in HCons (newLVPair lch emptyRecord ) ecr instance Empties HNil HNil where empties = hNil

6.

instance (Copy att nts vp (Record ics) ics 0 , HMember (Proxy t) nts mnts , HasLabel att vch mvch , Copy 0 mnts mvch att vp (Chi (Proxy (lch, t)) vch) pch , HExtend pch ics 0 ic) ⇒ Copy att nts vp (Record (HCons (Chi (Proxy (lch, t)) vch) ics)) ic where cpychi att nts vp (Record (HCons pch ics)) = cpychi 0 mnts mvch att vp pch .∗. ics 0 where ics 0 = cpychi att nts vp (Record ics) lch = sndProxy (labelLVPair pch) vch = valueLVPair pch mnts = hMember lch nts mvch = hasLabel att vch

Common Patterns

At this point all the basic functionality of attribute grammars has been implemented. In practice however we want more. If we look at the code in Figure 2 we see that the rules for ival at the production Node are “free of semantics”, since the value is copied unmodified to its children. If we were dealing with a tree with three children instead of two the extra line would look quite similar. When programming attribute grammars such patterns are quite common and most attribute grammar systems contain implicit rules which automatically insert such “trivial” rules. As a result descriptions can decrease in size dramatically. The question now arises whether we can extend our embedded language to incorporate such more high level data flow patterns. 6.1

The function cpychi 0 updates the field pch by adding the new attribute: class Copy 0 mnts mvch att vp pch pch 0 | mnts mvch pch → pch 0 where cpychi 0 :: mnts → mvch → att → vp → pch → pch 0 When the type of the child doesn’t belong to the non-terminals for which the attribute is defined we define an instance which leaves the field pch unchanged.

Copy Rule

instance Copy 0 HFalse mvch att vp pch pch where pch = pch cpychi 0

The most common pattern is the copying of an inherited attribute from the parent to all its children. We capture this pattern with the an operator copy, which takes the name of an attribute att and an heterogeneous list of non-terminals nts for which the attribute has to be defined, and generates a copy rule for this. This corresponds closely to the introduction of a Reader monad.

We also leave pch unchanged if the attribute is already defined for this child. instance Copy 0 HTrue HTrue att vp pch pch where cpychi 0 pch = pch

copy :: (Copy att nts vp ic ic 0 , HasField att ip vp) ⇒ att → nts → Rule sc ip ic sp ic 0 sp

In other case the attribution vch is extended with the attribute (Att att vp). instance HExtend (Att att vp) vch vch 0 ⇒ Copy 0 HTrue HFalse att vp (Chi lch vch) (Chi lch vch 0 ) where cpychi 0 att vp pch = lch .=. (att .=. vp .∗. vch) where lch = labelLVPair pch vch = valueLVPair pch

Thus, for example, the rule node ival of Figure 6 can now be written as: node ival input = copy ival { nt Tree } input The function copy uses a function defcp to define the attribute att as an inherited attribute of its children. This function is similar in some sense to inhdef , but instead of taking a record containing the new definitions it gets the value vp of the attribute which is to be copied to the children: copy att nts (Fam

6.2

Other Rules

In this section we introduce two more constructs of our DSL, without giving their implementation. Besides the Reader monad, there is also the Writer monad. Often we want to collect information provided by some of the children into an attribute of the parent. This can be used to e.g. collect all identifiers contained in an expression. Such a synthesized attribute can be declared using the

ip) = defcp att nts (ip # att)

defcp :: Copy att nts vp ic ic 0 ⇒ att → nts → vp → (Fam ic sp → Fam ic 0 sp) defcp att nts vp (Fam ic sp) = Fam (cpychi att nts vp ic) sp

253

use rule, which combines the attribute values of the children in similar way as Haskell’s foldr1 . The use rule takes the following arguments: the attribute to be defined, the list of non-terminals for which the attribute is defined, a monoidal operator which combines the attribute values, and a unit value to be used in those cases where none of the children has such an attribute.

chnAspect att nts chns inhdefs syndefs = (defAspect (FnChn att nts) chns) ⊕ (attAspect (FnInh att nts) inhdefs) ⊕ (attAspect (FnSyn att) syndefs) 7.1

Attribute Aspects

use :: (Use att nts a sc, HExtend (Att att a) sp sp 0 ) ⇒ att → nts → (a → a → a) → a → Rule sc ip ic sp ic sp 0

Consider the explicit definitions of the aspect asp sres. The idea is that, when declaring the explicit definitions, instead of completely writing the rules, like:

Using this new combinator the rule node smin of Figure 6 becomes:

{{ p Root .=. (λinput → syndef sres ((chi input # ch tree) # sres)) , p Leaf .=. (λinput → syndef sres (Leaf (par input # ival ))) }}

node smin = use smin { nt Tree } min 0 A third common pattern corresponds to the use of the State monad. A value is threaded in a depth-first way through the tree, being updated every now and then. For this we have chained attributes (both inherited and synthesized). If a definition for a synthesized attribute of the parent with this name is missing we look for the right-most child with a synthesized attribute of this name. If we are missing a definition for one of the children, we look for the right-most of its left siblings which can provide such a value, and if we cannot find it there, we look at the inherited attributes of the father.

we just define a record with the functions from the input to the attribute value: {{ p Root .=. (λinput → (chi input # ch tree) # sres) , p Leaf .=. (λinput → Leaf (par input # ival )) }} By mapping the function ((.) (syndef sres)) over such records, we get back our previous record containing rules. The function attAspect updates all the values of a record by applying a function to them: class AttAspect rdef defs rules | rdef defs → rules where attAspect :: rdef → defs → rules instance (AttAspect rdef (Record defs) rules , Apply rdef def rule , HExtend (Prd lprd rule) rules rules 0 ) ⇒ AttAspect rdef (Record (HCons (Prd lprd def ) defs)) rules 0 where attAspect rdef (Record (HCons def defs)) = let lprd = (labelLVPair def ) in lprd .=. apply rdef (valueLVPair def ) .∗. attAspect rdef (Record defs) instance AttAspect rdef (Record HNil ) (Record HNil ) where attAspect = emptyRecord

chain :: (Chain att nts val sc ic sp ic 0 sp 0 , HasField att ip val ) ⇒ att → nts → Rule sc ip ic sp ic 0 sp 0

7.

Defining Aspects

Now we have both implicit rules to define attributes, and explicit rules which contain explicit definitions, we may want to combine these into a single attribute aspect which contains all the definitions for single attribute. We now refer to Figure 9 which is a desugared version of the notation presented in the introduction. An inherited attribute aspect, like asp ival in Figure 9, can be defined using the function inhAspect. It takes as arguments: the name of the attribute att, the list nts of non-terminals where the attribute is defined, the list cpys of productions where the copy rule has to be applied, and a record defs containing the explicit definitions for some productions:

The class Apply (from the HList library) models the function application, and it is used to add specific constraints on the types:

inhAspect att nts cpys defs = (defAspect (FnCpy att nts) cpys) ⊕ (attAspect (FnInh att nts) defs)

class Apply f a r | f a → r where apply :: f → a → r In the case of synthesized attributes we apply ((.) (syndef att)) to values of type (Fam sc ip → val ) in order to construct a rule of type (Rule sc ip ic sp ic sp 0 ). The constraint HExtend (LVPair att val ) sp sp 0 is introduced by the use of syndef . The data type FnSyn is used to determine which instance of Apply has to be chosen.

The function attAspect generates an attribute aspect given the explicit definitions, whereas defAspect constructs an attribute aspect based in a common pattern’s rule. Thus, an inherited attribute aspect is defined as a composition of two attribute aspects: one with the explicit definitions and other with the application of the copy rule. In the following sections we will see how attAspect and defAspect are implemented. A synthesized attribute aspect, like asp smin and asp sres in Figure 9, can be defined using synAspect. Here the rule applied is the use rule, which takes op as the monoidal operator and unit as the unit value.

data FnSyn att = FnSyn att instance HExtend (LVPair att val ) sp sp 0 ⇒ Apply (FnSyn att) (Fam sc ip → val ) (Rule sc ip ic sp ic sp 0 ) where apply (FnSyn att) f = syndef att.f

synAspect att nts op unit uses defs = (defAspect (FnUse att nts op unit) uses) ⊕ (attAspect (FnSyn att) defs)

In the case of inherited attributes the function applied to define the rule is ((.) (inhdef att nts)). data FnInh att nt = FnInh att nt

A chained attribute definition introduces both an inherited and a synthesized attribute. In this case the pattern to be applied is the chain rule.

instance Defs att nts vals ic ic 0 ⇒ Apply (FnInh att nts) (Fam sc ip → vals)

254

asp smin = synAspect smin { nt Tree } min 0 { p Node } {{ p Leaf .=. (λ(Fam chi ) → chi # ch i) }}

-- synthesize at -- use at -- define at

asp ival = inhAspect ival { nt Tree } -- inherit { p Node } -- copy at {{ p Root .=. (λ(Fam chi ) → {{ ch tree .=. (chi # ch tree) # smin }}) }} -- define at asp sres = synAspect sres { nt Root, nt Tree } Node (Leaf 0) { p Node } {{ p Root .=. (λ(Fam chi ) → (chi # ch tree) # sres) , p Leaf .=. (λ(Fam par ) → Leaf (par # ival )) }}

-- synthesize at -- use at -- define at

Figure 9. Aspects definition for repmin (Rule sc ip ic sp ic 0 sp) where apply (FnInh att nts) f = inhdef att nts.f 7.2

hand, because Agda only permits the definition of total functions, we would need to maintain even more information in our types to make it evident that all our functions are indeed total. An open question is how easy it will be to extend the approach taken to more global strategies of accessing attributes definitions; some attribute grammars systems allow references to more remote attributes (Reps et al. 1986; Boyland 2005). Although we are convinced that we can in principle encode such systems too, the question remains how much work this turns out to be. Another thing we could have done is to make use of associated types (Chakravarty et al. 2005) in those cases where our relations are actually functions; since this feature is still experimental and has only recently become available we have refrained from doing so for the moment.

Default Aspects

The function defAspect is used to construct an aspect given a rule and a list of production labels. class DefAspect deff prds rules | deff prds → rules where defAspect :: deff → prds → rules It iterates over the list of labels prds, constructing a record with these labels and a rule determined by the parameter deff as value. For inherited attributes we apply the copy rule copy att nts, for synthesized attributes use att nt op unit and for chained attributes chain att nts. The following types are used, in a similar way than in attAspect, to determine the rule to be applied:

9.

data FnCpy att nts = FnCpy att nts data FnUse att nt op unit = FnUse att nt op unit data FnChn att nt = FnChn att nt

In the first place we remark that we have achieved all four goals stated in the introduction:

Thus, for example in the case of the aspect asp ival , the application:

1. removing the need for a whole collection of indexed combinators as used in (de Moor et al. 2000b)

defAspect (FnCpy ival { nt Tree }) { p Node }

2. replacing extensible records completely by heterogeneous collections

generates the default aspect: {{ p Node .=. copy ival { nt Tree } }}

8.

Conclusions

3. the description of common attribute grammar patterns in order to reduce code size, and making them almost first class objects 4. give a nice demonstration of type level programming

Related Work

We have extensive experience with attribute grammars in the construction of the Utrecht Haskell compiler (Dijkstra et al. 2009). The code of this compiler is completely factored out along the two axes mentioned in the introduction (Dijkstra and Swierstra 2004; Fokker and Swierstra 2008; Dijkstra et al. 2007), using the notation used in Figure 2. In doing so we have found the possibility to factor the code into separate pieces of text indispensable. We also have come to the conclusion that the so-called monadic approach, although it may seem attractive at first sight, in the end brings considerable complications when programs start to grow (Jones 1999). Since monad transformers are usually type based we already run into problems if we extend a state twice with a value of the same type without taking explicit measures to avoid confusion. Another complication is that the interfaces of non-terminals are in general not uniform, thus necessitating all kind of tricks to change the monad at the right places, keeping information to be reused later, etc. In our generated Haskell compiler (Dijkstra et al. 2009) we have non-terminals with more than 10 different attributes, and glueing all these together or selectively leaving some out turns out to be impossible to do by hand.

There have been several previous attempts at incorporating firstclass attribute grammars in lazy functional languages. To the best of our knowledge all these attempts exploit some form of extensible records to collect attribute definitions. They however do not exploit the Haskell class system as we do. de Moor et al. (2000b) introduce a whole collection of functions, and a result it is no longer possible to define copy, use and chain rules. Other approaches fail to provide some of the static guarantees that we have enforced (de Moor et al. 2000a). The exploration of the limitations of type-level programming in Haskell is still a topic of active research. For example, there has been recent work on modelling relational data bases using techniques similar to those applied in this paper (Silva and Visser 2006). As to be expected the type-level programming performed here in Haskell can also be done in dependently typed languages such as Agda (Norell 2008; Oury and Swierstra 2008). By doing so, we use Boolean values in type level-functions, thereby avoiding the need for a separate definition of the type-level Booleans. This would certainly simplify certain parts of our development. On the other

255

In our attribute grammar system (uuagc on Hackage), we perform a global flow analysis, which makes it possible to schedule the computations explicitly (Kastens 1980). Once we know the evaluation order we do not have to rely on lazy evaluation, and all parameter positions can be made strict. When combined with a uniqueness analysis we can, by reusing space occupied by unreachable attributes, get an even further increase in speed. This leads to a considerable, despite constant, speed improvement. Unfortunately we do not see how we can perform such analyses with the approach described in this paper: the semantic functions defining the values of the attributes in principle access the whole input family, and we cannot find out which functions only access part of such a family, and if so which part. Of course a straightforward implementation of extensible records will be quite expensive, since basically we use nested pairs to represent attributions. We think however that a not too complicated program analysis will reveal enough information to be able to transform the program into a much more efficient form by flattening such nested pairs. Note that thanks to our type-level functions, which are completely evaluated by the compiler, we do not have to perform any run-time checks as in (de Moor et al. 2000a): once the program type-checks there is nothing which will prevent it to run to completion, apart form logical errors in the definitions of the attributes. Concluding we think that the library described here is quite useful and relatively easy to experiment with. We notice furthermore that a conventional attribute grammar restriction, stating that no attribute should depend on itself, does not apply since we build on top of a lazily evaluated language. An example of this can be found in online pretty printing (Swierstra 2004; Swierstra and Chitil 2009). Once we go for speed it may become preferable to use more conventional off-line generators. Ideally we should like to have a mixed approach in which we can use the same definitions as input for both systems.

10.

Atze Dijkstra, Jeroen Fokker, and S. Doaitse Swierstra. The architecture of the Utrecht Haskell compiler. In Haskell Symposium, New York, NY, USA, September 2009. ACM. Jeroen Fokker and S. Doaitse Swierstra. Abstract interpretation of functional programs using an attribute grammar system. In Adrian Johnstone and Jurgen Vinju, editors, Language Descriptions, Tools and Applications, 2008. Benedict R. Gaster and Mark P. Jones. A polymorphic type system for extensible records and variants. NOTTCS-TR 96-3, Nottingham, 1996. Thomas Hallgren. Fun with functional dependencies or (draft) types as values in static computations in haskell. In Proc. of the Joint CS/CE Winter Meeting, 2001. Ralf Hinze. Fun with phantom types. In Jeremy Gibbons and Oege de Moor, editors, The Fun of Programming, pages 245–262. Palgrave Macmillan, 2003. Mark P. Jones. Typing Haskell in Haskell. In Haskell Workshop, 1999. URL http://www.cse.ogi.edu/ mpj/thih/thih-sep1-1999/. P. Jones, Mark. Type classes with functional dependencies. In ESOP ’00: Proceedings of the 9th European Symposium on Programming Languages and Systems, pages 230–244, London, UK, 2000. Springer-Verlag. Uwe Kastens. Ordered Attribute Grammars. Acta Informatica, 13: 229–256, 1980. Oleg Kiselyov, Ralf L¨ammel, and Keean Schupke. Strongly typed heterogeneous collections. In Haskell ’04: Proceedings of the ACM SIGPLAN workshop on Haskell, pages 96–107. ACM Press, 2004. Conor McBride. Faking it simulating dependent types in haskell. J. Funct. Program., 12(5):375–392, 2002. Ulf Norell. Dependently typed programming in Agda. In 6th International School on Advanced Functional Programming, 2008. Nicolas Oury and Wouter Swierstra. The power of Pi. In ICFP ’08: Proceedings of the Thirteenth ACM SIGPLAN International Conference on Functional Programming, 2008. Simon Peyton Jones, Mark Jones, and Erik Meijer. Type classes: an exploration of the design space. In Haskell Workshop, June 1997.

Acknowledgments

W. Reps, Thomas, Carla Marceau, and Tim Teitelbaum. Remote attribute updating for language-based editors. In POPL ’86: Proceedings of the 13th ACM SIGACT-SIGPLAN symposium on Principles of programming languages, pages 1–13, New York, NY, USA, 1986. ACM. J.C. Reynolds. User defined types and procedural data as complementary approaches to data abstraction. In S.A. Schuman, editor, New Directions in Algorithmic Languages. INRIA, 1975. Alexandra Silva and Joost Visser. Strong types for relational databases. In Haskell ’06: Proceedings of the 2006 ACM SIGPLAN workshop on Haskell, pages 25–36, New York, NY, USA, 2006. ACM. ISBN 1-59593-489-8. S. Doaitse Swierstra and Olaf Chitil. Linear, bounded, functional pretty-printing. Journal of Functional Programming, 19(01):1–16, 2009. S. Doaitse Swierstra, Pablo R. Azero Alcocer, and Jo˜ao A. Saraiva. Designing and implementing combinator languages. In S. D. Swierstra, Pedro Henriques, and Jos´e Oliveira, editors, Advanced Functional Programming, Third International School, AFP’98, volume 1608 of LNCS, pages 150–206. Springer-Verlag, 1999. S.D. Swierstra. Linear, online, functional pretty printing (extended and corrected version). Technical Report UU-CS-2004-025a, Inst. of Information and Comp. Science, Utrecht Univ., 2004. Phil Wadler. The Expression Problem. E-mail available online., 1998.

We want to thank Oege de Moor for always lending an ear in discussing the merits of attribute grammars, and how to further profit from them. Marcos Viera wants to thank the EU project Lernet for funding his stay in Utrecht. Finally, we would like to thank the anonymous referees for their helpful reviews.

References Richard S. Bird. Using circular programs to eliminate multiple traversals of data. Acta Inf., 21:239–250, 1984. John Boyland. Remote attribute grammars. Journal of the ACM (JACM, 52(4), Jul 2005. URL http://portal.acm.org/citation.cfm?id=1082036.1082042. Manuel M. T. Chakravarty, Gabriele Keller, and Simon Peyton Jones. Associated type synonyms. In ICFP ’05: Proceedings of the tenth ACM SIGPLAN international conference on Functional programming, pages 241–253, New York, NY, USA, 2005. ACM. Oege de Moor, Kevin Backhouse, and S. Doaitse Swierstra. First-class attribute grammars. Informatica (Slovenia), 24(3), 2000a. Oege de Moor, L. Peyton Jones, Simon, and Van Wyk, Eric. Aspect-oriented compilers. In GCSE ’99: Proceedings of the First International Symposium on Generative and Component-Based Software Engineering, pages 121–133, London, UK, 2000b. Springer-Verlag. ISBN 3-540-41172-0. Atze Dijkstra and S. Doaitse Swierstra. Typing Haskell with an Attribute Grammar. In Advanced Functional Programming Summerschool, number 3622 in LNCS. Springer-Verlag, 2004. Atze Dijkstra, Jeroen Fokker, and S. Doaitse Swierstra. The structure of the essential haskell compiler, or coping with compiler complexity. In Implementation of Functional Languages, 2007.

256

Parallel Concurrent ML John Reppy

Claudio V. Russo

Yingqi Xiao

University of Chicago [email protected]

Microsoft Research [email protected]

University of Chicago [email protected]

Abstract

ML [MTHM97]. CML extends SML with synchronous message passing over typed channels and a powerful abstraction mechanism, called first-class synchronous operations, for building synchronization and communication abstractions. This mechanism allows programmers to encapsulate complicated communication and synchronization protocols as first-class abstractions, which encourages a modular style of programming where the actual underlying channels used to communicate with a given thread are hidden behind data and type abstraction. CML has been used successfully in a number of systems, including a multithreaded GUI toolkit [GR93], a distributed tuple-space implementation [Rep99], and a system for implementing partitioned applications in a distributed setting [YYS+ 01]. The design of CML has inspired many implementations of CML-style concurrency primitives in other languages. These include other implementations of SML [MLt], other dialects of ML [Ler00], other functional languages, such as H ASKELL [Rus01], S CHEME [FF04], and other high-level languages, such as JAVA [Dem97]. One major limitation of the CML implementation is that it is single-threaded and cannot take advantage of multicore or multiprocessor systems.1 With the advent of the multicore and manycore era, this limitation must be addressed. In a previous workshop paper, we described a partial solution to this problem; namely a protocol for implementing a subset of CML, called Asymmetric CML (ACML), that supports input operations, but not output, in choice contexts [RX08]. This paper builds on that previous result by presenting an optimistic-concurrency protocol for CML synchronization that supports both input and output operations in choices. In addition to describing this protocol, this paper makes several additional contributions beyond the previous work.

Concurrent ML (CML) is a high-level message-passing language that supports the construction of first-class synchronous abstractions called events. This mechanism has proven quite effective over the years and has been incorporated in a number of other languages. While CML provides a concurrent programming model, its implementation has always been limited to uniprocessors. This limitation is exploited in the implementation of the synchronization protocol that underlies the event mechanism, but with the advent of cheap parallel processing on the desktop (and laptop), it is time for Parallel CML. Parallel implementations of CML-like primitives for Java and Haskell exist, but build on high-level synchronization constructs that are unlikely to perform well. This paper presents a novel, parallel implementation of CML that exploits a purpose-built optimistic concurrency protocol designed for both correctness and performance on shared-memory multiprocessors. This work extends and completes an earlier protocol that supported just a strict subset of CML with synchronization on input, but not output events. Our main contributions are a model-checked reference implementation of the protocol and two concrete implementations. This paper focuses on Manticore’s functional, continuation-based implementation but briefly discusses an independent, thread-based implementation written in C# and running on Microsoft’s stock, parallel runtime. Although very different in detail, both derive from the same design. Experimental evaluation of the Manticore implementation reveals good performance, dispite the extra overhead of multiprocessor synchronization. Categories and Subject Descriptors D.3.0 [Programming Languages]: General; D.3.2 [Programming Languages]: Language Classifications—Applicative (functional) languages; Concurrent, distributed, and parallel languages; D.3.3 [Programming Languages]: Language Constructs and Features—Concurrent programming structures General Terms Keywords

1.

• We present a reference implementation of the protocol written

in SML extended with first-class continuations. • To check the correctness of the protocol, we have used stateless

model-checking techniques to test the reference code. • We have two different parallel implementations of this proto-

Languages, Performance

col: one in the Manticore system and one written in C#. While the implementations are very different in their details — e.g., the Manticore implementation relies heavily on first-class continuations, which do not exist in C# — both implementations were derived from the reference implementation.

concurrency, parallelism, message passing

Introduction

Concurrent ML (CML) [Rep91, Rep99] is a statically-typed higher-order concurrent language that is embedded in Standard

• We describe various messy, but necessary, aspects of the imple-

mentation. • We present an empirical evaluation of the Manticore implemen-

tation, which shows that it provides acceptable performance (about 2.5× slower than the single-threaded implementation).

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICFP’09, August 31–September 2, 2009, Edinburgh, Scotland, UK. c 2009 ACM 978-1-60558-332-7/09/08. . . $5.00. Copyright

1 In

fact, almost all of the existing implementations of events have this limitation. The only exceptions are presumably the Haskell and Java implementations, which are both built on top of concurrency substrates that support multiprocessing.

257

The remainder of this paper is organized as follows. In the next section, we give highlights of the CML design. We then describe the single-threaded implementation of CML that is part of the SML/NJ system in Section 3. This discussion leads to Section 4, which highlights a number of the challenges that face a parallel implementation of CML. Section 5 presents our main result, which is our optimistic-concurrency protocol for CML synchronization. We have three implementations of this protocol. In Section 6, we describe our reference implementation, which is written in SML using first-class continuations. We have model checked this implementation, which we discuss in Section 7. There are various implementation details that we omitted from the reference implementation, but which are important for a real implementation. We discuss these in Section 8. We then give an overview of our two parallel implementations of the protocol: one in the Manticore system and one in C#. We present performance data for both implementations in Section 10 and then discuss related work in Section 11.

2.

Server1

Server2

request request

reply / ack

nack act1

A CML overview

Figure 1. A possible interaction between a client and two servers

Concurrent ML is a higher-order concurrent language that is embedded into Standard ML [Rep91, Rep99]. It supports a rich set of concurrency mechanisms, but for purposes of this paper we focus on the core mechanism of communication over synchronous channels. The interface to these operations is

let val replCh1 = channel() and nack1 = cvar() val replCh2 = channel() and nack2 = cvar() in send (reqCh1, (req1, replCh1, nack1)); send (reqCh2, (req2, replCh2, nack2)); select [ (replCh1, fn repl1 => ( signal nack2; action1 repl1)), (replCh2, fn repl2 => ( signal nack1; action2 repl2)) ] end

val spawn : (unit -> unit) -> unit type ’a chan val channel : unit -> ’a chan val recv : ’a chan -> ’a val send : (’a chan * ’a) -> unit

The spawn operation creates new threads, the channel function creates new channels, and the send and recv operations are used for message passing. Because channels are synchronous, both the send and recv operations are blocking. 2.1

Client

Figure 2. Implementing interaction with two servers

First-class synchronization type ’a event

The most notable feature of CML is its support for first-class synchronous operations. This mechanism was motivated by two observations about message-passing programs [Rep88, Rep91, Rep99]:

val recvEvt : ’a chan -> ’a event val sendEvt : (’a chan * ’a) -> unit event

1. Most inter-thread interactions involve two or more messages (e.g., client-server interactions typically require a request, reply, and acknowledgment messages).

val never : ’a event val always : ’a -> ’a event

2. Threads need to manage simultaneous communications with multiple partners (e.g., communicating with multiple servers or including the possibility of a timeout in a communication).

val val val val

For example, consider the situation where a client is interacting with two servers. Since the time that a server needs to fill a request is indeterminate, the client attempts both transactions in parallel and then commits to whichever one completes first. Figure 1 illustrates this interaction for the case where the first server responds first. The client-side code for this interaction might look like that in Figure 2. In this code, we allocate fresh reply channels and condition variables2 for each server and include these with the request message. The client then waits on getting a reply from one or the other server. Once it gets a reply, it signals a negative acknowledgement to the other server to cancel its request and then applies the appropriate action function to the reply message. Notice how the interactions for the two servers are intertwined. This property

choose : (’a event * ’a event) -> ’a event wrap : ’a event * (’a -> ’b) -> ’b event guard : (unit -> ’a event) -> ’a event withNack : (unit event -> ’a event) -> ’a event

val sync : ’a event -> ’a

Figure 3. CML’s event API

makes the code harder to read and maintain. Furthermore, adding a third or fourth server would greatly increase the code’s complexity. The standard Computer Science solution for this kind of problem is to create an abstraction mechanism. CML follows this approach by making synchronous operations first class. These values are called event values and are used to support more complicated interactions between threads in a modular fashion. Figure 3 gives the signature for this mechanism. Base events constructed by sendEvt and recvEvt describe simple communications on channels. There are also two special base-events: never, which is never enabled and always, which is always enabled for syn-

2 By

condition variable, we mean a write-once unit-valued synchronization variable. Waiting on the variable blocks the calling thread until it is signaled by some other thread. Once a variable has been signaled, waiting on it no longer blocks.

258

datatype server = S of (req * repl chan * unit event) chan

following equivalences for pushing wrapper functions to the leaves: wrap(wrap(ev, g), f ) = wrap(ev, f ◦ g) wrap(choose(ev1 , ev2 ), f1 ) = choose(wrap(ev1 , f ), wrap(ev2 , f ))

fun rpcEvt (S ch, req) = withNack ( fn nack => let val replCh = channel() in send (ch, (req, replCh, nack)); recvEvt replCh end)

Figure 5 illustrates the mapping from a nesting of wrap and choose combinators to its canonical representation. 3.2

The heart of the implementation is the protocol for synchronization on a choice of events. The implementation of this protocol is split between the sync operator and the base-event constructors (e.g., sendEvt and recvEvt). As described above, the base events are the leaves of the event representation. Each base event is a record of three functions: pollFn, which tests to see if the base-event is enabled (e.g., there is a message waiting); doFn, which is used to synchronize on an enabled event; and blockFn, which is used to block the calling thread on the base event. In the single-threaded implementation of CML, we rely heavily on the fact that sync is executed as an atomic operation. The single-threaded protocol is as follows:

Figure 4. The implementation of rpcEvt

chronization. These events can be combined into more complicated event values using the event combinators: • Event wrappers (wrap) for post-synchronization actions. • Event generators (combinators guard and withNack) for

pre-synchronization actions and cancellation (withNack). • Choice (choose) for managing multiple communications. In

CML, this combinator takes a list of events as its argument, but we restrict it to be a binary operator here. Choice of a list of events can be constructed using choose as a “cons” operator and never as “nil.”

1. Poll the base events in the choice to see if any of them are enabled. This phase is called the polling phase. 2. If one or more base events are enabled, pick one and synchronize on it using its doFn. This phase is called the commit phase.

To use an event value for synchronization, we apply the sync operator to it. Event values are pure values similar to function values. When the sync operation is applied to an event value, a dynamic instance of the event is created, which we call a synchronization event. A single event value can be synchronized on many times, but each time involves a unique synchronization event. Returning to our client-server example, we can now isolate the client-side of the protocol behind an event-valued abstraction.

3. If no base events are enabled we execute the blocking phase, which has the following steps: (a) Enqueue a continuation for the calling thread on each of the base events using its blockFn. (b) Switch to some other thread. (c) Eventually, some other thread will complete the synchronization.

type server type req = ... type repl = ... val rpcEvt : (server * req) -> repl event

Because the implementation of sync is atomic, the single-threaded implementation does not have to worry about the state of a base event changing between when we poll it and when we invoke the doFn or blockFn on it.

With this interface, the client-side code becomes much cleaner sync (choose ( wrap (rpcEvt (server1, req1), action1), wrap (rpcEvt (server2, req2), action2) ))

4.

sync (choose ( wrap (recvEvt ch, fn x => (1, x)), wrap (recvEvt ch, fn y => (2, y)) ))

The single-threaded implementation

Our parallel protocol has a similar high-level structure and eventrepresentation as the original single-threaded implementation of CML [Rep99]. In this section, we review these aspects of the single-threaded design to set the stage for the next section. 3.1

Issues

There are a number of challenges that must be met in the design of a protocol for CML synchronization. One issue that implementations must address is that a given event may involve multiple occurrences of the same channel. For example, the following code nondeterministicly tags the message received from ch with either 1 or 2:

The implementation of the rpcEvt function is also straightforward and is given in Figure 4. Most importantly, the details of the client-server protocol are now hidden behind an abstraction, which improves the code’s readability, modularity, and maintainability.

3.

Synchronization

A na¨ıve implementation might lock all of the channels involved in a synchronization, which would result in deadlock, unless reentrant locks were used. One must also avoid deadlock when multiple threads are simultaneously attempting communication on the same channel. For example, if thread P is executing

Event representation

An event value is represented as a binary tree, where the leaves are wrapped base-event values and the interior nodes are choice operators.3 This canonical representation of events relies on the

sync (choose (recvEvt ch1, recvEvt ch2))

at the same time that thread Q is executing sync (choose (recvEvt ch2, recvEvt ch1))

3 Strictly

speaking, the CML implementation represents events as a twolevel tree, where the root is a list of base-events, but we are treating choice as a binary operator in this paper.

we have a potential deadlock if the implementation of sync attempts to hold a lock on both channels simultaneously (i.e., where

259

choose

wrap

choose

wrap

wrap

choose

recv

recv

choose

wrap

wrap

recv

send

wrap wrap

wrap wrap

recv

send

Figure 5. The canonical-event transformation type ’a evt

of guards until synchronization time. When evaluated, it produces a list of cvars and a primitive event. The cvars are used to signal the negative acknowledgments for the event. The primitive event, when synchronized, will yield a list of those cvars that need to be signaled and a thunk that is the suspended wrapper action for the event. More details of this implementation can be found in our previous paper [RX08].

val choose : (’a evt * ’a evt) -> ’a evt val wrap : ’a evt * (’a -> ’b) -> ’b evt val sync : ’a evt -> ’a type ’a chan val recvEvt : ’a chan -> ’a evt val sendEvt : (’a chan * ’a) -> unit evt

5.

Figure 6. Primitive CML operations

An optimistic protocol for CML

In this section, we present our main result, which is our protocol for CML synchronization on shared-memory multiprocessors. Our approach to avoiding the pitfalls described above is to use an optimistic protocol that does not hold a lock on more than one channel at a time and avoids locking whenever possible. The basic protocol has a similar structure to the sequential one described above, but it must deal with the fact that the state of a base event can change during the protocol. This fact means that the commit phase may fail and that the blocking phase may commit. As before, the synchronization protocol is split between the sync operator and the base events. The sync operator executes the following algorithm:

P holds the lock on ch1 and attempts to lock ch2, while Q holds the lock on ch2 and attempts to lock ch1). Another problem is that a thread can both offer to send and receive on the same channel at the same time as in this example: sync (choose ( wrap (recvEvt ch, fn x => SOME x), wrap (sendEvt ch, fn () => NONE) ))

In this case, it is important that the implementation not allow these two communications to match.4 Lastly, the implementation of the withNack combinator requires fairly tricky bookkeeping. Fortunately, it is possible to implement the full set of CML combinators on top of a much smaller kernel of operations, which we call primitive CML. While implementing primitive CML on a multiprocessor is challenging, it is significantly simpler than a monolithic implementation. The signature of this subset is given in Figure 6. 5 To support full CML, we use an approach that was suggested by Matthew Fluet [DF06]. His idea is to move the bookkeeping used to track negative acknowledgments out of the implementation of sync and into guards and wrappers. In this implementation, negative acknowledgments are signaled using the condition variables (cvars) described earlier. Since we must create these variables at synchronization time, we represent events as suspended computations (or thunks). The event type has the following definition:

1. The protocol starts with the polling phase, which is done in a lock-free way. 2. If one or more base events are enabled, pick one and attempt to synchronize on it using its doFn. This attempt may fail because of changes in the base-event state since the polling was done. We repeat until either we successfully commit to an event or we run out of enabled events. 3. If there are no enabled base events (or all attempts to synchronize failed), we enqueue a continuation for the calling thread on each of the base events using its blockFn. When blocking the thread on a particular base event, we may discover that synchronization is now possible, in which case we attempt to synchronize immediately. This design is guided by the goal of minimizing synchronization overhead and maximizing concurrency. The implementations of the doFn and blockFn for a particular base-event constructor depend on the details of the underlying communication object, but we can describe the synchronization logic of these operations as state diagrams that abstract away the details of the implementation. For each dynamic instance of a synchronization, we create an event-state variable that we use to track the state of the protocol. This variable has one of three states:

datatype ’a event = E of (cvar list * (cvar list * ’a thunk) evt) thunk

where the thunk type is type ’a thunk = unit -> ’a. The outermost thunk is a suspension used to delay the evaluation 4 This

problem was not an issue for the asymmetric protocol described in our previous work [RX08]. 5 This subset is equivalent to the original version of first-class synchronous operations that appeared in the PML language [Rep88].

260

START

WAITING owner attempting synchronization in blockFn owner in doFN or other thread

get b

a=?; b=?

CLAIMED CAS(a,W,C)

owner synchronizes on the event in blockFn

unget b

a=W

a := W

SYNCHED

a=S

FAIL a=C; b=?

a=S; b=?

Figure 7. Allowed event-state-variable transitions a=C; b=S

b=S

CAS(b,W,S)

b=C

a=C; b=?

a := W

b=W

START

a=C; b=S get b

b=?

SUCCESS a := S b=S

CAS(b,W,S)

a=S; b=S

b=C

b=W

Figure 9. State diagram for the blockFn protocol b=S

SUCCESS

state from WAITING to SYNCHED using a CAS instruction. There are three possibilities: b = WAITING: in this case, the CAS will have changed b to SYNCHED and the doFn has successfully committed the synchronization.

Figure 8. State diagram for the doFn protocol

b = CLAIMED: in this case, the owner is trying to synchronize on some other base event that is associated with b, so we spin until either it succeeds or fails.

WAITING — this is the initial state and signifies that the event is available for synchronization. CLAIMED — this value signifies that the owner of the event is attempting to complete a synchronization.

b = SYNCHED: in this case, the event is already synchronized, so we try to get another event.

SYNCHED — this value signifies that the event has been synchronized.

The state diagram for the blockFns is more complicated because the state variable for the event may already be enqueued on some other communication object. For example, consider the case where thread A executes the synchronization

The state variable is supplied to the blockFn during the blocking phase and is stored in the waiting queues, etc. of the communication objects. Figure 7 shows the state transitions that are allowed for an event-state variable. This diagram illustrates an important property of state variables: a variable may change from WAITING to SYNCHED at any time (once it is made visible to other threads), but a CLAIMED variable is only changed by its owner. An important property of the commit phase is that the event state has not yet been “published” to other threads, so it cannot change asynchronously. This fact means that the doFn part of the protocol is fairly simple, as is shown in Figure 8. We use W, C, and S to represent the event state values in this diagram; states are represented as ovals, actions as rectangles, and atomic compareand-swap (CAS) tests as diamonds. The outgoing edges from a CAS are labelled with the cell’s tested value. The first step is to attempt to get a match from the communication object. We expect that such an object exists, because of the polling results, but it might have been consumed before the doFn was called. Assuming that it is present, however, and that it has state variable b, we attempt to synchronize on the potential match. We then attempt to change its

sync (choose (recvEvt ch1, recvEvt ch2))

Assuming that A calls the blockFn for ch1 first, then some other thread may be attempting to send A a message on ch1 while A is attempting to receive a message on ch2. Figure 9 gives the state diagram for a thread with event-state variable a, attempting to match a communication being offered by a thread with eventstate variable b. As with the doFn diagram, we start by attempting to get an item from the communication object. Given such an item, with state variable b, we attempt to set our own state variable a to CLAIMED to prevent other threads from synchronizing on our event. We use a CAS operation to do so and there are two possible situations: a = WAITING: in this case, the CAS will have changed a to CLAIMED and we continue with the protocol. a = SYNCHED: in this case, A’s event has already been synchronized and we can schedule some other thread to run, but before

261

type ’a queue

val enqueueRdy : thread -> unit val dispatch : unit -> ’a

queue : unit -> ’a queue isEmpty : ’a queue -> bool enqueue : (’a queue * ’a) -> unit dequeue : ’a queue -> ’a option dequeueMatch : (’a queue * (’a -> bool)) -> ’a option val undequeue : (’a * ’a queue) -> unit val val val val val

The first enqueues a ready thread in the scheduling queue and the second transfers control to the next ready thread in the queue. 6.1.3

Figure 10. Specification of queue operations

val CAS

Once we have successfully set a to CLAIMED, we know that its value will not be changed by another thread. At this point, we attempt to change b from WAITING to SYNCHED as we did in the doFn diagram. There are three possibilities:

6.2

datatype event_status = WAITING | CLAIMED | SYNCHED type event_state = event_status ref datatype ’a evt = BEVT of { pollFn : unit -> bool, doFn : ’a cont -> unit, blockFn : (event_state * ’a cont) -> unit } | CHOOSE of ’a evt * ’a evt

b = CLAIMED: in this case, the owner is trying to synchronize on some other base event that is associated with b, so we reset a to WAITING and spin try to match b again. b = SYNCHED: in this case, the event is already synchronized, so we reset a to WAITING and try to get another event. This protocol is somewhat similar to a two-variable STM transaction, except that we do not need a read log, since we never reset b’s value and we always reset a to WAITING when rolling back.

In our reference implementation we use first-class continuations to represent thread state. Notice that both the doFn and blockFn functions take a continuation argument. This continuation is the resume continuation for when the event is synchronized on.

A reference implementation

6.3

To make the protocol more concrete, we present key excerpts from our reference implementation in this section. Preliminaries

Queues

Our implementation uses queues to track pending messages and waiting threads in channels. We omit the implementation details here, but give the interface to the queue operations that we use in Figure 10. Most of these operations are standard and have the expected semantics, but the last two are less common. The dequeueMatch function dequeues the first element of the queue that satisfies the given predicate and the undequeue operation pushes an item onto the front of the queue. 6.1.2

Implementing sync

The sync operation is given in Figure 11 and directly follows the logic described in the previous section. It starts with a polling phase, then attempts to commit on any enabled events, and, failing that, blocks the thread on the base events. The main omitted detail is that it passes its return continuation as an argument to the doFn and blockFn calls. Note that we also allocate a new event-state variable that is passed into the blockFn calls. It is worth noting that we implement the sync operation as a single pass of invoking the blockFn for each base event. The problem with this approach is that it implements a biased choice that always favors the left alternative over the right. Although we do not describe it here, the structure that we use allows us to support priorities and/or fairness mechanisms for choice (see Chapter 10 of [Rep99] for more discussion).

We present our reference implementation using SML syntax with a few extensions. To streamline the presentation, we elide several aspects that an actual implementation must address, such as thread IDs and processor affinity, but we discuss these in Section 8. 6.1.1

The representation of events

We start with the representation of events and event-states:

b = WAITING: in this case, the CAS will have changed b to SYNCHED, so we set a to SYNCHED to mark that we have successfully committed the synchronization.

6.1

: (’a ref * ’a * ’a) -> ’a

type spin_lock val spinLock : spin_lock -> unit val spinUnlock : spin_lock -> unit

doing so, we need to put b back into the communication object’s queue.

6.

Low-level synchronization

Our implementation also relies on the atomic compare-and-swap instruction. We also assume the existence of spin locks. These lowlevel operations have the following interface:

6.4

Implementing wrap

The implementation of wrap, given in Figure 12, is not directly involved in the synchronization protocol, but it is responsible for maintaining the canonical representation of event values. The wrap function pushes its action argument f to the leaves of the event, where it composes f with the base event’s doFn and blockFn functions. This composition requires some horrible continuation hacking to implement.

Threads and thread scheduling

As in the uniprocessor implementation of CML, we use first-class continuations to implement threads and thread-scheduling. The continuation operations have the following specification: type ’a cont val callcc : (’a cont -> ’a) -> ’a val throw : ’a cont -> ’a -> ’b

6.5

Implementing sendEvt

To illustrate how the synchronization protocol works in a concrete example, we examine the reference code for the sendEvt event base-event constructor (the recvEvt function follows the same synchronization pattern). This operation works on the following representation of channels:

We represent the state of a suspended thread as a continuation: type thread = unit cont

The interface to the scheduling system is two atomic operations:

262

fun doFn k = let fun tryLp () = (case dequeue recvq of NONE => spinUnlock lock | SOME(flg, recvK) => let fun matchLp () = ( case CAS (flg, WAITING, SYNCHED) of WAITING => ( spinUnlock lock; enqueueRdy k; throw recvK msg) | CLAIMED => matchLp () | _ => tryLp () (* end case *)) in if (deref flag SYNCHED) then matchLp () else tryLp () end (* end case *)) in spinLock lock; tryLp () end

fun sync ev = callcc (fn resumeK => let (* optimistically poll the base events *) fun poll (BEVT{pollFn, doFn, ...}, enabled) = if pollFn() then doFn::enabled else enabled | poll (CHOOSE(ev1, ev2), enabled) = poll(ev2, poll(ev1, enabled)) (* attempt an enabled communication *) fun doEvt [] = blockThd() | doEvt (doFn::r) = ( doFn resumeK; (* if we get here, that means that the *) (* attempt failed, so try the next one *) doEvt r) (* record the calling thread’s continuation *) and blockThd () = let val flg = ref WAITING fun block (BEVT{blockFn, ...}) = blockFn (flg, resumeK) | block (CHOOSE(ev1, ev2)) = ( block ev1; block ev2) in block ev; dispatch () end in doEvt (poll (ev, [])) end)

Figure 13. The sendEvt doFn code It defines the three base-event functions for the operation and makes an event value out of them. Note that the polling function just tests to see if the queue of waiting receivers is not empty. There is no point in locking this operation, since the state may change before the doFn is invoked. The bulk of the sendEvt implementation is in the doFn and blockFn functions, which are given in Figures 13 and 14 respectively. The doFn implementation consists of a single loop (tryLp) that corresponds to the cycle in Figure 8. If the doFn is successful in matching a receive operation, it enqueues the sender in the ready queue and throws the message to the receiver’s resumption continuation. The blockFn code also follows the corresponding state diagram closely. It consists of two nested loops. The outer loop (tryLp) corresponds to the left-hand-side cycle in Figure 9, while the inner loop (matchLp) corresponds to the right-hand-side cycle.

Figure 11. The reference implementation of sync fun wrap (BEVT{pollFn, doFn, blockFn}, f) = BEVT{ pollFn = pollFn, doFn = fn k => callcc (fn retK => throw k (f (callcc (fn k’ => (doFn k’; throw retK ()))))), blockFn = fn (flg, k) => callcc (fn retK => throw k (f (callcc (fn k’ => (blockFn(flg, k’); throw retK ()))))) } | wrap (CHOOSE(ev1, ev2), f) = CHOOSE(wrap(ev1, f), wrap(ev2, f))

6.6

Asymmetric operations

In addition to synchronous message passing, CML provides a number of other communication primitives. These primitives have the property that they involve only one active thread at a time (as is the case for asymmetric-CML), which simplifies synchronization. In Figure 15, we give the reference implementation for the cvar type and waitEvt event constructor. In this case, the doFn is trivial, since once a cvar has been signaled its state does not change. The blockFn is also much simplier, because there is only one event-state variable involved.

Figure 12. The reference implementation of wrap datatype ’a chan = Ch of { lock : spin_lock, sendq : (event_state * ’a * unit cont) queue, recvq : (event_state * ’a cont) queue }

Each channel has a pair of queues: one for waiting senders and one for waiting receivers. It also has a spin lock that we use to protect the queues. It is important to note that we only lock one channel at a time, which avoids the problem of deadlock. The high-level structure of the sendEvt function is

7.

Verifying the protocol

Designing and implementing a correct protocol, such as the one described in this paper, is very hard. To increase our confidence in the protocol design, we have used stateless model checking to verify the reference implementation. Our approach is based on the ideas of the CHESS model checker [MQ07], but we built our own tool tailored to our problem. We used this tool to guide the design of the protocol; in the process, we uncovered several bugs and missteps in the design that we were able to correct. Our approach to model checking was to implement a virtual machine in SML that supported a scheduling infrastructure and memory cells with both atomic and non-atomic operations. The imple-

fun sendEvt (Ch{lock, sendq, recvq, ...}, msg) = let fun pollFn () = not(isEmpty recvq) fun doFn k = ... fun blockFn (myFlg : event_state, k) = ... in BEVT{ pollFn = pollFn, doFn = doFn, blockFn = blockFn} end

263

datatype cvar = CV of { fun blockFn (myFlg : event_state, k) = let lock : spin_lock, fun notMe (flg’, _, _) = not(same(myFlg, flg’)) state : bool ref, fun tryLp () = (case dequeueMatch (recvq, notMe) waiting : (event_state * thread) list ref of SOME(flg’, recvK) => let } (* a receiver blocked since we polled *) fun matchLp () = ( fun waitEvt (CV{lock, state, waiting}) = let case CAS(myFlg, WAITING, CLAIMED) fun pollFn () = !state of WAITING => ( fun doFn k = throw k () (* try to claim the matching event *) fun blockFn (flg : event_state, waitK) = ( case CAS (flg’, WAITING, SYNCHED) spinLock lock; of WAITING => ( (* we got it! *) if !state spinUnlock lock; then ( myFlg := SYNCHED; spinUnlock lock; enqueueRdy k; case CAS(flg, WAITING, SYNCHED) throw recvK msg) of WAITING => throw waitK ()) | CLAIMED => ( | _ => dispatch () myFlg := WAITING; (* end case *)) matchLp ()) else let | SYNCHED => ( val wl = !waiting myFlg := WAITING; in tryLp ()) waiting := (flg, waitK) :: wl; (* end case *)) spinUnlock lock | sts => ( end) undequeue ((flg’, recvK), recvq); in spinUnlock lock; BEVT{ dispatch ()) pollFn = pollFn, (* end case *)) doFn = doFn, in blockFn = blockFn} if (!flg’ SYNCHED) end then matchLp () else tryLp () Figure 15. The waitEvt event constructor end | NONE => ( enqueue (sendq, (myFlg, msg, k)); 8. The messy bits spinUnlock lock) (* end case *)) To keep the sample code clean and uncluttered, we have omitted in several implementation details that we discuss in this section. spinLock lock; tryLp () 8.1 Locally-atomic operations end

This implementation uses spin-lock-style synchronization at the lowest level. One problem with spin locks is that if a lock-holder is preempted and a thread on another processor attempts to access the lock, the second thread will spin until the first thread is rescheduled and releases the lock. To avoid this problem, the Manticore runtime provides a lightweight mechanism to mask local preemptions. We run sync as a locally-atomic operation, which has two benefits. One is that threads do not get preempted when holding a spin lock. The second is that certain scheduling structures, such as the perprocessor thread queue, are only by the owning processor and, thus, can be accessed without locking.

Figure 14. The sendEvt blockFn code

mentation of the virtual machine operations are allowed to inject preemptions into the computation. We used SML/NJ’s first-class continuations to implement a roll-back facility that allowed both the preempted and non-preempted execution paths to be explored. To keep the number of paths explored to a tractable number, we bound the number of preemptions to 3 on any given trace.6 Our reference implementation was then coded as a functor over the virtual machine API. On top of this we wrote a number of test cases that we ran through the checker. These tests required exploring anywhere from 20,000 to over one million distinct execution traces. Our experience with this tool was very positive. Using this tool exposed both a bug in the basic design of our protocol and a couple of failures to handle various corner cases. We strongly recommend such automated testing approaches to developers of concurrent language implementations. Perhaps the best proof of its usefulness is that when we ported the reference implementation to the Manticore runtime system, it worked “out of the box.”

8.2

Thread affinity

In the above implementation, we assume a single, global, scheduling queue for threads that are ready to run. In the Manticore runtime system, however, there is a separate thread queue for each processor. If a thread on processor P blocks on sending a message and then a thread on processor Q wakes it up by receiving the message, we want the sender to be rescheduled on P ’s queue. To this end, we include the thread’s host processor in the blocking information. 8.3

Avoiding space leaks

Another issue that the implementation must deal with is removing the “dead” elements from channel waiting queues. While setting the event-state flag to SYNCHED marks a queue item as dead, it does not get removed from the waiting queue. Consider the following loop:

6 Experience

shows that bounding the preemptive context switches is an effective way to reduce the state space, while still uncovering many concurrency bugs [MQ07].

fun lp () = sync (choose ( wrap (recvEvt ch1, fn x => ...), wrap (recvEvt ch2, fn x => ...)))

264

If there is a regular stream of messages on channel ch1, but never a sender on channel ch2, the waiting-sender queue for channel ch2 will grow longer and longer. To fix this problem, we need to remove dead items from the waiting queues on insert. Since scanning a queue for dead items is a potentially expensive operation, we want to scan only occasionally. To achieve this goal, we add two counters to the representation of a waiting queue. The first keeps track of the number of elements in the queue and the second defines a threshold for scanning. When inserting an item, if the number of items in the queue exceeds the threshold, then we scan the queue to remove dead items. We then reset the threshold to max(n+k1 , k2 ∗n), where n is the number of remaining items, and k1 and k2 are tuning parameters.7 For actively used channels with few senders and receivers, the threshold is never exceeded and we avoid scanning. For actively used channels that have large numbers of senders and receivers, the threshold will grow to accommodate the larger number of waiting threads and will subsequently not be exceeded. But for channels, like ch2 above, that have many dead items, the threshold will stay low (equal to k1 ) and the queues will not grow without bound. One should note that there is still the possibility that large data objects can be retained past their lifetime by being inserted into a queue that is only rarely used (and doesn’t exceed its threshold). We could address this issue by making the garbage collector aware of the structure of queue items, so that the data-pointer of a dead item could be nullified, but we do not believe that this problem is likely and worth the extra implementation complexity. 8.4

maintains (called the landing pad) and then waiting until the remote processor notices it and schedules it. The effect of this design is that message passing and remote thread creation have increased latency (cf. Section 10). 9.2

Although we described our CML implementation elegantly using first-class continuations, their use is by no means essential. Any continuations are used at most once and can readily be replaced by calls to threading primitives. To demonstrate this claim, we implemented a version of Parallel CML in C# [TG2] running on Microsoft’s Common Language Runtime [CLR]. The CLR does not support first-class continuations but can make use of parallel hardware. The framework libraries provide access to low-level synchronization primitives such as CAS, spin waiting and volatile reads and writes of machine words. This is in addition to the expected higher-level synchronization constructs such as CLR monitors that ultimately map to OS resources. The CLR thus provides a useful test-bed for our algorithms. CML’s event constructors have a natural and unsurprising translation to C# classes deriving from an abstract base class of events. The main challenge in translating our CML reference implementation lies in eliminating uses of callcc. However, since CML only uses a value of type ’a cont to denote a suspended computation waiting to be thrown some value, we can represent these continuations as values of the following abstract class: internal abstract class Cont { internal void Throw(T res) { Throw(() => res); } internal abstract void Throw(Thunk res); }

Reducing bus traffic

In the reference implementation, we often spin on tight loops performing CAS instructions. In practice, such loops perform badly, because of the bus traffic created by the CAS operation. It is generally recommended to spin on non-atomic operations (e.g., loads and conditionals) until it appears that the CAS will succeed [HS08].

9.

Here, Thunk is the type of a first-class method - a delegate - with no argument and return type T. In general, the thrown value res will be a delayed computation of type Thunk to accommodate the composition of post-synchronization functions using wrap - these must be lazily composed then executed on the receiving end of a synchronization. Now we can capture a waiting thread using a concrete subclass of Cont:

Parallel implementations of the protocol

In Section 6, we presented a reference implementation of the CML synchronization protocol described in Section 5. We have translated this reference implementation into two parallel implementations. One is a continuation-based implementation as part of the Manticore system [FFR+ 07]. Although very different in detail, both derive from the same design. In this section, we describe some specific aspects of these translations. We report on the performance of the Manticore and C# implementations in Section 10. 9.1

internal class SyncCont : Cont { private Thunk res; private bool Thrown; internal override void Throw(Thunk res) { lock (this) { this.res = res; Thrown = true; Monitor.Pulse(this); } } internal virtual T Wait() { lock (this) { while (!Thrown) Monitor.Wait(this); } return res(); } }

A continuation-based implementation

The Manticore implementation is written in a low-level functional language that serves as one of the intermediate representations of our compiler. This language can be viewed as a stripped-down version of ML with a few extensions. Specifically, it supports firstclass continuations via a continuation binder and it provides access to mutable memory objects8 and operations (including CAS). While the actual code is more verbose, the translation from the reference implementation was direct. The Manticore runtime system is designed to emphasize separation between processors [FRR08]. While this design helps with scalability, it does impose certain burdens on the implementation of the CML primitives. One aspect is that each processor has its own local scheduling queue, which other processors are not allowed to access. Thus, to schedule a thread on a remote processor requires pushing it on a concurrent stack that each processor 7 We

A thread-based implementation

In order to suspend itself, a thread allocates a new SyncCont value, k, does some work, and eventually calls k.Wait() to receive the result res() of this or another thread’s intervening or future call to k.Throw(res): k is essentially a condition variable carrying a suspended computation. For example, consider the callcc-based SML implementation of sync in Figure 11. Note that the current continuation resumeK, that encloses the entire body of sync ev, is just to return to the caller of sync. The call to doFn will either transfer control to the outer resumeK continuation once, when successful, or return if it fails. Similarly, the blockFn may complete synchronization, transferring control to resumeK, or return;

currently set k1 = 10 and k2 = 1.5. surface language does not have mutable storage.

8 Manticore’s

265

in which case the call to sync ev blocks by entering the scheduler to dispatch another thread. Finally, the scheduler ensures that at most one thread will continue with resumeK. This is our C# implementation of method Sync: public abstract class Evt { internal abstract List Poll(List enabled); internal abstract bool Block(Evt_State state, Cont resumeK); public T Sync() { List enabled = Poll(null); T t = default(T); while (enabled != null) { if (enabled.head.DoFn(ref t)) return t; enabled = enabled.tail; } var resumeK = new SyncCont(); Block(new Evt_State(), resumeK); return resumeK.Wait(); } }

Spawn benchmark System Threads/sec. CML 2,628,000 Manticore (1P) 1,235,000 Manticore (2P) 330,300

Ratio 1.00 0.47 0.13

Ping-pong benchmark System Messages/sec. CML 1,608,000 Manticore (1P) 697,800 Manticore (2P) 271,400

Ratio 1.00 0.43 0.17

Ping-pong benchmark Figure 16. Micro-benchmark results } var resumeK = new AsyncCont(k); Block(new Evt_State(), resumeK); }

The DoFn(ref t) method call cannot directly transfer control when it succeeds - unlike the CML doFn resumeK; application in Figure 11. Instead, DoFn returns true to indicate a successful commit, or false to indicate commit-failure. As a side-effect, it also updates the location t with any T-result that its caller should return. If the commit phase succeeds, the code simply returns the value of t and skips the blocking phase. Otherwise, it allocates a new SyncCont instance, resumeK, queues resumeK on all the base events and exits with a call to resumeK.Wait(), blocking unless the Block call managed to commit. Notice that, unlike the CML code for sync, the C# code delays creating a resumeK continuation until the commit phase is known to have failed, avoiding the additional heap-allocation, synchronization and potential context switch inherent in a more direct translation of the callcc-based code. In the message-passing benchmark of Section 10.4, this optimization improves performance by at least 10% over the literal translation of the reference implementation. Since CLR threads are expensive operating system threads, it is useful to avoid the overhead of blocking by using asynchronous calls when possible. To this end, we extended the CML event signature with an additional Async operation that, instead of blocking on the return of a value, immediately queues a callback that takes a value, to be invoked as a CLR task on completion of the event.9 Enabling this requires a new class of continuations whose Throw method queues a CLR task but that has no Wait method:

Here, Action is the type of a first-class method expecting a T argument that returns void. Instead of returning t or blocking on resumeK.Wait(), as in the code for Sync(), Async(k) immediately returns control, having either queued () => k(t) as a new asynchronous task or saved k for a future synchronization through a successful call to Block(...,resumeK): The Async method makes it possible to use C# iterators to provide a form of light-weight, user-mode threading. Although somewhat awkward, iterators let one write non-blocking tasks in a sequential style by yield-ing control to a dispatcher that advances the iterator through its states [CS05]. In particular, by yielding CML events, and having the dispatcher queue an action to resume the iteration asynchronously on completion of each event, we can arrange to multiplex a large number of lightweight tasks over a much smaller set of CLR worker threads.

10.

Performance

This section presents some preliminary benchmark results for our two implementations. To test the Manticore implementation of the protocol, we compare the results against the CML implementation, which is distributed as part of SML/NJ (Version 110.69). These tests were run on a system with four 2GHz dual-core AMD Opteron 870 processors and 8Gb of RAM. The system is running Debian Linux (kernel version 2.6.18-6-amd64). Each benchmark was run ten times; we report the average wall-clock time.

internal class AsyncCont : Cont { private Action k; internal AsyncCont(Action k) { this.k = k; } internal override void Throw(Thunk res) { P.QueueTask(() => k(res())); } }

10.1

Micro-benchmarks

Our first two experiments measure the cost of basic concurrency operations: namely, thread creation and message passing. Spawn This program repeatedly spawns a trivial thread and then waits for it to terminate. In the two-processor case, the parent thread runs on one machine and creates children on the other.

The code for method Async(k) takes a continuation action k and follows the same logic as Sync: public void Async(Action k) { List enabled = Poll(null); T t = default(T); while (enabled != null) { if (enabled.head.DoFn(ref t)) { QueueTask(() => k(t)); return; } enabled = enabled.tail;

Ping-pong This program involves two threads that bounce messages back and forth. In the two-processor case, each thread runs on its own processor. For Manticore, we measured two versions of these programs: one that runs on a single processor and one that runs on two processors. Note that these benchmarks do not exhibit parallelism; the twoprocessor version is designed to measure the extra overhead of working across processors (see Section 9.1). The results for these experiments are given in Figure 16. For each experiment, we report the measured rate and the ratio between the measured rate and the

9A

full implementation would also need to take a failure callback and properly plumb exceptions in the body of method Async(k).

266

CML version (a higher ratio is better). As can be seen from these numbers, the cost of scheduling threads on remote processors is significantly higher. 10.2

As before, the benchmark demonstrates that we will get speedups on parallel hardware when computations are independent. Our second C# benchmark is an asynchronous, task-based implementation of the primes benchmark from above. Note that the synchronous version that uses one CLR thread per prime filter exhausts system resources after around 1000 threads (as expected), but the task based implementation, written using C# iterators yielding Evt values, scales better, handling both larger inputs and benefiting from more processors.

Parallel ping-pong

While the above programs do not exhibit parallelism, it is possible to run multiple copies of them in parallel, which is predictor of aggragate performance across a large collection of independently communicating threads. We ran eight copies (i.e., 16 threads) of the ping-pong benchmark simultaneously. For the multiprocessor version, each thread of a communicating pair was assigned to a different processor. System

Messages/sec.

CML Manticore (1P) Manticore (2P) Manticore (4P) Manticore (8P)

1,576,000 724,000 412,000 734,000 1,000,000

# Procs 1 2 4 8

Ratio (vs. CML) (vs. 1P) 1.00 0.46 1.00 0.26 0.57 0.47 1.01 0.63 1.38

10.5

Primes

The Primes benchmark computes the first 2000 prime numbers using the Sieve of Erastothenes algorithm. The computation is structured as a pipeline of filter threads as each new prime is found, a new filter thread is added to the end of pipeline. We ran both single and multiprocessor versions of the program; the filters were assigned in a round-robin fashion. We report the time and speedup relative to the CML version in the following table: System CML Manticore (1P) Manticore (2P) Manticore (4P) Manticore (8P)

Time (sec.) 1.34 3.08 3.37 1.61 0.92

Speedup (vs. CML) (vs. 1P) 1.00 0.43 1.00 0.40 0.91 0.83 1.91 1.45 3.35

11.

Even though the computation per message is quite low in this program, we see a speed up on multiple processors. 10.4

We also measured the performance of the C# implementation on a system with two 2.33MHz quad-core Intel Xeon E5345 processors and 4GB of memory, running 32-bit Vista Enterprise SP1 and CLR 4.0 Beta 1. Each benchmark was run ten times allowing the OS to schedule on 1 to 8 cores; we report the average wall-clock time. Since we have no uniprocessor implementation (such as CML) to compare with, we resort to taking the single-processor runs as our baseline. Our first C# benchmark is the parallel ping-pong program from above. The implementations use proper threads synchronising using blocking calls to Sync. The mapping of threads to processors was left to the OS scheduler. Messages/sec. 37,100 68,400 75,000 84,700

Summary

Related work

Various authors have described implementations of choice protocols using message passing as the underlying mechanism [BS83, Bor86, Kna92, Dem98]. While these protocols could, in principle, be mapped to a shared-memory implementation, we believe that our approach is both simpler and more efficient. Russell described a monadic implementation of CML-style events on top of Concurrent Haskell [Rus01]. His implementation uses Concurrent Haskell’s M-vars for concurrency control and he uses an ordered two-phase locking scheme to commit to communications. A key difference in his implementation is that choice is biased to the left, which means that he can commit immediately to an enabled event during the polling phase. This feature greatly simplifies his implementation, since it does not have to handle changes in event status between the polling phase and the commit phase. Russell’s implementation did not support multiprocessors (because Concurrent Haskell did not support them at the time), but presumably would work on a parallel implementation of Concurrent Haskell. Donnelly and Fluet have implemented a version of events that support transactions on top of Haskell’s STM mechanism [DF06]. Their mechanism is quite powerful and, thus, their implementation is quite complicated.

C# Performance

# Procs 1 2 4 8

Speedup 1.00 1.42 2.17 2.68

The results presented in this section demonstrate that the extra overhead required to support parallel execution (i.e., atomic memory operations and more complicated protocols) does not prevent acceptable performance. As we would expect, the single-threaded implementation of CML is much faster than the parallel implementations (e.g., about 2.5 times faster than the Manticore 1P implementation). Since the performance of most real applications is not dominated by communication costs, we expect that the benefits of parallelism will easily outweigh the extra costs of the parallel implementation. We also expect that improvements in the Manticore compiler, as well as optimization techniques for message-passing programs [RX07], will reduce the performance gap between the single-threaded and multi-threaded implementations. These experiments also demonstrate that there is a significant cost in communicating across multiple processors in the Manticore system. Scheduling threads for the same processor will reduce message-passing costs. On the other hand, when the two communicating threads can compute in parallel, there is an advantage to having them on separate processors. Thus, we need scheduling policies that keep threads on the same processor when they are not concurrent, but distribute them when they are. There is some existing research on this problem by Vella [Vel98] and more recently Ritson [Rit08] that we may be able to incorporate in the Manticore runtime.

As expected, this benchmark demonstrates that we will get speedups on parallel hardware when computations are independent. It is worth noting that if we had assigned pairs of communicating threads to the same processor (instead of different ones), we would expect even better results, since we would not be paying the interprocessor communication overhead. 10.3

Time (sec.) 6.68 4.70 3.07 2.49

Ratio 1.00 1.84 2.02 2.28

267

This paper builds on our previous protocol for asymmetric CML. In addition to generalizing the protocol to handle output guards, this paper provides a more complete story, including verification, multiple parallel implementations, and performance results. In earlier work, we reported on specialized implementations of CML’s channel operations that can be used when program analysis determines that it is safe [RX07]. Those specialized implementations fit into our framework and can be regarded as complementary.

12.

Conclusion

We have described what we believe to be the first efficient parallel implementation of CML that supports fully symmetric input and output events. We found the application of state-less model checking to be a valuable tool during the development of the protocol, both uncovering bugs and increasing our confidence in the final design of a reasonably intricate and novel synchronization protocol. Our dual parallel implementations, both the continuation passing for Manticore and the thread-based implementation in C#, demonstrate that the underlying protocols have wider applicability than just Manticore. We evaluated the performance of the continuation based implementation and found it within a factor of 2.5 of the single-threaded implementation. More significantly, the parallel implementation will allows speedups on parallel hardware. Interesting future work would be to further evaluate the performance of the C# implementation and to use Microsoft’s CHESS framework to model-check its code.

[FF04]

Flatt, M. and R. B. Findler. Kill-safe synchronization abstractions. In PLDI ’04, June 2004, pp. 47–58.

[FFR+ 07]

Fluet, M., N. Ford, M. Rainey, J. Reppy, A. Shaw, and Y. Xiao. Status report: The Manticore project. In ML ’07. ACM, October 2007, pp. 15–24.

[FRR08]

Fluet, M., M. Rainey, and J. Reppy. A scheduling framework for general-purpose parallel languages. In ICFP ’08, Victoria, BC, Candada, September 2008. ACM, pp. 241–252.

[GR93]

Gansner, E. R. and J. H. Reppy. A Multi-threaded Higherorder User Interface Toolkit, vol. 1 of Software Trends, pp. 61–80. John Wiley & Sons, 1993.

[HS08]

Herlihy, M. and N. Shavit. The Art of Multiprocessor Programming. Morgan Kaufmann Publishers, New York, NY, 2008.

[Kna92]

Knabe, F. A distributed protocol for channel-based communication with choice. Technical Report ECRC-92-16, European Computer-industry Research Center, October 1992.

[Ler00]

Leroy, X. The Objective Caml System (release 3.00), April 2000. Available from http://caml.inria.fr.

[MLt]

MLton. Concurrent ML. Available at http://mlton. org/ConcurrentML.

[MQ07]

Musuvathi, M. and S. Qadeer. Iterative context bounding for systematic testing of multithreaded programs. In PLDI ’07, San Diego, CA, June 2007. ACM, pp. 446–455.

[MTHM97] Milner, R., M. Tofte, R. Harper, and D. MacQueen. The Definition of Standard ML (Revised). The MIT Press, Cambridge, MA, 1997.

Acknowledgments The extension of the asymmetric protocol [RX08] to the symmetric case was done while the first author was a Visiting Researcher at Microsoft Research Cambridge. The machine used for the benchmarks was supported by NSF award 0454136. This research was also supported, in part, by NSF award 0811389. Mike Rainey provided help with fitting the implementation into the Manticore runtime infrastructure.

[Rep88]

Reppy, J. H. Synchronous operations as first-class values. In PLDI ’88, June 1988, pp. 250–259.

[Rep91]

Reppy, J. H. CML: A higher-order concurrent language. In PLDI ’91. ACM, June 1991, pp. 293–305.

[Rep99]

Reppy, J. H. Concurrent Programming in ML. Cambridge University Press, Cambridge, England, 1999.

[Rit08]

Ritson, C. Multicore scheduling for lightweight communicating processes. Talk at the Workshop on Language and Runtime Support for Concurrent Systems, October 2008. Slides available from http://www.mm-net.org.uk/ workshop171008/mmw07-slides.

References [Bor86]

Bornat, R. A protocol for generalized occam. SP&E, 16(9), September 1986, pp. 783–799.

[Rus01]

[BS83]

Buckley, G. N. and A. Silberschatz. An effective implementation for the generalized input-output construct of CSP. ACM TOPLAS, 5(2), April 1983, pp. 223–235.

Russell, G. Events in Haskell, and how to implement them. In ICFP ’01, September 2001, pp. 157–168.

[RX07]

[CLR]

The .NET Common Language Runtime. See http: //msdn.microsoft.com/en-gb/netframework/.

Reppy, J. and Y. Xiao. Specialization of CML messagepassing primitives. In POPL ’07. ACM, January 2007, pp. 315–326.

[RX08]

[CS05]

Chrysanthakopoulos, G. and S. Singh. An asynchronous messaging library for C#. In Synchronization and Concurrency in Object-Oriented Languages (SCOOL), OOPSLA 2005 Workshop. UR Research, October 2005.

Reppy, J. and Y. Xiao. Toward a parallel implementation of Concurrent ML. In DAMP ’08. ACM, January 2008.

[TG2]

TG2, E. T. C# language specification. See http:// www.ecma-international.org/publications/ standards/Ecma-334.htm.

[Vel98]

Vella, K. Seamless parallel computing on heterogeneous networks of multiprocessor workstations. Ph.D. dissertation, University of Kent at Canterbury, December 1998.

[Dem97]

Demaine, E. D. Higher-order concurrency in Java. In WoTUG20, April 1997, pp. 34–47. Available from http: //theory.csail.mit.edu/˜edemaine/papers/ WoTUG20/.

[Dem98]

Demaine, E. D. Protocols for non-deterministic communication over synchronous channels. In IPPS/SPDP’98, March 1998, pp. 24–30. Available from http://theory. csail.mit.edu/˜edemaine/papers/IPPS98/.

[DF06]

Donnelly, K. and M. Fluet. Transactional events. In ICFP ’06, Portland, Oregon, USA, 2006. ACM, pp. 124–135.

[YYS+ 01] Young, C., L. YN, T. Szymanski, J. Reppy, R. Pike, G. Narlikar, S. Mullender, and E. Grosse. Protium, an infrastructure for partitioned applications. In HotOS-X, January 2001, pp. 41–46.

268

A Concurrent ML Library in Concurrent Haskell Avik Chaudhuri University of Maryland, College Park [email protected]

Abstract

1.

In Concurrent ML, synchronization abstractions can be defined and passed as values, much like functions in ML. This mechanism admits a powerful, modular style of concurrent programming, called higher-order concurrent programming. Unfortunately, it is not clear whether this style of programming is possible in languages such as Concurrent Haskell, that support only first-order message passing. Indeed, the implementation of synchronization abstractions in Concurrent ML relies on fairly low-level, languagespecific details. In this paper we show, constructively, that synchronization abstractions can be supported in a language that supports only firstorder message passing. Specifically, we implement a library that makes Concurrent ML-style programming possible in Concurrent Haskell. We begin with a core, formal implementation of synchronization abstractions in the π-calculus. Then, we extend this implementation to encode all of Concurrent ML’s concurrency primitives (and more!) in Concurrent Haskell. Our implementation is surprisingly efficient, even without possible optimizations. In several small, informal experiments, our library seems to outperform OCaml’s standard library of Concurrent ML-style primitives. At the heart of our implementation is a new distributed synchronization protocol that we prove correct. Unlike several previous translations of synchronization abstractions in concurrent languages, we remain faithful to the standard semantics for Concurrent ML’s concurrency primitives. For example, we retain the symmetry of choose, which can express selective communication. As a corollary, we establish that implementing selective communication on distributed machines is no harder than implementing first-order message passing on such machines.

As famously argued by Reppy (1999), there is a fundamental conflict between selective communication (Hoare 1978) and abstraction in concurrent programs. For example, consider a protocol run between a client and a pair of servers. In this protocol, selective communication may be necessary for liveness—if one of the servers is down, the client should be able to interact with the other. This may require some details of the protocol to be exposed. At the same time, abstraction may be necessary for safety—the client should not be able to interact with a server in an unexpected way. This may in turn require those details to be hidden. An elegant way of resolving this conflict, proposed by Reppy (1992), is to separate the process of synchronization from the mechanism for describing synchronization protocols. More precisely, Reppy introduces a new type constructor, event, to type synchronous operations in much the same way as -> (“arrow”) types functional values. A synchronous operation, or event, describes a synchronization protocol whose execution is delayed until it is explicitly synchronized. Thus, roughly, an event is analogous to a function abstraction, and event synchronization is analogous to function application. This abstraction mechanism is the essence of a powerful, modular style of concurrent programming, called higher-order concurrent programming. In particular, programmers can describe sophisticated synchronization protocols as event values, and compose them modularly. Complex event values can be constructed from simpler ones by applying suitable combinators. For instance, selective communication can be expressed as a choice among event values—and programmer-defined abstractions can be used in such communication without breaking those abstractions (Reppy 1992). Reppy implements events, as well as a collection of such suitable combinators, in an extension of ML called Concurrent ML (CML) (Reppy 1999). We review these primitives informally in Section 2; their formal semantics can be found in (Reppy 1992). The implementation of these primitives in CML relies on fairly low-level, language-specific details, such as support for continuations and signals (Reppy 1999). In turn, these primitives immediately support higher-order concurrent programming in CML. Other languages, such as Concurrent Haskell (Peyton-Jones et al. 1996), seem to be more modest in their design. Following the π-calculus (Milner et al. 1992), such languages support only first-order message passing. While functions for first-order message passing can be encoded in CML, it is unclear whether, conversely, the concurrency primitives of CML can be expressed in those languages.

Categories and Subject Descriptors D.1.3 [Programming Techniques]: Concurrent Programming; D.3.3 [Programming Languages]: Language Constructs and Features—Concurrent programming structures; D.3.1 [Programming Languages]: Formal Definitions and Theory—Semantics General Terms

Algorithms, Languages

Keywords Concurrent ML, synchronization abstractions, distributed synchronization protocol, π-calculus, Concurrent Haskell

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICFP’09, August 31–September 2, 2009, Edinburgh, Scotland, UK. c 2009 ACM 978-1-60558-332-7/09/08. . . $10.00 Copyright

Introduction

Contributions In this paper, we show that CML-style concurrency primitives can in fact be implemented as a library, in a language that already supports first-order message passing. Such a library makes higher-order concurrent programming possible in a language such as Concurrent Haskell. Our implementation has further interesting consequences. For instance, the designers of Con-

269

implementations need to restrict the power of choose in order to tame it (Russell 2001; Reppy and Xiao 2008). Our implementation is designed to avoid such problems (see Section 9).

current Haskell deliberately avoid a CML-style choice primitive (Peyton-Jones et al. 1996, Section 5), partly concerned that such a primitive may complicate a distributed implementation of Concurrent Haskell. By showing that such a primitive can be encoded in Concurrent Haskell itself, we eliminate that concern. At the heart of our implementation is a new distributed protocol for synchronization of events. Our protocol is carefully designed to ensure safety, progress, and fairness. In Section 3, we formalize this protocol as an abstract state machine, and prove its correctness. Then, in Section 4, we describe a concrete implementation of this protocol in the π-calculus, and prove its correctness as well. This implementation can serve as a foundation for other implementations in related languages. Building on this implementation, in Sections 5, 6, and 7, we show how to encode all of CML’s concurrency primitives, and more, in Concurrent Haskell. Our implementation is very concise, requiring less than 150 lines of code; in contrast, a related existing implementation (Russell 2001) requires more than 1300 lines of code. In Section 8, we compare the performance of our library against OCaml’s standard library of CML-style primitives, via several small, informal experiments. Our library consistently runs faster in these experiments, even without possible optimizations. While these experiments do not account for various differences between the underlying language implementations, especially those of threads, we think that these results are nevertheless encouraging. Finally, we should point out that unlike several previous implementations of CML-style primitives in other languages, we remain faithful to the standard semantics for those primitives (Reppy 1999). For example, we retain the symmetry of choose, which can express selective communication. Indeed, we seem to be the first to implement a CML library that relies purely on first-order message passing, and preserves the standard semantics. We defer a more detailed discussion on related work to Section 9.

2.

choose : [event tau] -> event tau • choose V returns an event that, on synchronization, synchro-

nizes one of the events in list V and “aborts” the other events. Conversely, the combinator wrapabort can specify an action that is spawned if an event is aborted by a selection. wrapabort : (() -> ()) -> event tau -> event tau • wrapabort f v returns an event that either synchronizes the

event v, or, if aborted, spawns a thread that runs the code f (). The combinators guard and wrap can specify pre- and postsynchronization actions. guard : (() -> event tau) -> event tau wrap : event tau -> (tau -> tau’) -> event tau’ • guard f returns an event that, on synchronization, synchro-

nizes the event returned by the code f (). • wrap v f returns an event that, on synchronization, synchro-

nizes the event v and applies the function f to the result. Finally, the function sync can synchronize an event and return the result. sync : event tau -> tau By design, an event can synchronize only at some “point”, where a message is either sent or accepted on a channel. Such a point, called the commit point, may be selected among several other candidates at run time. Furthermore, some code may be run before, and after, synchronization—as specified by guard functions, by wrap functions that enclose the commit point, and by wrapabort functions that do not enclose the commit point. For example, consider the following value of type event (). (Here, c and c’ are values of type channel ().)

Overview of CML

In this section, we present a brief overview of CML’s concurrency primitives. (Space constraints prevent us from motivating these primitives any further; the interested reader can find a comprehensive account of these primitives, with several programming examples, in (Reppy 1999).) We provide a small example at the end of this section. Note that channel and event are polymorphic type constructors in CML, as follows:

val v = choose [guard (fn () -> ...; wrapabort ... (choose [wrapabort ... (transmit c ()); wrap (transmit c’ ()) ... ] ) ); guard (fn () -> ...; wrap (wrapabort ... (receive c)) ... ) ]

• The type channel tau is given to channels that carry values

of type tau. • The type event tau is given to events that return values of

type tau on synchronization (see the function sync below). The combinators receive and transmit build events for synchronous communication.

The event v describes a fairly complicated protocol that, on synchronization, selects among the communication events transmit c (), transmit c’ (), and receive c, and runs some code (elided by ...s) before and after synchronization. Now, suppose that we run the following ML program.

receive : channel tau -> event tau transmit : channel tau -> tau -> event () • receive c returns an event that, on synchronization, accepts

a message M on channel c and returns M. Such an event must synchronize with transmit c M.

val _ = spawn (fn () -> sync v); sync (receive c’)

• transmit c M returns an event that, on synchronization, sends

the message M on channel c and returns () (that is, “unit”). Such an event must synchronize with receive c.

This program spawns sync v in parallel with sync (receive c’). In this case, the event transmit c’ () is selected inside v, so that it synchronizes with receive c’. The figure below depicts sync v as a tree. The point marked • is the commit point; this point is selected among the other candidates, marked ◦, at run time.

Perhaps the most powerful of CML’s concurrency primitives is the combinator choose; it can nondeterministically select an event from a list of events, so that the selected event can be synchronized. In particular, choose can express selective communication. Several

270

Furthermore, (only) code specified by the combinators marked in boxes are run before and after synchronization, following the semantics outlined above. wrapabort —◦ guard — wrapabort— choose Q wrap —• choose Z guard —wrap— wrapabort —◦

3.

The various states of principals are shown below. Roughly, principals in specific states react with each other to cause transitions in the machine, following rules that appear later in the section. States of principals ς ςp ::= p 7→ α ♥p α ςc ::= c ⊕c (p, q) ςs ::= s s Xs (p) ×(p) .. ^s (p) .. _s

A distributed protocol for synchronization

We now present a distributed protocol for synchronizing events. We focus on events that are built with the combinators receive, transmit, and choose. While the other combinators are important for describing computations, they do not fundamentally affect the nature of the protocol; we consider them later, in Sections 5 and 6. 3.1

A source language

For brevity, we simplify the syntax of the source language. Let → is a c range over channels. We use the following notations: − ϕ ` sequence of the form ϕ1 , . . . , ϕn , where ` ∈ 1..n; furthermore, →} is the set {ϕ , . . . , ϕ }, and [− →] is the list [ϕ , . . . , ϕ ]. {− ϕ ϕ 1 n 1 n ` ` The syntax of the language is as follows.

Let p and s range over points and synchronizers. A synchronizer can be viewed as a partial function from points to actions; we represent this function as a parallel composition of bindings of the form p 7→ α. Further, we require that each point is associated with a unique synchronizer, that is, for any s and s0 , s 6= s0 ⇒ dom(s) ∩ dom(s0 ) = ∅. The semantics of the machine is described by the local transition rules shown below, plus the usual structural rules for parallel composition, inaction, and name creation as in the π-calculus (Milner et al. 1992).

• Actions α, β, . . . are of the form c or c (input or output on

c). Informally, actions model communication events built with receive and transmit. • Programs are of the form S1 | . . . | Sm (parallel composition

of S1 , . . . , Sm ), where each Sk (k ∈ 1..m) is either an action → α, or a selection of actions, select(− αi ). Informally, a selection of actions models the synchronization of a choice of events, following the CML function select.

Operational semantics σ −→ σ 0

select : [event tau] -> tau select V = sync (choose V)

(1)

Further, we consider only the following local reduction rule: − → → c ∈ { βj } c ∈ {− αi } (S EL C OMM ) − → → select(− αi ) | select(βj ) −→ c | c This rule models selective communication. We also consider the usual structural rules for parallel composition. However, we ignore reduction of actions at this level of abstraction. 3.2

A distributed abstract state machine for synchronization

Our synchronization protocol is run by a distributed system of principals that include channels, points, and synchronizers. Informally, every action is associated with a point, and every select is associated with a synchronizer. The reader may draw an analogy between our setting and one of arranging marriages, by viewing points as prospective brides and grooms, channels as matchmakers, and synchronizers as parents whose consents are necessary for marriages. We formalize our protocol as a distributed abstract state machine that implements the rule (S EL C OMM). Let σ range over states of the machine. These states are built by parallel composition |, inaction 0, and name creation ν (Milner et al. 1992) over various states of principals.

p 7→ c | q 7→ c | c −→ ♥p | ♥q | ⊕c (p, q) | c

(2.i)

p ∈ dom(s) ♥p | s −→ Xs (p) | s

(2.ii)

p ∈ dom(s) ♥p | s −→ × (p) | s

(3.i)

Xs (p) | Xs0 (q) | ⊕c (p, q) −→ ^s (p) | ^s0 (q)

(3.ii)

Xs (p) | × (q) | ⊕c (p, q) −→ _s

(3.iii)

×(p) | Xs (q) | ⊕c (p, q) −→ _s

(3.iv)

×(p) | × (q) | ⊕c (p, q) −→ 0

(4.i) (4.ii)

..

..

..

..

s(p) = α .. ^s (p) −→ α .. → → _s , (ν − pi ) (s | s) where dom(s) = {− pi }

Intuitively, these rules may be read as follows.

States of the machine σ σ ::= σ | σ0 0 → (ν − pi ) σ ς

states of a point active matched released states of a channel free announced states of a synchronizer open closed selected refused confirmed canceled

(1) Two points p and q, bound to complementary actions on channel c, react with c, so that p and q become matched (♥p and ♥q ) and the channel announces their match (⊕c (p, q)).

states of the machine parallel composition inaction name creation state of principals

(2.i–ii) Next, p (and likewise, q) reacts with its synchronizer s. If the synchronizer is open (s ), it now becomes closed (s ), and p is declared selected by s (Xs (p)). If the synchronizer is already closed, then p is refused (×(p)).

271

(Progress) if P −→ , then σ −→+ σ 0 and P −→ P 0 for some σ 0 and P 0 such that P 0 ∼ σ 0 ; (Fairness) if P −→ P 0 for some P 0 , then σ0 −→ . . . −→ σn for some σ0 , . . . , σn such that σn = σ, P ∼ σi for all 0 ≤ i < n, and σ0 −→+ σ 0 for some σ 0 such that P 0 ∼ σ 0 .

(3.i–iv) If both p and q are selected, c confirms the selections to .. .. both parties (^s (p) and ^s0 (q)). If only one of them is .. selected, c cancels that selection (_s ). (4.i–ii) If the selection of p is confirmed, the action bound to p is released. Otherwise, the synchronizer “reboots” with fresh names for the points in its domain. 3.3

Informally, the above theorem guarantees that any sequence of program reductions can be simulated by some sequence of state transitions, and vice versa, such that

Compilation

Next, we show how programs in the source language are compiled on to this machine. Let Π denote indexed parallel composition; using this notation, for example, we can write a program S1 | . . . | Sm as Πk∈1..m Sk . Suppose that the set of channels in a program Πk∈1..m Sk is C. We compile this program to the state

• from any intermediate program, it is always possible to simulate

any transition of a related intermediate state; • from any intermediate state,

∼

– it is always possible to simulate some reduction of a related intermediate program;

Πc∈C c | Πk∈1..m Sk , where 8 if S = α > > < ≈ − → → if S = select(− αi ), S , (νs, pi ) > > ( | Π (p → 7 α )Js, α ˆ K) i ∈ 1..n, and s i∈1..n i i i > : → s, − pi are fresh names

A π-calculus model of the abstract state machine

We interpret states of our machine as π-calculus processes that run at points, channels, and synchronizers. These processes reduce by communication to simulate transitions in the abstract state machine. In this setting:

273

Interpretation of states as processes

Let ⇑ be a partial function from processes to states that, for any state σ, maps its interpretation as a process back to σ. For any process π such that ⇑ π is defined, we define its denotation pπq to be p⇑ πq; the denotation of any other process is undefined. We then prove the following theorem (Chaudhuri 2009), closely following the proof of Theorem 3.1.

States of a point (p 7→ c)Js, i [c] K , (ν candidate [p] ) i [c] hcandidate [p] i. candidate [p] (decision [p] ). ♥p Jdecision [p] , s, cK

T HEOREM 4.1 (Correctness of the π-calculus implementation). Let C be the set of channels in a program Πk∈1..m Sk . Then

(q 7→ c)Js, o [c] K , (ν candidate [q] ) o [c] hcandidate [q] i. candidate [q] (decision [q] ). ♥q Jdecision [q] , s, cK

≈

Πk∈1..m Sk ≈ (νc∈C i [c] , o [c] ) (Πc∈C c | Πk∈1..m Sk ) where ≈ is the largest relation such that P ≈ π iff

♥p Jdecision [p] , s, αK , shp, decision [p] i. p(). α

(Invariant) π −→? π 0 for some π 0 such that pPq = pπ 0 q; (Safety) if π −→ π 0 for some π 0 , then P −→? P 0 for some P 0 such that P 0 ≈ π 0 ; (Progress) if P −→ , then π −→+ π 0 and P −→ P 0 for some π 0 and P 0 such that P 0 ≈ π 0 ; (Fairness) if P −→ P 0 for some P 0 , then π0 −→ . . . −→ πn for some π0 , . . . , πn such that πn = π, P ≈ πi for all 0 ≤ i < n, and π0 −→+ π 0 for some π 0 such that P 0 ≈ π 0 .

States of a channel c Ji [c] , o [c] K , i [c] (candidate [p] ). o [c] (candidate [q] ). ((νdecision [p] , decision [q] ) candidate [p] hdecision [p] i. candidate [q] hdecision [q] i. ⊕c (p, q)Jdecision [p] , decision [q] K [c] | c Ji , o [c] K)

5.

A CML library in Concurrent Haskell

We now proceed to code a full CML-style library for events in a fragment of Concurrent Haskell with first-order message passing (Peyton-Jones et al. 1996). This fragment is close to the πcalculus, so we can simply lift our implementation of Section 4.1. Going further, we remove the restrictions on the source language: a program can be any well-typed Haskell program. We implement not only receive, transmit, choose, and sync, but also new, guard, wrap, and wrapabort. Finally, we exploit Haskell’s type system to show how events can be typed under the standard IO monad (Gordon 1994; Peyton-Jones and Wadler 1993). Before we proceed, let us briefly review Concurrent Haskell’s concurrency primitives. (The reader may wish to refer (PeytonJones et al. 1996) for details.) These primitives support concurrent I/O computations, such as forking threads and communicating on mvars—which are synchronized mutable variables, similar to πcalculus channels (see below). Note that MVar and IO are polymorphic type constructors in Concurrent Haskell, as follows:

⊕c (p, q)Jdecision [p] , decision [q] K , (decision [p] (confirm [p] , cancel [p] ). (decision [q] (confirm [q] , cancel [q] ). confirm [p] hi. confirm [q] hi. 0 | decision [q] (). cancel [p] hi. 0) | decision [p] (). (decision [q] (confirm [q] , cancel [q] ). cancel [q] hi. 0 | decision [q] (). 0)) States of a synchronizer

• The type MVar tau is given to a communication cell that car-

s , s(p, decision [p] ). (Xs (p)Jdecision [p] K | s )

ries values of type tau. • The type IO tau is given to a computation that yields results

of type tau, with possible side effects via communication.

s , s(p, decision [p] ). (×(p)Jdecision [p] K | s )

We rely on the following semantics of MVar cells. • A cell can carry at most one value at a time, that is, it is either

empty or full.

Xs (p)Jdecision [p] K , (ν confirm [p] , cancel [p] ) decision [p] hconfirm [p] , cancel [p] i. .. (confirm [p] (). ^s (p) .. | cancel [p] (). _s )

• The function newEmptyMVar :: IO (MVar tau) returns a

fresh cell that is empty. • The function takeMVar :: MVar tau -> IO tau is used to

read from a cell; takeMVar m blocks if the cell m is empty, else gets the content of m (thereby emptying it).

×s (p)Jdecision [p] K , decision [p] hi. 0

• The function putMVar :: MVar tau -> tau -> IO () is

used to write to a cell; putMVar m M blocks if the cell m is full, else puts the term M in m (thereby filling it).

..

^s (p) , phi. 0

Further, we rely on the following semantics of IO computations; see (Peyton-Jones and Wadler 1993) for details.

.. → _s , (νs, − pi ) (s | Πi∈1..n (pi 7→ αi )Js, α ˆ i K) → where dom(s) = {− pi }, i ∈ 1..n, and ∀i ∈ 1..n. s(pi ) = αi

• The function forkIO :: IO () -> IO () is used to spawn a

concurrent computation; forkIO f forks a thread that runs the computation f.

274

• The function return :: tau -> IO tau is used to inject a

5.2

value into a computation.

The protocol code run by points abstracts on a cell s for the associated synchronizer, and a name p for the point itself. Depending on whether the point is for input or output, the code further abstracts on an input cell i or output cell o, and an associated action alpha.

• Computations can be sequentially composed by “piping”. We

use Haskell’s convenient do {...} notation for this purpose, instead of applying the underlying piping function

AtPointI :: Synchronizer -> Point -> In -> IO tau -> IO tau AtPointI s p i alpha = do { candidate =) :: IO tau -> (tau -> IO tau’) -> IO tau’ Thus, e.g., we write do {x >= \x -> putMVar m x. Our library provides the following CML-style functions for programming with events in Concurrent Haskell.1 (Observe the differences between ML and Haskell types for these functions. Since Haskell is purely functional, we must embed types for computations, with possible side-effects via communication, within the IO monad. Further, since evaluation in Haskell is lazy, we can discard λ-abstractions that simply “delay” eager evaluation.)

AtPointO :: Synchronizer -> Point -> Out -> IO () -> IO () AtPointO s p o alpha = do { candidate tau -> event () guard :: IO (event tau) -> event tau wrap :: event tau -> (tau -> IO tau’) -> event tau’ choose :: [event tau] -> event tau wrapabort :: IO () -> event tau -> event tau sync :: event tau -> IO tau

We instantiate the function AtPointI in the code for receive, and the function AtPointO in the code for transmit. These associate appropriate point principals to any events constructed with receive and transmit.

In this section, we focus on events that are built without wrapabort; the full implementation appears in Section 6. 5.1

Type definitions

5.3

We begin by defining the types of cells on which messages are exchanged in our protocol (recall the discussion in Section 4.1).2 These cells are of the form i and o (on which points initially send messages to channels), candidate (on which channels reply back to points), s (on which points forward messages to synchronizers), decision (on which synchronizers inform channels), confirm and cancel (on which channels reply back to synchronizers), and p (on which synchronizers finally signal to points). type type type type type type type type

Protocol code for points

Protocol code for channels

The protocol code run by channels abstracts on an input cell i and an output cell o for the channel. AtChan :: In -> Out -> IO () AtChan i o = do { candidate_i tau; the term fix f reduces to f (fix f).)

synchronizer s, a fresh name for the point, the input cell for channel c, and an action that inputs on c.

AtSync :: Synchronizer -> IO () -> IO () AtSync s reboot = do { (p,decision) do { (p’,decision’) IO tau’) -> event tau’ wrap v f = \s -> do { x tau -> IO tau’) -> tau’ -> [tau] -> IO tau’. The term foldM f x [] reduces to return x, and the term foldM f x [v,V] reduces to do {x IO tau An ML channel is a Haskell MVar tagged with a pair of input and output cells. An ML event is a Haskell IO function that abstracts on a synchronizer cell. 5.6

choose :: [event tau] -> event tau choose V = \s -> do { temp \v -> forkIO (do { x \abort -> do { ...; P \v -> do { name’ \name -> \abort -> do { forkIO (do { P Abort -> IO () -> IO () AtSync s abort X = do { ...; forkIO (do { ...; fix (\iter -> do { (P,f) \name -> \abort -> do { v \name -> \abort -> do { x Name -> Abort -> IO tau Now, an ML event is a Haskell IO function that abstracts on a synchronizer, an abort cell, and a name cell that carries the list of points the event encloses. The Haskell function new does not change. We highlight minor changes in the remaining translations. We begin with the functions receive and transmit. An event built with either function is named by a singleton containing the name of the enclosed point.

Finally, in the function sync, a fresh abort cell is now passed to AtSync, and a fresh name cell is created for the event to be synchronized. sync v = do { ...; forkIO (fix (\iter -> do { ...; name do { ...; forkIO (putMVar name [p]); ... } transmit (i,o,m) M = \s -> \name -> \abort -> do { ...; forkIO (putMVar name [p]); ... }

277

7.

Implementation of communication guards

Just decision -> do { putMVar s (p,decision); ... }

Beyond the standard primitives, some implementations of CML further consider primitives for guarded communication. In particular, Russell (2001) implements such primitives in Concurrent Haskell, but his implementation strategy is fairly specialized—for example, it requires a notion of guarded events (see Section 9 for a discussion on this issue). We show that in contrast, our implementation strategy can accommodate such primitives with little effort. Specifically, we wish to support the following receive combinator, that can carry a communication guard.

} Finally, we make minor adjustments to the type constructor channel, and the functions receive and transmit. type channel tau = (In tau, Out tau, MVar tau) receive (i,o,m) cond = \s -> \name -> \abort -> do { ...; AtPointI s p i cond (takeMVar m) } transmit (i,o,m) M = \s -> \name -> \abort -> do { ...; AtPointO s p o M (putMVar m M) }

receive :: channel tau -> (tau -> Bool) -> event tau Intuitively, (receive c cond) synchronizes with (transmit c M) only if cond M is true. In our implementation, we make minor adjustments to the types of cells on which messages are exchanged between points and channels. type In tau = MVar (Candidate, tau -> Bool) type Out tau = MVar (Candidate, tau) type Candidate = MVar (Maybe Decision)

8.

Evaluation

Our implementation is derived from a formal model, constructed for the purpose of proof (see Theorem 4.1). Not surprisingly, to simplify reasoning about the correctness of our code, we overlook several possible optimizations. For example, we heavily rely on lazy evaluation and garbage collection in the underlying language for reasonable performance of our code. It is plausible that this performance can be improved with explicit management. We also rely on fair scheduling in the underlying language to prevent starvation. Nevertheless, preliminary experiments indicate that our code is already quite efficient. In particular, we compare the performance of our library against OCaml’s Event module (Leroy et al. 2008). The implementation of this module is directly based on Reppy’s original design of CML (Reppy 1999). Furthermore, it supports wrapabort, unlike recent versions of CML that favor an alternative primitive, withnack, which we do not support (see footnote 1, p.7). Finally, most other implementations of CML-style primitives do not reflect the standard semantics (Reppy 1999), which makes comparisons with them meaningless. Indeed, some of our benchmarks rely on the symmetry of choose—see, e.g., the swap channel abstraction implemented below; such benchmarks cannot work correctly on a previous implementation of events in Haskell (Russell 2001).3 For our experiments, we use several small benchmark programs that rely heavily on higher-order concurrency. We describe these benchmarks below; their code is available online (Chaudhuri 2009). These benchmarks are duplicated in Haskell and OCaml to the extent possible. Furthermore, to minimize noise due to inherent differences in the implementations of these languages, we avoid the use of extraneous constructs in these benchmarks. Still, we cannot avoid the use of threads, and thus our results may be skewed by differences in the implementations of threads in these languages. We compile these benchmarks using ghc 6.8.1 and ocamlc 3.10.2 (using the −vmthread option in the latter). All these benchmarks run faster using our library than using OCaml’s Event module. Our benchmarks are variations of the following programs.

Next, we adjust the protocol code run by points and channels. Input and output points bound to actions on c now send their conditions and messages to c. A pair of points is matched only if the message sent by one satisfies the condition sent by the other. AtChan :: In tau -> Out tau -> IO () AtChan i o = do { (candidate_i,cond) In tau -> (tau -> Bool) -> IO tau -> IO tau AtPointI s p i cond alpha = do { ...; putMVar i (candidate,cond); x AtPointI s p i cond alpha Just decision -> do { putMVar s (p,decision); ... } } AtPointO :: Synchronizer -> Point -> Out tau -> tau -> IO () -> IO () AtPointO s p o M alpha = do { ...; putMVar o (candidate,M); x AtPointO s p o M alpha

Extended example Recall the example of Section 3.5. This is a simple concurrent program that involves nondeterministic communication; either there is communication on channels x and z, or there is communication on channel y. To observe this nondeterminism, we add guard, wrap, and wrapabort functions to each communication event, which print messages such 3 Nevertheless,

we did consider comparing Russell’s implementation with ours on other benchmarks, but failed to compile his implementation with recent versions of ghc; we also failed to find his contact information online.

278

bly wastes some rounds by matching points that have the same synchronizer (and eventually canceling these matches, since such points can never be selected together). An optimization that eliminates such matches altogether should improve the performance of our implementation. Beyond total running times, it should also be interesting to compare performance relative to each CML-style primitive, to pinpoint other possible sources of inefficiency. We defer a more detailed investigation of these issues, as well as a more robust account of implementation differences between the underlying languages (especially those of threads), to future work. All the code that appears in this paper is available online at:

as "Trying", "Succeeded", and "Failed" for that event at run time. Both the Haskell and the ML versions of the program exhibit this nondeterminism in our runs. Primes sieve This program uses the Sieve of Eratosthenes (Wikipedia 2009) to print all prime numbers up to some n ≥ 2. We implement two versions of this program: (I) uses choose, (II) does not. (I) In this version, we create a “prime” channel and a “not prime” channel for each i ∈ 2..n, for a total of 2 ∗ (n − 1) channels. Next, we spawn a thread for each i ∈ 2..n, that selects between two events: one receiving on the “prime” channel for i and printing i, the other receiving on the “not prime” channel for i and looping. Now, for each multiple j ≤ n of each i ∈ 2..n, we send on the “not prime” channel for j. Finally, we spawn a thread for each i ∈ 2..n, sending on the “prime” channel for i.

http://code.haskell.org/cml/ Additional resources on this project are available at (Chaudhuri 2009; Chaudhuri and Franksen 2009).

9.

(II) In this version, we create a “prime/not prime” channel for each i ∈ 2..n, for a total of n − 1 channels. Next, we spawn a thread for each i ∈ 2..n, receiving a message on the “prime/not prime” channel for i, and printing i if the message is true or looping if the message is false. Now, for each multiple j ≤ n of each i ∈ 2..n, we send false on the “prime/not prime” channel for j. Finally, we spawn a thread for each i ∈ 2..n, sending true on the “prime/not prime” channel for i.

Related work

We are not the first to implement CML-style concurrency primitives in another language. In particular, Russell (2001) presents an implementation of events in Concurrent Haskell. The implementation provides guarded channels, which filter communication based on conditions on message values (as in Section 7). Unfortunately, the implementation requires a rather complex Haskell type for event values. In particular, a value of type event tau needs to carry a higher-order function that manipulates a continuation of type IO tau -> IO (). Further, a critical weakness of Russell’s implementation is that the choose combinator is asymmetric. As observed in (Reppy and Xiao 2008), this restriction is necessary for the correctness of that implementation. In contrast, we implement a (more expressive) symmetric choose combinator, following the standard CML semantics. Finally, we should point out that Russell’s CML library is more than 1300 lines of Haskell code, while ours is less than 150. Yet, guarded communication as proposed by Russell is already implemented in our setting, as shown in Section 7. In the end, we believe that this difference in complexity is due to the clean design of our synchronization protocol. Independently of our work, Reppy and Xiao (2008) recently pursue a parallel implementation of a subset of CML, with a distributed protocol for synchronization. As in (Reppy 1999), this implementation builds on ML machinery such as continuations, and further relies on a compare-and-swap instruction. Unfortunately, their choose combinator cannot select among transmit events, that is, their subset of CML cannot express selective communication with transmit events. It is not clear whether their implementation can be extended to account for the full power of choose. Orthogonally, Donnelly and Fluet (2006) introduce transactional events and implement them over the software transactional memory (STM) module in Concurrent Haskell. More recently, Effinger-Dean et al. (2008) implement transactional events in ML. Combining all-or-nothing transactions with CML-style concurrency primitives is attractive, since it recovers a monad. Unfortunately, implementing transactional events requires solving NP-hard problems (Donnelly and Fluet 2006), and these problems seem to interfere even with their implementation of the core CML-style concurrency primitives. In contrast, our implementation of those primitives remains rather lightweight. Other related implementations of events include those of Flatt and Findler (2004) in Scheme and of Demaine (1998) in Java. Flatt and Findler provide support for kill-safe abstractions, extending the semantics of some of the CML-style primitives. On the other hand, Demaine focuses on efficiency by exploiting communication patterns that involve either single receivers or single transmitters. It is unclear whether Demaine’s implementation of non-deterministic communication can accommodate event combinators.

Swap channels This program implements and uses a swap channel abstraction, as described in (Reppy 1994). Intuitively, if x is a swap channel, and we run the program forkIO (do {y event tau swap ch msgOut = guard (do { inCh let (msgIn, outCh) = x in do { sync (transmit outCh msgOut); return msgIn } ), wrap (transmit ch (msgOut, inCh)) (\_ -> sync (receive inCh)) ] } ) Communication over a swap channel is already highly nondeterministic, since one of the ends must choose to send its message first (and accept the message from the other end later), while the other end must make exactly the opposite choice. We add further nondeterminism by spawning multiple pairs of swap on the same swap channel. Buffered channels This program implements and uses a buffered channel abstraction, as described in (Reppy 1992). Intuitively, a buffered channel maintains a queue of messages, and chooses between receiving a message and adding it to the queue, or removing a message from the queue and sending it. Our library performs significantly better for all except one of these benchmarks—for the swap channels benchmark, the difference is only marginal. Note that in this case, our protocol possi-

279

Avik Chaudhuri and Benjamin Franksen. Hackagedb cml package, 2009. Available at http://hackage.haskell.org/ cgi-bin/hackage-scripts/package/cml.

Distributed protocols for implementing selective communication date back to the 1980s. The protocols of Buckley and Silberschatz (1983) and Bagrodia (1986) seem to be among the earliest in this line of work. Unfortunately, those protocols are prone to deadlock. Bornat (1986) proposes a protocol that is deadlock-free assuming communication between single receivers and single transmitters. Finally, Knabe (1992) presents the first deadlock-free protocol to implement selective communication for arbitrary channel communication. Knabe’s protocol appears to be the closest to ours. Channels act as locations of control, and messages are exchanged between communication points and channels to negotiate synchronization. However, Knabe assumes a global ordering on processes and maintains queues for matching communication points; we do not require either of these facilities in our protocol. Furthermore, as in (Demaine 1998), it is unclear whether the protocol can accommodate event combinators. Finally, our work should not be confused with Sangiorgi’s translation of the higher-order π-calculus (HOπ) to the π-calculus (Sangiorgi 1993). While HOπ allows processes to be passed as values, it does not immediately support higher-order concurrency. For instance, processes cannot be modularly composed in HOπ. On the other hand, it may be possible to show alternate encodings of the process-passing primitives of HOπ in π-like languages, via an intermediate encoding with CML-style primitives.

10.

E. D. Demaine. Protocols for non-deterministic communication over synchronous channels. In IPPS/SPDP’98: Symposium on Parallel and Distributed Processing, pages 24–30. IEEE, 1998. K. Donnelly and M. Fluet. Transactional events. In ICFP’06: International Conference on Functional Programming, pages 124–135. ACM, 2006. L. Effinger-Dean, M. Kehrt, and D. Grossman. Transactional events for ML. In ICFP’08: International Conference on Functional Programming, pages 103–114. ACM, 2008. M. Flatt and R. B. Findler. Kill-safe synchronization abstractions. In PLDI’04: Programming Language Design and Implementation, pages 47–58. ACM, 2004. ISBN 1-58113-807-5. A. D. Gordon. Functional programming and Input/Output. Cambridge University, 1994. ISBN 0-521-47103-6. C. A. R. Hoare. Communicating sequential processes. Communications of the ACM, 21(8):666–677, 1978. F. Knabe. A distributed protocol for channel-based communication with choice. In PARLE’92: Parallel Architectures and Languages, Europe, pages 947–948. Springer, 1992. ISBN 3-54055599-4.

Conclusion

In this paper, we show how to implement higher-order concurrency in the π-calculus, and thereby, how to encode CML’s concurrency primitives in Concurrent Haskell, a language with first-order message passing. We appear to be the first to implement the standard CML semantics for event combinators in this setting. An interesting consequence of our work is that implementing selective communication a` la CML on distributed machines is reduced to implementing first-order message passing on such machines. This clarifies a doubt raised in (Peyton-Jones et al. 1996). At the heart of our implementation is a new, deadlock-free protocol that is run among communication points, channels, and synchronization applications. This protocol seems to be robust enough to allow implementations of sophisticated synchronization primitives, even beyond those of CML.

X. Leroy, D. Doligez, J. Garrigue, D. R´emy, and J. Vouillon. The Objective Caml system documentation: Event module, 2008. Available at http://caml.inria.fr/pub/docs/ manual-ocaml/libref/Event.html.

Acknowledgments Thanks to Cormac Flanagan for suggesting this project for his Spring 2007 Concurrent Programming course at UC Santa Cruz. Thanks to Mart´ın Abadi, Jeff Foster, and several anonymous referees of Haskell’07 and ICFP’09 for their comments on this paper. Finally, thanks to Ben Franksen for maintaining the source package for this library at HackageDB. This work was supported in part by NSF under grants CCR-0208800 and CCF0524078, and by DARPA under grant ODOD.HR00110810073.

J. H. Reppy. Concurrent programming in ML. Cambridge University, 1999. ISBN 0-521-48089-2.

R. Milner, J. Parrow, and D. Walker. A calculus of mobile processes, parts I and II. Information and Computation, 100(1): 1–77, 1992. S. L. Peyton-Jones and P. Wadler. Imperative functional programming. In POPL’93: Principles of Programming Languages, pages 71–84. ACM, 1993. S. L. Peyton-Jones, A. D. Gordon, and S. Finne. Concurrent Haskell. In POPL’96: Principles of Programming Languages, pages 295–308. ACM, 1996.

J. H. Reppy. Higher-order concurrency. PhD thesis, Cornell University, 1992. Technical Report 92-1852. J. H. Reppy. First-class synchronous operations. In TPPP’94: Theory and Practice of Parallel Programming. Springer, 1994. J. H. Reppy and Y. Xiao. Towards a parallel implementation of Concurrent ML. In DAMP’08: Declarative Aspects of Multicore Programming. ACM, 2008.

References

G. Russell. Events in Haskell, and how to implement them. In ICFP’01: International Conference on Functional Programming, pages 157–168. ACM, 2001. ISBN 1-58113-415-0.

R. Bagrodia. A distributed algorithm to implement the generalized alternative command of CSP. In ICDCS’86: International Conference on Distributed Computing Systems, pages 422–427. IEEE, 1986.

D. Sangiorgi. From pi-calculus to higher-order pi-calculus, and back. In TAPSOFT’93: Theory and Practice of Software Development, pages 151–166. Springer, 1993.

R. Bornat. A protocol for generalized Occam. Software Practice and Experience, 16(9):783–799, 1986. ISSN 0038-0644.

Wikipedia. Sieve of Eratosthenes, 2009. See http://en. wikipedia.org/wiki/Sieve_of_Eratosthenes.

G. N. Buckley and A. Silberschatz. An effective implementation for the generalized input-output construct of CSP. ACM Transactions on Programming Languages and Systems, 5(2):223–235, 1983. ISSN 0164-0925. A. Chaudhuri. A Concurrent ML library in Concurrent Haskell, 2009. Links to proofs and experiments at http://www.cs. umd.edu/~avik/projects/cmllch/.

280

Experience Report: OCaml for an Industrial-Strength Static Analysis Framework Pascal Cuoq ∗

Julien Signoles

with Patrick Baudin, Richard Bonichon, G´eraud Canet, Lo¨ıc Correnson, Benjamin Monate, Virgile Prevosto, Armand Puccetti

CEA LIST, Software Reliability Labs, Boite 65, 91191 Gif-sur-Yvette Cedex, France [email protected]

Abstract

2.1

This experience report describes the choice of OCaml as the implementation language for Frama-C, a framework for the static analysis of C programs. OCaml became the implementation language for Frama-C because it is expressive. Most of the reasons listed in the remaining of this article are secondary reasons, features which are not specific to OCaml (modularity, availability of a C parser, control over the use of resources. . . ) but could have prevented the use of OCaml for this project if they had been missing.

It is typical for an article of this nature to include a few words to the effect that it is harder to find people who can program in functional language Y (Minsky 2007; Nanavati 2008) than in C++, sometimes nuanced by more words pointing out that this is balanced by the higher quality of Y candidates. The first proposition did not apply for us in the case of Frama-C and OCaml. CEA LIST is an applied research laboratory that recruits almost exclusively PhDs. When the choice is restricted to candidates with a PhD in the field of formal methods, it is not harder to find a candidate with the motivation to program in OCaml than in C++.

Categories and Subject Descriptors D1.1 [Programming techniques]: Applicative (Functional) Programming General Terms

1.

2.2

Design, Languages, Verification

Objectives of the Frama-C project

Although it is developed by research institutes, Frama-C tries to fulfill focused, specific needs expressed by industrial partners. It aims past the R&D departments and into the hands of the engineers who develop embedded code in any industry with criticality issues1 . Frama-C is structured as a kernel to which different analysis plug-ins are connected. It is composed as a whole of 100 to 200 thousands of lines of OCaml2 . All this OCaml code provides a wide range of functionalities, but the plug-ins do not just sit side by side. It is more accurate to think of them as built on top of each other. To give an example, a value analysis plug-in computes supersets of possible values for the variables of the program (Canet et al. 2009), indexed by statement of the original program. Unions, structs, arrays and casts thereof are handled with the precision necessary for embedded code (Cuoq 2008). These values, especially the values of pointers and expressions used as indices in arrays, are used by another plug-in to compute synthetic functional dependencies between the inputs and the outputs of each analyzed function. These synthetic dependencies are in turn used by a slicer plug-in which produces simplified, compilable C programs that are guaranteed to be equivalent to the original for the slicing criterion. And the building blocks of this slicer are used by one of Frama-C’s most sophisticated plug-ins to date, a security-aware slicer that preserves the confidentiality of information: the (functional) confidentiality is guaranteed to be exactly the same in the sliced program as in the original program (Monate and Signoles 2008). This means that it is safe to study functional confidentiality on the (smaller) sliced program, for instance for a security audit of the source code. This would not be possible with a traditional slicer, because a traditional

Introduction

Frama-C is a framework that allows static analyzers, implemented as plug-ins, to collaborate towards the study of a C program. Although it is distributed as Open Source, Frama-C is very much an industrial project, both in the size it has already reached and in its intended use for the certification, quality assurance, and reverseengineering of industrial code. Frama-C is written in OCaml, and this article reports on the most noticeable consequences of this choice on the human (section 2) and technical (section 3) levels, as well as providing an overview of the implementation of FramaC (section 4).

2.

Recruiting OCaml programmers

Human Context

Frama-C is developed collaboratively between the ProVal team (a joint laboratory of INRIA Saclay ˆIle-de-France and LRI) and CEA LIST. This article describes our (CEA LIST) own analysis of this collaborative development. The Open Source nature of the software and other partnerships involving CEA LIST mean that Frama-C is in fact developed at three different sites, by around ten full-time programmers, with infrequent inter-site face-to-face meetings. ∗ This

work has been supported by the french RNTL project CAT ANR05RNTL00301

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICFP’09, August 31–September 2, 2009, Edinburgh, Scotland, UK. c 2009 ACM 978-1-60558-332-7/09/08. . . $5.00 Copyright

1 or even without criticality issues, but so far the industrial interest has come

from critical embedded systems 2 The command wc ‘find . -name \*.ml -o -name \*.mli‘ reports 220000 lines as of this writing, which include comments, the CIL and ocamlgraph libraries, and some testing scripts

281

3.2

slicer might remove information leaks — in particular, a malicious programmer could insert information leaks that he knows the traditional slicer used for the audit will remove.

Control over the use of resources

3.

Technical context

One of Frama-C’s first plug-ins was a value analysis based on abstract interpretation. This plug-in computes supersets of possible values for expressions of the program. Among other things, these over-approximated sets are useful to exclude the possibility of a run-time error. In contrast with the heuristic techniques used in other static analysis tools, which may be very efficient but solve a different problem, shortcuts do not work when the question is to correctly — that is, without false negatives — find all possible run-time errors in large embedded C programs. The analysis has to start from the entry point of the analyzed program and unroll function calls (and often loops). In addition, the modular nature of Frama-C and the interface that the value analysis aimed at providing to other plug-ins meant that abstract states had to be memorized at each statement, which dictated the choice of persistent data structures, with sharing between identical sub-parts of states (Cuoq and Doligez 2008). This meant at the very least that a garbagecollected language had to be used. While there are popular imperative languages that are garbage-collected nowadays, and some of these languages have huge libraries of data structures ready to use, persistent data structures are often under-represented in these libraries, and are slightly annoying to write in these languages. For writing a static analyzer, one is not worse off with OCaml and a few select libraries (Conchon et al. 2008; Filliˆatre and Conchon 2006) than with any of Python, the .NET framework or the Java platform, although it is by no means impossible to write static analyzers in any of these. By contrast, in the development of Caveat in C++, there were explicit deallocation functions, and sanity checks that warned at the end of a session if deallocation had been forgotten for some nodes. Programming time that could have been spent usefully was spent writing the calls to the deallocation primitives and the source code’s readability was diminished, but this is neglectable compared to the time that had to be spent debugging (the warnings told the developper that the deallocation had been forgotten, not where it should have been done). The solution to a subtle deallocation problem was often to make copies of existing nodes (although it is not meaningful to compare the memory consumption of Caveat to that of Frama-C’s value analysis, because they work on different principles). To conclude, in Frama-C, garbage collection paradoxically enables a tighter control of memory usage than explicit deallocation because it makes sharing possible in practice. OCaml is a strict language. We do not know what the influence of lazy evaluation would be on memory use.

3.1

Expressivity

3.3

Figure 1. Frama-C’s value analysis displaying results in the GUI 2.3

History of Frama-C

At the Software Reliability Labs, Frama-C was initially seen as a major evolution to Caveat (Baudin et al. 2002; Randimbivololona et al. 1999; Delmas et al. 2008), a software verification tool for C programs based on Hoare logic. Caveat is used by Airbus for part of the verification of part of the software embedded in the A380. As the DO-178B standard mandates, Caveat has been qualified by Airbus as a verification tool to be used for the certification of this particular software. Caveat is supported by the Software Reliability Labs but new ideas are now developed within Frama-C. OCaml was pushed as the implementation language to choose for Frama-C by the new hires, but the actual reason it was accepted is that OCaml was not completely unheard of to the senior researchers there. Indeed, OCaml was already used in the (predominantly C++) Caveat project as the scripting language that allows an interactive validation process to be re-played in batch mode.

Existence of CIL

(Necula et al. 2002) is an OCaml library that provides a parser and Abstract Syntax Tree (AST)-level linker for C code. CIL is well documented and provides ready-made generic analyses that are very useful to get a prototypal analyzer started. Unlike the earlier mentioned data structures that are nice to find in OCaml but could be written quickly if they were missing, having to write a C parser when one’s goal is to write a static analyzer would be a major time-sink. It would have been a significant counter-argument to the choice of OCaml if such a library had not existed. We were probably a little bit lucky, and OCaml is certainly at a disadvantage with respect to other languages from this point of view. We weren’t perhaps lucky to find a C parser so much as to be working in a field that is also of interest to academia. So many researchers in software engineering use OCaml for their experiments that despite the amount of work involved, such a parser could be expected to get written someday. On the other hand, finding a library, or even bindings to an existing library, for an OCaml project in a field that does not interest students and researchers could be a problem.

OCaml’s expressivity was crucial in the adoption phase of the language. An initial one-person internal prototype was able to produce results that convinced management to extend the experiment to two programmers. Eventually, this prototype was able to persuade industrial users to get involved in a national project. To be precise, industrial partners agreed to be part of the project because of their previous experiences with the providers of static analyzers in the project. Some of the tools that they were familiar with were written in OCaml (Caduceus by the ProVal team), and some of them weren’t (Caveat). The time it takes to build these relationships should not be underestimated, and we are not saying that the choice of any programming language can shorten it. However, the progress made by Frama-C after the project had started, which can at least partly be attributed to OCaml, convinced the industrial participants to become involved beyond expectations. At each phase of the bootstrap process, OCaml’s expressivity was important in quickly writing the proof-of-concept that took the project to the next stage.

CIL

282

This drawback (less developers implies less available libraries) is mitigated by higher code re-usability when you do find code to re-use, the existence of the Caml Hump3 and the fact that the grapevine works well between OCaml developers. OCaml does not have anything that begins to compare for instance with CPAN, the Comprehensive Perl Archive Network, but it does have a healthy community. 3.4

with the help of source distribution systems such as GARNOME or MacPorts. In this case, retrieving and installing the dependencies of the gtksourceview1 library (a GTK + widget for displaying source code) is a pain. This is not directly an OCaml problem, but another development platform could have made it possible to use the modern gtksourceview2 (which solves the dependencies problem) or provided more toolkits to choose from initially (to the best of our knowledge, OCaml only offers Tk and GTK + at this time). Now that gtksourceview2 has become stable, there is talk on the lablgtk development list about including it in lablgtk. This is anyway a very minor quibble. It should be kept in mind for comparison that Java or .NET/Mono do not come pre-installed on every platform either. Again, we have no reason to regret the choices of OCaml and GTK + from the standpoint of portability.

Portability

It would have been an annoyance if the development platform for Frama-C had allowed it to be used only on Unix variants, because many potential users only have access to Windows stations. Unix being the system used by the majority of the researchers at the Software Reliability Labs, the switch to a Windows-only platform was not considered. Motivated users — and at this time actually deploying formal methods requires motivation anyway — have found their way past this limitation for previous projects developed at the Labs. From this point of view, the choice of OCaml (and later of GTK + as the toolkit for the graphical interface, through the lablgtk bindings) was an excellent compromise, with Unix clearly being the primary platform of the compiler, and Win32 robustly supported through either Cygwin or Visual C++. Compiling a large OCaml project on Windows+Cygwin is slow. This is probably caused one way or the other by the use of Unix compilation idioms (configuration script, makefile) on an OS where things are usually done differently, and is not a limitation in this context. It should be noted that many OCaml developments are referenced in the source distribution GODI, with dependency lists and automated compilation scripts. All of those Frama-C dependencies that are written in OCaml are referenced in GODI, and Frama-C itself is. Some Frama-C users who have no interest in OCaml outside of Frama-C have found that this was the most convenient installation path for them. Frama-C has been tested under Windows, Mac OS X, Solaris (32-bit), OpenBSD and Linux. Binaries are also distributed for some of these platforms.

OCaml as a scripting language It should be noted that while this is not its strongest point, OCaml is an acceptable scripting language. In other words, when OCaml is chosen as the main language for a new project, the project may be saved the introduction of additional dependencies towards various dedicated scripting languages down the road. For instance, the HTML pages of the Frama-C web site are processed with the yamlpp preprocessor4 , which is written in OCaml. For comparison, in its 15 years of development, Caveat had at one point accumulated dependencies towards Perl, Python, Bash and Zsh (in addition to C++, and in addition to OCaml, used for journalizing and re-playing). Some of these dependencies have since be removed, by replacing some tools with OCaml equivalents. Whatever the main language, discipline is obviously the foremost factor in avoiding “dependency creep”. 3.5

Module system

OCaml’s module system (Leroy 1996) has direct advantages: it creates separate namespaces and, when the modules are in separate files and interfaces have been defined, fully type-checked separate compilation. It is easy to underestimate the importance of these features in the management of a big project because they make the compiler transparent, but when they are missing, their absence is unpleasantly noticeable. We discuss these, and the (theoretically more interesting) functor system per se.

64-bit readiness A 64-bit address space is available to OCaml programs for many 64-bit platforms. Frama-C compiles in both 32- and 64-bit mode, and the resulting versions are functionally identical. This was not a big effort, as OCaml encourages to write high-level code, for instance by providing bignums. For some 64bit-aware platforms (Mac OS X), it is a simple configure option to choose between 32-bit or 64-bit pointers at the time of compiling OCaml. For others, it is troublesome to go against the default size (Linux). With Linux, getting an OCaml compiler with a word size different from the distribution default is akin to building a crosscompiler. However, efforts are under way in the OCaml community to improve support for cross-compilation, including from Linux to Win32. We are looking forward to the maturation of such initiatives. Availability of a graphical toolkit Frama-C uses the GTK + toolkit for its graphical user interface. This section does not discuss the merits or demerits of this toolkit with respect to others. The question it tries to answer is “If I choose OCaml for a software project, will I find one satisfactory toolkit to design the user interface with?”. The choice of GTK + for Frama-C was somewhat arbitrary, but it allows to give a positive answer to the question above, without prejudice to other available toolkits. Our experience is that for some Unix variants (Solaris, Mac OS X, very old Linux distributions), it is necessary to obtain and compile the missing GTK + libraries manually, or semi-manually

Separate compilation With OCaml, in bytecode, separate compilation has the same meaning as everywhere: compilation is parallelizable and only modified files need to be recompiled, with a quick final link phase. With native compilation, all the ancestors of the modified modules in the dependency graph must be recompiled, and the compilation of two files with a parenthood relationship can not be parallelized. Depending on the structure of an OCaml project, recompilation after an incremental change in a low-level module may sometimes feel longish, but in truth, it is much faster to recompile Frama-C with ocamlopt than to recompile Caveat with g++. The existence of two OCaml compilers, one with blazingly fast compilation in general, the other with acceptable recompilation time and producing reasonably fast code, allows very short modifyrecompile-test cycles. Again, it is easy to take short recompilation times for granted but with other languages, when a software project grows in size, this can sometimes be lost, and sorely missed. The OCaml compiler tries very hard not to get in the way between a programmer and his program, and it does not force the programmer to write interfaces. However, if the interface m.mli is missing for module M, the compiled interface is generated from m.ml. This means that any change to m.ml changes the compiled interface and forces the recompilation of every module that uses M, even in bytecode. In a large project, modules should always have

3 The

4 http://www.lri.fr/

Caml Hump an is informal central repository for OCaml libraries

283

~filliatr/yamlpp.en.html

interfaces, if only for the sake of separate compilation. OCaml has an option to generate automatically the interface m.mli that exports everything from M.

when a function takes several arguments of the same type with no obvious normal order between them. The only language that we know of with a feature vaguely similar to OCaml’s labels is Objective C’s infix notation for function calls. Syntax begets style. The “OCaml style” is to write pure functions unless an exception needs to be made because the syntax rewards the use of immutable definitions, as seen in the following: let x = 2 in ... x ... let x = ref 2 in ... !x ... We argue that using labels rewards consistent naming schemes in a similar fashion. When it is common for an argument to be passed repeatedly as-is from caller to callee without any computations actually happening to it (and in a persistent setting as much of FramaC is, this happens with a lot of arguments), the labels syntax rewards the consistent choice of a unique label and eponymous variable name for this argument by a very concise syntax. In this example, the function f is being defined and calls functions g and eval.

Separate namespaces for compilation units Orthogonally to separate compilation, but as importantly for big projects, OCaml’s module system provides separate namespaces. Better yet, the compiler option -pack allows to group several compilation units into a namespace. As a consequence, compilation units that have been put into different packs may safely use the same name for a type, variable or module. For instance the types tree in files both called m.ml in packs lib1 and lib2 are seen as Lib1.M.tree and Lib2.M.tree. This feature is very useful for libraries because libraries may use very common filenames (util.ml) with the guarantee that there will not be a clash at link-time for users of this library (on condition that the pack name itself is unique). In Frama-C, plug-ins are independent from each other: each plug-in only interfaces with the Frama-C kernel, and does not see the implementation details of other plug-ins. In order to implement this separation, the Frama-C system automatically packs each plugin. Thus, two different plug-ins may use files with identical names and still be linked together within Frama-C.

let f let ... ...

If the programmer deviates from this style by using different label names or variable names for mode, context, or env, he receives a gentle slap on the wrist in the form of the awkward ~context:computation_context syntax. This changes the way of reading labels-enabled OCaml programs, too. The reader can put more trust in the names of variables, without having to look for context all the time. The level of obtrusiveness of the label syntax is exactly the same as with the definition of mutable values, and it is exactly right, too. Good style is encouraged but the system can be circumvented when it needs to. Optional arguments (a syntax for giving a labeled argument a default value if it is omitted) are convenient when the consequences of the omission (and subsequent use of the default value) are visible and traceable (for instance, to provide a toolkit interface that is both powerful and beginner-friendly). It is in general a bad idea to use an optional argument to add a new mode to an existing function, both because of all the existing calls to this function — that the compiler would be glad to help the programmer inspect if s/he did not use an optional argument — and because of all the calls to be written in the future where the optional argument will be omitted by accident.

Interfaces and functors The possibility to write functors (modules that are parameterized by other modules or functors), introduced before objects (at the time of Caml Special Light), has proved a workable, and completely statically checked, alternative to object-oriented programming. We use OCaml objects only when interfacing with existing OCaml code that uses objects (CIL and lablgtk), and use functors for the rest. Some very structural idioms seem destined to be expressed with objects (or, for that matter, class types): equality, prettyprinting, hashconsing or marshaling functions5 . Most of our data structure are complicated enough that automatically produced pretty-printers or equality functions would not fit the bill. Consequently, it is in our case neither more nor less tedious to write modules (and interfaces) that sport, in addition to a type t, functions such as pretty: Format.formatter -> t -> unit and equal: t -> t -> bool, and for unmarshaling values of a hashconsed type, rehash: t -> t. But, speaking of equality, it should be noted on the other hand that OCaml’s polymorphic comparison functions (including =, >= and even ==) are dangerous pitfalls. The type-checker does not complain when they are applied wrongly instead of, for instance, equal above. In OCaml, the module system allows to encapsulate the definitions of data structures, and in particular to give a purely functional interface to a sophisticated data structure that uses mutable values internally for optimization. With this compromise, the amount of stateful information that the programmer has to keep in mind is limited by the module boundaries, and the implementation’s algorithmic complexity may be better than that of all known pure implementations. Some in the OCaml community call such an impure module “persistent” (Conchon and Filliˆatre 2007). In fact, some positive reports on the industrial use of Haskell (Nanavati 2008) resonate deeply with our own programming experience, except that we attribute to OCaml’s module system the advantages attributed there to Haskell’s purity. 3.6

~mode ~env x y = context = ... in g ~mode ~context (x+y) ... eval ~mode ~context ~env ...

4.

Development of Frama-C

There are a number of features in Frama-C’s architecture that any Frama-C developer must be aware of. The goal of this section is not to provide a complete list — which can be found in the Frama-C plug-in development guide (Signoles 2008) — but to give a mildly technical overview of each interesting one, with reference to the OCaml feature(s) that make its implementation possible. 4.1

Software architecture

The software architecture of Frama-C is plug-in-oriented. This architecture allows fine-grained collaboration of analysis techniques (as opposed to the large-grain collaboration that happens when one technique is used in a first pass and another in a second pass). As a consequence, mutual recursion between plug-ins must be possible: a plug-in A must be able to use a plug-in B that uses A. A Frama-C plug-in is a packed compilation unit (see Section 3.5) but, unfortunately, OCaml does not support mutually-recursive compilation units. This problem is circumvented pragmatically by using references to functions placed in a module that all plug-ins are allowed to depend on.

Labels and optional arguments

OCaml allows to use labels for function arguments. This feature does not make anything possible that was not already, but in practice, labels provide a concise way to remove the risk of confusion 5 In

the presence of hashconsing, not only do you have to write your own unmarshaling functions, but they are extremely tricky to get right

284

4.4

This “central directory” module is called db.ml and a snippet of it may look like:

Frama-C is able to handle several ASTs simultaneously. This allows to build slicing plug-ins where an original AST is navigated through while a reduced AST is being built. Each of these ASTs has its own state (containing for instance the results of the analyses that have been run on this AST). The AST and corresponding state form what is called a project (Signoles 2009). The desirable “safety property” of projects is the absence of interference between two distinct projects. To enforce this property, each global mutable value of Frama-C must be “projectified”. A set of functors are provided to this effect (these functors add a project-aware indirection to any mutable data that is used by any of the functions made visible by the plug-in). We wish OCaml’s type system helped us enforce this rule, but we plan to move to dynamic tags to detect at least at analysis-time when a variable that should have been projectified wasn’t.

/* db.ml: kernel database of plug-in stubs */ module Plugin1: sig val run: (unit->unit) ref end module Plugin2: sig ... end During its initialization, a plug-in registers each of its exported functions in the appropriate stub in Db — another OCaml feature is that each compilation unit can have its own initialization effects. /* plugin1_register.ml */ let run () = ... (* the analysis goes here *) (* registration of [run] in the kernel *) let () = Db.Plugin1.run := run Thought this solution is the most common way to break mutual recursion between compilation units, it has trade-offs. Firstly, polymorphic functions may not be registered this way. This is not an issue here: each plug-in is a static analyzer, and none of the analyzers we have written so far wanted to provide polymorphic functions. Secondly, the types of the registered functions have to be known by the Frama-C kernel. Here again, that is not a big issue in our context, especially because Frama-C encourages the use of ACSL (Baudin et al. 2008), a common specification language, as the lingua franca to transmit knowledge between plug-ins. Finally, the most significant trade-off with this solution is that any plug-in that wishes to provide an interface to other plug-ins (as opposed to interacting with the user only) needs to modify some well-identified parts of the Frama-C kernel. This has not been a problem so far because, for now, plug-ins written outside Frama-C’s development team have all been dedicated to answering a specific problem, as opposed to providing computations to help other plug-ins. 4.2

4.5

Journalization

During an interactive session, Frama-C journalizes most of the actions which modified its global state. This means that, like Caveat, it generates an OCaml script retracing what happened during the session. The journal may be compiled and statically or dynamically linked with the Frama-C kernel in order to replay the same actions. Furthermore, the journal can be used to grasp Frama-C’s internals (translating GUI actions into function calls), and it is possible to modify it before compiling and replaying it. As for dynamic loading, phantom types allow to safely implement this feature (see Section 4.6). 4.6

Phantom types for dynamic typing in a static setting

Both dynamic loading and journalization rely on phantom types (Rhiger 2003). Phantom types — parameterized types which employ their type variables for encoding meta-information — are used in both cases to ensure the dynamic safety of function calls which cannot be checked by the OCaml type system. Indeed we provide a library of dynamic typing. Its implementation requires the use of unsafe features (through OCaml standard library’s module Obj) but phantom types allow to provide a safe interface: the use of the library cannot break type safety (as long as there is no implementation error in the library).

Dynamic loading of plug-ins

OCaml has allowed dynamic linking of bytecode compilation units (through module Dynlink) for a long time. In OCaml 3.11, dynamic linking of native code became available for a large number of target architectures. Frama-C uses dynamic linking where available in order to provide dynamic loading of plug-ins. This is an alternative way to plug analyzers into the Frama-C kernel. When dynamic linking is used for the plugging, the plug-in’s functions are registered in a global table in the kernel at load-time. Because all functions do not have the same ML type, phantom types (Rhiger 2003) are used in order to dynamically ensure the program’s safety (see Section 4.6). Dynamic linking solves two out of three issues of static linking: it ceases to be necessary for the kernel to be aware of the types of all plug-ins’ exported functions, and it becomes more convenient to distribute a plug-in separately from Frama-C (in particular a plugin no longer needs to patch the kernel). 4.3

Multi-project framework

5.

Conclusion

We have not yet considered the point of view of the external FramaC plug-in developer. We hope to see in the future many useful plugins written outside the circle of the initial developers. It is too early to draw conclusions on the consequences of the choice of OCaml as the platform’s language for this goal. Responses so far have ranged from the enthusiastic (“and it’s even written in OCaml”) to the rejection (“[...]drawback that the extensions have to be written in Ocaml[sic]”), with in the middle at least one person who decided to learn OCaml because there was something s/he wanted to do with Frama-C.

Impure functional programming

Most analyses in Frama-C are written in a functional style. However Frama-C’s value analysis (whose results are used by many other plug-ins) relies on hashconsing (Filliˆatre and Conchon 2006) and memoization, which are both implemented with mutable data structures. More generally, Frama-C makes use of imperative features in order to improve efficiency. For instance, the abstract syntax tree (inherited from CIL) contains many mutable fields. Besides, Frama-C has a global state which is composed of many global tables.

Acknowledgments We would like to acknowledge the help of our colleagues at ProVal, at INRIA Sophia Antipolis’ projects Everest and now Marelle, and at CEA LIST, in the building of Frama-C. The feedback of users of Frama-C within the CAT project, at Fraunhofer FIRST or elsewhere has been great. The anonymous referees suggested various improvements to this experience report. Special thanks go to the developers of the OCaml system. Keywords

285

OCaml, software architecture, plug-ins, static analysis

References

Jean-Christophe Filliˆatre and Sylvain Conchon. Type-safe modular hashconsing. In ML ’06: Proceedings of the 2006 workshop on ML, pages 12–19, New York, NY, USA, 2006. ACM. ISBN 1-59593-483-9.

Patrick Baudin, Anne Pacalet, Jacques Raguideau, Dominique Schoen, and Nicky Williams. Caveat: a tool for software validation. In Dependable Systems and Networks, 2002, pages 537+, 2002.

Xavier Leroy. A syntactic theory of type generativity and sharing. Journal of Functional Programming, 6:1–32, 1996.

Patrick Baudin, Jean-Christophe Filliˆatre, Thierry Hubert, Claude March´e, Benjamin Monate, Yannick Moy, and Virgile Prevosto. ACSL: ANSI C Specification Language (preliminary design V1.4), preliminary edition, October 2008. URL http://frama-c.cea.fr/acsl.html.

Yaron Minsky. Caml trading: Experiences in functional programming on Wall Street. In Wouter Swierstra, editor, The Monad.Reader, April 2007. Benjamin Monate and Julien Signoles. Slicing for security of code. In Peter Lipp, Ahmad-Reza Sadeghi, and Klaus-Michael Koch, editors, TRUST, volume 4968 of Lecture Notes in Computer Science, pages 133–142. Springer-Verlags, March 2008.

G´eraud Canet, Pascal Cuoq, and Benjamin Monate. A value analysis for C programs, 2009. To appear in the proceedings of SCAM2009. Sylvain Conchon and Jean-Christophe Filliˆatre. A persistent union-find data structure. In ML ’07: Proceedings of the 2007 workshop on Workshop on ML, pages 37–46, New York, NY, USA, 2007. ACM. ISBN 978-159593-676-9. doi: http://doi.acm.org/10.1145/1292535.1292541.

Ravi Nanavati. Experience report: a pure shirt fits. SIGPLAN Not., 43(9): 347–352, 2008. ISSN 0362-1340. George C. Necula, Scott Mcpeak, Shree P. Rahul, and Westley Weimer. CIL: Intermediate language and tools for analysis and transformation of C programs. In International Conference on Compiler Construction, pages 213–228, 2002.

Sylvain Conchon, Jean-Christophe Filliˆatre, and Julien Signoles. Designing a generic graph library using ML functors. In Marco T. Moraz´an, editor, Trends in Functional Programming, volume 8 of Trends in Functional Programming, pages 124–140. Intellect, UK/The University of Chicago Press, USA, 2008. ISBN 978-1-84150-196-3.

Famantanantsoa Randimbivololona, Jean Souyris, Patrick Baudin, Anne Pacalet, Jacques Raguideau, and Dominique Schoen. Applying formal proof techniques to avionics software: A pragmatic approach. In FM ’99: Proceedings of the Wold Congress on Formal Methods in the Development of Computing Systems-Volume II, pages 1798–1815, London, UK, 1999. Springer-Verlag. ISBN 3-540-66588-9. Morten Rhiger. A foundation for embedded languages. ACM Transactions on Programming Languages and Systems (TOPLAS), 25(3):291–315, 2003. ISSN 0164-0925. Julien Signoles. Plug-in development guide, 2008. URL http:// frama-c.cea.fr/download/plug-in_development_guide.pdf. Julien Signoles. Foncteurs imp´eratifs et compos´es: la notion de projets dans Frama-C. In Actes des Journ´ees Francophones des Langages Applicatifs, pages 37–54, January 2009. In French.

Pascal Cuoq. Documentation of Frama-C’s value analysis plugin, 2008. URL http://frama-c.cea.fr/download/ frama-c-manual-Lithium-en.pdf. Pascal Cuoq and Damien Doligez. Hashconsing in an incrementally garbage-collected system: a story of weak pointers and hashconsing in OCaml 3.10.2. In ML ’08: Proceedings of the 2008 ACM SIGPLAN workshop on ML, pages 13–22, New York, NY, USA, 2008. ACM. ISBN 978-1-60558-062-3. David Delmas, St´ephane Duprat, Patrick Baudin, and Benjamin Monate. Proving temporal properties at code level for basic operators of control/command programs. In 4th European Congress on Embedded Real Time Software, 2008.

286

Control-Flow Analysis of Function Calls and Returns by Abstract Interpretation Jan Midtgaard

Thomas P. Jensen

Roskilde University [email protected]

CNRS [email protected]

Abstract

let g z = z in let f k = if b then k 1 else k 2 in let y = f (fn x => x) in g y

We derive a control-flow analysis that approximates the interprocedural control-flow of both function calls and returns in the presence of first-class functions and tail-call optimization. In addition to an abstract environment, our analysis computes for each expression an abstract control stack, effectively approximating where function calls return across optimized tail calls. The analysis is systematically calculated by abstract interpretation of the stack-based Ca EK abstract machine of Flanagan et al. using a series of Galois connections. Abstract interpretation provides a unifying setting in which we 1) prove the analysis equivalent to the composition of a continuation-passing style (CPS) transformation followed by an abstract interpretation of a stack-less CPS machine, and 2) extract an equivalent constraint-based formulation, thereby providing a rational reconstruction of a constraint-based control-flow analysis from abstract interpretation principles.

(a) Example program

call f k

main

f k

return return g y

call return

call

fn x =>

(b) Call-return call graph

call g y

return

call

fn x =>

(c) Optimized call graph

Figure 1: The corresponding call graphs

Categories and Subject Descriptors F.3.2 [Logics and Meanings of Programs]: Semantics of Programming Languages—Program Analysis General Terms

call

main

that “will determine where the flow of control may be transferred to in the case of a function return.” The resulting analysis thereby approximates both call and return information for a higher-order, direct-style language. Interestingly it does so by approximating the control stack. Consider the example program in Fig. 1(a). The program contains three functions: two named function g and f and an anonymous function fn x => x. A standard direct-style CFA can determine that the applications of k in each branch of the conditional will call the anonymous function fn x => x at run time. Building a call-graph based on this output gives rise to Fig. 1(b), where we have named the main expression of the program main. In addition to the above resolved call, our analysis will determine that the anonymous function returns to the let-binding of y in main upon completion, rather than to its caller. The analysis hence gives rise to the call graph in Fig. 1(c). On a methodological level, we derive the analysis systematically by Cousot-Cousot-style abstract interpretation. The analysis approximates the reachable states of an existing abstract machine from the literature: the Ca EK machine of Flanagan et al. [1993]. We obtain the analysis as the result of composing the collecting semantics induced by the abstract machine with a series of Galois connections that each specifies one aspect of the abstraction in the analysis. We show how the abstract interpretation formulation lends itself to a lock-step equivalence proof between our analysis and a previously derived CPS-based CFA. More precisely, we define a relation between the abstract domains of the analyses that is a simulation between the two, reducing the proof to a fixpoint induction over the abstract interpretations.

Languages, Theory, Verification

Keywords Control flow analysis, abstract interpretation, tail-call optimization, continuation-passing style, direct style, constraintbased analysis

1. Introduction The control flow of a functional program is expressed in terms of function calls and returns. As a result, iteration in functional programs is expressed using recursive functions. In order for this approach to be feasible, language implementations perform tail-call optimization of function calls [Clinger, 1998], by not pushing a stack frame on the control stack at call sites in tail position. Consequently functions do not necessarily return control to their caller. Control-flow analysis (CFA) has long been a staple of program optimization and verification. Surprisingly, research on control-flow analysis has focused on calls: A textbook CFA “will determine where the flow of control may be transferred to in the case [...] of a function application.” [Nielson et al., 1999]. Our systematic approximation of a known operational semantics leads to a CFA

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICFP’09, August 31–September 2, 2009, Edinburgh, Scotland, UK. c 2009 ACM 978-1-60558-332-7/09/08. . . $10.00 Copyright

287

To sum up, the main contributions of this article are:

Damian and Danvy [2003], Palsberg and Wand [2003]:

• An abstract interpretation-derivation of a CFA for a higher-

exists analysis C : for all p, C(p) ∼ C(cps(p))

order functional language from a well-known operational semantics,

Present paper, Theorem 5.1: exists analyses C1 , C2 : for all p, C1 (p) ∼ C2 (cps(p))

• a resulting CFA with reachability which computes both call and

return control-flow,

Our work relates to all of the above contributions. The disciplined derivation of specialized CPS and direct-style analyses results in comparable analyses, contrary to Sabry and Felleisen [1994]. Furthermore our equivalence proof extends the results of Damian and Danvy [2003] and Palsberg and Wand [2003] in that we relate both call flow, return flow, and reachability, contrary to their relating only the call flow of standard CFAs. In addition, the systematic abstract interpretation-based approach suggests a strategy for obtaining similar equivalence results for other CFAs derived in this fashion. Formulating CFA in the traditional abstract interpretation framework was stated as an open problem by Nielson and Nielson [1997]. It has been a recurring theme in the work of the present authors. In an earlier paper Spoto and Jensen [2003] investigated class analysis of object-oriented programs as a Galois connection-based abstraction of a trace semantics. In a recent article [Midtgaard and Jensen, 2008a], the authors systematically derived a CPS-based CFA from the collecting semantics of a stack-less machine. While investigating how to derive a corresponding direct-style analysis we discovered a mismatch between the computed return information. As tail calls are identified syntactically, the additional information could also have been obtained by a subsequent analysis after a traditional direct-style CFA. However we view the need for such a subsequent analysis as a strong indication of a mismatch between the direct-style and CPS analysis formulations. Debray and Proebsting [1997] have investigated such a “return analysis” for a first-order language with tail-call optimization. This paper builds a semantics-based CFA that determines such information, and for a higher-order language. The systematic design of constraint-based analyses is a goal shared with the flow logic framework of Nielson and Nielson [2002]. In flow logic an analysis specification can be systematically transformed into a constraint-based analysis. The present paper instead extracts a constraint-based analysis from an analysis developed in the original abstract interpretation framework.

• a proof of equivalence of the analysis of programs in direct style

and the CPS analysis of their CPS counterparts, • an equivalent constraint-based analysis extracted from the

above. 1.1 Related work We separate the discussion of related analyses in two: direct-style analyses and analyses based on CPS. Direct-style CFA has a long research history. Jones [1981] initially developed methods for approximating the control flow of lambda terms. Since then Sestoft [1989] conceived the related closure analysis. Palsberg [1995] simplified the analysis and formulated an equivalent constraint-based analysis. At the same time Heintze [1994] developed a related set-based analysis formulated in terms of set constraints. For a detailed account of related work, we refer to a recent survey of the area [Midtgaard, 2007]. It is worth emphasizing that all of the above analyses focus on calls, in that they approximate the source lambdas being called at each call site. As such they do not directly determine return flow for programs in direct style. CPS-based CFA was pioneered by Shivers [1988] who formulated control-flow analysis for Scheme. Since then several analyses have been formulated for CPS [Ayers, 1992, Ashley and Dybvig, 1998, Might and Shivers, 2006]. In CPS all calls are tail calls, and even returns are encoded as calls to the current continuation. By determining “call flow” and hence the receiver functions of such continuation calls, a CPS-based CFA thereby determines return flow without additional effort. The impact of CPS transformation on static analyses originates in binding-time analysis, for which the transformation is known to have a positive effect [Consel and Danvy, 1991, Damian and Danvy, 2003]. As to the impact of CPS transformation on CFA we separate the previous work on the subject in two: 1. results relating an analysis specialized to the source language to an analysis specialized to the target language (CPS), and

The idea of CFA by control stack approximation, applies equally well to imperative or object-oriented programs, but it is beyond the scope of this paper to argue this point. Due to space limitations most calculations and proofs are also omitted. We refer the reader to the accompanying technical report [Midtgaard and Jensen, 2008b].

2. results relating the analysis of a program to the same analysis of the CPS transformed program. Sabry and Felleisen [1994] designed and compared specialized analyses and hence falls into the first category as does the present paper. Damian and Danvy [2003] related the analysis of a program and its CPS counterpart for a standard flow-logic CFA (as well as for two binding-time analyses), and Palsberg and Wand [2003] related the analysis of a program and its CPS counterpart for a standard conditional constraint CFA. Hence the latter two fall into the second category. We paraphrase the relevant theorems of Sabry and Felleisen [1994], of Damian and Danvy [2003], of Palsberg and Wand [2003], and of the present paper in order to underline the difference between the contributions (C refers to non-trivial, 0-CFA-like analyses defined in the cited papers, p ranges over direct-style programs, cps denotes CPS transformation, and ∼ denotes analysis equivalence). Our formulations should not be read as a formal system, but only as a means for elucidating the difference between the contributions. Sabry and Felleisen [1994]:

2. Language and semantics Our source language is a simple call-by-value core language known as administrative normal form (ANF). The grammar of ANF terms is given in Fig. 2(a). Following Reynolds, the grammar distinguishes serious expressions, i.e., terms whose evaluation may diverge, from trivial expressions, i.e., terms without risk of divergence. Trivial expressions include constants, variables, and functions, and serious expressions include returns, let-bindings, tail calls, and non-tail calls. Programs are serious expressions. The analysis is calculated from a simple operational semantics in the form of an abstract machine. We use the environment-based Ca EK abstract machine of Flanagan et al. [1993] given in Fig. 2 in which functional values are represented using closures, i.e., pairs of a lambda-expression and an environment. The environmentcomponent captures the (values of the) free variables of the lambda. Machine states are triples consisting of a serious expression, an

exists analyses C1 , C2 : exists p, C1 (p) ≁ C2 (cps(p))

288

P ∋ p ::= s

(programs)

T ∋ t ::= c | x | fn x => s

(trivial expressions)

C ∋ s ::= t | let x=t in s | t0 t1 | let x=t0 t1 in s

(serious expressions)

(a) ANF grammar

Val ∋ w ::= c | [fn x => s, e] Env ∋ e ::= • | e[x 7→ w] K ∋ k ::= stop | [x, s, e] :: k

ht, e, [x, s′ , e′ ] :: k′ i −→ hlet x=t in s, e, ki −→ ht0 t1 , e, ki −→

(b) Values, environments, and stacks

hlet x=t0 t1 in s, e, ki −→

µ : T × Env ⇀ Val µ (c, e) = c

hs′ , e′ [x 7→ µ (t, e)], k′ i hs, e[x 7→ µ (t, e)], ki hs′ , e′ [x 7→ w], ki if [fn x => s′ , e′ ] = µ (t0 , e) and w = µ (t1 , e) ′ hs , e′ [y 7→ w], [x, s, e] :: ki if [fn y => s′ , e′ ] = µ (t0 , e) and w = µ (t1 , e)

(d) Machine transitions

µ (x, e) = e(x) µ (fn x => s, e) = [fn x => s, e]

eval(p) = w iff hp, •, [xr , xr , •] :: stopi −→∗ hxr , •[xr 7→ w], stopi

(c) Helper function

(e) Machine evaluation

Figure 2: The Ca EK abstract machine

4. Approximating the Ca EK collecting semantics

environment and a control stack. The control stack is composed of elements (“stack frames”) of the form [x, s, e] where x is the variable receiving the return value w of the current function call, and s is a serious expression whose evaluation in the environment e[x 7→ w] represents the rest of the computation in that stack frame. The empty stack is represented by stop. The machine has a helper function µ for evaluation of trivial expressions. The machine is initialized with the input program, with an empty environment, and with an initial stack, that will bind the result of the program to a special variable xr before halting. Evaluation follows by repeated application of the machine transitions.

As our collecting semantics we consider the reachable states of the Ca EK machine, expressed as the least fixed point lfp F of the following transition function. F : ℘(C × Env × K) → ℘(C × Env × K) F(S) = Ip ∪ {s | ∃s′ ∈ S : s′ −→ s} where Ip = {hp, •, [xr , xr , •] :: stopi} First we formulate in Fig. 3(a) an equivalent helper function µc extended to work on sets of environments. Lemma 4.1. ∀t, e : {µ (t, e)} = µc (t, {e}) The equivalence of the two helper functions follow straightforwardly. This lemma enables us to express an equivalent collecting semantics based on µc , which appears in Fig. 3. The equivalence of F and F c follows from the lemma and by unfolding the definitions. The abstraction of the collecting semantics is staged in several steps. Figure 4 provides an overview. Intuitively, the analysis extracts three pieces of information from the set of reachable states.

3. Abstract interpretation basics We assume some familiarity with the basic mathematical facts recalled in Appendix A. Canonical abstract interpretation approximates the collecting semantics of a transition system [Cousot, 1981]. A standard example of a collecting semantics is the reachable states from a given set of initial states I. Given a transition function T defined as: T (Σ) = I ∪ {σ | ∃σ ′ ∈ Σ : σ ′ → σ }, we can compute the reachable states of T as the least fixed-point lfp T of T . The collecting semantics is ideal, in that it is the most precise analysis. Unfortunately it is in general uncomputable. Abstract interpretation therefore approximates the collecting semantics, by instead computing a fixed-point over an alternative and perhaps simpler domain. For this reason, abstract interpretation is also referred to as a theory of fixed-point approximation. Abstractions are formally represented as Galois connections which connect complete lattices through a pair of adjoint functions α and γ (see Appendix A). Galois connection-based abstract interpretation suggests that one may derive an analysis systematically by composing the transition function with these adjoints: α ◦ T ◦ γ . In this setting Galois connections allow us to gradually refine the collecting semantics into a computable analysis function by mere calculation. An alternative “recipe” consists in rewriting the composition of the abstraction function and transition function α ◦ T into something of the form T ♯ ◦ α , from which the analysis function T ♯ can be read off [Cousot and Cousot, 1992a]. Cousot [1999] has shown how to systematically construct a static analyser for a first-order imperative language using calculational abstract interpretation.

1. An approximation of the set of reachable expressions. 2. A relation between expressions and control stacks that represents where the values of expressions are returned to. 3. An abstract environment mapping variables to the expressions that may be bound to that variable. This is standard in CFA and allows to determine which functions are called at a given call site. Keeping an explicit set of reachable expressions is more precise than leaving it out, once we further approximate the expressionstack pairs. Alternatively the reachable expressions would be approximated by the expressions present in the expression-stack relation. However expressions may be in the expression-stack relation without ever being reached. An example hereof would be a diverging non-tail call. To formalize this intuition, we first perform a Cartesian abstraction of the machine states, however keeping the relation between expressions and their corresponding control stacks. The second step in the approximation consists in closing the triples by a closure operator, to ensure that (a) any saved environment on the stack or nested within another environment is itself part of the environment

289

Lemma 4.2. α× , γ× is a Galois connection.

µc : T × ℘(Env) → ℘(Val)

The above Galois connection and the proof hereof closely resembles the independent attributes abstraction, which is a known Galois connection. We use the notation ∪× and ⊆× for the componentwise join and componentwise inclusion of triples. As traditional [Cousot and Cousot, 1979, 1992a, 1994], we will assume that the abstract product domains throughout this article have been reduced, i.e., all triples hA, B, Ci with a bottom component (A = ⊥a ∨ B = ⊥b ∨ C = ⊥c ) have been eliminated and replaced by a single bottom element h⊥a , ⊥b , ⊥c i. Based on this abstraction we can now calculate a new transfer function F × . The resulting transition function appears in Fig. 5. By construction, the transition function satisfies the following theorem.

µc (c, E) = {c} µc (x, E) = {w | ∃e ∈ E : w = e(x)} µc (fn x => s, E) = {[fn x => s, e] | ∃e ∈ E} (a) Helper function c

F : ℘(C × Env × K) → ℘(C × Env × K) c

F (S) = Ip ∪

[

{hs′ , e′ [x 7→ w], k′ i}

ht, e, [x, s′ , e′ ]::k′ i∈S w∈µc (t,{e})

∪

[

Theorem 4.1.

{hs, e[x 7→ w], ki}

∀C, F, E : α× (F c (γ× (hC, F, Ei))) = F × (hC, F, Ei)

hlet x=t in s, e, ki∈S w∈µc (t,{e})

∪

[

{hs′ , e′ [x 7→ w], ki}

4.2 A closure operator on machine states

ht0 t1 , e, ki∈S [fn x => s′ , e′ ]∈µc (t0 ,{e}) w∈µc (t1 ,{e})

∪

[

For the final analysis, we are only interested in an abstraction of the information present in an expression-stack pair. More precisely, we aim at only keeping track of the link between an expression and the top stack frame in effect during its evaluation, throwing away everything below. However, we need to make this information explicit for all expressions appearing on the control stack, i.e., for a pair hs, [x, s′ , e] :: ki we also want to retain that s′ will be evaluated with control stack k. Similarly, environments can be stored on the stack or inside other environments and will have to be extracted. We achieve this by defining a suitable closure operator on these nested structures. For environments, we adapt the definition of a constituent relation due to Milner and Tofte [1991] We say that each component xi of a tuple hx0 , . . . , xn i is a constituent of the tuple, written hx0 , . . . , xn i ≻ xi . For a partial function1 f = [x0 7→ w0 , . . . , xn 7→ wn ], we say that each wi is a constituent of the function, written f ≻ wi . We write ≻∗ for the reflexive, transitive closure of the constituent relation. To deal with the control stack, we define an order on expressionstack pairs. Two pairs are ordered if (a) the stack component of the second is the tail of the first’s stack component, and (b) the expression component of the second, resides on the top stack frame of the first pair: hs, [x, s′ , e] :: ki ⋗ hs′ , ki. We write ⋗∗ for the reflexive, transitive closure of the expression-stack pair ordering. Next, we consider an operator ρ , defined in terms of the constituent relation and the expression-stack pair ordering. The operator ρ ensures that all constituent environments will themselves belong to the set of environments, and that any structurally smaller expression-stack pairs are also contained in the expression-stack relation.

{hs′ , e′ [y 7→ w], [x, s, e] :: ki}

hlet x=t0 t1 in s, e, ki∈S [fn y => s′ , e′ ]∈µc (t0 ,{e}) w∈µc (t1 ,{e})

(b) Transition function

Figure 3: Collecting semantics

℘(C × Env × K)

coll. sem.

Fc

-

F×

-

Fρ

0-CFA

F♯

O γ×

α×

℘(C) × ℘(C × K) × ℘(Env)

O ρ

1

ρ (℘(C) × ℘(C × K) × ℘(Env))

O α⊗

γ⊗

℘(C) × (C/≡ → ℘(K ♯ )) × Env♯

Figure 4: Overview of abstraction

set, and (b) that all expression-control stack pairs that appear further down in a control stack are also contained in the expressionstack relation. We explain this in more detail below (Section 4.2). Finally as a third step we approximate stacks by their top element, we merge expressions with the same return point into equivalence classes, and we approximate closure values by their lambda expression. In the following sections we provide a detailed explanation of each abstraction in turn.

Definition 4.1.

ρ (hC, F, Ei) = hC, {hs, ki | ∃hs′ , k′ i ∈ F : hs′ , k′ i⋗∗ hs, ki}, {e | ∃hs, ki ∈ F : hs, ki ≻∗ e ∨ ∃e′ ∈ E : e′ ≻∗ e}i We need to relate the expression-stack ordering to the constituent relation. By case analysis one can prove that ∀hs, ki, hs′ , k′ i : hs, ki ⋗ hs′ , k′ i =⇒ k ≻ k′ . By structural induction (on the stack component) it now follows that ∀hs, ki, hs′ , k′ i : hs, ki ⋗∗ hs′ , k′ i =⇒ k ≻∗ k′ . Based on these results we can verify that ρ is a closure operator and formulate an abstraction on the triples:

4.1 Projecting machine states The mapping that extracts the three kinds of information described above is defined formally as follows. γ×

1

−− − − ℘(C) × ℘(C × K) × ℘(Env) ℘(C × Env × K) ← −− →

− − ρ (℘(C)×℘(C × K)×℘(Env)) ℘(C)×℘(C × K)×℘(Env) ← −− →

α×

ρ

α× (S) = hπ1 S, {hs, ki | ∃e : hs, e, ki ∈ S}, π2 Si γ× (hC, F, Ei) = {hs, e, ki | s ∈ C ∧ hs, ki ∈ F ∧ e ∈ E}

1 Milner

290

and Tofte define the constituent relation for finite functions.

F × : ℘(C) × ℘(C × K) × ℘(Env) → ℘(C) × ℘(C × K) × ℘(Env) F × (hC, F, Ei) = h{p}, {hp, [xr , xr , •] :: stopi}, {•}i [

h{s′ }, {hs′ , k′ i}, {e′ [x × h{t}, {ht, [x, s , e ]::k i}, {e}i⊆× hC, F, Ei w∈µc (t,{e}) ∪×

′

′

[

∪×

7→ w]}i

′

h{s}, {hs, ki}, {e[x 7→ w]}i

× h{let x=t in s}, {hlet x=t in s, ki}, {e}i⊆× hC, F, Ei w∈µc (t,{e})

[

h{s′ }, {hs′ , ki}, {e′ [x × h{t0 t1 }, {ht0 t1 , ki}, {e}i⊆× hC, F, Ei [fn x => s′ , e′ ]∈µc (t0 ,{e}) w∈µc (t1 ,{e}) ∪×

[

7→ w]}i

h{s′ }, {hs′ , [x, s, e] :: ki}, {e′ [y × h{let x=t0 t1 in s}, {hlet x=t0 t1 in s, ki}, {e}i⊆× hC, F, Ei [fn y => s′ , e′ ]∈µc (t0 ,{e}) w∈µc (t1 ,{e}) ∪×

7→ w]}i

Figure 5: Abstract transition function

We use the notation ∪ρ for the join operation λ X. ρ (∪× X) on the closure operator-induced complete lattice. First observe that in our case: ∪ρ = λ X. ρ (

[

×

Xi ) = λ X.

i

[

×

ρ (Xi ) = λ X.

i

[

×

the same return point. We define the smallest equivalence relation ≡ satisfying: let x=t in s ≡ s let x=t0 t1 in s ≡ s

Xi = ∪×

i

Based hereon we define a second elementwise operator @′ : C × K ♯ → C/≡ × K ♯ mapping the first component of an expressionstack pair to a representative of its corresponding equivalence class:

Based on the closure operator-based Galois connection, we can calculate a new intermediate transfer function F ρ . The resulting transfer function appears in Fig. 6. This transfer function differs only minimally from the one in Fig. 5, in that (a) the signature has changed, (b) the set of initial states has been “closed” and now contains the structurally smaller pair hxr , stopi, and (c) the four indexed joins now each join “closed” triples in the image of the closure operator. By construction, the new transition function satisfies the following theorem.

@′ (hs, k♯ i) = h[s]≡ , k♯ i We can choose the outermost expression as a representative for each equivalence class by a linear top-down traversal of the input program. Pointwise coding of a relation [Cousot and Cousot, 1994]: A relation can be isomorphically encoded as a set-valued function by a Galois connection:

Theorem 4.2. ∀C, F, E : ρ ◦ F × ◦ 1(hC, F, Ei) = F ρ (hC, F, Ei)

γω

˙ ←− ← − −− hA → ℘(B); ⊆i h℘(A × B); ⊆i − −−→ −→

4.3 Abstracting the expression-stack relation

αω

Since stacks can grow unbounded (for non-tail recursive programs), we need to approximate the stack component and hereby the expression-stack relation. We first formulate a grammar of abstract stacks and an elementwise operator @ : C × K → C × K ♯ operating on expression-stack pairs. ♯

αω (r) = λ a. {b | ha, bi ∈ r}

γω ( f ) = {ha, bi | b ∈ f (a)}

By composing the three above Galois connections we obtain our abstraction of the expression-stack relation: γst

−− − − C/≡ → ℘(K ♯ ) ℘(C × K) ← −− → αst

♯

K ∋ k ::= stop | [x, s]

S where αst = αω ◦ α@′ ◦ α@ = λ F. ˙ hs, ki∈F αω ({@′ ◦ @(hs, ki)}) and γst = γ@ ◦ γ@′ ◦ γω . We can now prove a lemma relating the concrete and abstract expression-stack relations.

@(hs, stopi) = hs, stopi @(hs, [x, s′ , e] :: ki) = hs, [x, s′ ]i Based on the elementwise operator we can now use an elementwise abstraction.

Lemma 4.3. Control stack and saved environments Let hC, F, Ei ∈ ρ (℘(C) × ℘(C × K) × ℘(Env)) be given.

Elementwise abstraction [Cousot and Cousot, 1997]: A given elementwise operator @ : C → A induces a Galois connection:

hs, [x, s′ , e] :: ki ∈ F =⇒ e ∈ E ∧ {hs′ , ki} ⊆ F ∧ {[x, s′ ]} ⊆ αst (F)([s]≡ )

γ@

−− −− h℘(A); ⊆i h℘(C); ⊆i ← −− α@→

α@ (P) = {@(p) | p ∈ P}

Proof. The first half follows from the assumptions. The second half ˙ @, follows from monotonicity of αst , and the definitions of αst , ∪, ˙ @′ , αω , and ⊆.

γ@ (Q) = {p | @(p) ∈ Q}

Notice how some expressions share the same return point (read: same stack): the expressions let x=t in s and s share the same return point, and let x=t0 t1 in s and s share the same return point. In order to eliminate such redundancy we define an equivalence relation on serious expressions grouping together expressions sharing

4.4 Abstracting environments We also abstract values using an elementwise abstraction. Again we formulate a grammar of abstract values and an elementwise

291

F ρ : ρ (℘(C) × ℘(C × K) × ℘(Env)) → ρ (℘(C) × ℘(C × K) × ℘(Env)) F ρ (hC, F, Ei) = h{p}, {hp, [xr , xr , •] :: stopi, hxr , stopi}, {•}i ∪×

[

∪×

[

ρ (h{s′ }, {hs′ , k′ i}, {e′ [x × h{t}, {ht, [x, s′ , e′ ]::k′ i}, {e}i⊆× hC, F, Ei w∈µc (t,{e})

7→ w]}i)

ρ (h{s}, {hs, ki}, {e[x 7→ w]}i)

× h{let x=t in s}, {hlet x=t in s, ki}, {e}i⊆× hC, F, Ei w∈µc (t,{e})

[

ρ (h{s′ }, {hs′ , ki}, {e′ [x × h{t0 t1 }, {ht0 t1 , ki}, {e}i⊆× hC, F, Ei [fn x => s′ , e′ ]∈µc (t0 ,{e}) w∈µc (t1 ,{e}) ∪×

7→ w]}i)

[

ρ (h{s′ }, {hs′ , [x, s, e] :: ki}, {e′ [y × h{let x=t0 t1 in s}, {hlet x=t0 t1 in s, ki}, {e}i⊆× hC, F, Ei [fn y => s′ , e′ ]∈µc (t0 ,{e}) w∈µc (t1 ,{e}) ∪×

7→ w]}i)

Figure 6: The second abstract transition function

operator @ : Val → Val♯ mapping concrete to abstract values.

Lemma 4.4. ∀t, E : α@ (µc (t, E)) = µ ♯ (t, αΠ (E))

Val♯ ∋ w♯ ::= c | [fn x => s] @(c) = c @([fn x => s, e]) = [fn x => s]

The resulting helper function µ ♯ : T × Env♯ → ℘(Val♯ ) reads:

µ ♯ (c, E ♯ ) = {c} µ ♯ (x, E ♯ ) = E ♯ (x)

The abstraction of environments, which are partial functions, can be composed by a series of well-known Galois connections.

♯

µ (fn x => s, E ♯ ) = {[fn x => s]} where we write Env♯ as shorthand for Var → ℘(Val♯ ). We shall need a lemma relating the two helper function definitions on closed environments.

Pointwise abstraction of a set of functions [Cousot and Cousot, 1994]: A given Galois connection on the co-domain h℘(C); ⊆i γ ← −→ − hC♯ ; ⊑i induces a Galois connection on a set of functions: −− α

Lemma 4.5. Helper function on closed environments (1) Let hC, F, Ei ∈ ρ (℘(C) × ℘(C × K) × ℘(Env)) be given.

γΠ

˙ −− − − hD → C♯ ; ⊑i h℘(D → C); ⊆i ← −− → αΠ

{[fn x => s, e]} ⊆ µc (t, E) =⇒ e ∈ E

αΠ (F) = λ d. α ({ f (d) | f ∈ F}) γΠ (A) = { f | ∀d : f (d) ∈ γ (A(d))}

∧ {[fn x => s]} ⊆ µ ♯ (t, αΠ (E)) The above lemma is easily extended to capture nested environments in all values returned by the helper function:

Subset abstraction [Cousot and Cousot, 1997]: Given a set C and a strict subset A ⊂ C hereof, the restriction to the subset induces a Galois connection:

Lemma 4.6. Helper function on closed environments (2) Let hC, F, Ei ∈ ρ (℘(C) × ℘(C × K) × ℘(Env)) be given.

γ⊂

−−→ − −− h℘(A); ⊆i h℘(C); ⊆i ← −− −→ α⊂

α⊂ (X) = X ∩ A

{w} ⊆ µc (t, E) ∧ w ≻∗ e′′ =⇒ e′′ ∈ E

γ⊂ (Y ) = Y ∪ (C \ A)

A standard trick is to think of partial functions r : D ⇀ C as total functions r⊥ : D → (C ∪ ⊥) where ⊥ ⊑ ⊥ ⊑ c, for all c ∈ C. Consider environments e ∈ Var ⇀ Val to be total functions Var → (Val ∪ ⊥) using this idea. In this context the bottom element ⊥ will denote variable lookup failure. Now compose a subset abstraction γ⊂ −−→ −−− ℘(Val) with the above value abstraction, and ℘(Val ∪ ⊥) ← −− α−→ ⊂ feed the result to the pointwise abstraction above. The result is a pointwise abstraction of a set of environments, not explicitly γΠ −− −− Var → ℘(Val♯ ). modelling variable lookup failure: ℘(Env) ← −− αΠ→ By considering only closed programs, we statically ensure against failure of variable-lookup, hence disregarding ⊥ loses no information.

4.6 Abstracting the machine states

4.5 Abstracting the helper function

We write ∪⊗ and ⊆⊗ for componentwise join and inclusion, respectively. For the set of expressions ℘(C) we use the identity abstraction consisting of two identity functions. For the expression-stack

We abstract the triplet of sets into abstract triples by a componentwise abstraction. Componentwise abstraction [Cousot and Cousot, 1994]: Asγi −− −− Ai for i ∈ suming a series of Galois connections: ℘(Ci ) ← −− α→ i {1, . . . , n}, their componentwise composition induces a Galois connection on tuples: γ⊗

−− − − hA1 × . . . × An ; ⊆⊗ i h℘(C1 ) × . . . × ℘(Cn ); ⊆× i ← −− → α⊗

α⊗ (hX1 , . . ., Xn i) = hα1 (X1 ), . . ., αn (Xn )i γ⊗ (hx1 , . . ., xn i) = hγ1 (x1 ), . . ., γn (xn )i

We can calculate an abstract helper function, by “pushing α ’s” under the function definition, and reading off a resulting abstract definition.

292

F ♯ : P → ℘(C) × (C/≡ → ℘(K ♯ )) × Env♯ → ℘(C) × (C/≡ → ℘(K ♯ )) × Env♯ Fp♯ (hC, F ♯ , E ♯ i) = h{p}, [[p]≡ 7→ {[xr , xr ]}, [xr ]≡ 7→ {stop}], λ _. 0i / ∪⊗

[

∪⊗

[

⊗ {t}⊆C ′ {[x, s ]}⊆F ♯ ([t]≡ )

⊗ {let x=t in s}⊆C

∪⊗

[

∪⊗

[

h{s′ }, F ♯ , E ♯ ∪˙ [x 7→ µ ♯ (t, E ♯ )]i

h{s}, F ♯ , E ♯ ∪˙ [x 7→ µ ♯ (t, E ♯ )]i ˙ [x 7→ µ ♯ (t1 , E ♯ )]i h{s′ }, F ♯ ∪˙ [[s′ ]≡ 7→ F ♯ ([t0 t1 ]≡ )], E ♯ ∪

⊗ {t0 t1 }⊆C ′ {[fn x => s ]}∈µ ♯ (t0 ,E♯ )

h{s′ }, F ♯ ∪˙ [[s′ ]≡ 7→ {[x, s]}], E ♯ ∪˙ [y 7→ µ ♯ (t1 , E ♯ )]i

⊗ {let x=t0 t1 in s}⊆C ′ {[fn y => s ]}∈µ ♯ (t0 ,E♯ )

Figure 7: The resulting analysis function

relation ℘(C × K) we use the expression-stack abstraction αst developed in Section 4.3. For the set of environments ℘(Env) we use the environment abstraction αΠ developed in Section 4.4. Using the alternative “recipe” we can calculate the analysis by “pushing α ’s” under the intermediate transition function: α⊗ (F ρ (hC, F, Ei)) ⊆⊗ F ♯ (hC, αst (F), αΠ (E)i) from which the final definition of F ♯ can be read off. The resulting analysis appears in Fig. 7. The alert reader may have noticed that this final abstraction is not complete in that the above equation contains an inequality. Completeness is a desirable goal in an abstract interpretation but unfortunately it is not possible in general without refining the abstract domain [Giacobazzi et al., 2000]. Consider for example the addition operator over the standard sign-domain: 0 = α (1 + (−1)) ⊑ α (1) + α (−1) = ⊤. As traditional [Cousot, 1999], we instead limit upward judgements to a minimum. As a corollary of the construction, the analysis safely approximates the reachable states of the abstract machine.

CProg ∋ p ::= fn k => e SExp ∋ e ::= t0 t1 c | c t

(CPS programs) (serious CPS expressions)

TExp ∋ t ::= x | v | fn x, k => e (trivial CPS expressions) CExp ∋ c ::= fn v => e | k

(continuation expressions)

Figure 8: BNF of CPS language

5. Analysis equivalence In previous work [Midtgaard and Jensen, 2008a] we derived an initial CFA with reachability for a CPS language from the stack-less CE-machine [Flanagan et al., 1993]. In this section we show that the present ANF analysis achieves the same precision as obtained by first transforming a program into CPS and then using the CPS analysis. This is done by defining a relation that captures how the direct-style analysis and the CPS analysis operate in lock-step. The grammar of CPS terms is given in Fig. 8. The grammar distinguishes variables in the original source program x ∈ X, from intermediate variables v ∈ V and continuation variables k ∈ K. We assume the three classes are non-overlapping. Their union constitute the domain of CPS variables Var = X ∪ V ∪ K.

Corollary 4.1. α⊗ ◦ ρ ◦ α× (lfp F) ⊆⊗ lfp F ♯

4.7 Characteristics of the analysis First of all the analysis incorporates reachability: it computes an approximate set of reachable expressions and will only analyse those reachable program fragments. Reachability analyses have previously been discovered independently [Ayers, 1992, Palsberg and Schwartzbach, 1995, Gasser et al., 1997]. In our case they arise naturally from a projecting abstraction of a reachable states collecting semantics. Second the formulation materializes monomorphism into two mappings: (a) one mapping merging all bindings to the same variable, and (b) one mapping merging all calling contexts of the same function. Both characteristics are well known, but our presentation is novel in that it literally captures this phenomenon in two approximation functions. Third the analysis handles returns inside-out (“callee-restore”), in that the called function restores control from the approximate control stack and propagates the obtained return values. This differs from the traditional presentations [Palsberg, 1995, Nielson et al., 1999] that handle returns outside-in (“caller-restore”) where the caller propagates the obtained return values from the body of the function to the call site (typically formulated as conditional constraints).

5.1 CPS transformation and back again In order to state the relation between the ANF and CPS analyses we first recall the relevant program transformations. The below presentation is based on Danvy [1991], Flanagan et al. [1993], and Sabry and Felleisen [1994]. The CPS transformation given in Fig. 9(a) is defined by two mutually recursive functions — one for serious and trivial expressions. A continuation variable k is provided in the initial call to F . A fresh k is generated in V ’s lambda abstraction case. To ease the expression of the relation, we choose k unique to the serious expression s — ks . It follows that we only need one k per lambda abstraction in the original program + an additional k in the initial case. It is immediate from the definition of F that the CPS transformation of a let-binding let x=t in s and the CPS transformation of its body s share the same continuation identifier — and similarly for non-tail calls. Hence we shall equate the two: Definition 5.1. ks ≡ ks′ iff s ≡ s′

293

D : CProg → P C : P → CProg

D[fn k => e] = U [e]

C [p] = fn kp => Fkp [p]

U : SExp → C

F : K → C → SExp

U [k t] = P[t]

Fk [t] = k V [t]

U [(fn v => e) t] = let v=P[t] in U [e]

Fk [let x=t in s] = (fn x => Fk [s]) V [t]

U [t0 t1 k] = P[t0 ] P[t1 ]

Fk [t0 t1 ] = V [t0 ] V [t1 ] k

U [t0 t1 (fn v => e)] = let v=P[t0 ] P[t1 ] in U [e]

Fk [let x=t0 t1 in s] = V [t0 ] V [t1 ] (fn x => Fk [s])

P : TExp → T

V : T → TExp

P[x] = x

V [x] = x

P[v] = v

V [fn x => s] = fn x, ks => Fks [s]

P[fn x, k => e] = fn x => U [e]

(a) CPS transformation

(b) Direct-style transformation

Figure 9: Transformations to and from CPS

The direct-style transformation given in Fig. 9(b) is defined by two mutually recursive functions over serious and trivial CPS expressions. We define the direct-style transformation of a program fn k => e as the direct-style transformation of its body U [e]. Transforming a program, a serious expression, or a trivial expression to CPS and back to direct style yields the original expression, which can be confirmed by (mutual) structural induction on trivial and serious expressions.

We are now in position to state our main theorem relating the ANF analysis to the CPS analysis. Intuitively the theorem relates: • reachability in ANF to CPS reachability • abstract stacks in ANF to CPS continuation closures • abstract stack bottom in ANF to CPS initial continuation • ANF closures to CPS function closures

Theorem 5.1. Let p be given. Let hC, F ♯ , E ♯ i = lfp Fp♯ and hQ♯ , R♯ i = lfp TC♯ [p] . Then

Lemma 5.1. D[C [p]] = p ∧ U [Fk [s]] = s ∧ P[V [t]] = t 5.2 CPS analysis

s ∈ C ⇐⇒ Fks [s] ∈ Q♯ ∧

We recall the CPS analysis of Midtgaard and Jensen [2008a] in Fig. 10. It is defined as the least fixed point of a program specific transfer function Tp♯ . The definition relies on two helper functions µt♯ and µc♯ for trivial and continuation expressions, respectively. The analysis computes a pair consisting of (a) a set of serious expressions (the reachable expressions) and (b) an abstract environment. Abstract environments map variables to abstract values. Abstract values can be either the initial continuation stop, function closures [fn x, k => e], or continuation closures [fn v => e]. The definition relies on two special variables kr and vr , the first of which names the initial continuation and the second of which names the result of the program. To ensure the most precise analysis result, variables in the source program can be renamed to be distinct as is traditional in control-flow analysis [Nielson et al., 1999].

[x, s′ ] ∈ F ♯ (s) ⇐⇒ [fn x => Fks′ [s′ ]] ∈ R♯ (ks ) ∧ stop ∈ F ♯ (s) ⇐⇒ stop ∈ R♯ (ks ) ∧ [fn x => s] ∈ E ♯ (y) ⇐⇒ [fn x, ks => Fks [s]] ∈ R♯ (y) For the purpose of the equivalence we equate the special variables xr and vr both naming the result of the computations. We prove the theorem by combining an implication in each direction with the identity from Lemma 5.1. We formulate both implication as relations and prove that both relations are preserved by the transfer functions. 5.4 ANF-CPS equivalence We formally define a relation RANF CPS that relates ANF analysis triples to CPS analysis pairs. ♯ ♯ Definition 5.2. hC, F ♯ , E ♯ i RANF CPS hQ , R i iff ∀s :

5.3 Analysis equivalence Before formally stating the equivalence of the two analyses we will study an example run. As our example we use the ANF program: let f=fn x => x in let a1 =f cn1 in let a2 =f cn2 in a2 , taken from Sabry and Felleisen [1994] where we have Church encoded the integer literals. We write cn1 for fn s => fn z => s z and cn2 for fn s => fn z => let t1 =s z in s t1 . The analysis trace appears in the left column of Table 1. Similarly we study the CPS analysis of the CPS transformed program. The analysis trace appears in the right column of Table 1 where we have written ccn1 for V [cn1] and ccn2 for V [cn2]. Contrary to Sabry and Felleisen [1994] both the ANF and the CPS analyses achieve the same precision on the example, determining that a1 will be bound to one of the two integer literals.

s ∈ C =⇒ Fks [s] ∈ Q♯ ∧ [x, s′ ] ∈ F ♯ (s) =⇒ [fn x => Fks′ [s′ ]] ∈ R♯ (ks ) ∧ stop ∈ F ♯ (s) =⇒ stop ∈ R♯ (ks ) ∧ [fn x => s] ∈ E ♯ (y) =⇒ [fn x, ks => Fks [s]] ∈ R♯ (y) First we need a small lemma relating the ANF helper function to one of the CPS helper functions. Lemma 5.2. ♯ ♯ [fn x => s] ∈ µ ♯ (t, E ♯ ) ∧ hC, F ♯ , E ♯ i RANF CPS hQ , R i

=⇒ [fn x, ks => Fks [s]] ∈ µt♯ (V [t], R♯ )

294

Env♯ = Var → ℘(Val♯ ) ♯

(abstract environment)

♯

Val ∋ w ::= stop | [fn x, k => e] | [fn v => e]

(abstract values)

(a) Abstract domains

µt♯ : TExp × Env♯ → ℘(Val♯ )

T ♯ : CProg → ℘(SExp) × Env♯ → ℘(SExp) × Env♯

µt♯ (x, R♯ ) = R♯ (x)

♯ ♯ ♯ Tfn k => e (hQ , R i) = h{e}, [kr 7→ {stop}, k 7→ {[fn vr => kr vr ]}]i

∪⊗

[

˙ [x 7→ µt♯ (t1 , R♯ ), k′ 7→ µc♯ (c, R♯ )]i h{e }, R ∪

∪⊗

[

˙ [v 7→ µt♯ (t, R♯ )]i h{e′ }, R♯ ∪

′

♯

⊗ t0 t1 c∈Q♯ [fn x,k′ => e′ ]∈µt♯ (t0 ,R♯ )

µt♯ (v, R♯ ) = R♯ (v) µt♯ (fn x, k => e, R♯ ) = {[fn x, k => e]} µc♯ : CExp × Env♯ → ℘(Val♯ ) µc♯ (k, R♯ ) = R♯ (k)

⊗ c t∈Q♯ [fn v => e′ ]∈µc♯ (c,R♯ )

µc♯ (fn v

(b) Abstract transition function

=> e, R♯ ) = {[fn v => e]}

(c) Abstract helper functions

Figure 10: CPS analysis

♯ ♯ Definition 5.3. hQ♯ , R♯ i RCPS ANF hC, F , E i iff ∀e :

The relation is preserved by the transfer functions.

e ∈ Q♯ =⇒ U [e] ∈ C ∧

Theorem 5.2.

[fn x => e] ∈ R♯ (ks ) =⇒ [x, U [e]] ∈ F ♯ (s) ∧

♯ ♯ hC, F ♯ , E ♯ i RANF CPS hQ , R i

stop ∈ R♯ (ks ) =⇒ stop ∈ F ♯ (s) ∧

♯ ♯ ♯ =⇒ Fp♯ (hC, F ♯ , E ♯ i) RANF CPS TC [p] (hQ , R i)

[fn x, ks => e] ∈ R♯ (y) =⇒ [fn x => U [e]] ∈ E ♯ (y) We again need a helper lemma relating the helper functions.

Proof. First we name the individual triples of the union in the function body of F ♯ . We name the first triple of results as initial: hCI , FI♯ , EI♯ i = h{p}, [p 7→ {[xr , xr ]}, xr 7→ {stop}], λ _. 0i. / The results of the second, third, fourth, and fifth joined triples corresponding to return, binding, tail call, and non-tail call are ♯ ♯ ♯ ♯ ♯ ♯ named hCret , Fret , Eret i, hCbind , Fbind , Ebind i, hCtc , Ftc , Etc i and ♯ ♯ hCntc , Fntc , Entc i, respectively. Similarly we name the first result pair in the function body of the CPS analysis as initial: hQ♯I , R♯I i = h{e}, [kr 7→ {stop}, k 7→ {[fn vr => kr vr ]}]i. The results of the second and third joined pair corresponding to call and return are named hQ♯call , R♯call i and hQ♯ret , R♯ret i, respectively. The proof proceeds by verifying five relations: ♯ ♯ hCI , FI♯ , EI♯ i RANF CPS hQI , RI i ♯ ♯ ♯ ♯ hCret , Fret , Eret i RANF CPS hQret , Rret i ♯ ♯ ♯ ♯ hCbind , Fbind , Ebind i RANF CPS hQret , Rret i ♯ ♯ ♯ ♯ hCtc , Ftc , Etc i RANF CPS hQcall , Rcall i ♯ ♯ ♯ ♯ hCntc , Fntc , Entc i RANF CPS hQcall , Rcall i

Lemma 5.3. ♯ ♯ [fn x, ks => e] ∈ µt♯ (t, R♯ ) ∧ hQ♯ , R♯ i RCPS ANF hC, F , E i

=⇒ [fn x => U [e]] ∈ µ ♯ (P[t], E ♯ ) This relation is also preserved by the transfer functions. Theorem 5.3. ♯ ♯ hQ♯ , R♯ i RCPS ANF hC, F , E i ♯ ♯ ♯ =⇒ TC♯ [p] (hQ♯ , R♯ i) RCPS ANF Fp (hC, F , E i)

Proof. The proof follows a similar structure to the earlier proof.

(1)

(3)

The bottom elements are related by the relation and it follows by fixed point induction that their least fixed points (and hence the analyses) are related.

(4)

♯ Corollary 5.2. lfp TC♯ [p] RCPS ANF lfp Fp

(2)

(5)

6. Extracting constraints

Realizing that the union of related triples and pairs are related we obtain the desired result.

The resulting analysis may appear complex at first glance. However we can express the analysis in the popular constraint formulation, extracted from the obtained definition. The formulation shown below is in terms of program-specific conditional constraints. Constraints have a (possibly empty) list of preconditions and a conclusion [Palsberg and Schwartzbach, 1995, Gasser et al., 1997]:

After realizing that the bottom elements are related by the above relation, it follows by fixed point induction that their least fixed points (and hence the analyses) are related.

{u1 } ⊆ rhs1 ∧ . . . ∧ {un } ⊆ rhsn ⇒ lhs ⊆ rhs

♯ Corollary 5.1. lfp Fp♯ RANF CPS lfp TC [p]

The constraints operate on the same three domains as the above analysis. Left-hand sides lhs can be of the form {u}, F ♯ ([s]≡ ), or E ♯ (x), right-hand sides rhs can be of the form C, F ♯ ([s]≡ ), or E ♯ (x), and singleton elements u can be of the form s, c, [fn x => s], or [x, s]. From Fig. 7 we directly read off the following constraints.

5.5 CPS-ANF equivalence Again we formally define a relation now relating CPS analysis pairs to ANF analysis triples.

295

i

0

ANF trace: hCi , Fi♯ , Ei♯ i

CPS trace: hQ♯i , R♯i i

{let f=fn x => x in let a1 =f cn1 in let a2 =f cn2 in a2 } " # [xr ]≡ 7→ {stop}, [let f=fn x => x in let a1 =f cn1 in let a2 =f cn2 in a2 ]≡ 7→ {[xr , xr ]}

λ _. 0/ C0 ∪ {let a1 =f cn1 in let a2 =f cn2 in a2 }

1

2

3

h i E0♯ ∪˙ f 7→ {[fn x => x]} C1 ∪ {x} h i F1♯ ∪˙ [x]≡ 7→ {[a1 , let a2 =f cn2 in a2 ]} h i E1♯ ∪˙ x 7→ {cn1}

Q♯1 ∪ {kx x} " # kx 7→ {[fn a1 => f ccn2 (fn a2 => kp a2 )]} ˙ R♯1 ∪ x 7→ {ccn1}

C2 ∪ {let a2 =f cn2 in a2 }

Q♯2 ∪ f ccn2 (fn a2 => kp a2 ) h i ˙ a1 7→ {ccn1} R♯2 ∪

F2♯

h i E2♯ ∪˙ a1 7→ {cn1}

Q♯3

h i F3♯ ∪˙ [x]≡ 7→ {[a1 , let a2 =f cn2 in a2 ], [a2 , a2 ]} h i E3♯ ∪˙ x 7→ {cn1, cn2}

˙ R♯3 ∪

C4 ∪ {a2 }

5

(fn f => f ccn1 (fn a1 => f ccn2 (fn a2 => kp a2 ))) (fn x, kx => kx x) # kr 7→ {stop}, kp 7→ {[fn vr => kr vr ]}

Q♯0 ∪ f ccn1 (fn a1 => f ccn2(fn a2 => kp a2 )) h i ˙ f 7→ {[fn x, kx => kx x]} R♯0 ∪

F0♯

C3

4

"

F4♯ E4♯ ∪˙

"

F5♯

7

C6

# a1 7→ {cn1, cn2} a2 7→ {cn1, cn2}

#

Q♯5 ∪ {kr vr } h i ˙ vr 7→ {ccn1, ccn2} R♯5 ∪

h i E5♯ ∪˙ xr 7→ {cn1, cn2} F6♯

kx 7→ {[fn a1 => f ccn2 (fn a2 => kp a2 )], [fn a2 => kp a2 ]} x 7→ {ccn1, ccn2}

Q♯4 ∪ kp a2 " # a1 7→ {ccn1, ccn2} ˙ R♯4 ∪ a2 7→ {ccn1, ccn2}

C5 ∪ {xr }

6

"

Q♯6

E6♯

R♯6

Table 1: Analysis traces of let f=fn x => x in let a1 =f cn1 in let a2 =f cn2 in a2 and its CPS transformed counterpart

• For each non-tail call let x=t0 t1 in s and function fn y => s′

• For the program p:

{p} ⊆ C

{[xr , xr ]} ⊆ F ♯ ([p]≡ )

in p:

{stop} ⊆ F ♯ ([xr ]≡ )

 ′  {s } ⊆ C ∧ {let x=t0 t1 in s} ⊆ C ∧ ⇒ {[x, s]} ⊆ F ♯ ([s′ ]≡ ) ∧ {[fn y => s′ ]} ⊆ µsym (t0 , E ♯ )   µ (t , E ♯ ) ⊆ E ♯ (y) sym 1

• For each return expression t and non-tail call let x=t0 t1 in s′

in p: {t} ⊆ C ∧ {[x, s′ ]} ⊆ F ♯ ([t]≡ ) ⇒

( {s′ } ⊆ C ∧ µsym (t, E ♯ ) ⊆ E ♯ (x)

where we partially evaluate the helper function, i.e., interpret the helper function symbolically at constraint-generation time, to generate a lookup for variables, and a singleton for constants and lambda expressions. The definition of the symbolic helper function otherwise coincides with the abstract helper function µ ♯ . We may generate constraints {[fn x => s]} ⊆ {[fn y => s′ ]} of a form not covered by the above grammar. We therefore first preprocess the constraints in linear time, removing vacuously true inclusions {[fn x => s]} ⊆ {[fn x => s]} from each constraint, and removing constraints containing vacuously false preconditions {[fn x => s]} ⊆ {w♯ }, where [fn y => s′ ] 6= w♯ . The resulting constraint system is formally equivalent to the control flow analysis in the sense that all solutions yield correct control flow information and that the best (smallest) solution of

• For each let-binding let x=t in s in p:

( {s} ⊆ C ∧ {let x=t in s} ⊆ C ⇒ µsym (t, E ♯ ) ⊆ E ♯ (x) • For each tail call t0 t1 and function fn x => s′ in p:

 ′  {s } ⊆ C ∧ ♯ ⇒ F ([t0 t1 ]≡ ) ⊆ F ♯ ([s′ ]≡ ) ∧ {[fn x => s′ ]} ⊆ µsym (t0 , E ♯ )  µ (t , E ♯ ) ⊆ E ♯ (x) sym 1 {t0 t1 } ⊆ C ∧

296

The analysis has been developed for a minimalistic functional language in order to be able to focus on the abstraction of the control structure induced by function calls and returns. An obvious extension is to enrich the language with numerical operators and study how our Galois connections interact with abstractions such as the interval or polyhedral abstraction of numerical entities. The calculations involved in the derivation of a CFA are lengthy and would benefit enormously from some form of machine support. Certified abstract interpretation [Pichardie, 2005, Cachera et al., 2005] has so far focused on proving the correctness of the analysis inside a proof assistant by using the concretization (γ ) component of the Galois connection to prove the correctness of an already defined analysis. Further work should investigate whether proof assistants such as Coq are suitable for conducting the kind of reasoning developed in this paper in a machine-checkable way.

the constraints is as precise as the information computed by the analysis. More formally: Theorem 6.1. A solution to the CFA constraints of program p is a safe approximation of the least fixpoint of the analysis function F ♯ induced by p. Furthermore, the least solution to the CFA constraints is equal to the least fixpoint of F ♯ . Implemented naively, a single constraint may take O(n) space alone. However by using pointers or by labelling each sub-expression and using the pointer/label instead of the sub-expression itself, a single constraint takes only constant space. By linearly determining a representative for each sub-expression, by generating O(n2 ) constraints, linear post-processing, and iteratively solving them using a well-known algorithm [Palsberg and Schwartzbach, 1995, Nielson et al., 1999], we can compute the analysis in worst-case O(n3 ) time. The extracted constraints bear similarities to existing constraintbased analyses in the literature. Consider, e.g., calls t0 t1 , which usually gives rise to two conditional constraints [Palsberg, 1995, b 0 ) ⇒ C(t b 1 ) ⊆ E(x) b Nielson et al., 1999]: (1) {[fn x => s′ ]} ⊆ C(t b 0 ) ⇒ C(s b ′ ) ⊆ C(t b 0 t1 ). The first conand (2) {[fn x => s′ ]} ⊆ C(t straint resembles our third constraint for tail calls. The second “return constraint” differs in that it has a inside-out (or caller-restore) nature, i.e., propagation of return-flow from the function body is handled at the call site. The extracted reachability constraints are similar to Gasser et al. [1997] (modulo an isomorphic encoding ℘(C) ≃ C → ℘({on}) of powersets).

Acknowledgments The authors thank Matthew Fluet, Amr Sabry, Mitchell Wand, Daniel Damian, Olivier Danvy, and the anonymous referees for comments on earlier versions. Part of this work was done with the support of the Carlsberg Foundation.

References J. M. Ashley and R. K. Dybvig. A practical and flexible flow analysis for higher-order languages. ACM Transactions on Programming Languages and Systems, 20(4):845–868, 1998. A. E. Ayers. Efficient closure analysis with reachability. In M. Billaud, P. Castéran, M.-M. Corsini, K. Musumbu, and A. Rauzy, editors, Actes WSA’92 Workshop on Static Analysis, Bigre, pages 126–134, Bordeaux, France, Sept. 1992. Atelier Irisa, IRISA, Campus de Beaulieu.

7. Conclusion We have presented a control-flow analysis determining interprocedural control-flow of both calls and returns for a direct-style language. Existing CFAs have focused on analysing which functions are called at a given call site. In contrast, the systematic derivation of our CFA has lead to an analysis that provides extra information about where a function returns to at no additional cost. In the presence of tail-call optimization, such information enables the creation of more precise call graphs. The analysis was developed systematically using Galois connection-based abstract interpretation of a standard operational semantics for that language: the Ca EK abstract machine of Flanagan et al. In addition to being more principled, such a formulation of the analysis is pedagogically pleasing since monomorphism of the analysis is made explicit through two Galois connections: one literally merges all bindings to the same variable and one merges all calling contexts of the same function. The analysis has been shown to provide a result equivalent to what can be obtained by first CPS transforming the program and then running a control flow analysis derived from a CPS-based operational semantics. This extends previous results obtained by Damian and Danvy, and Palsberg and Wand. The close correspondence between the way that the analyses operate (as illustrated by the analysis trace in Table 1) leads us to conjecture that such equivalence results can be obtained for other CFAs derived using abstract interpretation. The functional, derived by abstract interpretation, that defines the analysis may appear rather complex at first glance. As a final result, we have shown how to extract from the analysis an equivalent constraint-based formulation expressed in terms of the more familiar conditional constraints. Nevertheless, we stress that the derived functional can be used directly to implement the analysis. We have developed a prototype implementation of the resulting analysis in OCaml.2 2 available

D. Cachera, T. Jensen, D. Pichardie, and V. Rusu. Extracting a data flow analyser in constructive logic. Theoretical Computer Science, 342(1): 56–78, 2005. W. D. Clinger. Proper tail recursion and space efficiency. In K. D. Cooper, editor, Proc. of the ACM SIGPLAN 1998 Conference on Programming Languages Design and Implementation, pages 174–185, Montréal, Canada, June 1998. C. Consel and O. Danvy. For a better support of static data flow. In J. Hughes, editor, Proc. of the Fifth ACM Conference on Functional Programming and Computer Architecture, volume 523 of LNCS, pages 496–519, Cambridge, Massachusetts, Aug. 1991. Springer-Verlag. P. Cousot. The calculational design of a generic abstract interpreter. In M. Broy and R. Steinbrüggen, editors, Calculational System Design. NATO ASI Series F. IOS Press, Amsterdam, 1999. P. Cousot. Semantic foundations of program analysis. In S. S. Muchnick and N. D. Jones, editors, Program Flow Analysis: Theory and Applications, chapter 10, pages 303–342. Prentice-Hall, 1981. P. Cousot and R. Cousot. Abstract interpretation of algebraic polynomial systems. In M. Johnson, editor, Proc. of the Sixth International Conference on Algebraic Methodology and Software Technology, AMAST ’97, volume 1349 of LNCS, pages 138–154, Sydney, Australia, Dec. 1997. Springer-Verlag. P. Cousot and R. Cousot. Higher-order abstract interpretation (and application to comportment analysis generalizing strictness, termination, projection and PER analysis of functional languages), invited paper. In H. Bal, editor, Proc. of the Fifth IEEE International Conference on Computer Languages, pages 95–112, Toulouse, France, May 1994. P. Cousot and R. Cousot. Abstract interpretation frameworks. Journal of Logic and Computation, 2(4):511–547, Aug. 1992a. P. Cousot and R. Cousot. Abstract interpretation and application to logic programs. Journal of Logic Programming, 13(2–3):103–179, 1992b. P. Cousot and R. Cousot. Systematic design of program analysis frameworks. In B. K. Rosen, editor, Proc. of the Sixth Annual ACM Sym-

at http://www.brics.dk/~jmi/ANF-CFA/

297

posium on Principles of Programming Languages, pages 269–282, San Antonio, Texas, Jan. 1979.

J. Palsberg and M. Wand. CPS transformation of flow information. Journal of Functional Programming, 13(5):905–923, 2003.

D. Damian and O. Danvy. Syntactic accidents in program analysis: On the impact of the CPS transformation. Journal of Functional Programming, 13(5):867–904, 2003. A preliminary version was presented at the 2000 ACM SIGPLAN International Conference on Functional Programming.

D. Pichardie. Interprétation abstraite en logique intuitioniste: extraction d’analyseurs Java certifiés. PhD thesis, Université de Rennes 1, Sept. 2005. A. Sabry and M. Felleisen. Is continuation-passing useful for data flow analysis? In V. Sarkar, editor, Proc. of the ACM SIGPLAN 1994 Conference on Programming Languages Design and Implementation, pages 1–12, Orlando, Florida, June 1994.

O. Danvy. Three steps for the CPS transformation. Technical Report CIS92-2, Kansas State University, Manhattan, Kansas, Dec. 1991. B. A. Davey and H. A. Priestley. Introduction to Lattices and Order. Cambridge University Press, Cambridge, England, second edition, 2002. S. K. Debray and T. A. Proebsting. Interprocedural control flow analysis of first-order programs with tail-call optimization. ACM Transactions on Programming Languages and Systems, 19(4):568–585, 1997.

P. Sestoft. Replacing function parameters by global variables. In J. E. Stoy, editor, Proc. of the Fourth International Conference on Functional Programming and Computer Architecture, pages 39–53, London, England, Sept. 1989.

C. Flanagan, A. Sabry, B. F. Duba, and M. Felleisen. The essence of compiling with continuations. In D. W. Wall, editor, Proc. of the ACM SIGPLAN 1993 Conference on Programming Languages Design and Implementation, pages 237–247, Albuquerque, New Mexico, June 1993.

O. Shivers. Control-flow analysis in Scheme. In M. D. Schwartz, editor, Proc. of the ACM SIGPLAN 1988 Conference on Programming Languages Design and Implementation, pages 164–174, Atlanta, Georgia, June 1988.

K. L. S. Gasser, F. Nielson, and H. R. Nielson. Systematic realisation of control flow analyses for CML. In M. Tofte, editor, Proc. of the Second ACM SIGPLAN International Conference on Functional Programming, pages 38–51, Amsterdam, The Netherlands, June 1997.

F. Spoto and T. P. Jensen. Class analyses as abstract interpretations of trace semantics. ACM Transactions on Programming Languages and Systems, 25(5):578–630, 2003.

R. Giacobazzi, F. Ranzato, and F. Scozzari. Making abstract interpretations complete. J. ACM, 47(2):361–416, 2000.

A. Underlying mathematical material This section is based on known material [Cousot and Cousot, 1979, Cousot, 1981, Cousot and Cousot, 1992b, 1994, Davey and Priestley, 2002]. A complete lattice is a partially ordered set hC; ⊑, ⊥, ⊤, ⊔, ⊓i (poset), such that the least upper bound ⊔S and the greatest lower bound ⊓S exists for every subset S of C. ⊥ = ⊓C denotes the infimum of C and ⊤ = ⊔C denotes the supremum of C. The set of total functions D → C, whose domain is a complete lattice hC; ⊑, ⊥, ⊤, ⊔, ⊓i, is itself a complete lattice ˙ ⊤, ˙ ⊔, ˙ ⊥, ˙ f ′ ⇐⇒ ˙ ⊓i ˙ under the pointwise ordering f ⊑ hD → C; ⊑, ∀x. f (x) ⊑ f ′ (x), with bottom, top, join, and meet extended similarly. The powersets ℘(S) of a set S ordered by set inclusion is a complete lattice h℘(S); ⊆, 0, / S, ∪, ∩i. A Galois connection is a pair of functions α , γ between two posets hC; ⊑i and hA; ≤i such that for all a ∈ A, c ∈ C : α (c) ≤ a ⇐⇒ c ⊑ γ (a). Equivalently a Galois connection can be defined as a pair of functions satisfying (a) α and γ are monotone, (b) α ◦ γ is reductive, and (c) γ ◦ α is extensive. Galois connections γ −− hA; ≤i. We omit the orderings when they are typeset hC; ⊑i ← −− α→ are clear from the context. For a Galois connection between two complete lattices α is a complete join-morphism (CJM) and γ is a complete meet morphism. The composition of two Galois γ1 γ2 −− − − hB; ⊆i and hB; ⊆i ← −− − − hA; ≤i is itself connections hC; ⊑i ← −− → −− →

N. Heintze. Set-based program analysis of ML programs. In C. L. Talcott, editor, Proc. of the 1994 ACM Conference on Lisp and Functional Programming, LISP Pointers, Vol. VII, No. 3, pages 306–317, Orlando, Florida, June 1994. N. D. Jones. Flow analysis of lambda expressions (preliminary version). In S. Even and O. Kariv, editors, Automata, Languages and Programming, 8th Colloquium, Acre (Akko), volume 115 of LNCS, pages 114–128, Israel, July 1981. Springer-Verlag. J. Midtgaard. Control-flow analysis of functional programs. Technical Report BRICS RS-07-18, Dept. of Comp. Sci., University of Aarhus, Aarhus, Denmark, Dec. 2007. Accepted for publication in ACM Computing Surveys. J. Midtgaard and T. Jensen. A calculational approach to control-flow analysis by abstract interpretation. In M. Alpuente and G. Vidal, editors, Static Analysis, 15th International Symposium, SAS 2008, volume 5079 of LNCS, pages 347–362, Valencia, Spain, July 2008a. Springer-Verlag. J. Midtgaard and T. P. Jensen. Control-flow analysis of function calls and returns by abstract interpretation. Rapport de Recherche RR-6681, INRIA Rennes – Bretagne Atlantique, Oct. 2008b. M. Might and O. Shivers. Environmental analysis via ∆CFA. In S. Peyton Jones, editor, Proc. of the 33rd Annual ACM Symposium on Principles of Programming Languages, pages 127–140, Charleston, South Carolina, Jan. 2006.

α1

R. Milner and M. Tofte. Co-induction in relational semantics. Theoretical Computer Science, 87(1):209–220, 1991.

α2

γ1 ◦ γ2

−− −−− − − hA; ≤i. Galois connections in a Galois connection hC; ⊑i ← −− → ◦α 1 α2− which α is surjective (or equivalently γ is injective) are typeset γ −− − hA; ≤i. Galois connections in which γ is surjective hC; ⊑i ← −− −→ −→

F. Nielson and H. R. Nielson. Infinitary control flow analysis: a collecting semantics for closure analysis. In N. D. Jones, editor, Proc. of the 24th Annual ACM Symposium on Principles of Programming Languages, pages 332–345, Paris, France, Jan. 1997.

α

γ

← −→ − hA; ≤i. (or equivalently α is injective) are typeset hC; ⊑i ←− − −− α− When both α and γ are surjective, the two domains are isomorphic. A(n upper) closure operator ρ is map ρ : S → S on a poset hS; ⊑i, that is (a) monotone: (for all s, s′ ∈ S : s ⊑ s′ =⇒ ρ (s) ⊑ ρ (s′ )), (b) extensive (for all s ∈ S : s ⊑ ρ (s)), and (c) idempotent, (for all s ∈ S : ρ (s) = ρ (ρ (s))). A closure operator ρ in1 − − hρ (S); ⊑i, writing ρ (S) for duces a Galois connection hS; ⊑i ← −− →

F. Nielson, H. R. Nielson, and C. Hankin. Principles of Program Analysis. Springer-Verlag, 1999. H. R. Nielson and F. Nielson. Flow logic: a multi-paradigmatic approach to static analysis. In T. Æ. Mogensen, D. A. Schmidt, and I. H. Sudborough, editors, The Essence of Computation: Complexity, Analysis, Transformation. Essays Dedicated to Neil D. Jones, volume 2566 of LNCS, pages 223–244. Springer-Verlag, 2002.

ρ

{ρ (s) | s ∈ S} and 1 for the identity function. Furthermore the image of a complete lattice hC; ⊑, ⊥, ⊤, ⊔, ⊓i by an upper closure operator is itself a complete lattice hρ (C); ⊑, ρ (⊥), ⊤, λ X. ρ (⊔X), ⊓i.

J. Palsberg. Closure analysis in constraint form. ACM Transactions on Programming Languages and Systems, 17(1):47–62, 1995. J. Palsberg and M. I. Schwartzbach. Safety analysis versus type inference. Information and Computation, 118(1):128–141, 1995.

298

Automatically RESTful Web Applications Marking Modular Serializable Continuations Jay McCarthy Brigham Young University [email protected]

Abstract

not add expressive power (Pettyjohn et al. 2005; Thiemann 2006). Continuation-based Web servers are unusable by real world developers, though they add more expressive power, because they are inherently not scalable (Ducasse et al. 2004; Krishnamurthi et al. 2007). This paper presents a modular program transformation that produces scalable Web applications and offers more expressive power than existing modular systems. Web applications written using our system can offload all state to clients—the gold standard of scalability—or, if necessary, keep state on the server and use ten times less memory.

Continuation-based Web servers provide distinct advantages over traditional Web application development: expressive power and modularity. This power leads to fewer errors and more interesting applications. Furthermore, these Web servers are more than prototypes; they are used in some real commercial applications. Unfortunately, they pay a heavy price for the additional power in the form of lack of scalability. We fix this key problem with a modular program transformation that produces scalable, continuation-based Web programs based on the REST architecture. Our programs use the same features as non-scalable, continuation-based Web programs, so we do not sacrifice expressive power for performance. In particular, we allow continuation marks in Web programs. Our system uses 10 percent (or less) of the memory required by previous approaches. Categories and Subject Descriptors and Features]: Control structures General Terms Keywords

2. Background The largest problem Web developers solve is imposed by the statelessness of HTTP: when a server responds to a client’s request, the connection is closed and the Web program on the server exits. When the client makes a subsequent request, the request delivered to the server must contain enough information to resume the computation. The insight of functional programmers is that this information is the continuation. Traditional Web programmers call this representational state transfer (REST ) (Fielding and Taylor 2002).1 It is naturally scalable due to the lack of per-session state. Session state is poison to scalability because each session has an absolute claim on server resources. There is no sound way to reclaim space since dormant sessions may reactivate at any time. This is clearly inefficient. Consequently, unsafe and ad hoc resource policies, like timeouts, are used to restore some scalability. The scalability of REST also comes at a price in the form of programming difficulty. We will demonstrate this by porting a calculator to the Web.2

D.3.3 [Language Constructs

Languages, Performance, Theory

Continuations, Stack Inspection, Web Applications

1. Introduction The functional programming community has inspired Web application developers with the insight that Web interaction corresponds to continuation invocation (Hughes 2000; Queinnec 2000; Graham 2001). This insight helps to explain what Web applications do and when ad hoc continuation capture patterns are erroneous (Krishnamurthi et al. 2007). This understanding leads to more correct Web applications. Many continuation-based Web development frameworks try to apply this insight directly by automatically capturing continuations for Web application authors. These frameworks are often frustrated because they limit modularity, are not scalable, or only achieve correct continuation capture without more expressive power. Whole-program compilation systems are unusable by realworld developers because they sacrifice modularity and interaction with third-party libraries for performance and formal elegance (Matthews et al. 2004; Cooper et al. 2006). Modular compilation systems are unattractive to real world developers when they do

(define (calc) (display (+ (prompt "First:") (prompt "Second:")))) We must break this single coherent function into three different functions to create a Web application. Each function represents a distinct part of the control-flow of the application: entering the first number, entering the second, and displaying their sum. (define (calc) (web-prompt "First:" ’get-2nd #f)) (define (get-2nd first) (web-prompt "Second:" ’sum first)) (define (sum first second) (display/web (+ second first)))

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICFP’09, August 31–September 2, 2009, Edinburgh, Scotland, UK. Copyright © 2009 ACM 978-1-60558-332-7/09/08. . . $10.00

1 Unfortunately,

modern Web programmers have forgotten this definition of REST. They use the acronym REST to refer to a resource-based URL structure where database operations (like create, replace, update, and delete) are mapped to suggestive combinations of URLs and HTTP request types (like PUT, POST, GET, and DELETE). We will not use REST in this way. 2 All program examples are written in PLT Scheme.

299

(define (calc) ;; new-session allocates server-side storage (web-prompt "First:" ’get-2nd (new-session)))

(define (fact n) (if (zero? n) (begin (display (c-c-m ’fact)) 1) (w-c-m ’fact n (∗ n (fact (sub1 n)))))) (fact 3) → console output: (1 2 3) computed value: 6

(define (get-2nd session-id first) ;; session-set! modifies server-side storage (session-set! session-id ’first first) (web-prompt "Second:" ’sum session-id)) (define (sum session-id second) ;; session-lookup references server-side storage (define first (session-lookup session-id ’first)) (display/web (+ second first)))

(define (fact-tr n a) (if (zero? n) (begin (display (c-c-m ’fact)) a) (w-c-m ’fact n (fact-tr (sub1 n) (∗ n a))))) (fact 3) → console output: (1) computed value: 6

Figure 1. REST Without All the REST

The continuation is encoded by the developer in the second argument of web-prompt and the free variables of the continuation in the third argument. Unfortunately, it is tiresome in most Web programming environments to marshal all data to the client and back but convenient to access the server-side store (through session objects and databases), so developers use this naturally REST ful style in an entirely nonREST ful way. (See Figure 1.) The Web’s REST ful style is a form of continuation-passing style (CPS) (Fischer 1972). There are well-known transformations from direct style code into CPS that allow Web applications to be written in the natural way but converted into the scalable style by the server before execution. Matthews et al. (2004) gave an automatic translation from direct style programs into traditional Web applications. This tool performs a CPS transformation, λ-lifting, defunctionalization, and a store-passing style transformation (to capture the store as a cookie value). These direct style Web applications are entirely REST ful because the lexical environment and store are transferred to the user between interactions. Unfortunately, the CPS transformation is not modular; the entire code-base, including libraries, must be transformed. Thus, this technique is not feasible in applications that rely on unmodifiable libraries or separate compilation. The PLT Web Server by Krishnamurthi et al. (2007) does not have this problem. It enables direct style Web applications written in PLT Scheme through first-class continuations. These implicit continuations avoid the CPS transformation and thereby provide modularity. However, the PLT implementation technique sacrifices the REST architecture. Continuations (and the environments they close over) in PLT Scheme cannot be serialized into an external format or transferred to the client. Thus, the PLT Web Server stores continuations in the server’s memory and provides the client with a unique identifier for each continuation. These continuations are per-session server state, and their unique identifiers are new GC roots. Because there is no sound way to reclaim these continuations, they must be retained indefinitely or unsoundly deleted. The memory problems associated with this un-REST ful policy are well known. For example, a recent ICFP experience report (Welsh and Gurnell 2007) concurs with our experience managing the C ONTINUE service (Krishnamurthi 2003) by reporting unreasonable memory usage. C ONTINUE is a Web application for paper submissions, reviews, and PC meetings, so there is no intrinsic reason for this memory usage. We have experimented with a number of stopgap strategies, such as explicit continuation management through the primitive send/forward (Krishnamurthi et al.

Figure 2. Factorial with Continuation Marks

2007) and a Least Recently Used (LRU) continuation management strategy. While useful remedies for some symptoms, they are not solutions. In contrast, the work presented herein reduces memory consumption by ten times for these same Web applications. Despite its memory problems, the PLT Web Server provides a valuable Web application framework, in part because of its expressive features, like continuation marks. Many programming languages and environments allow access to the runtime stack in one way or another. Examples include Java security through stack inspection, privileged access for debuggers in . NET , and exception handlers in many languages. An abstraction of all these mechanisms is provided by PLT Scheme in continuation marks (Clements et al. 2001). Using the with-continuation-mark (w-c-m) language form, a developer can attach values to the control stack. Later, the stack-walking primitive current-continuationmarks (c-c-m) can retrieve those values from the stack. Continuation marks are parameterized by keys and do not interfere with Scheme’s tail-calling requirements. These two mechanisms allow marks to be used without interfering with existing code. A pedagogic example of continuation mark usage is presented in Figure 2. fact is the factorial function with instrumentation using continuation marks: w-c-m records function arguments on the stack and c-c-m collects them in the base case. fact-tr is a tail-recursive version of factorial that appears to be an identical usage of continuation marks, but because they preserve tail-calling space usage, the intermediate marks are overwritten, leaving only the final mark. Continuation marks are useful in all programs, but are particularly useful on the Web. If the control stack is isomorphic to the user’s position in the application, then continuation marks can be used to record information about where the user has gone and is going. For example, in C ONTINUE we use a continuation mark to hold the identity of the user. The essence of this technique is shown in Figure 3. This mark is stored on login and retrieved inside display-site for tasks like paper rating. This is more convenient than threading the state throughout the application and allows a trivial implementation of user masquerading, so an administrator can debug a user’s problems, and delegation, so a reviewer can assign a sub-reviewer limited powers. Web application developers are torn between the REST architecture, direct style code, modularity, and expressive features, like

300

(define (resume l x) (if (empty? l) x (let ([k (first l)] [l (rest l)]) (k (w-c-m k (resume l x))))))

(define (start-server ireq) (w-c-m current-user (show-login) (display-site))) (define (who-am-i) (first (c-c-m current-user)))

This transformation produces REST ful Web applications, because standard modular λ-lifting and defunctionalization transformations encode all values into serializable representations that can be sent to the client. The great irony of the Pettyjohn et al. (2005) transformation is that, while it shows the immense power of continuation marks, it does not support continuation marks in the input language; it cannot be used for Web applications that use marks themselves. Furthermore, it is not trivial to add support for continuation marks in the transformation: a semantic insight is necessary—this is our formal contribution, in addition to the other practical extensions we have made.

(define (delegate email paper) (w-c-m current-user (list ’delegate paper (who-am-i)) (email-continuation-url email) (display-paper paper))) (define (masquerade user) (if (symbol=? (who-am-i) ’admin) (w-c-m current-user user (display-site)) (access-denied))) Figure 3. Marks in Web Applications

3.1 Capturing Marks The most intuitive strategy for supporting marks is to simply capture “all the continuation marks” whenever the marks that record the continuation are captured. However, this is not possible. Continuation marks are parameterized by a key. When a developer uses c-c-m, she or he must provide a key—and only marks associated with that key are returned. If a mark is added with a unique, or unknowable, key, such as a random value or uninterned symbol (a “gensym”), then it cannot be extracted by the context it wraps. This is an essential feature of continuation marks: they are invisible to the uninitiated. Without this property, the behavior of an expression could drastically change with a change to its context: new results could mysteriously return from c-c-m without any explanation. This would have a dynamic scope level of grotesqueness to it. We must record the continuation marks, as we record the continuation components, so they can be extracted when performing continuation capture. A simple strategy is to transform all instances of

continuation marks. In this paper, we present a modular program transformation that automatically produces REST ful versions of direct style Web programs that utilize continuation marks.

3. Transformation Intuition Our modular REST ful transformation is based on one from Pettyjohn et al. (2005). Unfortunately, their transformation does not support continuation marks in the input language, so it is not sufficient for our purposes. Our transformation is structurally similar to theirs, so we review their transformation before turning to our contribution. The Pettyjohn et al. (2005) transformation relies on the modular Administrative Normal Form (ANF) transformation (Flanagan et al. 2004) and stack inspection to simulate call/cc. ANF is a canonical form that requires all function arguments to be named. This has the implication that the entire program is a set of nested let expressions with simple function calls for bodies. If the lets are expanded into λs, then the continuation of every expression is syntactically obvious. Any expression can be modularly transformed into ANF without modifying the rest of the program. The main insight of Pettyjohn et al. (2005) was that c-c-m can “capture” the continuation, just like call/cc, if the components of the continuation are installed via w-c-m. Their transformation does this by duplicating the continuation into marks. This is easy because ANF makes these continuations obvious, and the tail-calling property of marks mirrors that of continuations themselves, so the two stay synchronized. Their work is a testament to the power of continuation marks; we will review the essence of the transformation. Function applications, like (k a), are transformed as

(w-c-m k v e) into (w-c-m k v (w-c-m (cons k v) e)) where is a key known only to the transformation. It seems straightforward to adapt the transformation so call/cc captures and restores these marks as well (e (let ([ks (c-c-m )] [cms (c-c-m )]) (λ (x) (abort (re-mark cms (λ () (resume ks x))))))) where re-mark is similar to resume: (define (re-mark l e) (if (empty? l) (e) (let∗ ([cm (first l)] [l (rest l)] [m (car cm)] [v (cdr cm)]) (w-c-m m v (w-c-m (cons m v) (re-mark l e)))))) While simple and elegant, these strategies are incorrect.

(k (w-c-m k a))

3.2 Reinstalling Marks

where is a special key known only to the transformation. This effectively duplicates the continuation in a special mark. Then (call/cc e) is transformed as

The first problem is that resume and re-mark do not interact correctly. Consider the following program: (f (w-c-m k1 v1 (g (call/cc e))))

(e (let ([ks (c-c-m )]) (λ (x) (abort (resume ks x)))))

This is transformed into (f (w-c-m f (w-c-m k1 v1 (w-c-m (cons k1 v1) (g (w-c-m g (e (let ([ks (c-c-m )] [cms (c-c-m )])

where resume restores the continuation record from the marks into an actual control stack. resume must also reinstall the marks so subsequent invocations of call/cc are correct.

301

l a w v

::= | | | | | | | ::= ::= ::= ::=

x σ

∈ ∈

Variables References where Variables ∩ References = ∅

EΔ

::= | ::=

(w-c-m v v Ev,Δ ) where v ∈ /Δ [ ] | (v E) ∅ | Σ[σ → v]

e

(λ (x) (abort (re-mark cms (λ () (resume ks x))))))))))))) If e calls the continuation with a value, x, it reduces to (w-c-m k1 v1 (w-c-m (cons k1 v1) (f (w-c-m f (g (w-c-m g x)))))) The mark for k1 is lifted outside of f . The problem is that even though the and marks are recorded in the same stack frames, they are collected and installed separately: the s are put before the s. We can collect these together by extracting multiple keys at once. We correct the transformation of (call/cc e) as (e (let ([k∗cms (c-c-m )]) (λ (x) (abort (resume k∗cms x))))) When c-c-m is given n arguments, the marks are returned as a list of frames, where each frame is a list of length n where (list-ref l i) is the value of associated with the i argument or #f if none exists. Naturally, resume must combine the previous resume and re-mark operations.

Σ

a (w e) (letrec ([σ v]) e) (w-c-m a a e) (c-c-m [a]) (match w l) (abort e) (call/cc w) [(K x)e] w | (K a) v|x (λ(x) e) | (K v) | σ | κ.E

Figure 4. SL Syntax

(define (resume l x) (if (empty? l) x (let∗ ([M (car l)] [l (cdr l)] [k (car M)] [cm (cadr M)]) (cond [(and k (not cm)) (k (w-c-m k (resume l x)))] [(and (not k) cm) (let ([m (car cm)] [v (cdr cm)]) (w-c-m m v (w-c-m cm (resume l x))))] [else (resume (list∗ (list k #f) (list #f cm) l) x)]))))

We’ve lost the record of the k1 mark in the mark. One solution is to maintain a map from keys to values in marks and explicitly update that map with the continuation mark transformation. For example, we will transform (w-c-m k v e) into (w-c-m k v (c-w-i-c-m (λ (cms) (w-c-m (map-set cms k v) e)) empty)) where (c-w-i-c-m key-v proc default-v) (c-w-i-c-m = call-withimmediate-continuation-mark) calls proc with the value associated with key-v in the first frame of the current continuation. This is the value that would be replaced if this call were replaced with a call to w-c-m. If no such value exists in the first frame, default-v is passed to proc. The call to proc is in tail position. This function can be implemented using just w-c-m and c-c-m (Clements et al. 2008). After changing resume to operate on mark sets, we have a correct transformation of continuation marks in the input language. The rest of the transformation does not need to change dramatically for the entire PLT Scheme language. Now REST ful Web applications can be written in direct style.

Even though the marks are now in the correct order, there is still an error in our transformation. 3.3 The Algebra of Marks Consider the transformation of the following program: (w-c-m k1 v1 (w-c-m k2 v2 e)) This is transformed as (w-c-m k1 v1 (w-c-m (cons k1 v1) (w-c-m k2 v2 (w-c-m (cons k2 v2) e))))

4. Formal Treatment Armed with intuition, we present the continuation (mark) reconstruction transformation formally.

Because continuation marks respect the tail-calling properties of Scheme, if a frame already contains a mark for a key, the mark is overwritten. Thus, the following are equivalent:

4.1 The Source Language

(w-c-m k v (w-c-m k u e)) and (w-c-m k u e)

The language in Figure 4, dubbed SL for source language, is a modified version of A-Normal form (ANF) (Flanagan et al. 2004). It uses λ instead of let. Furthermore, we allow applications of arbitrary length. The language is extended with call/cc, abort, letrec, algebraic datatypes, and continuation marks. This is different from the source language of Pettyjohn et al. (2005) by including continuation marks and abort, which were included in the target language there. Algebraic datatypes are essential to using marks; abort is included for consistency with the target language. X denotes zero or more occurrences of X. Instances of algebraic datatypes are created with constructors (K, K m ) and destructured with match. We leave the actual set of constructors unspecified, though we assume it contains the standard list constructors cons and nil.

Similarly, marks with different keys share the same frame. Therefore, the following are equivalent: (w-c-m x v (w-c-m y u e)) and (w-c-m y u (w-c-m x v e)) Thus, the transformation is equivalent to (w-c-m k1 v1 (w-c-m k2 v2 (w-c-m (cons k1 v1) (w-c-m (cons k2 v2) e)))) which is equivalent to (w-c-m k1 v1 (w-c-m k2 v2 (w-c-m (cons k2 v2) e)))

302

(1)

Σ/E [((λ(x) e) v)]

−−→SL

Σ/E[e[x → v]]

Σ/E [(match (K v) l)]

(2)

−−→SL where

Σ/E[e[x → v]] [(K x)e] ∈ l and is unique

Σ/E[(letrec ([σ v]) e)]

−−→SL

(3)

Σ[σ → v]/E[e]

Σ/E [(σ v)]

−−→SL where

Σ/E[e[x → v]] Σ(σ) = (λ(x) e)

Σ/E [(match σ l)]

−−→SL

(4)

(5)

Σ/E [(w-c-m vk v1 E

(6)

Σ/E[(match Σ(σ) l)] Σ/E[(w-c-m vk v2

vk v2 e)])] −−→SL E vk [e])] where E vk contains only w-c-ms

vk [(w-c-m

(7)

e

::= | | | | | |

a (w e) (letrec ([σ v]) e) (w-c-m a a e) (c-c-m [a]) (match w l) (abort e)

l a w v

::= ::= ::= ::=

[(K x)e] w | (K a) v|x (λ(x) e) | (K v) | σ

x σ

∈ ∈

Variables References where Variables ∩ References = ∅

::= | ::=

(w-c-m v v Ev,Δ ) where v ∈ /Δ [ ] | (v E) ∅ | Σ[σ → v]

Σ/E [(w-c-m vk v1 v2 )]

−−→SL

Σ/E[v2 ]

Σ/E[(c-c-m [v])]

−−→SL

Σ/E[χv (E , (nil))]

EΔ

Σ/E[(abort e)]

−−→SL

Σ/e

Σ

Σ/E[(call/cc v)]

−−→SL

Σ/E[(v κ.E )]

Σ/E [(κ.E v)]

− −→SL

Σ/E [v]

(8) (9)

() ()

Figure 6. TL Syntax

χvs (E ) = χvs (E , (nil)) χvs ([], vl ) = vl

such a redex is encountered, the redundant marks are removed, starting with the outermost (rule 6). Marks that surround a value are discarded after the evaluation of the subterm (rule 7). The evaluation of c-c-m employs the function χ to extract the marks of the keys from the evaluation context (rule 8). Marks are extracted in order, such that c-c-m evaluates to a list of lists of pairs of keys and their value, starting with the oldest. The evaluation rules for continuations (rules and ) are standard; abort abandons the context (rule 9) to facilitate reimplementing continuations.

χvs ((v E ), vl ) = (cons vl χvs (E , (nil)) χvs ((w-c-m vk vv E ), vl ) = χvs (E , (cons (cons vk vv ) vl ))) if vk ∈ vs χvs ((w-c-m vk vv E ), vl ) = χvs (E , vl ) otherwise Figure 5. SL Semantics

4.2 The Target Language The target language (TL ) in Figure 6 is identical to SL , except that call/cc has been removed along with the continuation values associated with it. The semantics (Figure 7) is also identical, except for the removal of rules and for continuation capture and application.

The operational semantics is specified via the rewriting system in Figure 5. It is heavily based on target language semantics of Pettyjohn et al. (2005). The first rule is the standard βv -rewriting rule for call-by-value languages (Plotkin 1975). The second handles algebraic datatypes. Rules 3, 4, and 5 specify the semantics for letrec. Bindings established by letrec are maintained in a global store, Σ. For simplicity, store references (σ) are distinct from identifiers bound in lambda expressions (Felleisen and Hieb 1992). Furthermore, to simplify the syntax for evaluation contexts, store references are treated as values, and dereferencing is performed when a store reference appears in an application (rule 4) or in a match expression (rule 5). The next six rules implement the continuation-related operators. Recall that continuation marks allow for the manipulation of contexts. Intuitively, (w-c-m k v e) installs a mark for the key k associated with the value v into the continuation of the expression e, while (c-c-m [v]) recovers a list of all marks for the keys in v embedded in the current continuation. To preserve proper tail-call semantics, if a rewriting step results in more than one w-c-m of the same key, surrounding the same expression, the outermost mark is replaced by the inner one. Similarly, marks for different keys are considered to be a single location. The mark interleaving requirement is enforced by a grammar for evaluation contexts that consists of one parameterized nonterminal. The parameter of E (Δ) represents the keys that are not allowed to appear in the context. Thus, multiple adjacent w-c-m expressions (of the same key) must be treated as a redex. When

4.3 Replacing Continuation Capture Following Pettyjohn et al. (2005), we define our translation (Figure 9) from SL to TL as CMT , for continuation mark transform. The translation decomposes a term into a context and a redex by the grammar in Figure 8. We prove that the decomposition is unique and thus the translation is well-defined. Lemma 1 (Unique Decomposition). Let e ∈ SL. Either e ∈ a or e = E[r] for some redex r. The translation rules are straightforward, except for application, continuation capture, values and marks. Continuation values are transformed using a variation of the rule for call/cc. call/cc uses c-c-m to reify the context and applies resume to reconstruct the context after a value is supplied. abort is used in the continuationas-closure to escape from the calling context, as the SL semantics does. Continuations rely on the insertion of marks to capture the continuation as it is built. This strategy employs the property of ANF that every continuation is obvious, in that it is the value in the function position of function applications. The translation marks each application, using the mark to record the continuation.

303

(1)

Σ/E [((λ(x) e) v)]

−−→T L

Σ/E[e[x → v]]

Σ/E [(match (K v) l)]

−−→T L where

Σ/E[e[x → v]] [(K x)e] ∈ l and is unique

Σ/E [(letrec ([σ v]) e)]

−−→T L

Σ[σ → v]/E[e]

Σ/E [(σ v)]

−−→T L where

Σ/E[e[x → v]] Σ(σ) = (λ(x) e)

Σ/E [(match σ l)]

−−→T L

(2)

(3) (4)

(5)

Σ/E [(w-c-m vk v1

(6)

Variables and Values: CMT [x] = x CMT [σ] = σ CMT [(λ(x) e)] = (λ(x) CMT [e]) CMT [κ.E ] = (kont/ms χ{,} (CMT [E], (nil))) CMT [(K a)] = (K CMT [a]) Redexes:

Σ/E[(match Σ(σ) l)]

CMT [(w)] = (CMT [w])

Σ/E[(w-c-m vk v2

CMT [(letrec ([σ w]) e)] = (letrec ([σ CMT [w]]) CMT [e])

E vk [(w-c-m vk v2 e)])] −−→T L E vk [e])] where E vk contains only w-c-ms (7)

Σ/E [(w-c-m vk v1 v2 )]

−−→T L

Σ/E[v2 ]

Σ/E [(c-c-m [v])]

−−→T L

Σ/E[χv (E , (nil))]

Σ/E[(abort e)]

−−→T L

Σ/e

(8) (9)

CMT [(w-c-m e)] = (w-c-m CMT [e]) CMT [(c-c-m [a])] = (c-c-m [CMT [a]]) CMT [(,atch w l)] = (match CMT [w] CMT [l]) CMT [[(K x)e]] = [(K x)CMT [e]] CMT [(abort e)] = (abort CMT [e]) CMT [(call/cc w)] = (CMT [w] kont)

χvs (E ) = χvs (E , (nil))

kont = (kont/ms (c-c-m [ ]))

χvs ([], vl ) = vl

kont/ms = (λ(m) (λ(x)

χvs ((v E ), vl ) = (cons vl χvs (E , (nil))

(abort (resume m x))))

χvs ((w-c-m vk vv E ), vl ) = χvs (E , (cons (cons vk vv ) vl )))

Contexts:

if vk ∈ vs

CMT [[]] = []

χvs ((w-c-m vk vv E ), vl ) = χvs (E , vl )

CMT [(w E)] = (K (w-c-m K CMT [E]))

otherwise

where K = (λ(x) (CMT [w] x)) CMT [(w-c-m v v E)] = (w-c-m v v (c-w-i-c-m (λ (cms)

Figure 7. TL Semantics

(w-c-m (map-set cms v v ) CMT [E])))) Compositions: r

EΔ

::= | | | | | | ::= |

(w) (letrec ([σ w]) e) (w-c-m a a w) (c-c-m [a]) (match w l) (abort e) (call/cc e) (w-c-m v v Ev,Δ ) where v ∈ /Δ [ ] | (v E)

CMT [E[r]] = CMT [E][CMT [r]] Figure 9. Translation from SL to TL

4.4 Correctness Let

Figure 8. Translation Decompositions

( v evalx (p) = ⊥

if ∅/p →∗ v ∅/p →∗ . . .

Theorem 1. CMT [evalSL (p)] = evalT L (CMT [p]) Overview. If a source term reduces in k steps, then its translation will reduce in at least k steps, such that the result of the translation’s reduction is the translation of the source’s result. This is proved by induction on k. The base case is obvious, but the inductive case must be shown by arguing that TL simulates each step of SL in a finite number of steps. This argument is captured in the next lemma.

Similarly, all continuation marks are recorded with the mark. Later, these marks will be collected by c-c-m and used to reproduce the context. The resume function (Figure 10) is used by the translated program. resume faithfully reconstructs an evaluation context from a list of pairs of continuation functions and mark sets. It traverses the list and recursively applies the functions from the list and reinstalls the marks using restore-marks. It restores the original and marks as well so that the context matches exactly and subsequent call/cc operations will succeed.

Lemma 2 (Simulation). If Σ/E [e] →SL Σ /E [e ] then CMT [Σ]/CMT [E[e]] →+ TL CMT [Σ ]/CMT [E [e ]] Overview. This is proved by a case analysis of the →SL relation. It requires additional lemmas that cover the additional

304

identifier appears in the argument to CMT [], it is not transformed, but left as is, so it could be substituted after the transformation with the CMT [] of the value v.

(letrec ([resume (λ (l v) (match l [(nil) v] [(cons ms l) (match ms [(nil) (resume l v)] [(cons (cons k) nil) (k (w-c-m k (resume l v)))] [(cons (cons cms) nil) (restore-marks cms (λ () (w-c-m cms (resume l v))))] [(cons (cons k) (cons cms)) (w-c-m k (restore-marks cms (λ () (w-c-m cms (resume l v)))))] [(cons (cons cms) (cons k)) (w-c-m k (restore-marks cms (λ () (w-c-m cms (resume l v)))))])]))] [restore-marks (λ (cms thnk) (match cms [(nil) (thnk)] [(cons (cons m v) cms) (w-c-m m v (restore-marks cms thnk))]))] [c-w-i-c-m (λ (k proc default-v) . . . )] [map-set (λ (map k v) . . . )]) ...)

4.5 Defunctionalization We do not need to extend the defunctionalization defined by Pettyjohn et al. (2005) in any interesting way, but in our implementation we have extended it in the trivial way to keyed continuation marks and the numerous PLT λ forms.

5. Extensions Continuation marks, however, are not the only expressive features of PLT Scheme we aim to support. We discuss how to support fluid state, Web cells, advanced control flow, and continuation management in turn. 5.1 Parameters Web applications often use fluid state. State is fluid when it is bound only in a particular dynamic context. The continuation mark demonstration from the introduction (Figure 3) is an application of fluid state: when the user authenticates to C ONTINUE , the “current user identity” is bound in the dynamic context of the displaysite call. Every link presents the user with a new page using the same user identity. If another type of state were used, it would be more difficult to program or would prevent certain kinds of user behavior. For example, if the environment were used, then the user identity would need to be an argument to every function; if the store were used, then there would be only one user identity for all URL s associated with a session, thereby disallowing a “free” implementation of masquerading and delegation. PLT Scheme provides a mechanism for fluidly bound state: parameters. Parameters are effectively a short-hand for continuation mark operations. (parameterize p v e) wraps e in a continuation mark for the key associated with p bound to v. The parameter p can then be referenced inside e and will evaluate to whatever the closest mark is. Unfortunately, parameters are not implemented this way, because they also provide efficient lookup, thread-safety, and threadlocal mutation. Instead, there is a single continuation mark key for all parameters. This key is not serializable, so our mark recording and reconstitution strategy fails. The key is included in the captured continuation structure but destroys its serializability. We compensate by providing an implementation of parameters using a distinct serializable key for each parameter. This way, Web servlets can effectively use fluid state, like parameters.

Figure 10. Necessary Definitions

steps that →T L takes to reduce a program to images of subexpressions/contexts of the original. Lemma 3 (Compositionality). CMT [Σ]/CMT [E ][CMT [e]] →∗T L CMT [Σ]/CMT [E][e] Sketch. CMT [] only introduces w-c-m into the context or abstracts the continuation of an argument to a function. These additional contexts are eventually erased as the argument is evaluated or the surrounding w-c-m is removed as a value is returned. Lemma 4 (Reconstitution). CMT [Σ]/(resume χ{,} (CMT [E ]) CMT [v]) →+ CMT [Σ]/CMT [E ][CMT [v]] TL

5.2 Web Cells

Proof. We proceed by cases on the structure of E . Suppose E = [], then CMT [E ] = [], so χ returns (nil) and resume returns CMT [v], which is equal to [][CMT [v]]. Suppose E = (w E ), then CMT [] expands to a mark that captures E as a function abstracted over E in the mark, which is restored by resume. E is preserved by induction. Suppose E = (w-c-m v v E ), then CMT [] expands to a mark that captures v and v in the mark, which is restored by resume. E is preserved by induction.

Sometimes fluid state (parameters), the environment (lexical variables), and the store (mutable structures) are all inappropriate for Web applications. A simple example is maintaining the preferred sort state of a list while the user is exploring the application, without forcing the user to have one sort state per session. If fluid state is used, then the entire application must be written as tail calls to ensure that the dynamic extent of sort state modifications is the “rest” of the session. This means the program must be written in CPS. If the lexical environment is used, then the sort state must be threaded throughout every part of the application, including those that are not related to the sorted list. This means the program must be written in store-passing style, an invasive and nonmodular global program transformation. If the store is used, then a single sort state will be used for all browsers displaying the same list. This means the user will not be

Lemma 5 (Substitution). CMT [e[x → v]] = CMT [e][x → CMT [v]] Sketch. The CMT [] transformation is applied to every subexpression in the transformed expression. Thus, the vs substituted in will eventually have CMT [] performed on them if x appears in e. If an

305

able to use the Web interactions provided by the browser, such as Open in New Window, to compare different sorts of the same list. Web cells (McCarthy and Krishnamurthi 2006) provide a kind of state appropriate for the sort state. Semantically, the set of Web cells is a store-like container that is part of the evaluation context; however, unlike the store, it is captured when continuations are captured and restored when they are invoked. Since continuation capture corresponds to Web interaction, this state is “fluid” over the Web interaction tree, rather than the dynamic call tree. It is easy to add support for Web cells in our system. We have a thread-local box that stores the cells for the execution of a continuation. Whenever a continuation is captured, these are saved. Our serializable continuation data structure contains (a) the continuation components, (b) the continuation marks, and (c) the Web cell record. Each of these are restored when the continuation is called.

This encoding generates one continuation per call to s/s/d (application-context) and one escape (“one-shot”) continuation per call to the embedding procedure (mk-page-context). We can do better with serializable continuations. Because everything is serializable and manipulable, we can implement s/s/d as (define (send/suspend/dispatch mk-page) (call-with-serializable-current-continuation (λ (application-context) (define (embed/url handler) (define application+handler-context (kont-cons handler application-context)) (kont→url application+handler-context)) (send/back (mk-page embed/url))))) Like before, this implementation first captures the continuation of s/s/d. embed/url accepts a procedure and returns a continuation serialized into a URL . This serialized continuation is the continuation of s/s/d with the procedure appended to the end. Since the components of the continuation are represented as a list, we can do this directly. However, the continuation components are stored in reverse order, so a logical append is a prepend on the representation. In the program,

5.3 Request Handlers send/suspend is the fundamental operator of the PLT Web Server. This function captures the current continuation, serializes it into a URL , and calls a display function with the URL . This works well for applications with a linear control-flow. However, most applications have many possible continuations (links) for each page, and therefore are difficult to encode with send/suspend. We can simulate this control-flow by dispatching on some extra data attached to a single continuation captured by send/suspend. This dispatching pattern is abstracted into send/suspend/dispatch (s/s/d) (Hopkins 2003). This function allows request handling procedures to be embedded as links; when clicked, the request is given to the procedure, and the procedure returns to s/s/d’s continuation. For example, consider the servlet

(f (g (h (s/s/d (λ (embed/url) (embed/url i)))))) application-context is (list h g f ) and application+handler-context is (list i h g f ). This captures only a single continuation regardless of how many handler procedures are embedded. This improves our time and space efficiency.

(define (show message) (send/suspend/dispatch (λ (embed/url) ‘(html (h1 ,message) (a ([href ,(embed/url (λ (show "One")))]) "One") (a ([href ,(embed/url (λ (show "Two")))]) "Two")))))

5.4 Continuation Management In our system, continuations are serialized and embedded in the URL s given to the client by default. However, there are some pragmatic reasons why this is not always a good idea. First, there is, in principle, no limit to the size of a continuation. If the lexical environment contains the contents of a 100MB file, then the continuation will be at least 100MBs (modulo clever compression). Most Web clients and servers support URL s of arbitrary length, but some browsers and servers do not. In particular, Microsoft Internet Explorer (IE ) limits URL s to 2,048 characters and Microsoft IIS limits them to 16,384 characters. Second, if a continuation is embedded in a URL and given to the user, then it is possible to manipulate the continuation in its serialized form. Thus, the environment and Web cell contents are not “secure” when handled directly by users. Providing security is not always appropriate, so we allow Web application developers to customize the kont→url function that is used to embed continuations in URL s with “stuffers.” We provide a number of different stuffer algorithms and the ability to compose them. They compose because they produce and consume serializable values.

This servlet generates a page with two links: one for each call to embed/url. When clients click on either, they are sent to an identical page that contains a header with the link text. s/s/d can either build a hash-table and perform dispatching with a single continuation, or it may be written (Krishnamurthi et al. 2007) succinctly as (define (send/suspend/dispatch mk-page) (let/cc application-context (local [(define (embed/url handler) (let/ec mk-page-context (application-context (handler (send/suspend mk-page-context)))))] (send/back (mk-page embed/url)))))

Plain The value is serialized with no special considerations. This encoding employs a clever use of continuations to embed GZip The value is compressed with the GZip algorithm. the handler in the continuation captured by send/suspend. When Sign The value is signed with a parameterized algorithm. embed/url is called, it captures the continuation of mk-page, mkpage-context, which is in the process of constructing a link. em- Crypt The value is encrypted with a parameterized algorithm. bed/url provides this link by giving the continuation mk-pageHash The value is hashed with the MD 5 or SHA 1 algorithm. The value context to send/suspend, which calls its argument with a link to is serialized into a database addressed by the hash and the hash send/suspend’s continuation. embed/url is arranged so that when is embedded in the URL . send/suspend returns, its return value is given to the handler, whose return value is given to the caller of send/suspend/dispatch, via Len(s) Stuffer s is used if the URL would be too long when stuffed with the value. application-context.

306

These techniques can be combined in many ways. For example, an application with small continuations and no need for secrecy could just use the Plain algorithm. An application that had larger continuations might add the GZip algorithm. An application that needed to protect against changes could add the Sign algorithm, while one that needed to guarantee the values could not be inspected might add Crypt. Finally, an application that did not want the expense in either bandwidth or computational time could just use the Hash algorithm. Every URL would be the same length, and identical continuations would be stored only once. Although the Hash method is not truly REST ful, it performs drastically better than the traditional method of storing the continuations in memory. It uses less space because the continuation representation is tailored to the particular application, in contrast to the standard C-stack copy. Furthermore, it takes less time to service requests. This might seem implausible since the operating system’s virtual memory system seems morally equivalent to a continuation database because unused parts of memory are moved to disk. However, the VM considers memory unused only when it is not touched by the application. In PLT Scheme, the garbage collector stores a tag bit with objects. Thus, even though the collector doesn’t need to walk all data, collection affects these tag bits, which causes the operating system to bring the entire page into main memory. This paging, which would not be present with a swap-sensitive garbage collector (Hertz et al. 2005), causes severe performance degradation. The Hash method has the additional advantage of providing multi-server scalability easily, compared to other possible serverside continuation stores. Since the Hash method guarantees that two writes to the same key must contain the same data, because otherwise the hashing algorithm would not be secure, multiple Web servers do not need to coordinate their access to the database of serialized continuations. Therefore, replication can be done lazily and efficiently, avoiding many of the problems that many session object databases are fraught with.

5.4.2 Serialization Format Each continuation record is scarcely more than 100 bytes. This is split between Web cells, the continuation marks, and the continuation function components. The cells and marks are comparable to the lexical values captured in the continuation. Each function is serialized as a unique identifier that refers to the compiled code and the captured values of free variables. The continuation record has a list of these functions. A sanitized, sample record is below. (serialized ((web-server/lang/abort-resume . web:kont) (web-server/lang/web-cells . web:frame) (application/servlet . web:300)) (web:kont (web:frame (list (cons web:23-1 (path #"static-path" . unix)) (cons web:10-0 (path #"db-path" . unix)) (cons web:36-2 "username"))) (list (vector (web:300) #f)))) This can be seen as a short program that constructs the serialized value. The first part records what modules contain the definitions of data-structures that are created. The module path refer to code loaded into the PLT Scheme instance that is deserializing the continuation. If they are resolved to the wrong code, or if the module are simply not available, then deserialization will fail. This means any PLT Scheme instance with access to the same source can deserialize this continuation. Our system protects against certain errors by including in a hash of the module source in the names of continuation data structures. In the real record that this example corresponds to, the token 300 would include a hash of the source of application/servlet to result in a deserialization error if the code were changed, rather than the unsafe behavior that would result if a different kind of continuation were populated with erroneous data from this record. The second part is an expression that creates the continuation record. Its first field contains the record of the Web cells. This is an association list from identifying symbols to values. In this example, two of the values are paths, while the other is a string. The second field of the continuation is the continuation record. This is the list that will be passed to resume. In the example, there is a single function, web:300, with no accompanying continuation mark recording.

5.4.1 Replay Attacks Since the URL s of our application completely specify what the Web application will do in response to a request, it is natural to assume that our applications are particular susceptible to replay attacks. For example, suppose we build a stock-trading application and at some point a user sells 10 shares. An adversary could capture the continuation for “sell 10 shares” and replay it n times to sell 10n shares, even with encryption in place. This seems utterly unacceptable. However, consider the same application on another platform where the continuation is specified through an ad-hoc combination of URL , form data, and cookies. In this case as well, a request may be replayed to perform this attack. On a traditional platform, this would be prevented by some server-side state. For example, each server response would include a unique identifier that would be sent back with requests; each identifier would be allowed to be received only once, and the identifier would be cryptographically tied to the incoming requests, so that new identifiers could not be used to “freshen” old requests to replay them. This same strategy can be implemented in our system as well, except perhaps better because the unique identifier can be combined with the entire continuation since it is explicitly represented, in one place, in our system. As before with the various stuffer algorithms, it is not always appropriate to disallow replays. For example, it is useful to use the browser’s Refresh button and to send links to colleagues. If we provided replay protection “for free,” we would also disallow many useful Web applications.

6. Evaluation The formal treatment of Section 4 can tell us if the transformation is correct and if it formally has the modularity properties we desire, but it cannot tell us if it is useful for producing scalable, REST ful, direct-style Web applications. 6.1 Scalability We observe that Web applications in our system written in direct style can be entirely REST ful. Their usage of the lexical environment, fluid state, and Web cells are all contained in serializable structures. These can then be stored by the client in encrypted and compressed URL s. Cookies can easily capture store state, and since nearly all data structures are serializable, any value can be stored in cookies. Finally, our programs may choose to use server state where appropriate. However, our system would not really be useful if it greatly slowed down the developer (with compilation lag) or the client (with execution lag), so we measure those. Compilation takes, on average, twice the amount of time as compiling normal Scheme. This measurement was based on compiling a collection of 15 servlets. This is because our compiler is implemented as a PLT module language (Flatt 2002) that performs

307

#lang scheme (require web-server/servlet) (define (get-number which) (string->number (extract-binding/single ’number (send/suspend (λ (k-url) ‘(html (body (form ([action ,k-url]) ,which " number:" (input ([name "number"])))))))))) (define (start req) ‘(html (body ,(number->string (+ (get-number "First") (get-number "Second"))))))

#lang web-server

; ← different ; ← different

(define (get-number which) (string->number (extract-binding/single ’number (send/suspend (λ (k-url) ‘(html (body (form ([action ,k-url]) ,which " Number:" (input ([name "number"])))))))))) (define (start req) ‘(html (body ,(number->string (+ (get-number "First") (get-number "Second"))))))

Figure 11. Add-Two-Numbers (Before)

Figure 12. Add-Two-Numbers (After)

five passes over the source code before it produces normal Scheme. These five passes cause a delay that is noticeable to developers but not prohibitive. There is no noticeable difference in the execution time of our servlets versus standard servlets. Although it is possible to cause slow down by serializing large objects. We tested scalability by comparing the space usage of a typical Web application before and after conversion to our system. Before, the LRU manager kept memory usage scarcely below 128MB. This pattern of “hugging the edge” of the limit matches our experience with C ONTINUE (Krishnamurthi 2003). After conversion, the server uses about 12MB, of which approximately 10MB is the bytecode for the application and its libraries. We tested with multiple serialization regimes. When GZip is used, no continuation is larger than IE ’s limit, so there is no persession state on the server. When we use Hash, the continuation store is about 24MB for approximately 150 users. This means that we use about 20 percent of the preconversion storage, while providing more service, because continuations are never revoked. But remember, we don’t need to use that storage because the client can hold every continuation.

etc. If a program includes these data structures in the environment of serialized continuations, then the continuation is not serializable. In most cases this is not problematic, because these data structures are often defined at the global level or used during a computation but not between Web interactions. For example, it is much more common for the function + to be invoked during a computation than for the function + to be stored in a list that it is in the environment of a continuation. Only the second prevents serialization. Since these practices are so uncommon, we have not found constraint to prevent compatibility in practice. 6.2.2 Higher-order Third-Party Library Procedures Programs that use higher-order third-party library procedures cannot be used safely with our system. For example, (map get-number (list "First" "Second")) This does not work because the send/suspend inside of getnumber relies on the mark to capture the continuation, but because map is not transformed, its part of the continuation is not recorded. We can detect this situation and signal a runtime error as described by Pettyjohn et al. (2005). However, it is always possible to recompile the necessary code (i.e., map) under our transformation.

6.2 Modularity & Compatibility The final way we evaluate our work is by its ability to run unmodified Scheme programs, in particular, Web applications. In most cases, there is no difficulty whatsoever; the user simply changes one line at the top of his or her program. Figure 11 presents a servlet and Figure 12 shows the same servlet using our transformation: the first line selects the compiler, and the second eliminates the unnecessary require specification. There are two categories of programs that lead to errors when transformed. The first category is programs that include nonserializable data structures in the environment of their captured continuations. The second category is programs that use higher-order library procedures with arguments that capture continuations.

7. Conclusion We presented a modular program transformation that produces REST ful implementations of direct style Web programs that use expressive features, like continuation marks. We have discussed how to extend this transformation into a deployable system. We have discussed the opportunities for continuation management this allows. We have evaluated the performance of our work and found that it meets the gold standard of scalability—no server-side session state—and can use as little as 10% of the memory when server-side state is desirable. This work relies on continuation marks, so it is difficult to apply it to programming languages other than PLT Scheme. However, practitioners could apply our technique easily once continuation marks were available. Since continuation marks can be implemented for both C# (Pettyjohn et al. 2005) and JavaScript (Clements et al. 2008), it should be possible to automatically produce REST ful Web applications in those languages as well.

6.2.1 Non-serialized Data Structures Our transformation implements continuations with closures and renders closures serializable through defunctionalization. However, other data structures remain unserializable: ports, foreign pointers, global and thread-local boxes, untransformed closures, parameters,

308

In the future, we will explore how to allow continuation capture in an untransformed context. We anticipate that the WASH approach (Thiemann 2006) of combining multiple continuation capture methods will be appropriate.

Paul Graham. Lisp for web-based applications, 2001. http://www.paulgraham.com/lwba.html. Matthew Hertz, Yi Feng, and Emery D. Berger. Garbage collection without paging. In Programming Language Design and Implementation, pages 143–153, 2005.

Acknowledgments We thank Matthew Flatt for his superlative work on PLT Scheme. We thank Greg Pettyjohn for his work on the prototype our system is based upon. We thank Matthias Felleisen, Matthew Flatt, Shriram Krishnamurthi, and the anonymous reviewers for their comments on this paper. This material is based upon work supported under a National Science Foundation Graduate Research Fellowship.

Peter Walton Hopkins. Enabling complex UI in Web applications with send/suspend/dispatch. In Scheme Workshop, 2003. John Hughes. Generalising monads to arrows. Science of Computer Programming, 37(1–3):67–111, May 2000. Shriram Krishnamurthi. The C ONTINUE server. In Practical Aspects of Declarative Langauges, January 2003.

References

Shriram Krishnamurthi, Peter Walton Hopkins, Jay McCarthy, Paul T. Graunke, Greg Pettyjohn, and Matthias Felleisen. Implementation and Use of the PLT Scheme Web Server. HigherOrder and Symbolic Computation, 2007.

John Clements, Matthew Flatt, and Matthias Felleisen. Modeling an algebraic stepper. In European Symposium on Programming, April 2001.

Jacob Matthews, Robert Bruce Findler, Paul T. Graunke, Shriram Krishnamurthi, and Matthias Felleisen. Automatically restructuring programs for the Web. Automated Software Engineering, 11(4):337–364, 2004.

John Clements, Ayswarya Sundaram, and David Herman. Implementing continuation marks in JavaScript. In Scheme and Functional Programming Workshop, 2008. Ezra Cooper, Sam Lindley, Philip Wadler, and Jeremy Yallop. Links: Web programming without tiers. In Formal Methods for Components and Objects, 2006.

Jay McCarthy and Shriram Krishnamurthi. Interaction-safe state for the Web. In Scheme and Functional Programming, September 2006.

St´ephane Ducasse, Adrian Lienhard, and Lukas Renggli. Seaside a multiple control flow web application framework. In European Smalltalk User Group - Research Track, 2004.

Greg Pettyjohn, John Clements, Joe Marshall, Shriram Krishnamurthi, and Matthias Felleisen. Continuations from generalized stack inspection. In International Conference on Functional Programming, September 2005.

Matthias Felleisen and Robert Hieb. The revised report on the syntactic theories of sequential control and state. Theoretical Computer Science, 102:235–271, 1992.

Gordon D. Plotkin. Call-by-name, call-by-value, and the λcalculus. Theoretical Computer Science, 1975.

Roy T. Fielding and Richard N. Taylor. Principled design of the modern web architecture. ACM Transactions on Internet Technology, 2(2):115–150, 2002.

Christian Queinnec. The influence of browsers on evaluators or, continuations to program web servers. In International Conference on Functional Programming, pages 23–33, 2000.

M. J. Fischer. Lambda calculus schemata. ACM SIGPLAN Notices, 7(1):104–109, 1972. In the ACM Conference on Proving Assertions about Programs.

Peter Thiemann. Wash server pages. Functional and Logic Programming, 2006. Noel Welsh and David Gurnell. Experience report: Scheme in commercial web application development. In International Conference on Functional Programming, September 2007.

Cormac Flanagan, Amr Sabry, Bruce F. Duba, and Matthias Felleisen. The essence of compiling with continuations. SIGPLAN Notices, 39(4):502–514, 2004. Matthew Flatt. Composable and compilable macros. In International Conference on Functional Programming, 2002.

309

Experience Report: Ocsigen, a Web Programming Framework Vincent Balat

J´erˆome Vouillon

Boris Yakobowski

Laboratoire Preuves, Programmes et Syst`emes Universit´e Paris Diderot (Paris 7), CNRS Paris, France {vincent.balat, jerome.vouillon, boris.yakobowski}@pps.jussieu.fr

Abstract

Web programming is highly constrained by technology (protocols, standards and browsers implementations). Commonly used Web programming tools remain very close to this technology. We believe that Web development would benefit a lot from higher level paradigms and that a more semantical approach would result in huge gains in expressiveness. It is now widely known in our community that functional programming is a really elegant solution to some important Web interaction problems, as it offers a solution to the statelessness of the HTTP protocol (Queinnec 2000; Graham 2001; Hughes 2000). But this wisdom has not spread in the Web programming community. Almost no major Web framework is taking advantage of it.1 We believe the reason is that functional programming has never been fully exploited and that one must be very careful about the way it is integrated in a complete framework in order to match precisely the needs of Web developers. The Ocsigen project is trying to find global solutions to these needs. In this paper, we present our experience in designing Ocsigen, a general framework for Web programming in Objective Caml (Leroy et al. 2008). It provides a full featured Web server and a framework for programming Web applications, with the aim of improving expressiveness and safety. This is done by taking advantage of functional programming and static typing as much as possible (Balat 2006). This paper is a wide and quick overview of our experience regarding this implementation. In section 2, we describe our use of functional programming. In section 3, we show how some very strong correctness properties can be encoded using Ocaml’s type system. Finally, in section 4, we describe the implementation of a concrete Web application using our solutions.

The evolution of Web sites towards very dynamic applications makes it necessary to reconsider current Web programming technologies. We believe that Web development would benefit greatly from more abstract paradigms and that a more semantical approach would result in huge gains in expressiveness. In particular, functional programming provides a really elegant solution to some important Web interaction problems, but few frameworks take advantage of it. The Ocsigen project is an attempt to provide global solutions to these needs. We present our experience in designing this general framework for Web programming, written in Objective Caml. It provides a fully featured Web server and a framework for programming Web applications, with the aim of improving expressiveness and safety. This is done by taking advantage of functional programming and static typing as much as possible. Categories and Subject Descriptors D.1.1 [PROGRAMMING TECHNIQUES]: Applicative (Functional) Programming; H.3.5 [INFORMATION STORAGE AND RETRIEVAL]: Online Information Services—Web-based services General Terms

Design, Languages, Reliability, Security

Keywords Ocsigen, Web, Networking, Programming, Implementation, Objective Caml, ML, Services, Typing, Xhtml

1.

Introduction

In the last few years, the Web has evolved from a data-centric platform into a much more dynamic one. We tend now to speak more and more of Web application, rather than Web sites, which hints that the interaction between the user and the server is becoming much more complex than it used to be. What is striking is that this evolution has not been induced, nor even followed, by a corresponding evolution of the underlying technology. The RFC specifying the version of the HTTP protocol currently in use dates back to 1999 and current HTML looks very much like the one we were using ten years ago. The main change is probably the increasing use of JavaScript, mainly due to implementation improvements, which made possible the advent of a new kind of Web applications.

2.

Functional Programming for the Web

The Ocsigen project provides a Web server written in Objective Caml. This server offers all the features one would expect from a general purpose Web server, starting with a comprehensive support of the HTTP 1.1 protocol (Fielding et al. 1999) (including range requests and pipelining). Data compression, access control and authentication are all supported. The server is configured through flexible XML-based configuration files. The server is designed in a modular way. It can therefore be extended very easily just by writing new modules in Ocaml. Among the modules currently available are a module for running CGI scripts, a reverse proxy (which makes it easy to use Ocsigen with another Web server), a filter to compress contents, etc. In the remainder of this section, we highlight the concurrency model used for the Web server implementation, and then our main

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICFP’09, August 31–September 2, 2009, Edinburgh, Scotland, UK. c 2009 ACM 978-1-60558-332-7/09/08. . . $5.00 Copyright

1A

311

notable exception being Seaside (Ducasse et al. 2004).

extension to the server, that is, Eliom, a framework for writing Web applications in Ocaml. 2.1

let srv = register_new_service ~path:["bar"] ~get_params:unit f

Cooperative Threads

The first argument (labelled ~path), corresponds to the path in the URL to which the service will be bound. The second argument (labelled ~get_params) describes URL parameters (here none, as this service does not expect any parameter). The function f is used to produce the corresponding page. Inside Eliom, one can generate an anchor linking to this service by applying the HTML anchor constructor a to the service srv, the current service context sp (which contains in particular the current URL, and is used to construct relative links), the text of the anchor anchor_contents and the unit value () corresponding to the absence of parameters.

A Web server is inherently a highly concurrent application. It must be able to handle simultaneously a large number of requests. Furthermore, composing a page may take some time, for instance when several database queries are involved. The server should not to be stalled in the meantime. We have chosen to use cooperative multithreading to address this issue. Indeed, cooperative threads make it possible to write multi-threaded code while avoiding most race conditions, as context switches only occur at well-specified points. Between these points, all operations are executed atomically. In particular, it is easy to use safely shared mutable datastructures, such as hash tables, without using any lock. We use the Lwt thread library (Vouillon 2008), which provides a monadic API for threads. With this library, a function creating a Web page asynchronously will typically have type:

a srv sp anchor contents () Note that the service URL can be changed just by modifying the path at a single location in the source code, and all links will remain correct as they are computed automatically.

unit → html Lwt.t. It returns immediately a promise (sometimes also called future) of type html Lwt.t, that is, a value that acts as a proxy for the value of type html eventually computed by the function. Promises are a monad. The return operator of the monad has type:

Several services can share the same URL, for instance when they expect different parameters. This means that a service will not respond if some of its arguments are missing or ill-typed, avoiding a whole class of hard-to-detect errors.2 More generally, a full range of service kinds is provided, allowing to describe precisely how services are attached to URLs. This makes it possible to describe very flexible and precise Web interactions in just a few lines of code. A service can be associated to a path (and possibly parameters), to the value of a special parameter, or to both or them. The choice of the right service to invoke is performed automatically by Eliom. Services can be dynamically created in response to previous interactions with the user. Their behavior may depend for instance on the contents of previous forms submitted by the user or the result of previous computations. This is implemented by recording the behavior associated to the service as a function closure in a table on the server. This is an instance of continuation-based Web programming (Queinnec 2000; Hughes 2000; Graham 2001), This is known to be a really clean solution to the so-called back button problem, but is provided by very few Web frameworks. It is also possible to classify services with respect to the type of the value they return. Usually services return an HTML page, but it is also possible to build services sending for example files, redirections, or even no content at all. The latter are called actions, as they are used to perform an effect on the server (for example a database change). Eliom also provides a kind of action that will redisplay the page automatically after having performed the effect. It is also possible to write services that will choose dynamically the kind of output they want.

’a → ’a Lwt.t. It creates an immediately fulfilled promise. The bind operator has type: ’a Lwt.t → (’a → ’b Lwt.t) → ’b Lwt.t. It takes as arguments a promise and a function to be applied to the value of the promise when it becomes available. The promise returned by the operator gives access to the value eventually computed by the function. This operator can be used for sequencing asynchronous operations, as well as for synchronization (for waiting for the completion of some operation before performing further operations). In order to make it possible to use third-party non-cooperative libraries, Lwt also allows to detach some computations to preemptive threads. 2.2

Web Programming

Eliom (Balat 2007) is the most innovative part of the project. This Web server extension provides a high-level API for programming dynamic Web sites with Ocaml. Its design goals are twofold: to propose a new Web programming paradigm based on semantic concepts rather than relying on technical details, and to ensure the quality of Web application by using static typing as much as possible (this latter point is detailed in section 3). The main principle on which Eliom is based is the use of firstclass values for representing the services provided by the Web server. What we call a service is a proxy for a function which can be called remotely from a browser to generate a page or perform an action. Eliom keeps track of the mapping from URLs to services: instead of having one script or executable associated to each URL, like many traditional Web programming tools, Eliom’s services are programmatically associated to URLs. This lets the programmer organize the code in the most convenient way. In particular, it makes it easy to share behaviors between several pages, as the same service can be associated to several URLs. As an example, the following piece of code creates a service srv at URL http://foo/bar (on some server foo).

3.

Typing a Web Application

3.1

XML Typing

Historically, browsers have treated HTML errors leniently. As a result, Web pages are often written in loosely standardized HTML dialects (so-called tag soups). However, the interpretation of malformed markup can vary markedly from one browser to the next. Ensuring that Web pages follow precisely existing specifications makes it more likely that they will be interpreted similarly by all Web browsers. 2 This

mainly concerns URLs written by hand or outdated: as explained in §3.2, Eliom automatically guarantees that links inside pages are correct.

312

Our framework ensures statically that all Web pages served by the server are well-formed and valid with respect to W3C recommendations. Two ways are provided to the developer to this end. The functor-based API of Eliom makes it possible to support these two choices. The first way is to use the XHTML module developed by Thorsten Ohl.3 This library provides combinators to write HTML pages. HTML element types are encoded using phantom types (Leijen and Meijer 1999) and polymorphic variants. The covariant abstract type of elements is type ’a elt. For instance, the combinator p takes as argument a list of elements which are either text or inline elements, optionally some common attributes such as class or id, and returns a paragraph element:

let events_info = register_new_service ~path:["events"] ~get_params:(int "year" ** bool "reverseorder") (fun sp (year, reverseorder) () -> ...) The third argument is the function implementing the service. It takes three parameters: sp corresponds to the current request context (it contains all information about the request, like the IP address of the client, its user-agent, etc). The second one is for URL parameters (GET) and the third one for the body of the HTTP request (POST parameters, here none). When a request is received, the actual arguments are automatically type-checked and converted from string to the right ML datatype by the server. Note that the type of the function implementing a service depends on the value of the second parameter: here, it expects a pair of type int * bool. This is not easy to implement in Ocaml. We have considered two solutions to this problem. The first one, used before version 0.4.0 of Ocsigen was to rely on functional unparsing (Danvy 1998). The current solution consists in a simulation of generalized algebraic datatypes (GADT) (Xi et al. 2003; Pottier and R´egis-Gianas 2006) implemented using unsafe features of Ocaml, anticipating their future introduction in the language. In the example above, int, bool and ** are the GADT constructor functions, and the strings "year" and "reverseorder" are HTTP parameters names.

val p : ?a:([< common ] attrib list) -> [< inline | ‘PCDATA ] elt list -> [> ‘P] elt Here is a piece of code that builds a simple page, given a list of elements the_page_contents. html (head (title (pcdata "Hello world!")) []) (body (h1 [pcdata "Hello world"] :: the_page_contents)) By using a syntax extension based on the Camlp4 preprocessor, HTML fragments can be directly incorporated into an Ocaml source code. The fragments are translated in Ocaml code that relies on the library above.

Parameters are statically checked when building a link with parameters. Concretely, the function a, that builds a link, takes as last parameter the arguments to be given to the service. These arguments will be encoded in the URL. Again, the type of this parameter depends on the service.

>

a events_info sp (pcdata "Last year seminars") (2008, true)

The second way of writing valid Web pages is to use OcamlDuce (Frisch 2006), which brings together Ocaml and the CDuce language (Benzaken et al. 2003). The latter is specifically designed for XML, and allows to manipulate XML documents with very precise (in fact, exact) typing.

When generating a link, service parameter names are taken from the abstract structure representing the service. This ensures that they are always correct and makes it possible to change them without needing to update all links. Here, the generated relative link is:

{{ [["Hello world!"] ["Hello world" !{:the_page_contents:}]] }}

events?year=2008&reverseorder=on Eliom also provides some static guarantees that a form corresponds to its target service. As with link functions, the function that creates an HTML form takes as parameters the service and the information sp about the request. But instead of being directly given the contents of the form, it expects a function that will build the contents of the form. This function takes as parameters the names of the different fields of the form. The following example shows how to create a form to our events_info service:

Unlike the XHTML library, which is specific to XHTML documents, OcamlDuce can be used to create any kind of XML documents, for instance, Atom feeds (Nottingham and Sayre 2005). The only drawback is that OcamlDuce is incompatible with Camlp4 at the moment, requiring somewhat complicated compilation schemes when OcamlDuce files are mixed with Ocaml files that use syntax extensions. 3.2

get_form events_info sp make_form

Typing Web Interactions

Here is in an excerpt of the definition of the make_form function:

Eliom’s first-class notion of service makes it possible to check the validity of links and forms. A service is represented as an abstract data structure containing all the information about its kind, its URL, its parameters, etc. As we saw in section 2.2, links are built automatically by a function taking as parameter a service, rather than a URL. This makes broken links impossible! The types and names of service parameters are declared when constructing a service. Here is an example of service with two parameters year and reverseorder, of types int and bool respectively. 3 http://physik.uni-wuerzburg.de/

let make_form (year_name, reverseorder_name) = ... int_input ~input_type:‘Text ~name:year_name (); bool_checkbox ~name:reverseorder_name (); ... The functions int_input and bool_checkbox are used to create respectively an “input” form widget, and a checkbox. To ensure a correct typing of the fields, we use an abstract parametric type ’a param_name for names instead of simply type string. The parameter of this type is a phantom type corresponding to the type of the service parameter. Each function generating form

~ohl/xhtml/

313

widgets uses the appropriate type for the name of the parameter it corresponds to. For instance, the ~name parameter of the function int_input above has type int param_name, whereas it has type bool param_name for the function bool_checkbox. This ensures that the field names correspond exactly to those expected by the service, and that their types are correct. But there is no guarantee that all required parameters are present, nor that the same parameter is not used several times in the form. Indeed, this would require a very sophisticated type system for forms, which would also need to interact gracefully with the type system for HTML. Rather than trying to write an hazardous and complex extension to Ocaml’s type system, we decided to relax somewhat the static checks for forms. 3.3

4.

An Application: Writing a Wiki with Ocsigen

4.1

An Overview of Ocsimore

We have started the development of Ocsimore, a content management system written using Eliom. At the moment, it mostly consists in a wiki, with an advanced gestion of users and groups. Currently in final beta state, it is already used to publish the PPS laboratory website (http://www.pps.jussieu.fr/). At the heart of Ocsimore is the notion of box. Wiki pages are composed by putting together or nesting boxes. This provides strong possibilities of modularity and code reuse: HTML code can be shared between several pages simply by putting it in a common box. Moreover, a box can be a container, that is, it can contain a hole that is to be filled by the contents provided by an outer box. In the example below, the first box has a hole (named by convention ) and is included in the second box.

Typing Database Accesses

Database accesses are crucial for Web programming. We have found it very convenient to use PG’OCaml4 , an interface to PostgreSQL5 written by Richard Jones. In particular, Ocsimore, our content management system (see section 4), relies on it. We actually included some changes in PG’OCaml to turn its implementation into monadic style, in order to make it usable with Lwt. Thus, even though we do not use preemptive threads, queries still do not block the Web server. The most noteworthy feature of PG’OCaml is that SQL statements are statically typed, with type inference. Another key point is that it is immune to SQL code injection vulnerabilities, as queries are automatically compiled into SQL prepared statements (which are pre-compiled, well-formed, queries that can receive arguments). Static typing relies on the ‘DESCRIBE statement’ command provided by recent versions of PostgreSQL. This command returns the types of the placeholders and return columns of the given statement. At compile time, SQL statements inside the Ocaml code are thus fed into the PostgreSQL frontend by a Camlp4-based preprocessor, which answers with their types. Types are then converted back into Ocaml types and used to generate the appropriate code. As an example, the following code

Box 1: The text at the end is in bold: ****. Box 2: Let us call 1: The second box is thus displayed as: “Let us call 1: The text at the end is in bold: In bold.” The default wiki syntax of Ocsimore is Wikicreole (Sauer et al. 2007). It is translated into well-formed XHTML using OcamlDuce. Ocsimore features an authentication mechanism, for either PAM, NIS or Ocsimore-specific users. Wikis and wikipages can be assigned read, write and administration rights. The permissions system is very general, with the possibility to put users into groups which can themselves be nested. Moreover, Ocsimore features a notion parameterized groups. For example, the groups WikiWriters and WikiReader are parameterized by the id of a wiki, and WikiWriters(3) is the group of the users that can write in the third wiki. One can also add generic inclusions between parameterized groups with similar arguments, and WikiWriters(i) is included in WikiReaders(i) for any i. 4.2

fun db wiki -> PGSQL(db) "SELECT id, contents FROM wikiboxes WHERE wiki_id = $wiki"

General Structure of the Code

Ocsimore is written using the object system of Ocaml in order to define modifiable and extensible widgets in a modular way. This makes it easy to add extensions without any modification to the core system. For instance, a comment system (forum, blogs, news) is currently being implemented. The wiki is also extensible: syntax extensions can be implemented and then included inside wiki boxes. Ocsimore makes use of Eliom’s most advanced features. For instance, it takes advantage of non-attached services, i.e. services that are not associated to any path (they are implemented using special URL parameters). These services are used for implementing a connection widget on each page of the site in just a few lines of code. Indeed, we do not have to consider that each page may optionally take credential information as parameters. Instead, the special connection service just performs the action of opening a session and then triggers the redisplay of the current page after logging. The same kind of service is also used for editing wiki boxes. As the central notion in our wiki is not the notion of page but the notion of box (where a box can be included in several pages) it is important to keep the information of the current page while editing a box. This behavior is really easy to implement using Eliom, and is a good example of the simplification of the code induced by Eliom’s high level features. It is noteworthy that making the implementation of common complex Web interactions so easy has an impact on the ergonomics of Web sites. The example of the connection box is eloquent: in

defines an Ocaml function of type db → wiki id → (wikibox id × string option) list. The field contents is an SQL text field which can be NULL. It is thus mapped to the Ocaml type string option. The fact that queries are typed proved extremely useful, as it helps to find out rapidly which queries have to be modified whenever the database structure is changed during program development. We see a few potential improvements to PG’OCaml. First, the code passed to PG’OCaml must be a valid SQL query. Hence, it is not possible to write queries as subblocks that are concatenated together, as is often done. It would be interesting to incorporate a comprehension-based query language (Trinder 1992) into Objective Caml. Second, SQL fields are often mapped to the same Ocaml type. In the example above, we have in fact wiki id = wikibox id = int32. Making wiki id and wikibox id abstract types in the Ocaml code results in a lot of (explicit) conversions, including problematic ones inside containers. We would like to use explicit coercions (id :> int32) and (int32 :> id) instead, but this is not currently possible in Ocaml. 4 http://developer.berlios.de/projects/pgocaml/ 5 http://www.postgresql.org/

314

many sites, lazy PHP programmers prefer having the connection box only in the main page of the site, rather than duplicating the code for each page.

5.

Benjamin Canou, Vincent Balat, and Emmanuel Chailloux. O’browser: Objective Caml on browsers. In ML ’08: Proceedings of the 2008 ACM SIGPLAN workshop on ML, pages 69–78, New York, NY, USA, 2008. ACM. ISBN 978-1-60558-062-3. doi: http://doi.acm.org/10.1145/ 1411304.1411315.

Conclusion

Ezra Cooper, Sam Lindley, Philip Wadler, and Jeremy Yallop. Links: Web programming without tiers. In FMCO 2006, 2006.

A few other projects also take advantage of functional programming for the Web. The two most closely related are Links (Cooper et al. 2006) and Hop (Serrano et al. 2006). A few other tools have also implemented continuation-based Web programming: Seaside (Ducasse et al. 2004), Wash/CGI (Thiemann 2002) and PLT Scheme (Krishnamurthi et al. 2007). Eliom departs from all these others projects in that it is based on Ocaml and proposes a very rich set of services. The wide overview of the Ocsigen project we have given in this paper demonstrates that building a Web programming framework is complex, as many very different issues have to be addressed. Continuation-based Web programming is a key notion of the system but is not sufficient in itself. It needs to be integrated within a full environment. We believe that, far beyond this pecular point, functional programming is the ideal setting for defining a more high-level way of programming the Web. It allows the programmer to concentrate on semantics rather than on implementation details. Note that this abstraction from low-level technologies does not entail any limitation but offers a huge step forward in expressiveness. One of the main concerns of our project has been to improve the reliability of Web applications using static typing, which is at the opposite of traditional Web programming, based on scripting languages. We think this evolution is necessary because of the growing complexity of Web applications. Our experience in writing application with Eliom and in implementing the whole system itself shows that relying heavily on sophisticated features of the typing system simplifies a lot the maintenance and evolution of large pieces of software. For all this project, we made the choice of using the Ocaml language rather than defining a new one. This makes it possible to take full advantage of the large set of available Ocaml libraries. We were surprised of being able to encode most of the properties we wanted using Ocaml’s type system. Very few things are missing (a better typing of forms is one of them). Up to now, we have concentrated mainly on server-side programming. We intend to extend this work to other aspects of Web programming, namely database interaction and client-side programming. This last point is really challenging as it is not obvious how to build a Web site where some parts of the code run on the server and other parts on the client, with strong safety guarantees. Our first experiment in that direction has been the implementation of a virtual machine for Ocaml in Javascript (Canou et al. 2008). Currently, the Ocsigen implementation is mature enough to be used for developing and operating real Web sites. Ocsigen is an open source project with a growing community of users, who have already developed significant Eliom-based applications, and who were a great help in building a strong and usable tool. We thank them all.

Olivier Danvy. Functional unparsing. Journal of Functional Programming, 8(6):621–625, 1998. St´ephane Ducasse, Adrian Lienhard, and Lukas Renggli. Seaside – a multiple control flow web application framework. In Proceedings of ESUG Research Track 2004, pages 231–257, 2004. R. Fielding, J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P. Leach, and T. Berners-Lee. Hypertext transfer protocol – HTTP/1.1, 1999. URL http://www.ietf.org/rfc/rfc2616.txt. Alain Frisch. OCaml + XDuce. In International conference on Functional programming (ICFP), pages 192–200, New York, NY, USA, 2006. ACM. doi: http://doi.acm.org/10.1145/1160074.1159829. Paul Graham. Beating the averages, 2001. URL http://www. paulgraham.com/avg.html. John Hughes. Generalising monads to arrows. Science of Computer Programming, 37(1–3):67–111, 2000. Shriram Krishnamurthi, Peter Walton Hopkins, Jay Mccarthy, Paul T. Graunke, Greg Pettyjohn, and Matthias Felleisen. Implementation and use of the PLT Scheme Web server. In Higher-Order and Symbolic Computation, 2007. Daan Leijen and Erik Meijer. Domain specific embedded compilers. In Domain-Specific Languages, pages 109–122, 1999. URL citeseer. ist.psu.edu/leijen99domain.html. Xavier Leroy, Damien Doligez, Jacques Garrigue, J´erˆome Vouillon, and Dider R´emy. The Objective Caml system. Software and documentation available on the Web, 2008. URL http://caml.inria.fr/. Mark Nottingham and Robert Sayre. The Atom Syndication Format. RFC 4287, December 2005. Franc¸ois Pottier and Yann R´egis-Gianas. Stratified type inference for generalized algebraic data types. In Proceedings of the 33rd ACM Symposium on Principles of Programming Languages (POPL’06), pages 232–244, Charleston, South Carolina, January 2006. Christian Queinnec. The influence of browsers on evaluators or, continuations to program web servers. In International conference on Functional programming (ICFP), pages 23–33, Montreal (Canada), September 2000. Christoph Sauer, Chuck Smith, and Tomas Benz. Wikicreole: a common wiki markup. In WikiSym ’07: Proceedings of the 2007 international symposium on Wikis, pages 131–142, New York, NY, USA, 2007. ACM. ISBN 978-1-59593-861-9. doi: http://doi.acm.org/10.1145/ 1296951.1296966. Manuel Serrano, Erick Gallesio, and Florian Loitsch. Hop, a language for programming the Web 2.0. In Dynamic Languages Symposium, October 2006. Peter Thiemann. Wash/CGI: Server-side Web scripting with sessions and typed, compositional forms. In PADL ’02, January 2002. Phil Trinder. Comprehensions, a query notation for DBPLs. In DBPL3: Proceedings of the third international workshop on Database programming languages : bulk types & persistent data, pages 55–68, San Francisco, CA, USA, 1992. Morgan Kaufmann Publishers Inc. ISBN 155860-242-9. J´erˆome Vouillon. Lwt: a cooperative thread library. In ML ’08: Proceedings of the 2008 ACM SIGPLAN workshop on ML, pages 3–12, New York, NY, USA, 2008. ACM. ISBN 978-1-60558-062-3. doi: http://doi.acm. org/10.1145/1411304.1411307. Hongwei Xi, Chiyan Chen, and Gang Chen. Guarded recursive datatype constructors. In Proceedings of the 30th ACM SIGPLAN Symposium on Principles of Programming Languages, pages 224–235, New Orleans, January 2003.

References Vincent Balat. Ocsigen: Typing Web interaction with Objective Caml. In International Workshop on ML, pages 84–94. ACM Press, 2006. ISBN 1-59593-483-9. doi: http://doi.acm.org/10.1145/1159876.1159889. Vincent Balat. Eliom programmer’s guide. Technical report, Laboratoire PPS, CNRS, universit´e Paris-Diderot, 2007. URL http://ocsigen. org/eliom. V´eronique Benzaken, Giuseppe Castagna, and Alain Frisch. CDuce: An XML-centric general-purpose language. In ACM SIGPLAN International Conference on Functional Programming (ICFP), Uppsala, Sweden, pages 51–63, 2003. ISBN 1-58113-756-7.

315

Implementing First-Class Polymorphic Delimited Continuations by a Type-Directed Selective CPS-Transform Tiark Rompf

Ingo Maier

Martin Odersky

Programming Methods Laboratory (LAMP) ´ Ecole Polytechnique F´ed´erale de Lausanne (EPFL) 1015 Lausanne, Switzerland {firstname.lastname}@epfl.ch

Abstract

do not embody the entire control stack but just stack fragments, so they can be used to recombine stack fragments in interesting and possibly complicated ways. To access and manipulate delimited continuations in directstyle programs, a number of control operators have been proposed, which can be broadly classified as static or dynamic, according to whether the extent of the continuations they capture is determined statically or not. The dynamic variant is due to Felleisen (1988); Felleisen et al. (1988) and the static variant to Danvy and Filinski (1990, 1992). The static variant has a direct, corresponding CPSformulation which makes it attractive for an implementation using a static code transformation and thus, this is the variant underlying the implementation described in this paper. We will not go into the details of other variants here, but refer to the literature instead (Dyvbig et al. 2007; Shan 2004; Biernacki et al. 2006); suffice it to note that the two main variants, at least in an untyped setting, are equally expressive and have been shown to be macro-expressible (Felleisen 1991) by each other (Shan 2004; Kiselyov 2005). Applying the type systems of Asai and Kameyama (2007); Kameyama and Yonezawa (2008), however, renders the dynamic control operators strictly more expressive since strong normalization holds only for the static variant (Kameyama and Yonezawa 2008). In Danvy and Filinski’s model, there are two primitive operations, shift and reset. With shift, one can access the current continuation and with reset, one can demarcate the boundary up to which continuations reach: A shift will capture the control context up to, but not including, the nearest dynamically enclosing reset (Biernacki et al. 2006; Shan 2007). Despite their undisputed expressive power, continuations (and in particular delimited ones) have not yet found their way into the majority of programming languages. Full continuations are standard language constructs in Scheme and popular ML dialects, but most other languages do not support them natively. This is partly because efficient support for continuations is assumed to require special provisions from the runtime system (Clinger et al. 1999), like the ability to capture and restore the run-time stack, which are not available in all environments. In particular, popular VM’s such as the JVM or the .NET CLR do not provide this lowlevel access to the run-time stack. One way to overcome these limitations is to simulate stack inspection with exception handlers and/or external data structures (Pettyjohn et al. 2005; Srinivasan 2006). Another avenue is to use monads instead of continuations to express custom-defined control flow. Syntactic restrictions imposed by monadic style can be overcome by supporting more language constructs in the monadic level, as is done in F#’s workflow expressions. Nevertheless, the fact remains that monads or workflows impose a certain duplication of syntax constructs that need to be

We describe the implementation of first-class polymorphic delimited continuations in the programming language Scala. We use Scala’s pluggable typing architecture to implement a simple type and effect system, which discriminates expressions with control effects from those without and accurately tracks answer type modification incurred by control effects. To tackle the problem of implementing first-class continuations under the adverse conditions brought upon by the Java VM, we employ a selective CPS transform, which is driven entirely by effect-annotated types and leaves pure code in direct style. Benchmarks indicate that this high-level approach performs competitively. Categories and Subject Descriptors D.3.3 [Programming Languages]: Language Constructs and Features—Control structures General Terms

Languages, Theory

Keywords Delimited continuations, selective CPS transform, control effects, program transformation

1.

Introduction

Continuations, and in particular delimited continuations, are a versatile programming tool. Most notably, we are interested in their ability to suspend and resume sequential code paths in a controlled way without syntactic overhead and without being tied to VM threads. Classical (or full) continuations can be seen as a functional version of the infamous GOTO-statement (Strachey and Wadsworth 2000). Delimited (or partial, or composable) continuations are more like regular functions and less like GOTOs. They do not embody the entire rest of the computation, but just a partial rest, up to a programmer-defined outer bound. Unlike their undelimited counterparts, delimited continuations will actually return control to the caller after they are invoked, and they may also return values. This means that delimited continuations can be called multiple times in succession, and the program can proceed at the call site afterwards. This ability makes delimited continuations strictly more powerful than regular ones. Operationally speaking, delimited continuations

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICFP’09, August 31–September 2, 2009, Edinburgh, Scotland, UK. c 2009 ACM 978-1-60558-332-7/09/08. . . $5.00 Copyright

317

Danvy and Filinski (1989) presented a monomorphic type system for shift and reset, which was extended by Asai and Kameyama (2007) to provide answer type polymorphism. A type and effect system for full continuations has been presented by Thielecke (2003). None of these allow to identify all uses of shift (including trivial ones like in shift(k => k(3))) in a given source program. The selective CPS transform has been introduced by Nielsen (2001), but it has not been applied in the context of delimited continuations.

made available on both monadic and direct style levels. Besides that, it is common knowledge that delimited continuations are able to express any definable monad (Filinski 1994, 1999). In this paper we pursue a third, more direct alternative: transform programs using delimited continuations into CPS using a type-directed selective CPS transform. Whole program CPS transforms were previously thought to be too inefficient to be practical, unless accompanied by tailor-made compiler backends and runtimes (Appel 1992). However, we show that the more localized CPS-transforms needed for delimited continuations can be implemented on stock VMs in ways that are competitive in terms of performance. 1.1

1.3

Contributions

To the best of our knowledge, we are the first to implement directstyle shift and reset operators with full answer type polymorphism in a statically type-safe manner and integrated into a widely available language. We implement a simple type and effect system which discriminates pure from impure expressions, while accurately tracking answer type modification and answer type polymorphism. We thereby extend the work of Asai and Kameyama (2007) to a slightly different notion of purity, which identifies all uses of shift in a program and thus is a suitable basis for applying a selective CPS transform (Nielsen 2001). To the best of our knowledge, we are the first to use a selective CPS transform to implement delimited continuations. With our implementation, we present evidence that the standard CPS transform, when applied selectively, is a viable means to efficiently implement static control operators in an adverse environment like the JVM, under the condition that closures are available in the host language. 1.2

Overview

The rest of this paper is structured as follows. Section 2 gives a short overview of the Scala language and the language subset relevant to our study. Section 3 is the main part of this paper and describes the typing rules and program transformations which constitute the implementation of delimited continuations in Scala. Section 4 presents programming examples, Section 5 performance figures, and Section 6 concludes.

2.

The Host Language

This section gives a quick overview of the language constructs of Scala as far as they are necessary to understand the material in this paper. Scala is different from most other statically typed languages in that it fuses object-oriented and functional programming. Most features from traditional functional languages such as Haskell or ML are also supported by Scala, but sometimes in a slightly different way. Value definitions in Scala are written val pat = expr where pat is a pattern and expr is an expression. They correspond to let bindings. Monomorphic function definitions are written

Related Work

Filinski (1994) presented an ML-implementation of shift and reset (using callcc and mutable state), which has fixed answer types. Gasbichler and Sperber (2002) describe a direct implementation in Scheme48, which, of course, is not statically typed. Dyvbig et al. (2007) presented a monadic framework for delimited continuations in Haskell, which includes support for shift and reset among other control operators and supports multiple typed prompts. This framework does not allow answer type modification, though, and the control operators can only be used in monadic style (e.g. using Haskell’s do notation) but not in direct-style. Kiselyov et al. (2006) presented a direct-style implementation in (bytecode) OCaml, which is inspired by Dyvbig et al.’s framework. The OCaml implementation does not support answer type modification, though, and the type system does not prevent using shift outside the dynamic scope of a reset. In this case, a runtime exception will occur. All of the above implementations cannot express Asai’s type-safe implementation of printf (Asai 2007). Kiselyov (2007) gave an adaption of Asai and Kameyama (2007)’s type system (powerful enough to express printf) in Haskell, which is fully type safe and provides answer type polymorphism, but cannot be used in direct-style (in fact, due to the use of parameterized monads (Atkey 2006), do notation cannot be used either). Asai and Kameyama (2007) did not provide a publicly available implementation of their calculus. On the JVM, continuations have been implemented using a form of generalized stack inspection (Pettyjohn et al. 2005) in Kilim (Srinivasan 2006; Srinivasan and Mycroft 2008). These continuations can in fact be regarded as delimited, but there is no published account of their delimited nature. Kilim also tracks control effects using a @pausable annotation on methods but there are no explicit or definable answer types.

def f(params): T = expr where f is the function’s name, params is its parameter list, T is its return type and expr is its body. Regular by-value parameters are of the form x: T, where T is a type. Scala also permits byname parameters, which are written x: => T. It is possible to leave out the parameter list of a function completely. In this case, the function’s body is evaluated every time the function’s name is used. Definitions in Scala programs are always nested inside classes or objects.1 A simple class definition is written class C[tparams](params) extends Ds { defs } This defines a class C with type parameters given by tparams, value parameters given by params, superclasses and -traits given by Ds and definitions given by defs. All components except the class name can be omitted. Classes are instantiated to objects using the new operator. One can also define an object directly with almost the same syntax as a class: object O extends Ds { defs } Such an object is an instance of an anonymous class with the given parent classes Ds and definitions defs. The object is created lazily, the first time it is referenced. The most important form of type in Scala is a reference to a class C, or C[Ts] if C is parameterized. Unary function types from S to T are type instances of class Function1[S, T], but one usually uses the abbreviated syntax S => T for them. By-name function 1 Definitions

outside classes are supported in the Scala REPL and in Scala scripts but they cannot be accessed from outside their session or script.

318

types can be written (=> S) => T. Most of Scala’s libraries are written in an object-oriented style, where operations apply to an implicit this receiver object. For instance, a List class would support operations map and flatMap in the following way:

reset { val x = shift { k: (Int=>Int) => "done here" No output (continua} tion not invoked) println(x) }

class List[T] { ... def map[U](f: T => U) = this match { case Nil => Nil case x :: xs => f(x) :: xs.map(f) } def flatMap[U](f: T => List[U]) = this match { case Nil => Nil case x :: xs => f(x) ::: xs.flatMap(f) } }

reset { val x = shift { k: (Int=>Int) => k(7) Output: 7 } println(x) }

Here, :: is list cons and ::: is list concatenation. The implementations above also show Scala’s approach to pattern matching using match expressions. Similar to Haskell and ML, pattern matching blocks that enclose a number of cases in braces can also appear outside match expressions; they are then treated as function literals. Most forms of expressions in Scala are written similarly to Java and C, except that the distinction between statements and expressions is less strict. Blocks { ... }, for example, can appear as expressions, including as function arguments. Another departure from Java is support for function literals, which are written with an infix =>. For instance, (x: Int) => x + 1 represents the incrementation function on integers. All binary operations in Scala are treated as method calls. x op y is treated as x.op(y) for every operator identifier op, no matter whether op is symbolic or alphanumeric. In fact, map and flatMap correspond closely to the operations of a monad. flatMap is monadic bind and map is bind with a unit result. Together, they are sufficient to express all monadic expressions as long as injection into the monad is handled on a per-monad basis. Therefore, all that needs to be done to implement monadic unit is to provide a corresponding constructor operation that, in this case, builds one-element lists. Similarly to Haskell and F#, Scala supports monad comprehensions, which are called forexpressions. For instance, the expression for (x g(x, y))) Definitions as well as parameters can be marked as implicit. Implicit parameters that lack an actual argument can be instantiated from an implicit definition that is accessible at the point of call and that matches the type of the parameter. Implicit parameters can simulate the key aspects of Haskell’s type classes (Moors et al. 2008). An implicit unary function can also be used as a conversion, which implictly maps its domain type to its range. Definitions and types can be annotated. Annotations are userdefined metadata that can be attached to specific program points. Some annotations are visible at run-time where they can be accessed using reflection. Others are consumed at compile-time by compiler plugins. The Scala compiler has a standardized plugin architecture (Nielsen 2008) which lets users add additional type checking and transformation passes to the compiler. Syntactically, annotations take the form of a class constructor preceded by an @-sign. For instance, the type

val x = reset { shift { k: (Int=>Int) => k(7) } + 1 } * 2 println(x)

Output: 16

val x = reset { shift { k: (Int=>Int) => k(k(k(7))) } + 1 } * 2 println(x)

Output: 20

val x = reset { shift { k: (Int=>Int) => k(k(k(7))); "done" } + 1 } println(x)

Output: “done”

def foo() = { 1 + shift(k => k(k(k(7)))) } def bar() = { foo() * 2 } def baz() = { reset(bar()) } println(baz())

Output: 70

Figure 1. Examples: shift and reset In this paper, we study the addition of control operators shift and reset to this language framework, which together implement static delimited continuations (Danvy and Filinski 1990, 1992). The operational semantics of shift is similar to that of callcc in languages like Scheme or ML, except that a continuation is only captured up to the nearest enclosing reset and the capturing is always abortive (i.e. the continuation must be invoked explicitly). Figure 1 presents some examples to illuminate the relevant cases.

3.

Implementation

Broadly speaking, there are two ways to implement continuations (see (Clinger et al. 1999) for a more detailed account). One is to stick with a stack-based execution architecture and to reify the current continuation by making a copy of the stack, which is reinstated

String @cps[Int, List[Int]] is the type String, annotated with an instance of the type cps applied to type arguments Int and List[Int].

319

when the continuation is invoked. This is the approach taken by many language implementations that are in direct control of the runtime system. Direct implementations of delimited continuations using an incremental stack/heap strategy have also been described (Gasbichler and Sperber 2002). In the Java world, stack-copying has been used to implement continuations on the Ovm virtual machine (Dragos et al. 2007). For Scala, though, this is not a viable option, since Scala programs need to run on plain, unmodified, JVMs, which do not permit direct access or modification of stack contents. A variant of direct stack inspection is generalized stack inspection (Pettyjohn et al. 2005), which uses auxiliary data structures to simulate continuation marks that are not available on the JVM or CLR architectures. That approach is picked up and refined by Kilim (Srinivasan 2006; Srinivasan and Mycroft 2008), which transforms compiled programs at the bytecode-level, inserting copy and restore instructions to save the stack contents into a separate data structure (called a fiber) when a continuation is to be accessed. The other approach is to transform programs into continuationpassing-style (CPS) (Appel and Jim 1989; Danvy and Filinski 1992). Unfortunately, the standard CPS-transform is a wholeprogram transformation. All explicit or implicit return statements are replaced by function calls and all state is kept in closures, completely bypassing the stack. For a stack-based architecture like the JVM, of course, this is not a good fit. On the other hand, regarding manually written CPS code shows that only a small number of functions in a program actually need to pass along continuations. What we are striving for is thus a selective CPS transform (Nielsen 2001) that is applied only where it is actually needed, and allows us to stick to a regular, stack-based runtime discipline for the majority of code. As a side effect, this by design avoids the performance problems associated with implementations of delimited continuations in terms of undelimited ones (Balat and Danvy 1997; Gasbichler and Sperber 2002). In general, a CPS transform is feasible only if the underlying architecture supports constant-space tail-calls, which is the case for the .NET CLR but not yet for the JVM2 . So far, we have not found this a problem in practice. One reason is that for many use-cases of delimited continuations, call depth tends to be rather small. Moreover, some applications lend themselves to uses of shift and reset as parts of other abstractions, which allow a transparent inclusion of a trampolining facility, in fact introducing a back-door tail-call optimization. 3.1

fun((x:A) => f(x).fun(k))) } } Note the +/- variance annotations on the type parameters of class Shift, which make the class covariant in parameters A and C and contravariant in B. This makes Shift objects consistent with the subtyping behavior of the parameter fun. We go on by defining reset to operate on Shift objects, invoking the body of a given shift block with the identity function to pass the result back into the body (which is the standard CPS definition of reset): def reset[A,C](c: Shift[A,A,C]) = c.fun(x:A => x) With these definitions in place, we can use delimited continuations by placing Shift blocks in for comprehensions, which are Scala’s analog to the do notation in Haskell: val ctx = for { x Int) => k(k(k(7)))) } yield (x + 1) reset(ctx) // 10 This works because during parsing, the Scala compiler desugars the for comprehension into invocations of map and flatMap: val ctx = new Shift((k:Int=>Int) => k(k(k(7)))) .map(x => x + 1) reset(ctx) // 10 So for all practical matters, we have a perfectly workable selective CPS transform, albeit a syntax-directed one, i.e. one which is carried out by the parser on the basis of purely syntactic criteria, more specifically the placement of the keywords for and yield. Being forced to use for comprehensions everywhere continuations are accessed does not make for a pleasant programming style, though. Instead, we would like our CPS to be type-directed, i.e. carried out by the compiler on the basis of expression types. 3.2

Syntax-Directed Selective CPS Transform

Taking a step back, we consider how we might implement delimited continuations as user-level Scala code. The technique we use comes as no surprise and is a straightforward generalization of the continuation monad to one that is parametric in the answer types. On a theoretical level, parameterized monads have been studied in (Atkey 2006). As a first step, we define a wrapper class to hold shift blocks and provide methods to extend and compose them. The method flatMap is the monadic bind operation: class Shift[+A,-B,+C](val fun: (A => B) => C) { def map[A1](f: (A => A1)): Shift[A1,B,C] = { new Shift((k:(A1 => B)) => fun((x:A) => k(f(x)))) }

Γ; α ` e : τ ; β

def flatMap[A1,B1,C1 Shift[A1,B1,C1])): Shift[A1,B1,C] = { new Shift((k:(A1 => B1)) => 2 Tail-call

Effect-Annotated Types

The motivation for this approach is to transparently mix code that must be transformed with code that does not. Therefore, we have to disguise the type Shift[A,B,C] as something else, notably something that is compatible with type A because A is the argument type of the expected continuation (recall the definition of Shift). Thus, we make use of Scala’s pluggable typing facilities and introduce a type annotation @cps[-B,+C], with the intention that any expression of type A @cps[B,C] should be translated to an expression of type Shift[A,B,C]. The approach of using annotated types to track control effects has a close correspondence to the work on polymorphic delimited continuations (Asai and Kameyama 2007). It has been noted early (Danvy and Filinski 1989) that in the presence of control operators, standard typing judgements of the form Γ ` e : τ , which associate a single result type τ with an expression e, are insufficient to accurately describe the result of evaluating the expression e. The reason is that evaluating e may change the answer type of the enclosing computation. In the original type system by Danvy and Filinski (1989), typing judgements thus have the form

meaning that “if e is evaluated in a context represented by a function from τ to α, the type of the result will be β” or equivalently “In a context where the (original) result type was α, the type of e is τ and the new type of the result will be β”.

support for the JVM has been proposed by JSR 292 (Rose 2008)

320

Asai and Kameyama (2007) present a polymorphic extension of this (monomorphic) type system and prove a number of desirable properties, which include strong type soundness, existence of principal types and an efficient type inference algorithm. A key observation is that if e does not modify its context, α and β will be identical and if Γ; α ` e : τ ; α is derivable for any α, the expression does not have any measurable control effect. In other words, pure expressions (e.g. values) are intuitively characterized as being polymorphic in the answer type and not modifying it (Thielecke 2003). Pure expressions such as lambda abstractions (or other values) should thus be allowed to be used polymorphically in the language. The ability to define functions that are polymorphic in how they modify the answer type when invoked plays a crucial role e.g. in the implementation of type-safe printf (Asai 2007). Asai and Kameyama therefore use two kinds of typing judgements to distinguish pure from impure expressions, and they require that only pure expressions be used in right-hand sides of let-bindings, in order to keep their (predicative) type system sound. The kinds of judgements used are Γ `p e : τ

Γ ` f : (A => B) => C Γ ` shift(f ) : A @cps[B,C]

( SHIFT )

Γ ` c : A @cps[B,C] Γ ` reify(c) : Shift[A,B,C]

( REIFY )

Γ ` f : (A => (B γ)) α Γ`e:A β δ = comp(α β γ) Γ ` f (e) : B δ Γ ` f : ((=>A β) => (B γ)) α Γ`e:A β δ = comp(α γ) Γ ` f (e) : B δ

( APP - VALUE )

( APP - NAME )

Γ ` f : ((=>A @cps[C,C]) => (B γ)) α Γ`e:A δ = comp(α γ) Γ ` f (e) : B δ ( APP - DEMOTE )

Γ; α ` e : τ ; β

for pure and impure expressions, respectively. Expressions classified as pure are variables, constants, lambda abstractions and uses of reset. In addition, pure expressions can be treated as impure ones that do not change the answer type:

Γ, x : A, f : (A=>B β) ` e : B β Γ, f : (A=>B β) ` {r} : C γ Γ ` def f (x:A): B β = e; r : C γ

Γ `p e : τ Γ; α ` e : τ ; α Instead of the standard function types σ → τ , types of the form σ/α → τ /β are used (denoting a change in the answer type from α to β when the function is applied).

( DEF - CBV )

3.3

Γ, x : A α, f : ((=>A α)=>B β) ` e : B β Γ, f : ((=>A α)=>B β) ` {r} : C γ Γ ` def f (x:=>A α): B β = e; r : C γ ( DEF - CBN )

Pure is not Pure

For our goal of applying a selective CPS transform, we need a slightly different notion of purity. Since we have to transform all expressions that actually access their continuation (and only those), we have to be able to identify them accurately. Neither the intuitive notion of purity nor the purity judgement of Asai and Kameyama does provide this classification. For example, the expression shift(k => k(3)), which needs to be transformed, would be characterized as pure in the intuitive sense (it is polymorphic in the answer type and does not modify it) but the purity judgement is not applicable. The expression id(3), however, which should not be CPS-transformed, is intuitively pure but impure as defined by applicability of the purity judgement, as are function applications in general. We thus define an expression as pure if and only if it does not reify its continuation via shift in the course of its evaluation. In order to adapt the effect typing to this modified notion of purity we use a slightly different formulation. We keep the standard typing judgements of the form Γ ` e : τ , but we enrich the types themselves. That is, τ can be either A, denoting a pure type, or A @cps[B,C], denoting an impure type that describes a change of the answer type from B to C. We will write A α when we talk about both pure and impure types. We present typing rules for a selected subset of the Scala language in Figure 2. Impure types are introduced by shift expressions (SHIFT).3 Impure types are eliminated by reify expressions

Γ`e:Aα Γ, x : A ` {r} : B β Γ ` val x: A = e; r ( VAL ) Γ`s:Aα

δ = comp(α β) :Bδ

Γ`e:Bβ δ = comp(α β) Γ ` {s; e} : B δ

( SEQ )

Figure 2. Typing rules for a selected subset of Scala expressions. Lowercase letters are expressions, uppercase letters types without annotations, greek letters are either annotations or no annotations.

comp() = comp(@cps[B,C]) = @cps[B,C] W B) => C V ) => W A @cps[B,C] (B β) (DEF - CBV). That the effect of applying a function is coupled to its return type is consistent with the intuitive assumption that the return type describes what happens when the function is applied. The formal parameter is not allowed to have an effect. This is also intuitively consistent, because for a by-value parameter, any effect the evaluation of an argument might have will occur at the call site (APP - VALUE), so inside the function, accessing the argument will be effect-free. On the other hand, functions in Scala can also have by-name parameters. A function with a single by-name parameter will have the type (=>A α) => (B β) (DEF - CBN), which is consistent with the assumption that the effect of evaluating the argument now happens inside the function body (APP - NAME). If a function taking an impure parameter is applied to a pure expression, the argument is demoted to impure provided that the parameter type does not demand changing the answer type (APP - DEMOTE). The typing rules for other kinds of expressions (e.g. conditionals) are similar in spirit to those presented for function applications. Since functions can be polymorphic in their answer type modification, we allow right-hand-sides of def statements to be impure. We also allow impure expressions in val definitions (these are monomorphic in Scala) that occur inside methods, but the identifier will have a pure type since the effect occurs during evaluation of the right hand side and has already happened once the identifier is assigned (VAL). The effect is instead accounted to the enclosing block (SEQ). In Scala, val definitions are also used to define object or class level fields. Contrary to those inside methods, these val definitions are required to be pure since we cannot capture a continuation while constructing an object. Subtyping between impure types takes the annotations into account, with proper variances as induced by the corresponding CPS representation (see Figure 3). There is no subtyping or general subsumption between pure and impure types, but pure expressions may be treated as impure (and thus polymorphic in the answer type) where required, as defined by the typing rules in Figure 2. This is a subtle difference that allows us to keep track of the purity of each expression and prevents the loss of accuracy associated with Asai and Kameyama’s subsumptive treatment of pure expressions as impure ones. The CPS transform will detect these conversion cases in the code and insert shifts where necessary to demote pure expressions to impure ones. In addition, impure expressions can be treated as pure ones in all by-value places, i.e. those where the expression is reduced to a value. The expression’s effect (which happens during evaluation) is then accounted to the enclosing expression. And this is exactly what will drive the selective CPS conversion: Every use of an impure expression as a pure one must reify the context as a continuation and explicitly pass it to the translated impure expression. When we say “accounted to the enclosing expression”, we are actually a bit imprecise. The correct way to put it is that every expression’s cumulated control effect (which may be none) is a composition of its by-value subexpressions’ control effects in the order of their evaluation. For such a composition to exist, the answer types must be compatible (according to the standard rules, also manifest in the type constraints of the class Shift and its method flatMap). During composition, pure expressions are treated neu-

f : (A => B) => C [[shift(f )]] = new Shift[A,B,C](f ) ( SHIFT ) c : A @cps[B,C] [[reify(c)]] = [[c]] ( REIFY ) e : A @cps[B,C] {[[r]]} : U @cps[V ,W ] [[val x: A = e; r]] = [[e]] .flatMap(x:A => {[[r]]}) ( VAL - IMPURE ) e : A @cps[B,C] {[[r]]} : U [[val x: A = e; r]] = [[e]] .map(x:A => {[[r]]}) ( VAL - PURE ) [[def f (x:A) = e; r]] = def f (x:A) = [[e]]; [[r]] ( DEF ) [[s; r]] = s; [[r]]

[[{r}]] = {[[r]]}

( SEQ ) Figure 4. Type-directed selective CPS transform for a subset of Scala. Lowercase italic letters denote untransformed expressions, uppercase letters expression sequences or types. Rules are applied deterministically from top to bottom.

trally. This is how we achieve answer type polymorphism in our system. If no composition exists, a type error is signaled. The rules that define the composition relation are given in Figure 3. 3.4

Type-Directed Transformation

We will define the selective CPS transform in two steps and start with the one that comes last. A subset of the transformation rules is shown in Figure 4. We denote the transformation itself by [[.]], and we let Scala programs access transformed expressions with the primitive reify (REIFY). Invocations of shift are translated to creations of Shift objects (SHIFT). If an impure expression appears on the right-hand-side of a value definition, which is followed by a sequence of expressions, then the right-hand-side is translated to a Shift object, upon which flatMap or map is invoked with the translated remaining expressions wrapped up into an anonymous function (VAL - IMPURE,VAL - PURE). Whether flatMap or map is used depends on whether the remaining expressions are translated to a Shift object or not. The semantics of shift require to flatten out nested Shift objects because otherwise, multiple nested reset handlers would be needed to get at the result of a sequence of shift blocks. In this case, all but the first shift would escape the enclosing reset4 . The use of map is an optimization. We could as well wrap the remaining code into a stub shift that behaves as identity and then use flatMap. But that would introduce unnecessary “administrative” redexes, which customary CPS-transform algorithms go to great lengths to avoid (Danvy et al. 2007). Right-hand sides of function definitions are translated independently of the context (DEF). Finally, block expressions {...} 4 This

322

is in fact Felleisen’s model (Felleisen 1988)

expression in question. While this is a sound premise in the description at hand, we actually make sure in the implementation that the expression itself is annotated accordingly. This is done by an annotation checker, which hooks into the typer phase of the Scala compiler, promoting CPS annotations outwards to expressions that have nested CPS expressions in positions where [[.]]Inline will be applied. In the actual implementation, the selective ANF transform is also slightly more complex than described here. One reason is that we have to accommodate for possibly erroneous programs. Therefore, the actual transform takes two additional parameters, namely an expected @cps annotation (or none if a pure expression is expected) for the current expression sequence and an actual one, which is built up as we go along. When reaching the end of an expression sequence, these two must either match, or, if an annotation is expected but none is present, an implicit shift is inserted that behaves as identity. Summing up the transformation steps, we implement reset in terms of reify, using a by-name parameter:

[[{r; e}]] = {[[r]]Inline ; [[e]]} [[s; r]]Inline = [[s]]Inline ; [[r]]Inline g : A @cps[B,C] [[e]] = r; g [[e]]Inline = r; val x: A = g; x [[f ]]Inline = r; g [[e]]Inline = s; h [[f (e)]] = r; s; g(h) [[f ]]Inline = r; g [[e]]Inline = s; h [[f (e)]] = r; g({s; h})

( BY- VALUE APPLY )

( BY- NAME APPLY )

Figure 5. Selective ANF transform (only selected rules shown). Lowercase italic letters denote untransformed expressions, uppercase letters expression sequences or types.

def reset[A,C](ctx: => A @cps[A,C]) = { reify(ctx).fun(x:A => x) }

are translated by applying the other rules to the enclosed expression sequence, possibly skipping a prefix of non-CPS terms (SEQ). Applying the transformation rules given in Figure 4, we can transform code like

Finally, we can express our working example as reset { shift(k => k(k(k(7)))) + 1 }

reset(reify { val x = shift(k => k(k(k(7)))) x+1 })

which is exactly what was intended.

into the following:

4.

reset(new Shift(k => k(k(k(7)))).map(x => x + 1))

Programming Examples

There are many well-known use cases for delimited continuations and most of them can be implemented in Scala straightforwardly.

We are still somewhat restricted, though, in that CPS expressions may only appear in value definitions. Fortunately, we can reach this form by a pre-transform step, which assigns synthetic names to the relevant intermediate values, lifting them into value definitions. In analogy to the selective CPS transform, we can describe this step as a selective ANF transform (administrative normal form (Flanagan et al. 1993)). We present a subset of the transformation rules in Figure 5. For the ANF pre-transform, we use two mutually recursive functions, [[.]] and [[.]]Inline that map expressions to expression sequences ([[.]]Inline is extended pointwise to expression sequences). The latter is used to lift nested CPS-expressions and insert a val definition inline, preceding the parent expression. Since we do not want to introduce value definitions for expressions that are already in tail position, we use either transformation depending on the context. Again, we illustrate the main principle by considering function applications. We consider application of functions with a single by-value parameter first. The function and the argument are nested expressions and thus transformed using [[.]]Inline , each of them yielding a statement sequence followed by an expression. The Scala semantics demand that the function be evaluated first, so the result is the function’s statements, followed by the argument’s statements, followed by applying the expressions. When considering functions with by-name parameters, by contrast, the statements that result from transforming the corresponding argument must not be inserted preceding the application. In this case, the whole resulting sequence is wrapped up in a block and passed as an argument to the transformed function. For other kinds of Scala expressions like conditionals, pattern matching, etc., the transformation works accordingly, depending on the context whether the by-name or byvalue style is used. Note that in Figure 5, the insertion of new value definitions is triggered by a @cps annotation on the result of transforming the

4.1

Type-Safe printf

As a first example, we present the Scala implementation of typesafe printf (Danvy 1998; Asai 2007): val int = (x: Int) => x.toString val str = (x: String) => x def format[A,B](toStr: A => String) = shift { k: (String => B) => (x:A) => k(toStr(x)) } def sprintf[A](str: =>String @cps[String,A]) = { reset(str) } val f1 = sprintf[String]("Hello World!") val f2 = sprintf("Hello " + format[String,String](str) + "!") val f3 = sprintf("The value of " + format[String,Int=>String](str) + " is " + format[Int,String](int) + ".") println(f1) println(f2("World")) println(f3("x")(3)) This example is instructive for its use of both answer type modification and answer type polymorphism. As we can see in the code above, format takes a function that converts a value of type A to a string. In addition, it modifies the answer type from any type B to a function from A to B. In a context whose result type is String, invoking format(int) will change the answer

323

type to Int => String; an additional integer argument has to be provided to turn the result into a string. Unfortunately, Scala’s local type inference cannot reconstruct all the type parameters here so we must give explicit type arguments for uses of format. 4.2

// synchronous reference cell (no buffering) class SyncRefCell[A] extends ReferenceCell[A] { join { case put(x Unit) val item = new (A ==> Unit) join { case put(x C[B]) => xs.flatMap(k) } } } Defining an implicit conversion for iterables, we can e.g. use the list monad in direct-style. The unit constructor List is used here as Filinski’s reify operation:

4.4

implicit def reflective[A](xs:Iterable[A]) = new Reflective[A,Iterable](xs) reset { val left = List("x","y","z") val right = List(4,5,6) List((left.reflect[Any], right.reflect[Any])) } // result: cartesian product The same mechanism applies to other monads, too. Using the option monad, for example, we can build a custom exception handling mechanism and the state monad could be used as an alternative to thread-local variables. 4.3

Actors

Scala Actors provide an implementation of the actor model (Agha and Hewitt 1987) on the JVM (Haller and Odersky 2009). To make efficient use of VM threads, actors, when waiting for incoming messages, can suspend in event-based mode with an explicitly passed continuation closure instead of blocking the underlying thread (Haller and Odersky 2006). This is accomplished by the primitive react that takes a message handler (the continuation closure), and suspends the current actor in event mode. The underlying (pool) thread is released to execute other runnable actors. Using react, however, imposes some restrictions on the program structure. In particular, no code following a react is ever executed, only the explicitly provided closure. For example, consider implementing a communication protocol using actors. It would be tempting to handle the connection setup in a separate method:

Concurrency Primitives

def establishConnection() = { server ! SYN react { case SYN_ACK => server ! ACK } }

Using delimited continuations, we can implement a rich variety of primitives for concurrent programming. Among others, these include bounded and unbounded buffers, rendezvous cells, futures, single-assignment variables, actor mailboxes, and join patterns. Without going into the details of the implementation, we show how our implementation of extensible join patterns or dynamic functional nets (Fournet and Gonthier 1996; Odersky 2000; Rompf 2007) can integrate join-calculus based programming into Scala. The following code implements, with a common interface, synchronous rendezvous cells and asynchronous reference cells backed by a one-place buffer:

which is then used as part of a more complex actor behavior: actor { establishConnection() transferData() ... }

abstract class ReferenceCell[A] { val put = new (A ==> Unit) val get = new (Unit ==> A) }

But unfortunately, this does not work as is. The use of react inside establishConnection precludes the execution of transferData. To make this example work, one would have to

324

use explicit andThen combinators to chain the individual pieces of behavior together. In the presence of complex control structures, programming in this style quickly becomes cumbersome. In addition, the type system does not enforce the use of combinators so errors will manifest only at runtime. Using delimited continuations, we can simplify programming event-based actors significantly. Moreover, we can do so without changing the implementation of the existing primitives, thereby maintaining the high degree of interoperability with standard Java threads (Haller and Odersky 2009). This approach, which has been suggested by Philipp Haller, introduces a higher-order function proceed that can be applied to react, such that the message handling closure is extended with the current continuation:

we take an example from our user interface toolkit whose event handling details are implemented exclusively with our reactive programming library. The interactive behavior of a button widget can be implemented as follows: behavior { next(mouse.leftDown) showPressed(true) val t = loop { showPressed(next(mouse.hovers.changes)) } next(mouse.leftUp) t.done() showPressed(false) if (mouse.hovers.now) performClick() }

def proceed[A, B](fun: PartialFunction[A, Unit] => Nothing): PartialFunction[A, B] => B @cps[Unit, Unit] = (cases: PartialFunction[A, B]) => shift((k: B => Unit) => fun(cases andThen k))

The first action of the behavior above is to wait until the left mouse button is pressed down. Method next blocks the current behavior and returns the next message that becomes available in a given event stream. In our case, the behavior drops that message and then updates the button view and starts a loop. A loop is a child behavior that is automatically terminated when the current cycle of the parent behavior ends. The loop updates the button view whenever the mouse enters or leaves the button area. We do this by waiting for changes in the boolean signal mouse.hovers, which indicates whether the mouse currently hovers over the button widget. The call to changes converts that boolean signal to an event stream that yields boolean messages. We use the event message to determine whether the mouse button is currently over the button. In the parent behavior, in parallel to the loop, we wait for the left mouse button to be released. On release, we terminate the loop by calling done, which causes the child behavior to stop after it has processed all pending events. Note that this does not lead to race conditions since, in constrast to actors, behaviors are executed sequentially and should not be accessed from different threads. Eventually, we update the button view and perform a click if the mouse button has been released while inside the bounds of the button widget. The use of the CPS transform API is hidden inside behavior, next, and loop. Methods behavior and loop delimit the continuation scope while method next captures the continuation and passes it to an event stream observer which invokes the continuation on notification.

Wrapping each react with a proceed and inserting a reset to delineate the actor behavior’s outer bound we can actually code the above example as follows. It is worth mentioning that leaving out the reset would cause the compiler to signal a type error, since an impure expression would occur in a pure context: def establishConnection() = { server ! SYN proceed(react) { case SYN_ACK => server ! ACK } } actor { reset { establishConnection() transferData() ... } } Alternatively, the implementations of react and actor could be modified to make use of the necessary control operators directly. But using proceed is a good example of incorporating delimited continuations into existing code in a backwards-compatible way. 4.5

4.6

Asynchronous IO

Using a similar model, we can use scalable asynchronous IO primitves in a high-level declarative programming style. Below, we consider the Scala adaptation of an asynchronous webserver presented in (Rompf 2007). The basic mechanism is to request a stream of callbacks matching a set of keys from a selector. This stream can be iterated over and transformed into a higher-level stream by implementing the standard iteration methods (e.g. foreach) used by Scala’s for comprehensions. A stream of sockets representing incoming connections can be implemented like this:

Functional Reactive Programming

Functional reactive programming (FRP) is an effort to integrate reactive programming concepts into functional programming languages (Elliott and Hudak 1997; Courtney et al. 2003; Cooper and Krishnamurthi 2006). The two fundamental abstractions of FRP are signals and event streams5 . A signal represents a continuous time-varying value; it holds a value at every point in time. An event stream, on the other hand, is discrete; it yields a value only at certain times. Signals and event streams can be composed through combinators, some of which are known from functional collections, such as map, flatMap, or filter. Our implementation of a reactive library in Scala takes the basic ideas of FRP and extends it with support for imperative programming. One key abstraction to achieve this is called behaviors. A behavior can be used to conveniently react to complex event patterns in an imperative way. To give an idea how behaviors work,

def acceptConnections(sel: Selector, port: Int) = new Object { def foreach(body: (SocketChannel => Unit @suspendable)) = { val serverSock = ServerSocketChannel.open() for (key 0 What type should be inferred for function f1? Alas there are two possible most-general types, neither of which is an instance of the other: f1 :: ∀a. T a → Bool f1 :: ∀a. T a → a The loss of principal types is both well-known and unavoidable (CH03). Since f1 has no principal type (one that is more general than all others) the right thing must be to reject the program, and ask the programmer to say which type is required by means of an explicit type signature, like this, for example:

Γ ν τ, υ σ

::= ::= ::= ::=

{x1 : σ1 , ..., xn : σn } α|a ν | τ → υ | T τ¯ τ | ∀¯ a.C ⇒ τ

C, D F

::= ::=

τ ∼τ |C ∧C | C | [α](∀ ¯ ¯b.C ⊃ F ) | F ∧ F

θ φ

::= ::=

∅ | θ, {α := τ } ∅ | φ, {ν := τ }

Patterns

Constraints Impl. Constraints

But exactly which programs should be rejected in this way? For example, consider f2:

K | x | λx.e | e e let {g = e} in e let {g :: σ = e1 } in e2 case e of [pi → ei ]i∈I K x1 ...xn

e

Type envt Type variables Monotypes Type Schemes

f1 :: T a -> a f1 (T1 n) = n>0

p

::= | | | ::=

Terms

Unifiers Substitutions

f2 (T1 n) = n>0 f2 (T2 xs) = null xs

fuv(τ )

Since null :: [a] -> Bool returns a Bool, and T2 is an ordinary (non-GADT) data constructor, the only way to type f2 is with result Bool, so the programmer might be confused at being required to say so. After all, there is only one solution: why can’t the compiler find it?

=

The free unification variables of τ (and similarly fuv(Γ))

Substitution φ(F1 ∧ F2 ) = φ(F1 ) ∧ φ(F2 ) ¯b.φ(C) ⊃ φ(F ) φ([α]∀ ¯ ¯b.C ⊃ F ) = [fuv(φ(α))]∀ ¯ Substitution on C and τ is conventional

An exactly similar issue arises in relation to variables in the environment. Consider

Abbreviations ∀¯ a.τ [α](∀ ¯ ¯b.F )

h1 x (T1 n) = x && n>0 h1 x (T2 xs) = null xs

, ,

∀¯ a. ⇒ τ [α](∀ ¯ ¯b. ⊃ F )

Figure 1. Syntax of Programs

Which of these two incomparable types should we infer? h1 :: ∀a. a → T a → Bool h1 :: ∀a. Bool → T a → Bool

You may imagine a value of type T τ , built with T1, as a heapallocated object with two fields: a value of type Int, and some evidence that τ ∼Bool. When the value is constructed the evidence must be supplied; when the value is de-constructed (i.e. matched in a pattern) the evidence becomes available locally. While in many systems, including GHC, this “evidence” has no run-time existence, the vocabulary can still be helpful and GHC does use explicit evidence-passing in its intermediate language (SCPD07).

Again, since neither is more general than the other, we should reject the program. But if we somehow know from elsewhere that x is a Bool, then there is no ambiguity, and we might prefer to accept the definition. Here is an example h2 x (T1 n) = x && n>0 h2 x (T2 xs) = not x The key difficulty is that a GADT pattern match brings local type constraints into scope. For example in the T1 branch of the definition of f1, we know that the constraint a ∼ Bool holds, where the second argument of f1 has type T a.1 Indeed, while the declaration for the GADT T above is very convenient for the programmer, it is quite helpful to re-express it with an explicit equality constraint, like this2 :

3.

Formal setup

Before we can present our approach, we briefly introduce our language, and the general form of its type system. 3.1

Syntax

Figure 1 gives the syntax of terms and types, which should look familiar to Haskell programmers. A program consists of a set of data type declarations together with a term e. Terms consist of the lambda calculus, together with let bindings (perhaps with a user-supplied type signature), and simple case expressions to perform pattern matching. A data type declaration introduces a type constructor T and one or more data constructors Ki , each of which is given a type signature. As described in Section 2, in the case of

data T :: *->* where T1 :: (a~Bool) => Int -> T a -- Was: T1 :: Int -> T Bool T2 :: [a] -> T a 1 We

consistently use “∼” to denote type equalities, because “=” is used for too many other things. 2 GHC allows both forms, and treats them as equivalent.

342

C, Γ ` e : τ

(VAR)

K :: ∀¯ a.D ⇒ υ φ = {a := τ } C |= φ(D) C, Γ ` K : φ(υ)

(x : ∀¯ a.υ) ∈ Γ φ = {a := τ } (C ON) C, Γ ` x : φ(υ) (A BS)

C, Γ ∪ {x : τ1 } ` e : τ2 C, Γ ` λx.e : τ1 → τ2

C, Γ ` e : τ1 C, Γ `p pi → ei : τ1 → τ2 for i ∈ I C, Γ ` case e of [pi → ei ]i∈I : τ2

(PAT)

(E Q)

C, Γ ` e : τ1 C |= τ1 ∼ τ2 C, Γ ` e : τ2

C, Γ ` e1 : τ1 → τ2 C, Γ ` e2 : τ1 C, Γ ` e1 e2 : τ2

(A PP)

C, Γ ∪ {g : τ1 } ` e1 : τ1 a ¯ = fv(τ1 ) − fv(C, Γ) C, Γ ∪ {g : ∀¯ a.τ1 } ` e2 : τ2 C, Γ ` let {g = e1 } in e2 : τ2

(L ET)

(C ASE)

C, Γ `p p → e : τ → υ

(L ETA)

C, Γ ∪ {g : ∀¯ a.τ1 } ` e1 : τ1 C, Γ ∪ {g : ∀¯ a.τ1 } ` e2 : τ2 C, Γ ` let {g :: ∀¯ a.τ1 = e1 } in e2 : τ2

K::∀¯ a, ¯b.D ⇒ υ1 → ... → υp → T a ¯ fv(C, Γ, τ , τr ) ∩ ¯b = ∅ φ = {a := τ } consistent(C ∧ φ(D)) C ∧ φ(D), Γ ∪ φ{x1 : υ1 , . . . , xp : υp } ` e : τr C, Γ `p K x1 ...xp → e : T τ¯ → τr

Figure 2. Simple but over-permissive typing rules

this is with the judgement (T RUE)

(R EFL)

C |=

C |= τ ∼ τ

C |= τ ∼ υ (S YM) C |= υ ∼ τ

C, Γ ` e : τ meaning “in a context where constraints C are in scope, and type environment Γ the term e has type τ ”. For example, here is a valid judgement:

C |= τ1 ∼ τ2 C |= τ2 ∼ τ3 (T RANS) C |= τ1 ∼ τ3

(a ∼ Bool), {x : a, not : Bool → Bool} ` not x : Bool (G IVEN)

(S TRUCT)

C1 ∧ C2 |= C2

C |= F1 C |= F2 (C ONJ) C |= F1 ∧ F2

C |= τi ∼ υi C |= T τ¯i ∼ T υ¯i (I MPL)

(T CON)

The judgement only holds because of the availability of the local equality a ∼ Bool. The type system of Figure 2 takes exactly this form. For example, rule (C ON) instantiates the type scheme of a data constructor in the usual way, except that it has the additional premise

C |= T τ¯i ∼ T υ¯i C |= τi ∼ υi

C ∧ C1 |= F ¯b ∩ fv (C) = ∅ C |= [α](∀ ¯ ¯b.C1 ⊃ F )

C |= φ(D) This requires that the “wanted” constraints φ(D) must be deducible from the “given” constraints C. To be concrete we give the (routine) definition of |= in Figure 3. Compare rule (C ON) to rule (VAR), where the type scheme does not mention constraints.

Figure 3. Equality theory

Rule (EQ) allows us to use the available constraints C to adjust the result type τ1 to any equal type τ2 . Finally, a case expression uses an auxiliary judgement `p to typecheck the case alternatives. Notice the way that the local constraints C are extended when going inside a pattern match (in rule (PAT)), just as the type environment is augmented when going inside a lambda-term (in rule (ABS)).

GADTs the data constructor’s type contains a set of constraints D, that are brought into scope when K is used in a pattern match, and required when K is used as a constructor. The syntax of types, and of constraints, is also given in Figure 1. Note that unification variables α denote unknown types and only appear during type inference, never in the resulting typings. To avoid clutter we use only equality constraints τ1 ∼τ2 in our formalism, although in GHC there are several other sorts of constraint, including implicit parameters and type classes. We treat conjunction (∧) as a commutative and associative operator as is conventional. Implication constraints F will be introduced in Section 4.4. 3.2

Whenever we go inside a pattern match, we require the given constraint C ∧ φ(D) to be consistent, defined by the rule: (C ONSISTENT)

∃φ. |= φ(C) consistent(C )

i.e. C is consistent if it has a unifier. Consistency implies that we will reject programs with inaccessible case branches.

Type system

The declarative specification of a type system usually takes the form of a typing judgement

A second point to notice about rule (PAT) is that a data constructor may have existential type variables ¯b as well as universal type variables a ¯3 . Rule (PAT) must check that the existential variables

Γ ` e:τ with the meaning “in type environment Γ the term e has type τ ”. In a system with GADTs, however, a pattern match may bring into scope some local equality constraints. The standard way to express

3 The former are called existential, despite their apparent quantification with

∀, because the constructor’s type is isomorphic to K::∀¯ a.(∃¯b.D × υ1 × ... × υp ) → T a ¯.

343

are not mentioned in the environment C, Γ, or the scrutinee type T τ , or the result type τr . In the following example, fx1 is welltyped, but fx2 is not because the existential variable b escapes:

algorithm finds the most general substitution that solves the constraints. Temporarily leaving aside the question of generalization that’s all there is to type inference for ML.

data X where X1 :: forall b. b -> (b->Int) -> X

4.2

fx1 (X1 x f) = f x fx2 (X1 x f) = x 3.3

\x -> case x of { T1 n -> n>0 } recalling the type of T1:

Properties

T1 : ∀a.(Bool ∼ a) ⇒ Int → T a

The type checking problem for GADTs is decidable (CH03; SP07). However, type inference turns out to be extremely difficult. The example from Section 2 shows that GADTs lack principal types. The difficulty is that the type system can type too many terms. Hence, our goal is to restrict the type system to reject just enough programs to obtain a tractable type inference system which enjoys principal types. Nevertheless, we regard Figure 2 as the “natural” type system for GADTs, against which any such restricted system should be compared.

4.

Again we make up fresh unification variables for any unknown types: α type of the entire right-hand side βx type of x Matching x against a constructor from type T imposes the constraint βx ∼ T γ, for some new unification variable γ. From the term n>0 we get the constraint α ∼ Bool, but that arises inside the branch of a case that brings into scope the constraint γ ∼ Bool. We combine these two into a new sort of constraint, called an implication constraint:

A new approach

γ ∼ Bool ⊃ α ∼ Bool

In this section we describe our new approach to type inference for GADTs. Type system designers often develop a type inference algorithm hand-in-hand with the specification of the type system: there is no point in a specification that we cannot implement, or an implementation whose specification is incomprehensible. We begin with the inference algorithm. 4.1

Constraint solving with GADTs

What happens when GADTs enter the picture? Consider our standard example term:

Now our difficulty becomes clear: there is no most-general unifier for implication constraints. The substitutions {α := Bool}

{α := γ}

On the other hand, sometimes there obviously is a unique solution. Consider f2 from Section 2:

Type inference by constraint solving

It is well known that type inference can be carried out in two stages: first generate constraints from the program text, and then solve the constraints ignoring the program text (PR05). The generated constraints involve unification variables, which stand for as-yetunknown types, and solving the constraints produces a substitution that assigns a type to each unification variable. The most basic form of constraint is a type equality constraint of form τ1 ∼ τ2 , where τ1 and τ2 are types.

\x -> case x of { T1 n -> n>0; T2 xs -> null xs } From the two alternatives of the case we get two constraints, respectively: γ ∼ Bool ⊃ α ∼ Bool and α ∼ Bool Since the second constraint can be solved only by {α := Bool}, there is a unique most-general unifier to this system of constraints.

For example, consider the definition

4.3

data Pair :: *->*->* where MkP :: a -> b -> Pair a b

GADT type inference is undecidable

Multiple pattern clauses give rise to a conjunction of implication constraints (C1 ⊃ C10 ) ∧ ... ∧ (Cn ⊃ Cn0 ) The task of GADT type inference is to find a substitution θ such that each θ(Ci0 ) follows from θ(Ci ). This problem is identical to the simultaneous rigid E-unification problem which is known to be undecidable (DV95). Hence, we can immediately conclude that GADT type inference is undecidable in the unrestricted type system. To restore decidability and most general solutions, we consider a restricted implication solver algorithm.

f = \x -> MkP x True The data type declaration specifies the type of the constructor MkP, thus: MkP : ∀ab. a → b → Pair a b Now consider the right-hand-side of f. The constraint generator makes up unification variables as follows: α βx γ1 , γ 2

type of the entire right-hand side type of x instantiate a, b respectively, when instantiating the call of MkP From the text we can generate the following equalities: βx ∼ γ1 Bool ∼ γ2 α ∼ Pair γ1 γ2

and

are both solutions, but neither is more general than the other.

4.4

The OutsideIn solving algorithm

Our idea is a simple one: we must refrain from unifying a global unification variable under a local equality constraint. By “global” we mean “free in the type environment4 ”, and we must record that information in the implication constraint itself, thus

First argument of MkP Second argument of MkP Result of MkP

[α] (γ ∼ Bool ⊃ α ∼ Bool) because α is free in the type environment. Here γ ∼ Bool is a given equality constraint that may only locally be assumed to hold, i.e., to solve the constraint to the right of the implication sign: α ∼ Bool.

These constraints can be solved by unification, yielding the substitution {α := Pair βx Bool, γ2 := Bool, γ1 := βx }. This substitution constitutes a “solution”, because under that substitution the constraints are all of form τ ∼ τ . Not only that, but the unification

4 We

344

must treat the result type as part of the type environment.

When solving this constraint we must refrain from unifying {α := Bool}; hence, the constraint by itself is insoluble. It can be solved only if there is some other constraint that binds α.

`inf e : τ

The syntax of implication constraints F is given in Figure 1. An implication constraint is an ordinary constraint C, or a conjunction of implications, or has the form [α]∀ ¯ ¯b.C ⊃ F . We call the set of unification variables α ¯ the untouchables of the constraint, and the set of type variables ¯b the skolems of the constraint. Applying a substitution to an implication constraint requires a moment’s thought, because an untouchable might be mapped to a type by the substitution, so we must take the free unification variables of the result; see Figure 1. We often omit the untouchables, skolems, or C when they are empty.

(INFER)

Γ `W e : τ, F (x : ∀¯ a.τ ) ∈ Γ α fresh φ = {a := α} Γ `W x : φ(τ ),

(VAR)

K :: ∀¯ a.C ⇒ τ α fresh φ = {a := α} Γ `W K : φ(τ ), φ(C)

(C ON)

More precisely, to solve a set of implication constraints F , proceed as follows: 1. Split F into Fg ∧ Fs , where all the constraints in Fg are proper implications, and Fs are all simple. An implication is simple if it does not involve any local equalities, and proper otherwise: Fg ::= Fg ∧ Fg | [α]∀ ¯ ¯b.C ⊃ F C 6≡ Fs ::= Fs ∧ Fs | C | [α]∀ ¯ ¯b. ⊃ Fs

Γ `W e1 : τ1 , F1 Γ `W e2 : τ2 , F2 α fresh F = F1 ∧ F2 ∧ (τ1 ∼ τ2 → α) Γ `W e1 e2 : α, F

(A PP)

(A BS)

2. Solve the simple constraints Fs by ordinary unification, yielding a substitution θ. 3. Now apply θ to Fg , and solve each implication in θ(Fg ). In the last step, how do we solve a proper implication [α]∀ ¯ ¯b.C ⇒ F ? Simply find φ, the most general unifier of C, and solve φ(F ), under the restriction that the solution must not bind α. ¯

(L ETA)

This algorithm is conservative: if it finds a unifier, that solution will be most general, but the converse is not true. For example, the algorithm fails to solve the constraint (L ET)

[α] (γ ∼ Bool ⊃ α ∼ Int) but the constraint actually has a unique solution, namely {α := Int}.

5.

α fresh Γ ∪ {x : α} `W e : τ, F Γ `W λx.e : α → τ, F Γ ∪ {g : ∀¯ a.τ } `W e1 : τ 0 , F1 Γ ∪ {g : ∀¯ a.τ } `W e2 : υ, F2 F = F2 ∧ [fuv(Γ)](∀¯ a.F1 ∧ τ ∼ τ 0 )) Γ `W let {g :: ∀¯ a.τ = e1 } in e2 : υ, F

α fresh Γ ∪ {g : α} `W e1 : τ, F1 F10 = F1 ∧ α ∼ τ Fs = simple(F10 ) `s Fs : φs β = fuv(φs (τ )) − fuv(φs (Γ)) b fresh θk = {β := b} Γ ∪ {g : ∀¯b.θk (φs (τ ))} `W e2 : υ, F2 F = F2 ∧ [fuv(Γ)]∀¯b.θk (F10 ) Γ `W let {g = e1 } in e2 : υ, F

The OutsideIn approach in detail

It is time to nail down the details. Our approach relies on constraint generation and constraint solving. We specify top-level constraint generation with Γ `W e : τ, F to be read as: in the environment Γ, we may infer type τ for the expression e and generate constraint F . Solving a constraint F to produce a substitution θ is specified with `s F : θ. The top-level inference algorithm is then given by the judgement `inf in Figure 4.5

(C ASE)

T = constructor (pi ) for i ∈ I Γ `W e : τe , Fe α, ¯ β fresh Γ ` P p i → ei : T α ¯ → β, Fi for i ∈ I V F = Fe ∧ τe ∼ T α ¯ ∧ i∈I Fi Γ `W case e of [pi → ei ]i∈I : α, F Γ `P p → e : τ → υ, F

We start by discussing constraint generation (Section 5.1) and constraint solving (Section 5.2). Subsequently we present the highlevel declarative type system (Section 6). 5.1

`W e : τ, F `s F : θ `inf e : θ(τ )

(PAT)

Generating implication constraints

The constraint generation algorithm is given in Figure 4 with the judgement Γ `W e : τ, F In this judgement, thought of as an algorithm, Γ and e are inputs, while τ and F are outputs.

K::∀¯ a, ¯b.D ⇒ υ1 → ... → υp → T a ¯ b 6∈ f tv(Γ, τr ) φ = {a := τ } Γ ∪ φ{x1 : υ1 , . . . , xp : υp } `W e : τe , Fe F = [α ∪ fuv(Γ, τe )](∀¯b.φ(D) ⊃ Fe ∧ τe ∼ τr ) Γ `P K x1 . . . xp → e : T τ → τr , F Figure 4. Translation to Constraints

Rules (VAR), (C ON), (A BS), and (A PP) are straightforward. Rule (PAT) generates an implication constraint, as described informally in Section 4.2. Rule (C ASE) “peeks” inside the pattern

match alternatives to determine the constructor type T (by calling constructor (pi )) and subsequently pushes the type of e (T α) ¯ in the typing clause for each alternative (rule (PAT)). Finally rule (C ASE) returns the constraints arising from the alternatives.

5 We

start with an initially-empty environment, informally relying on a fixed, implicit global environment to specify the types of each data constructor.

Rule (L ETA) generates implication constraints for an annotated letbinding. Pay attention to two details: (a) the inferred type τ 0 must

345

equate to the declared type τ , and (b) the universally quantified variables a ¯ must not escape their scope. The first is captured in an additional equality constraint τ ∼τ 0 , and the latter in the degenerate implication constraint.

`s F : φ

Rule (L ET) for unannotated let-expressions is much trickier. First, it derives the constraint F1 for the bound expression e1 . The conventional thing to do at this point is to create fresh type variables ¯b for the variables β¯ that are not free in the environment with a substitution θk = {β := b}, and abstract over the constraints, inferring the following type for g (SP07): g : ∀¯b.θk (F1 ⇒ τ )

(S- SOLVE)

simple(F ) = Fs `?s Fs : φ `?s F : θ · φ

`?s φ(F ) : θ

`?s F : φ `?s F1 : φ1 φ1 (F2 ) : φ2 (S- EMPTY) (S-S PLIT) `?s : ∅ `?s F1 ∧ F2 : φ2 · φ1 V `?s ( i τi ∼ τi0 ) : φ (S-R EFL) (S-C ONS ) `?s τ ∼ τ : ∅ `?s T τ¯ ∼ T τ¯0 : φ `?s

This is correct, but by postponing all solving until the second phase we get unexpectedly complicated types for simple definitions. For example, from the definition g = \x -> x && True we would infer the type

(S-UL)

g :: ∀b.b ∼ Bool ⇒ b → Bool when the programmer would expect the equivalent but simpler type Bool → Bool. Furthermore, this approach obviously requires that types can take the form F ⇒ τ — including the possibility that F is itself an implication! It all works fine (see (SP07) for example), but it makes the types significantly more complicated and, in an evidence-passing internal language such as that used by GHC, creates much larger elaborated terms.

(S-SI MPL)

ν 6∈ τ φ = {ν := τ } `?s ν ∼ τ : φ `?s F : φ

(S-PI MPL)

Instead, we interleave constraint generation and constraint solving in rule (L ET), thus6 :

(S-UR)

ν 6∈ τ φ = {ν := τ } `?s τ ∼ ν : φ

fv(φ(α)) ¯ ∩ ¯b = ∅ ¯b ∩ dom(φ) = ∅ `?s [α](∀ ¯ ¯b.F ) : φ

C 6≡ `?s C : φ `s φ(F ) : θ α ¯ ∩ dom(θ) = ∅ `?s [α](∀ ¯ ¯b.C ⊃ F ) : θ

Figure 5. Solver algorithm

• Generate constraints for e1 under the assumption that g : α (we

allow recursion in let). • Add the constraint α ∼ τ to tie the recursive knot in the usual

bind skolem variables. This is perhaps unintuitive—after all in ordinary Hindley-Milner we would require that the substitution binds only unification variables. Nevertheless, in the presence of given equations this is not enough. Consider: data T where MkT :: forall a b. (a ~ b) => a -> b -> T

way, forming F10 .

• We cannot, at this stage, guarantee to solve all the constraints in

F10 , because the latter might include implications that can only be solved in the presence of information from elsewhere. So we extract from F10 the “simple” constraints, Fs : simple(C) simple(F1 ∧ F2 ) simple([α](∀ ¯ ¯b.F )) ¯ simple([α](∀ ¯ b.C ⊃ F ))

= = = =

C simple(F1 ) ∧ simple(F2 ) [α]∀ ¯ ¯b.simple(F ) C 6≡

foo = case e of MkT y z -> let h = [y,z] in () Constraint generation for the inner let definition produces the constraint a ∼ b, where a and b are the existential variables introduced by the pattern match. But we must not fail at this point and hence the solver of the Fs constraint must be prepared to encounter equalities between skolem variables.

• Solve Fs appealing to our solver ` Fs : φs . Notice that this

φs may bind skolems, a point that we will return to. Moreover, notice that if unification fails for a simple constraint (such as Bool ∼ Int) then the program is definitely untypeable.

• Second, notice that we do not apply φs to F10 . Why not? Be-

• Apply the solving substitution φs to τ and Γ, and compute the

cause φs may bind variables free in Γ, and we must not lose that information. But the very same information is present in the original F10 , so if we return F10 unchanged (apart from applying the skolemizing substitution θk , then the rest of the derivation will be able to “see” it too.

set of variables β¯ over which to quantify in the usual way.

• Skolemise the variables we can quantify over, using a substitu-

tion θk . • Typecheck the body of the let, with a suitable type for g.

Finally, notice that there is quite a bit of “junk” in F10 . Consider the definition of g given earlier in this subsection. We will get

• Lastly we figure out the constraint F to return. It includes F2

of course, and F10 suitably wrapped in a ∀ to account for the skolemized variables just as in (L ETA).

τ = β → Bool, F10 = β ∼ Bool, θs = {β := Bool} Now β plays no part in the rest of the program, but still lurks in F10 . Because of our freshness assumptions, however, it does no harm either.

There are two tricky points in this process. • First, notice that the substitution returned by solving Fs is a

φ-substitution and not merely a θ-substitution, and hence can

5.2

6 This

interleaving is not so unusual: every Haskell compiler does the same for type-class constraints. Alternatively, Pottier and R´emy (PR05) show how to defer quantification to the solving phase and avoid interleaving.

The OutsideIn Implication Solver

Figure 5 presents the rules of our implication solver. The solver judgement is of the form `s F : φ. This judgement should

346

be thought of as taking F as input and producing a φ, such that |= φ(F ) according to the equational theory of Figure 3. The judgement appeals to simple(F ) first, to extract the simple part of the constraint Fs . It solves the simple part using the auxiliary judgement `?s F : φ. It applies the substitution to the original constraint and tries to solve the returned constraint.7

is important to type constraints whose assumptions involve skolem variables, such as (a ∼ Int ⊃ Int ∼ a). Furthermore, in rule (S-PI MPL) the solution of the right-hand side of the constraint is required to be a θ. The reason is because this rule is triggered whenever we are trying to solve a proper implication constraint, and hence the solver is not called from rule (L ET), but rather after constraint generation has finished. In order to solve such a constraint at the end, it is unsound to bind skolem variables: Equalities that involve skolems may only be discharged by given equalities. Hence we must not return a φ, but a substitution that binds unification variables only (i.e. a θ).

Notice that the solver returns a φ substitution, which can bind both skolem variables and unification variables. As discussed in the previous section, being able to handle equalities between skolem variables is important for the interleaving of solving and constraint generation in rule (L ET). Nevertheless, only a θ is returned the second time we attempt to solve the constraints. This is because the second time the solver will attempt to solve the proper implications that remain – and solutions to those may only bind unification variables as we shall shortly see (rule (S-PI MPL)).

5.3

Example

Consider again our standard example \x -> case x of { T1 n -> n>0 } for which the type αx → β is derived, and the constraint

`?s

The judgement F : φ is the core of our constraint solver. Rules (S- EMPTY) and (S-S PLIT) are straightforward. The remaining rules deal with a single equality constraint. Rule (S-C ONS) deconstructs a type constructor application, and Rule (S-R EFL) discharges trivial equality constraints. Rules (S-UL) and (S-UR) actually instantiate a type variable variable ν with a type τ . They must be careful not to violate the occurs-check (ν 6∈ τ ).

F = αx ∼ T α ∧ [α, αx , β](α ∼ Bool ⊃ β ∼ Bool) If we solve first the simple constraint on the left, we get the substitution [αx := T α]. We apply this substitution on the implication constraint, yielding [α, β](α ∼ Bool ⊃ β ∼ Bool). Next, we try to solve the implication constraint. Firstly, applying the mgu of α ∼ Bool, i.e. [α := Bool], to β ∼ Bool) has no impact. Secondly, we try to solve β ∼ Bool by substituting β for Bool. Yet this fails, because β is an “untouchable”. Hence, our algorithm rejects the program.

Simple implication constraints, i.e. with empty given constraints, are treated by the (S-SI MPL) rule. A simple implication constraint is treated almost as if it were just a basic constraint with two differences. First, we make sure that the returned φ does not unify any of the skolemized variables of the constraint – it would be unsound to do otherwise. Second, we must never instantiate any of the variables captured in [α] ¯ with a type that contains some of the skolemized variables ¯b. (In this case “untouchables” for the α ¯ variables is a bad name. For example, it is fine – indeed essential – to unify α in [α]∀b.α ∼ Bool.)

Now let’s add a second branch to the example \x -> case x of { T1 n -> n>0 ; T2 xs -> null xs } Again the type αx → β is derived, now with the constraint F 0 = F ∧ αx ∼ T α0 ∧ [α0 ] ∼ [α00 ] ∧ β ∼ Bool The first additional constraint originates from the pattern T2 xs, the second and third from null xs. Solving all simple constraints first, we get the substitution [αx := T α, α0 := α, α00 := α, β := Bool]. These reduce the implication constraint to [α](α ∼ Bool ⊃ Bool ∼ Bool), which is now readily solved. Hence, the expression is accepted with type T α → Bool.

Proper implication constraints are tackled by the (S-PI MPL) rule. First it computes φ that solves the assumptions C — if there is no solution, the implication constraint originates from a dead code branch. Next, it applies it to F and solves F recursively yielding θ. Finally, it checks that the solution θ does not touch any of the untouchables.

6.

There are several tricky points:

Specifying the restricted type system

It is all very well having an inference algorithm, but must also explain to the programmer which programs are accepted by the type checker and which are not. Every GADT inference algorithm has difficulty with this point, and ours is no exception.

• There is some non-determinism in rule (S-S PLIT), but it is

harmless. When solving simple constraints, the order of solving them does not matter; and when solving conjunctions of proper constraints solutions from one can never affect the other.

Figure 6 presents the rules of the restricted type system. The toplevel typing judgement is C, Γ `R e : τ which asserts that expression e has type τ with respect to environment Γ and type constraints C. This judgement is defined by rule (R-M AIN) which in turn is defined in terms of the auxiliary judgement

• Some non-determinism appears in (S-PI MPL). For example,

consider the constraint [α]∀.C ⊃ α ∼ β. The recursive invocation of `s could return either φ = {α := β} or {β := α}, but only one will satisfy the untouchables check. In contrast, any most-general unifier of C will do for φ. Similarly, in a simple constraint []∀c.β ∼ c there is a choice to bind either β or c when we solve the constraint β ∼ c. However, because of the conditions in rule (S-SI MPL), only the solution {β := c} is acceptable.

C, Γ `r e : τ, P which should be read “under constraints C and type environment Γ, the term e has type τ and suspended typing judgements P ”. What are these suspended judgements? The idea is that we typecheck the original term simply ignoring any GADT case alternatives. Instead, these ignored alternatives, along with the current environment and result type, are collected in a set P of tuples hC, Γ, e, τ i. Suppose the original top-level program term can be typed, so that Γ `r e : τ, P holds. Then, for every such typing (or, more realistically for the principal typing) we require that all the suspended typing problems in P are soluble. That is what the (rather complicated) rule

• Rule (S-PI MPL) does not need the skolem-escape check that

appears in (S-SI MPL). Because θ does not affect α, such a check cannot fail. • In (S-PI MPL), solving C requires us to bind skolem variables

as well as unification variables, and hence we return a φ. This 7A

more realistic implementation would split the constraint, solve the simple part and use that substitution to solve the proper part – no need to re-solve the simple part. We chose our current formalism as it saves us the definition of splitting.

347

C, Γ `R e : τ

(R-M AIN)

consistent(C) C, Γ `r e : τ, P ∀τ 0 , P 0 . (C, Γ `r e : τ 0 , P 0 ) ⇒ ∀hCi , Γi , ei , τi i ∈ P 0 . Ci , Γi `R ei : τi C, Γ `R e : τ C, Γ `r e : τ, P

(R-VAR)

(x : ∀¯ a.υ) ∈ Γ φ = {a := τ } C, Γ `r x : φ(υ), ∅

(R-A PP)

(R-L ET)

(R-C ON)

K :: ∀¯ a.D ⇒ υ φ = {a := τ } C |= φ(D) C, Γ `r K : φ(υ), ∅

C, Γ `r e1 : τ1 → τ2 , P1 C, Γ `r e2 : τ1 , P2 C, Γ `r e1 e2 : τ2 , P1 ∪ P2

C, Γ ∪ {g : τ1 } `r e1 : τ1 , P1 a ¯ = fv(τ1 ) − fv(C, Γ) C, Γ ∪ {g : ∀¯ a.τ1 } `r e2 : τ2 , P2 C, Γ `r let {g = e1 } in e2 : τ2 , P1 ∪ P2 (R-C ASE)

(R-A BS)

(R-L ETA)

(R-E Q)

C, Γ `r e : τ1 , P C |= τ1 ∼ τ2 C, Γ `r e : τ2 , P

C, Γ ∪ {x : τ1 } `r e : τ2 , P C, Γ `r λx.e : τ1 → τ2 , P

C, Γ ∪ {g : ∀¯ a.τ1 } `r e1 : τ1 , P1 C, Γ ∪ {g : ∀¯ a.τ1 }, `r e2 : τ2 , P2 C, Γ `r let {g :: ∀¯ a.τ1 = e1 } in e2 : τ2 , P1 ∪ P2

C, Γ `r e : τ1 , P C, Γ `rp pi → ei : τ1 → τ2 , Pi for i ∈ I S C, Γ `r case e of [pi → ei ]i∈I : τ2 , P ∪ i∈I Pi C, Γ `rp p → e : τ → υ, P

(R-VPAT)

K : ∀¯ a, ¯b. ⇒ υ1 → ... → υp → T a ¯ fv(C, Γ, τ¯, τr ) ∩ ¯b = ∅ φ = {a := τ } C, Γ ∪ φ{x1 : υ1 , . . . , xp : υp } `r e : τr , P C, Γ `rp K x1 . . . xp → e : T τ¯ → τr , P

(R-GPAT)

K : ∀¯ a, ¯b.D ⇒ υ1 → ... → υp → T a ¯ D 6= fv(C, Γ, τ¯, τr ) ∩ ¯b = ∅ φ = {a := τ } P = {hC ∧ φ(D), Γ ∪ φ{x1 : υ1 , . . . , xl : υl }, e, τr i} C, Γ `rp K x1 . . . xl → e : T τ¯ → τr , P

Figure 6. Typing Rules for the Restricted Type System

(R-M AIN) says. It ensures that typing information from inside a GADT match does not influence the typing of code outside that match — just as the algorithm does. Observe the recursive nature of rule (R-M AIN), which defers and processes nested case expressions one layer at a time.

First, the restricted type system is sound wrt. the unrestricted type system: T HEOREM 7.1 (Soundness). If , Γ `R e : τ in the restricted type system (Figure 6), then , Γ ` e : τ in the unrestricted type system (Figure 2).

The only rule that adds a deferred typings to P is R-GPAT; it defers the typing of a branch of a case expression that matches a GADT constructor pattern. This rule only applies to GADT constructors that bring a type equality into scope. In all other cases, when no new type equalities are brought into scope, the rule R-PAT applies, which does not defer the typing.

7.

It is fairly easy to see that this theorem holds. In addition to all the constraints of the unrestricted type system, the restricted type system imposes one more constraint on well-typing: the universal well-typing of the deferred typings discussed above. Moreover, the restricted type system has the important property that it only admits expressions that have a principal type.

Formal properties

T HEOREM 7.2 (Principal Typing in the Restricted Type System). If an expression e is typeable in the restricted type system wrt. a type environment Γ, then there is a principal type τp such that , Γ `R e : τp and such that for any other τ for which , Γ `R e : τ , there exists a substitution φ such that φ(τp ) = τ .

In this section we describe the properties of our type system and its inference algorithm. 7.1

Properties of the type system

As we have discussed, implication constraints arising from program text may have a finite or infinite set of incomparable solutions. This ambiguity makes type inference hard. Even in the case when the solutions are finite (but cannot be described by a common most general solution) modular type inference is impossible. Our restricted system however imposes conditions on the typeable programs of the unrestricted system, which ensure that we can perform tractable type inference without having to search the complete space of possibly incomparable solutions for the arising constraints.

Note that the principality is not an artifact of the restricted type system. A principal type in the restricted type system is also a principal type in the unrestricted type system: T HEOREM 7.3 (Principal Typing in the Unrestricted Type System). Assume that the type τp is the principal type of e wrt. a type environment Γ in the restricted type system. Then, for for any other τ for which , Γ ` e : τ , there exists a substitution φ such that φ(τp ) = τ .

348

in this section, but we refrain from presenting the generalized statements for the sake of clarity of exposition. all expressions

8.

well-typed in unrestricted

Implementation Aspects

A key property of OutsideIn is that it is easy to implement, and the implementation is efficient. To substantiate this claim we briefly describe our implementation of OutsideIn in Haskell. Our implementation is available for download from

principal type in unrestricted well-typed in restricted

http://research.microsoft.com/people/dimitris/ and additionally supports bidirectional type checking, open type annotations, and type annotations with constraints. We introduce a datatype MetaTv for unification variables α, β, . . . and a datatype TyVar for skolem variables a, b, . . .. As in traditional implementations (PVWS07), the MetaTv contains a reference cell that may contain a type to which the variable is bound:

Figure 7. The space of programs Note that not all programs with a principal type in the unrestricted type system, are accepted in the unrestricted type system.

data MetaTv = Meta Name (IORef (Maybe Type)) newtype TyVar = TyV Name

Consider the following program: data T a where MkT :: (a ~ Bool) => T a

The main type checker is written in a monad Tc a, which is a function from environments TcEnv and encapsulates IO and threading of error messages.

f :: T a -> Char f x = let h = case x of MkT -> 3 in ’a’

newtype Tc a = Tc (TcEnv data TcEnv = TcEnv { var_env , lie_env , untouchables , ... }

The principal type for h in the unrestricted type system is Int. The restricted, ignoring the case branch, attempts to assign ∀b.b as the principal type for h. However, this does not allow the implication (a ∼ Bool) ⊃ (b ∼ Int) to be solved. Hence, the program is rejected.

Finally, we observe that nearly all well-typings in the unrestricted type system can be recovered in the restricted type system by adding additional type annotations to the program. Because our language does not provide a means to name existential type variables brought into scope by GADT pattern matches, we cannot recover thos well-typings that require mentioning them. Open type annotations8 would lift that limitation.

8.1

Constraint generation

In traditional implementations, unification variables are typically eagerly unified to types as type inference proceeds. In contrast, the algorithm of Figure 4 first generates (lots of) constraints, and then solves them, which is much less efficient. In our implementation we choose an intermediate path, which results in much more compact generated constraints. The environment TcEnv is equipped with the untouchables field, which records the untouchable variables. As type inference proceeds we perform eager unification by side effect in the usual way, except that we refrain from unifying a variable α from the untouchable set to a type τ . In that case, we defer the constraint α ∼ τ , to be dealt with after constraint generation is finished. Hence, the unifier has signature:

Properties of the inference algorithm

The solver algorithm has a number of vital properties. First, the search for solution always terminates, either in failure or success. T HEOREM 7.4 (Termination). The solver algorithm terminates. Second, when a solution is found, it is a proper well-typing in the restricted type system.

unify :: Type -> Type -> Tc [Constraint]

T HEOREM 7.5 (Soundness). If `inf e : τ then , ∅ `R e : τ Third, when a solution is found, it is not an arbitrary solution, but the principal solution.

It accepts, two types to unify, unifies them (perhaps using side effects on MetaTvs that are not untouchable), and returns a list of deferred equalities for variables that belong in the untouchables field of the environment.

T HEOREM 7.6 (Principality). The inferred type is the principal type in the restricted type system: If `inf e : τ and , ∅ `R e : υ then υ = φ(τ ) for some φ.

How does the untouchables environment field get updated? Whenever we perform type inference for a pattern match clause with non-empty given equations, the main type checker:

Finally, if an expression is well-typed, then the solver algorithm finds a solution.

1. extends the untouchables field with the unification variables of the scrutinee and the environment and the return type, as required by Figure 4,

T HEOREM 7.7 (Completeness). If , ∅ `R e : τ then `inf e : υ. Of course, in order to prove the solver algorithm properties, we have to generalize appropriately the statements of the theorems

9 The 8 i.e.

:: Map Name Type :: IORef [Constraint] :: [MetaTv]

Among other fields, the TcEnv environment contains a typing environment var_env, which is a map from term variable names to types. The field lie_env collects the set of Constraints that arise during type inference.9 The Constraint datatype holds equality and implication constraints.

Hence, the gray area in Figure 7 is non-empty. We leave it as a challenge for future work to expand the innermost area towards the dashed line.

7.2

-> IO (Either ErrMsg a))

name lie env is folklore from type class implementations, where it stands for Local Instance Environment.

containing free occurrences of lexically scoped type variables.

349

• If the flag is (Unif untch) then the returned Substitution

2. performs type inference for the right-hand-side of the clause and returns the deferred constraints, and 3. defers an implication constraint whose right-hand side consists of the aforementioned deferred constraints. 8.2

binds only unification variables that do not appear in the list of untouchables, untch.11 • If the flag is All then the returned Substitution binds in-

variably skolem and unification variables. Notice that it would be wrong to apply this substitution to unification variables as a side-effect – for example it is definitely wrong in the context of solving the local assumptions of an implication constraint.

Constraint solving

During type inference we need to solve the generated constraints at two points: when the constraints for the complete program have been generated (rule (I NFER), Figure 4), but also, more subtly, when we encounter a let-bound definition with no annotation (rule (L ET), Figure 4) – in the latter case we must only solve the simple constraints.

One reason we need the flag All is because solveSimple has to unify both skolem and unification variables when called on the given equalities of implication constraints. Concretely, here is the definition of solveProper (simplified):

Post-constraint-generation-solving After constraint generation is finished, the lie_env field holds the set of deferred constraints. At this point we may appeal to our constraint simplifier, which is written in a lightweight error-threading monad, implemented with Haskell’s Either datatype. By design, this monad is pure in the sense that it does not support in-place updating of MetaTvs.

solveProper (CImplicConstraint envs sks gs ws) = do { subst Either SimplifierError () solveConstraints untch cs = do { let (simples, propers) = splitCConstrs cs ; subst CConstraint -> Either SimplifierError Substitution

The Substitution datatype denotes substitutions from either MetaTv or TyVar variables to types. Notice that solveSimples is defined as a fold, that starts-off with the empty substitution and updates it as it solves each simple constraint. In contrast, we use mapM_ to solve each proper constraint independently in solveConstraints, because they cannot affect each other. The SimplifierMode argument to solveSimple stands for the mode of operation:

11 We could in principle apply the returned substitution as side-effect but we 10 The CConstraint datatype is a “canonicalized” variant of Constraint,

chose to not do so in order to treat this case uniformly with the case when the flag is All, to be described next.

and we can ignore their differences below.

350

9.3

We first type check the binding and get the resulting constraints and its type. Subsequently, we call the simplifier: we split the constraints to simples and propers, we solve the simples (using mode All) and return the propers and the resulting substitution, subst. Next, we compute the variables to quantify and the skolemizing substitution θk of rule (L ET) in Figure 4 (thetak).

We conclude that tractable type inference for completely unannotated programs is impossible. It is therefore acceptable to demand a certain amount of user-provided type information. We know of two well-documented approaches: R´egis-Gianas and Pottier stratify type inference into two passes. The first figures out the “shape” of types involving GADTs, while the second performs more-or-less conventional type inference (PRG06). R´egis-Gianas and Pottier present two different shape analysis procedures, the Wob and Inst systems. The Wob system has similar expressiveness and need for annotation as in (PVWW06). The Ibis system on the other hand has similar expressiveness as our system, with a very aggressive iterated shape analysis process. This is reminiscent of our unification of simple constraints arising potentially from far-away in the program text, prior to solving a particular proper constraint. In terms of expressiveness, the vast majority of programs typeable by our system are typeable in Ibis but we conjecture that there exist programs typeable in our system not typeable in Ibis, because unification of simple (global) constraints may be able to figure more out about the types of expressions than the preprocessing shape analysis of Ibis. On the other hand, Ibis lacks a declarative specification that does not force the programmer to understand the intricacies of shape propagation.

Next, we need to extend the lie_env with a constraint. At this point we could in principle return the original constraint to which we have applied θk , wrapped as a simple implication constraint, as in rule (L ET). As an optimization however, we call unify on each binding ν := τ in phi_thetak; in the common case where ν is a (touchable) unification variable α unify will update α in-place, otherwise it will defer the constraints. Those deferred constraints are bound to deferred, and finally return a simple implication constraint that contains the skolemized proper part of the original (spropers) and those deferred.

9.

Related Work

Since GADTs have become popular there has been a flurry of papers on inference algorithms to support them in a practical programming language. 9.1

Fully-annotated programs

One approach is to assume that the program is fully type-annotated, i.e. each sub-expression carries explicit type information. Under this (strong) assumption, we speak of type checking rather than inference. Type checking boils down to unification which is decidable. Hence, we can conclude that type checking for GADTs is decidable. For example, consider (CH03) and (SP07). 9.2

Practical compromises

Peyton Jones et al require that the scrutinee of a GADT match has a “rigid” type, known to the type checker ab initio. A number of ad hoc rules describe how a type signature is propagated to control rigidity (PVWW06). Because rigidity analysis is more aggressive in our system we type many more programs than in (PVWW06), including the carefully-chosen Example 7.2 from (PRG06). On the other hand a program fails to type check in our approach if the type of a case branch is not determined by some “outer” constraint:

Entirely unannotated programs

Type inference for unannotated programs turns out to be extremely hard. The difficulty lies in the fact that GADT pattern matches bring into scope local type assumptions (Section 2). Following the standard route of reducing type inference to constraint solving, GADTs require implication constraints to capture the inference problem precisely (SSS08).

data Eq a b where { Refl :: forall a. Eq a a } test :: forall a b. Eq a b -> Int test x = let funny_id = \z -> case x of Refl -> z in funny_id 3 By contrast this program is typeable in (PVWW06). Arguably, though, this program should be rejected, because there are several incomparable types for funny_id (in the unrestricted system of Figure 2), including ∀c.c → c and a → b.

Unification is no longer sufficient to solve such constraints. We require more complicated solving methods such as constraint abduction (Mah05) and E-unification (GNRS92). It is fairly straightward to construct examples which show that no principal solutions (and therefore no principal types) exist. We can even conclude that GADT inference is undecidable by reduction to simultaneous rigid E-unification problem which is known to be undecidable (DV95).

The implementation of GHC is a slight variation that requires that the right-hand-side of a pattern match clause be typed in a rigid environment12 . Hence, it would reject the previous example. Our system is strictly more expressive than this variation:

How do previous inference approaches tackle these problems? Simonet and Pottier (SP07) solve the inference problem by admitting (much) richer constraints. They sidestep the problems of undecidability and lack of principal types altogether by reducing type inference to type checking. Their inference approach only accumulates (implication) constraints and refrains from solving them. As a result, implications may appear in type schemes, which is a serious complication for the poor programmer (we elaborate in Section 5.1). Furthermore, no tractable solving algorithm is known for the constraints they generate, largely because of the (absolutely necessary) use of implications.

test :: forall a b. Eq a b -> Int test x = (\z -> case x of Eq -> z) 34 The above program would fail to type check in GHC, as the “wobbly” variable z cannot be used in the right-hand-side of a pattern match clause, but in our system it would be typeable because the “outer” constraint forces z to get type Int. In both approaches, inferred types are maximal, but not necessarily principal in the unrestricted natural GADT type system. The choice for a particular maximal type over others relies on the ad hoc rigidity analysis or shape pre-processing. By contrast, in our system only programs that enjoy principal types in the unrestricted type system are accepted.

Sulzmann et al (SSS08) go the other direction, by keeping constraints (in types) simple, and instead apply a very powerful (abductive) solving mechanism. To avoid undecidability, they only consider a selected set of “intuitive” solutions. However they give only an inference algorithm, and it is not clear how to give a declarative description that specifies which programs are well-typed and which are not. Furthermore their system lacks principal types.

12 GHC’s

algorithm is described in an Appendix to the online version of the paper, available from: http://research.microsoft.com/people/simonpj/papers/gadt

351

Acknowledgements We are grateful to the anonymous ICFP 2009 reviewers, and to James McKinna’s team for their comments.

Moreover, in both approaches the programmer is required to understand an entirely new concept (shape or rigidity repectively), with somewhat complex and ad hoc rules (e.g. Figure 6 of (PRG06)). Nor is the implementation straightforward; e.g., GHC’s implementation of (PVWW06) is known to be flawed in a non-trivial way.

10.

References [CH03] J. Cheney and R. Hinze. First-class phantom types. TR 1901, Cornell University, 2003.

Further work

[DV95] A. Degtyarev and A. Voronkov. Simultaneous regid Eunification is undecidable. In Proc. of CSL’95, volume 1092 of LNCS, pages 178–190. Springer, 1995.

Although we have focused exclusively on GADTs, we intend to apply our ideas in the context of Haskell, and more specifically of the Glasgow Haskell Compiler. The latter embodies numerous extensions to Haskell 98, some of which are highly relevant. Notably, a data constructor can bring into scope a local type-class constraint:

[GNRS92] J. H. Gallier, P. Narendran, S. Raatz, and W. Snyder. Theorem proving using equational matings and rigid e-unification. J. ACM, 39(2):377–429, 1992.

class Eq a where { (==) :: a -> a -> Bool } data D a where { D1 :: Eq a => a -> D a }

[LLMS00] J. R. Lewis, J. Launchbury, E. Meijer, and M. Shields. Implicit parameters: Dynamic scoping with static types. In POPL, pages 108–118, 2000.

h :: a -> D a -> Bool h x (D1 y) = x==y

[Mah05] M. Maher. Herbrand constraint abduction. In Proc. of LICS’05, pages 397–406. IEEE Comp. Soc., 2005.

The pattern match on D1 brings the (Eq a) constraint into scope, which can be used to discharge the (Eq a) constraint that arises from the occurrence of (==). Note that D1 is not a GADT; it brings into scope no new type equalities. The same thing may happen with Haskell’s implicit parameters (LLMS00).

[PR05] F. Pottier and D. R´emy. The essence of ML type inference. In Benjamin C. Pierce, editor, Advanced Topics in Types and Programming Languages, chapter 10, pages 389–489. MIT Press, 2005.

Since type inference for Haskell already involves gathering and solving type-class constraints, the constraint-gathering approach to inference is quite natural. The above extension to Haskell generalises the idea of local type constraints to constraints other than equalities, and these naturally map to the same implication constraints we need for GADTs.

[PRG06] F. Pottier and Y. R´egis-Gianas. Stratified type inference for generalized algebraic data types. In Proc. of POPL’06, pages 232–244. ACM, 2006. [PVWS07] S. Peyton Jones, D. Vytiniotis, S. Weirich, and M. Shields. Practical type inference for arbitrary-rank types. J. of Func. Prog., 17:1–82, January 2007.

More ambitiously, GHC also supports indexed type families and type-equality constraints between them (SJCS08). So we may write

[PVWW06] S. Peyton Jones, D. Vytiniotis, S. Weirich, and G. Washburn. Simple unification-based type inference for GADTs. In Proc. of ICFP’06, pages 50–61. ACM, 2006.

type family F :: * -> * type instance F Int = Int type instance F [a] = F a

[SCPD07] M. Sulzmann, M. Chakravarty, S. Peyton Jones, and K. Donnelly. System F with type equality coercions. In Proc. of TLDI’07. ACM, 2007.

data E a where { E1 :: (F a ~ Int) => a -> E a } Here, when we match on E1 we get the local constraint that F a ∼ Int, which in turn gives rise to new questions for the solver (SJCS08).

[SJCS08] T. Schrijvers, S. Peyton Jones, M. Chakravarty, and M. Sulzmann. Type checking with open type functions. SIGPLAN Not., 43(9):51–62, 2008.

Unsurprisingly, these extensions raise similar issues that we found with simple equality constraints. For example, it turns out that type classes suffer from the same lack of principal types as equality constraints (SSS06). Consider this function:

[SP07] V. Simonet and F. Pottier. A constraint-based approach to guarded algebraic data types. ACM Trans. Prog. Languages Systems, 29(1), January 2007. [SSS06] M. Sulzmann, T. Schrijvers, and P. J. Stuckey. Principal type inference for GHC-style multi-parameter type classes. In Proc. of APLAS’06, volume 4279 of LNCS, pages 26–43. Springer, 2006.

data T a where { MkT :: Eq a => T a } f x y = case x of { MkT -> y==y } :: Bool What type should be inferred for f? Here are two, neither of which are more general than the other:

[SSS08] M. Sulzmann, T. Schrijvers, and P. Stuckey. Type inference for GADTs via Herbrand constraint abduction. Report CW 507, K.U.Leuven, Belgium, 2008.

f :: ∀a.T a → a → Bool f :: ∀ab.Eq b ⇒ T a → b → Bool Hence, the declarative specification of the type system and the inference algorithm must be extended to cope with additional kinds of constraints. Finally, in practice, our type checker algorithm would have to be augmented with the generation of evidence, witnessing that the wanted constraints hold. In GHC’s intermediate language, evidence for equality constraints takes the form of type equality coercions, while dictionaries are the evidence for type class constraints. We have omitted evidence handling here so as not to distract from the essence of the OutsideIn algorithm.

352

Author Index Andrieu, Olivier ......................... 215

Hudak, Paul ................................. 35

Puccetti, Armand ........................281

Arts, Thomas ............................. 149

Hughes, John ............................. 149

Quint, Vincent ............................221

Balat, Vincent ............................ 311

Hur, Chung-Kil ............................ 97

Reppy, John ...............................257

Barzilay, Eli ............................... 109

Jagannathan, Suresh .................. 161

Rodriguez Yakushev, Alexey ....233

Baudin, Patrick .......................... 281

Jensen, Thomas P. ...................... 287

Rompf, Tiark ..............................317

Benton, Nick ................................ 97

Jeuring, Johan ............................ 233

Rossberg, Andreas .....................135

Bierman, Gavin M. ..................... 329

Kiselyov, Oleg ............................. 11

Russo, Claudio V. .......................257

Bonichon, Richard ..................... 281

Klein, Gerwin .............................. 91

Sampson, Curt J. .........................185

Canet, Géraud ............................ 281

Ko, Teresa ................................... 59

Schrijvers, Tom ..........................341

Canou, Benjamin ....................... 215

Krishnamurthi, Shriram ............... 47

Sculthorpe, Neil ...........................23

Chailloux, Emmanuel ................ 215

Layaïda, Nabil ........................... 221

Shan, Chung-chieh .......................11

Chaudhuri, Avik ........................ 269

Licata, Daniel R. ........................ 123

Shinnar, Avraham ........................79

Cheng, Eric .................................. 35

Liu, Hai ....................................... 35

Signoles, Julien ..........................281

Chlipala, Adam ............................ 79

Löh, Andres ............................... 233

Singh, Satnam ..............................65

Claessen, Koen .......................... 149

Maier, Ingo ................................ 317

Sivaramakrishnan, KC. ..............161

Colaço, Jean-Louis .................... 215

Malecha, Gregory ........................ 79

Smallbone, Nicholas ..................149

Correnson, Loïc ......................... 281

Manoury, Pascal ........................ 215

Steele Jr., Guy L..............................1

Cuoq, Pascal .............................. 281

Marlow, Simon ............................ 65

Sulzmann, Martin .......................341

Derrin, Philip ............................... 91

McCarthy, Jay A. ...................... 299

Svensson, Hans ..........................149

Dreyer, Derek ............................ 135

Midtgaard, Jan ........................... 287

Swamy, Nikhil ...........................329

Elliott, Conal M. ......................... 191

Monate, Benjamin ..................... 281

Swierstra, S. Doaitse ...................245

Elphinstone, Kevin ...................... 91

Moniot, Thomas ........................ 215

Swierstra, Wouter ......................245

Felleisen, Matthias ....................... 47

Morrisett, Greg ............................ 79

Viera, Marcos .............................245

Findler, Robert Bruce ........... 47, 109

Neis, Georg ................................ 135

Voigtländer, Janis ......................173

Fischer, Sebastian ........................ 11

Newton, Ryan R. .......................... 59

Vouillon, Jérôme ........................311

Flatt, Matthew ...................... 47, 109

Nilsson, Henrik ............................ 23

Vytiniotis, Dimitrios ..................341

Gazagnaire, Thomas .................. 203

Odersky, Martin ........................ 317

Wang, Philippe ...........................215

Genevès, Pierre .......................... 221

Pagano, Bruno ........................... 215

Wiger, Ulf ..................................149

Hanquez, Vincent ...................... 203

Pałka, Michał ............................. 149

Wisnesky, Ryan ...........................79

Harper, Robert ........................... 123

Peyton Jones, Simon .............65, 341

Xiao, Yingqi ...............................257

Hicks, Michael ........................... 329

Pierce, Benjamin C. ................... 121

Yakobowski, Boris .....................311

Hinze, Ralf ..................................... 3

Piponi, Dan ................................ 231

Ziarek, Lukasz ............................161

Holdermans, Stefan .................... 233

Prevosto, Virgile ........................ 281

353

join today!

SIGPLAN & ACM www.acm.org/sigplan

www.acm.org

The ACM Special Interest Group on Programming Languages (SIGPLAN) explores programming language concepts and tools, focusing on design, implementation, and eﬃcient use. Its members are programming language users, developers, theoreticians, researchers, and educators. The monthly newsletter, ACM SIGPLAN Notices, publishes several conference proceedings issues, regular columns and technical correspondence (available in electronic or hardcopy versions). Members also receive a CD containing the prior year conference proceedings and newsletter issues. SIGPLAN sponsors several annual conferences including OOPSLA, PLDI, POPL and ICFP, plus a number of other conferences and workshops. The Association for Computing Machinery (ACM) is an educational and scientiﬁc computing society which works to advance computing as a science and a profession. Beneﬁts include subscriptions to Communications of the ACM, MemberNet, TechNews and CareerNews, plus full access to the Guide to Computing Literature, full and unlimited access to thousands of online courses and books, discounts on conferences and the option to subscribe to the ACM Digital Library. ❑ ❑ ❑ ❑ ❑ ❑ ❑ ❑ ❑ ❑ ❑ ❑ ❑ ❑ ❑ ❑ ❑

SIGPLAN Print (ACM Member or Non-ACM Member) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . $ 65 SIGPLAN Print (ACM Student Member). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . $ 40 SIGPLAN Online (ACM Member or Non-ACM Member) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . $ 25 SIGPLAN Online (ACM Student Member) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . $ 15 ACM Professional Membership ($99) & SIGPLAN Print ($65) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . $164 ACM Professional Membership ($99) & SIGPLAN Print ($65) & ACM Digital Library ($99). . . . . . . . . . . . . . . . . . . . . . . . . . . $263 ACM Professional Membership ($99) & SIGPLAN Online ($25) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . $124 ACM Professional Membership ($99) & SIGPLAN Online ($25) & ACM Digital Library ($99). . . . . . . . . . . . . . . . . . . . . . . . . $223 ACM Student Membership ($19) & SIGPLAN Print ($40) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . $ 59 ACM Student Membership ($19) & SIGPLAN Online ($15) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . $ 34 SIGPLAN Notices only (available in electronic or hardcopy versions) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . $ 75 SIGPLAN Notices Expedited Air (outside N. America) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . $ 55 FORTRAN Forum (ACM or SIGPLAN Member) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . $ 10 FORTRAN Forum (ACM Student Member). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . $ 6 FORTRAN Forum only . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . $ 20 Expedited Air for FORTRAN Forum (outside N. America) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . $ 10 Expedited Air for Communications of the ACM (outside N. America). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . $ 50

payment information Name __________________________________________________

Credit Card Type:

ACM Member # __________________________________________

Credit Card # ______________________________________________

Mailing Address __________________________________________

Exp. Date _________________________________________________

_______________________________________________________

Signature_________________________________________________

City/State/Province _______________________________________ ZIP/Postal Code/Country___________________________________ Email __________________________________________________ Fax ____________________________________________________ Mailing List Restriction ACM occasionally makes its mailing list available to computer-related organizations, educational institutions and sister societies. All email addresses remain strictly conﬁdential. Check one of the following if you wish to restrict the use of your name: ❏ ACM announcements only ❏ ACM and other sister society announcements ❏ ACM subscription and renewal notices only

❏ AMEX

❏ VISA

❏ MC

Make check or money order payable to ACM, Inc ACM accepts U.S. dollars or equivalent in foreign currency. Prices include surface delivery charge. Expedited Air Service, which is a partial air freight delivery service, is available outside North America. Contact ACM for more information.

Questions? Contact: ACM Headquarters 2 Penn Plaza, Suite 701 New York, NY 10121-0701 voice: 212-626-0500 fax: 212-944-1318 email: [email protected]

Remit to: ACM General Post Oﬃce P.O. Box 30777 New York, NY 10087-0777 SIGAPP10

www.acm.org/joinsigs Advancing Computing as a Science & Profession

ICFP’10 Proceedings of the 2010 ACM SIGPLAN International Conference on Functional Programming

Genetic Programming: European Conference, EuroGP 2000 Edinburgh, Scotland, UK, April 15-16, 2000 Proceedings: European Conference, EuroGP 2000, ... 3rd

Proceedings of the 14th ACM SIGPLAN International Conference on Functional Programming : August 31-September 2, 2009, Edinburgh, Scotland

ICFP’10 Proceedings of the 2010 ACM SIGPLAN International Conference on Functional Programming

Erlang’10 Proceedings of the 2010 ACM SIGPLAN Erlang Workshop

Haskell’10 Proceedings of the 2010 ACM SIGPLAN Haskell Symposium

Proceedings of International Science Education Conference 2009

p-adic Functional Analysis.. Proceedings of the Sixth International Conference

P-adic functional analysis: proceedings of the fourth international conference

Contemporary Ergonomics 2009: Proceedings of the International Conference on Contemporary Ergonomics 2009

User Modeling 2005: 10th International Conference, UM 2005, Edinburgh, Scotland, UK, July 24-29, 2005, Proceedings

7th International Conference on Automated Deduction: Proceedings

Genetic Programming: European Conference, EuroGP 2000 Edinburgh, Scotland, UK, April 15-16, 2000 Proceedings: European Conference, EuroGP 2000, ... 3rd

Proceedings of the 14th Annual Acm-Siam Symposium on Discrete Algorithms

Mathematical Foundations of Programming Semantics: International Conference Proceedings

Proceedings of the 20th International Conference on Fluidized Bed Combustion

Proceedings of the 14th Gokova Geometry-Topology Conference 2007

Proceedings of the 6th SIAM International Conference on Data Mining

Proceedings of the International Conference on Experimental Fluid Mechanics (2nd)

Proceedings of the International Conference on Chinese Enterprise Research 2007

Proceedings of the Sixth SIAM International Conference on Data Mining

Proceedings of the 4th International Conference on Southeast Asia

Proceedings of the Fourth SIAM International Conference on Data Mining

Proceedings of the International Conference on Colloid and Surface Science

Proceedings of the 20th International Conference on Fluidized Bed Combustion

Proceedings of the European Computing Conference 2

Advances in Cognitive Neurodynamics (II): Proceedings of the Second International Conference on Cognitive Neurodynamics - 2009

ACM International Conference on Distributed Systems Platforms Heidelberg, Germany, November 12-16, 2001, Proceedings

International Mathematical Conference 1982: Proceedings

Advanced Functional Programming 2 conf

Theory and Application of Diagrams: First International Conference, Diagrams 2000, Edinburgh, Scotland, UK, September 1-3, 2000 Proceedings

Proceedings of COMPSTAT'2010: 19th International Conference on Computational Statistics, Paris France, August 22-27, 2010

Logic Programming: 22nd International Conference, ICLP 2006, Seattle, WA, USA, August 17-20, 2006, Proceedings

Proceedings of the 14th ACM SIGPLAN International Conference on Functional Programming : August 31-September 2, 2009, Edinburgh, Scotland

ICFP’10 Proceedings of the 2010 ACM SIGPLAN International Conference on Functional Programming

Erlang’10 Proceedings of the 2010 ACM SIGPLAN Erlang Workshop

Haskell’10 Proceedings of the 2010 ACM SIGPLAN Haskell Symposium

Proceedings of International Science Education Conference 2009

p-adic Functional Analysis.. Proceedings of the Sixth International Conference

P-adic functional analysis: proceedings of the fourth international conference

Contemporary Ergonomics 2009: Proceedings of the International Conference on Contemporary Ergonomics 2009

User Modeling 2005: 10th International Conference, UM 2005, Edinburgh, Scotland, UK, July 24-29, 2005, Proceedings

7th International Conference on Automated Deduction: Proceedings

Genetic Programming: European Conference, EuroGP 2000 Edinburgh, Scotland, UK, April 15-16, 2000 Proceedings: European Conference, EuroGP 2000, ... 3rd

Proceedings of the 14th Annual Acm-Siam Symposium on Discrete Algorithms

Mathematical Foundations of Programming Semantics: International Conference Proceedings

Proceedings of the 20th International Conference on Fluidized Bed Combustion

Proceedings of the 14th Gokova Geometry-Topology Conference 2007

Proceedings of the 6th SIAM International Conference on Data Mining

Proceedings of the International Conference on Experimental Fluid Mechanics (2nd)

Proceedings of the International Conference on Chinese Enterprise Research 2007

Proceedings of the Sixth SIAM International Conference on Data Mining

Proceedings of the 4th International Conference on Southeast Asia

Proceedings of the Fourth SIAM International Conference on Data Mining

Proceedings of the International Conference on Colloid and Surface Science

Proceedings of the 20th International Conference on Fluidized Bed Combustion

Proceedings of the European Computing Conference 2

Advances in Cognitive Neurodynamics (II): Proceedings of the Second International Conference on Cognitive Neurodynamics - 2009

ACM International Conference on Distributed Systems Platforms Heidelberg, Germany, November 12-16, 2001, Proceedings

International Mathematical Conference 1982: Proceedings

Advanced Functional Programming 2 conf

Theory and Application of Diagrams: First International Conference, Diagrams 2000, Edinburgh, Scotland, UK, September 1-3, 2000 Proceedings

Proceedings of COMPSTAT'2010: 19th International Conference on Computational Statistics, Paris France, August 22-27, 2010

Logic Programming: 22nd International Conference, ICLP 2006, Seattle, WA, USA, August 17-20, 2006, Proceedings

Recommend Documents